**DataTypes in PySpark:**  

RDD:
* Records

DataFrames:
* Rows
* Columns

Groupby:
* Groups

DataSets:


**Joins:**  
*multiple joins*  
You can join multiple dataframes in one chained function.

```python
>>> df = sc.sql.createDataFrame([('Alice', 2), ('Bob', 5)], ['name','age'])
>>> df_2 = sc.sql.createDataFrame([('Alice', 'female'), ('Bob', 'male')], ['name','gender'])
>>> df_3 = sc.sql.createDataFrame([('female', 'pink'), ('male', 'blue')], ['gender','color'])
>>> df.join(df_2, 'name')\
...   .join(df_3, 'gender')
...   .collect()
[
    Row(struct=Row(name=u'Alice', gender=u'female', age=2)), 
    Row(struct=Row(name=u'Bob', gender=u'male', age=5))
]
```

**F.struct in PySpark:**  
`pyspark.sql.functions.`**`struct`**`(*cols)`

Creates a new struct column.  
**Parameters:**	**cols** – list of column names (string) or list of **Column** expressions

Example:
```python
>>> df = sc.sql.createDataFrame([('Alice', 2), ('Bob', 5)], ['name','age'])

>>> df.select(struct('age', 'name').alias("struct")).collect()
[
    Row(struct=Row(age=2, name=u'Alice')), 
    Row(struct=Row(age=5, name=u'Bob'))
]

>>> df.select(struct([df.age, df.name]).alias("struct")).collect()
[
    Row(struct=Row(age=2, name=u'Alice')), 
    Row(struct=Row(age=5, name=u'Bob'))
]
```

Use cases:
1. When you use `groupBy`, it drop any columns you're not grouping by or aggregating on.

**F.coalesce in PySpark:**  
`pyspark.sql.functions.`**`coalesce`**`(*cols)`

Returns the first column that is not null.

Example:
```python
>>> cDf = sc.sql.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
>>> cDf.show()
+----+----+
|   a|   b|
+----+----+
|null|null|
|   1|null|
|null|   2|
+----+----+
```

```python
>>> cDf.select(coalesce(cDf["a"], cDf["b"])).show()
+--------------+
|coalesce(a, b)|
+--------------+
|          null|
|             1|
|             2|
+--------------+
```

```python
>>> cDf.select('*', coalesce(cDf["a"], lit(0.0))).show()
+----+----+----------------+
|   a|   b|coalesce(a, 0.0)|
+----+----+----------------+
|null|null|             0.0|
|   1|null|             1.0|
|null|   2|             0.0|
+----+----+----------------+
```

**Multi-dimensional arrays collapsing in DataFrames:**  
Calling `F.explode(col)` will flatten all 2-D arrays in `col`.

```python
>>> df = sc.sql.createDataFrame([(['a'],'a'),
...                             (['a', 'b'],'a'),
...                             (['c'],'b'),
...                             (['d', 'e'],'b')],
...                             ['arrays', 'group'])
>>> df = df.withColumn('arrays', F.explode('arrays'))
>>> df.groupBy('group').agg(F.collect_list('arrays').alias('arrays')).collect()
[Row(group=u'b', arrays=[u'c', u'd', u'e']), Row(group=u'a', arrays=[u'a', u'a', u'b'])]
```

**Python Functions**
* Functions are just variables in python. By writing it like this you save space and complexity as opposed to doing it in-line.

```python
>>> def derived_session_token_udf():
>>>    return F.concat(
...        F.col("shop_id").cast("string"), F.lit(":"),
...        F.col("user_token"), F.lit(":"),
...        F.col("session_token"), F.lit(":"),
...        F.year(F.col(timestamp_key)), F.lit(":"),
...        F.dayofyear(F.col(timestamp_key))
...    )

>>> new_df = df.withColumn("derived_session_token", derived_session_token_udf())
```