# Additional Functions

##### Objectives
1. Apply built-in functions to generate data for new columns
1. Apply DataFrame NA functions to handle null values
1. Join DataFrames

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html" target="_blank">DataFrameNaFunctions</a>: `fill`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>:
  - Aggregate: `collect_set`
  - Collection: `explode`
  - Non-aggregate and miscellaneous: `col`, `lit`

### DataFrameNaFunctions
<a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html" target="_blank">DataFrameNaFunctions</a> is a DataFrame submodule with methods for handling null values. Obtain an instance of DataFrameNaFunctions by accessing the `na` attribute of a DataFrame.

| Method | Description |
| --- | --- |
| drop | Returns a new DataFrame omitting rows with any, all, or a specified number of null values, considering an optional subset of columns |
| fill | Replace null values with the specified value for an optional subset of columns |
| replace | Returns a new DataFrame replacing a value with another value, considering an optional subset of columns |

### Non-aggregate and Miscellaneous Functions
Here are a few additional non-aggregate and miscellaneous built-in functions.

| Method | Description |
| --- | --- |
| col / column | Returns a Column based on the given column name. |
| lit | Creates a Column of literal value |
| isnull | Return true iff the column is null |
| rand | Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0) |

### Joining DataFrames
The DataFrame <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html?highlight=join#pyspark.sql.DataFrame.join" target="_blank">`join`</a> method joins two DataFrames based on a given join expression. Several different types of joins are supported. For example:

```
# Inner join based on equal values of a shared column called 'name' (i.e., an equi join)
df1.join(df2, 'name')

# Inner join based on equal values of the shared columns called 'name' and 'age'
df1.join(df2, ['name', 'age'])

# Full outer join based on equal values of a shared column called 'name'
df1.join(df2, 'name', 'outer')

# Left outer join based on an explicit column expression
df1.join(df2, df1['customer_name'] == df2['account_name'], 'left_outer')
```

# Abandoned Carts Lab
Get abandoned cart items for email without purchases.
1. Get emails of converted users from transactions
2. Join emails with user IDs
3. Get cart item history for each user
4. Join cart item history with emails
5. Filter for emails with abandoned cart items

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: `join`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>: `collect_set`, `explode`, `lit`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameNaFunctions.html" target="_blank">DataFrameNaFunctions</a>: `fill`

### Setup
Run the cells below to create DataFrames **`salesDF`**, **`usersDF`**, and **`eventsDF`**.

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
# sale transactions at BedBricks
salesDF = spark.read.parquet(salesPath)
display(salesDF)

In [0]:
# user IDs and emails at BedBricks
usersDF = spark.read.parquet(usersPath)
display(usersDF)

In [0]:
# events logged on the BedBricks website
eventsDF = spark.read.parquet(eventsPath)
display(eventsDF)

### 1-A: Get emails of converted users from transactions
- Select the **`email`** column in **`salesDF`** and remove duplicates
- Add a new column **`converted`** with the value **`True`** for all rows

Save the result as **`convertedUsersDF`**.

In [0]:
# TODO
from pyspark.sql.functions import *
convertedUsersDF = (salesDF.FILL_IN
)
display(convertedUsersDF)

#### 1-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["email", "converted"]

expectedCount = 210370

assert convertedUsersDF.columns == expectedColumns, "convertedUsersDF does not have the correct columns"

assert convertedUsersDF.count() == expectedCount, "convertedUsersDF does not have the correct number of rows"

assert convertedUsersDF.select(col("converted")).first()[0] == True, "converted column not correct"

### 2-A: Join emails with user IDs
- Perform an outer join on **`convertedUsersDF`** and **`usersDF`** with the **`email`** field
- Filter for users where **`email`** is not null
- Fill null values in **`converted`** as **`False`**

Save the result as **`conversionsDF`**.

In [0]:
# TODO
conversionsDF = (usersDF.FILL_IN
)
display(conversionsDF)

#### 2-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["email", "user_id", "user_first_touch_timestamp", "converted"]

expectedCount = 782749

expectedFalseCount = 572379

assert conversionsDF.columns == expectedColumns, "Columns are not correct"

assert conversionsDF.filter(col("email").isNull()).count() == 0, "Email column contains null"

assert conversionsDF.count() == expectedCount, "There is an incorrect number of rows"

assert conversionsDF.filter(col("converted") == False).count() == expectedFalseCount, "There is an incorrect number of false entries in converted column"

### 3-A: Get cart item history for each user
- Explode the **`items`** field in **`eventsDF`** with the results replacing the existing **`items`** field
- Group by **`user_id`**
  - Collect a set of all **`items.item_id`** objects for each user and alias the column to "cart"

Save the result as **`cartsDF`**.

In [0]:
# TODO
cartsDF = (eventsDF.FILL_IN
)
display(cartsDF)

#### 3-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["user_id", "cart"]

expectedCount = 488403

assert cartsDF.columns == expectedColumns, "Incorrect columns"

assert cartsDF.count() == expectedCount, "Incorrect number of rows"

assert cartsDF.select(col("user_id")).drop_duplicates().count() == expectedCount, "Duplicate user_ids present"

### 4-A: Join cart item history with emails
- Perform a left join on **`conversionsDF`** and **`cartsDF`** on the **`user_id`** field

Save result as **`emailCartsDF`**.

In [0]:
# TODO
emailCartsDF = conversionsDF.FILL_IN
display(emailCartsDF)

#### 4-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["user_id", "email", "user_first_touch_timestamp", "converted", "cart"]

expectedCount = 782749

expectedCartNullCount = 397799

assert emailCartsDF.columns == expectedColumns, "Columns do not match"

assert emailCartsDF.count() == expectedCount, "Counts do not match"

assert emailCartsDF.filter(col("cart").isNull()).count() == expectedCartNullCount, "Cart null counts incorrect from join"

### 5-A: Filter for emails with abandoned cart items
- Filter **`emailCartsDF`** for users where **`converted`** is False
- Filter for users with non-null carts

Save result as **`abandonedItemsDF`**.

In [0]:
# TODO
abandonedCartsDF = (emailCartsDF.FILL_IN
)
display(abandonedCartsDF)

#### 5-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
expectedColumns = ["user_id", "email", "user_first_touch_timestamp", "converted", "cart"]

expectedCount = 204272

assert abandonedCartsDF.columns == expectedColumns, "Columns do not match"

assert abandonedCartsDF.count() == expectedCount, "Counts do not match"

### 6-A: Bonus Activity
Plot number of abandoned cart items by product

In [0]:
# TODO
abandonedItemsDF = (abandonedCartsDF.FILL_IN
)
display(abandonedItemsDF)

#### 6-B: Check Your Work

Run the following cell to verify that your solution works:

In [0]:
abandonedItemsDF.count()

In [0]:
expectedColumns = ["items", "count"]

expectedCount = 12

assert abandonedItemsDF.count() == expectedCount, "Counts do not match"

assert abandonedItemsDF.columns == expectedColumns, "Columns do not match"

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup