
##### Objectives
1. Group data by specified columns
1. Apply grouped data methods to aggregate data
1. Apply built-in functions to aggregate data

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: `groupBy`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.html#pyspark.sql.GroupedData" target="_blank" target="_blank">Grouped Data</a>: `agg`, `avg`, `count`, `max`, `sum`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>: `approx_count_distinct`, `avg`, `sum`

In [0]:
%run ./Includes/Classroom-Setup

Let's use the BedBricks events dataset.

In [0]:
df = spark.read.parquet(eventsPath)
display(df)

### Grouping data

<img src="https://files.training.databricks.com/images/aspwd/aggregation_groupby.png" width="60%" />

### groupBy
Use the DataFrame `groupBy` method to create a grouped data object. 

This grouped data object is called `RelationalGroupedDataset` in Scala and `GroupedData` in Python.

In [0]:
df.groupBy("event_name")

In [0]:
df.groupBy("geo.state", "geo.city")

### Grouped data methods
Various aggregation methods are available on the <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.html" target="_blank">GroupedData</a> object.


| Method | Description |
| --- | --- |
| agg | Compute aggregates by specifying a series of aggregate columns |
| avg | Compute the mean value for each numeric columns for each group |
| count | Count the number of rows for each group |
| max | Compute the max value for each numeric columns for each group |
| mean | Compute the average value for each numeric columns for each group |
| min | Compute the min value for each numeric column for each group |
| pivot | Pivots a column of the current DataFrame and performs the specified aggregation |
| sum | Compute the sum for each numeric columns for each group |

In [0]:
eventCountsDF = df.groupBy("event_name").count()
display(eventCountsDF)

Here, we're getting the average purchase revenue for each.

In [0]:
avgStatePurchasesDF = df.groupBy("geo.state").avg("ecommerce.purchase_revenue_in_usd")
display(avgStatePurchasesDF)

And here the total quantity and sum of the purchase revenue for each combination of state and city.

In [0]:
cityPurchaseQuantitiesDF = df.groupBy("geo.state", "geo.city").sum("ecommerce.total_item_quantity", "ecommerce.purchase_revenue_in_usd")
display(cityPurchaseQuantitiesDF)

## Built-In Functions
In addition to DataFrame and Column transformation methods, there are a ton of helpful functions in Spark's built-in <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-functions-builtin.html" target="_blank">SQL functions</a> module.

In Scala, this is <a href="https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/functions$.html" target="_bank">`org.apache.spark.sql.functions`</a>, and <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions" target="_blank">`pyspark.sql.functions`</a> in Python. Functions from this module must be imported into your code.

### Aggregate Functions

Here are some of the built-in functions available for aggregation.

| Method | Description |
| --- | --- |
| approx_count_distinct | Returns the approximate number of distinct items in a group |
| avg | Returns the average of the values in a group |
| collect_list | Returns a list of objects with duplicates |
| corr | Returns the Pearson Correlation Coefficient for two columns |
| max | Compute the max value for each numeric columns for each group |
| mean | Compute the average value for each numeric columns for each group |
| stddev_samp | Returns the sample standard deviation of the expression in a group |
| sumDistinct | Returns the sum of distinct values in the expression |
| var_pop | Returns the population variance of the values in a group |

Use the grouped data method <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.agg.html#pyspark.sql.GroupedData.agg" target="_blank">`agg`</a> to apply built-in aggregate functions

This allows you to apply other transformations on the resulting columns, such as <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.alias.html" target="_blank">`alias`</a>.

In [0]:
from pyspark.sql.functions import sum

statePurchasesDF = df.groupBy("geo.state").agg(sum("ecommerce.total_item_quantity").alias("total_purchases"))
display(statePurchasesDF)

Apply multiple aggregate functions on grouped data

In [0]:
from pyspark.sql.functions import avg, approx_count_distinct

stateAggregatesDF = (df
                     .groupBy("geo.state")
                     .agg(avg("ecommerce.total_item_quantity").alias("avg_quantity"),
                          approx_count_distinct("user_id").alias("distinct_users"))
                    )

display(stateAggregatesDF)

### Math Functions
Here are some of the built-in functions for math operations.

| Method | Description |
| --- | --- |
| ceil | Computes the ceiling of the given column. |
| cos | Computes the cosine of the given value. |
| log | Computes the natural logarithm of the given value. |
| round | Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode. |
| sqrt | Computes the square root of the specified float value. |

In [0]:
from pyspark.sql.functions import cos, sqrt

display(
    spark.range(10)  # Create a DataFrame with a single column called "id" with a range of integer values
    .withColumn("sqrt", sqrt("id"))
    .withColumn("cos", cos("id"))
)

# Revenue by Traffic Lab
Get the 3 traffic sources generating the highest total revenue.
1. Aggregate revenue by traffic source
2. Get top 3 traffic sources by total revenue
3. Clean revenue columns to have two decimal places

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html" target="_blank">DataFrame</a>: groupBy, sort, limit
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.html?highlight=column#pyspark.sql.Column" target="_blank">Column</a>: alias, desc, cast, operators
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-in Functions</a>: avg, sum

### Setup
Run the cell below to create the starting DataFrame **`df`**.

In [0]:
from pyspark.sql.functions import col

# Purchase events logged on the BedBricks website
df = (spark.read.parquet(eventsPath)
      .withColumn("revenue", col("ecommerce.purchase_revenue_in_usd"))
      .filter(col("revenue").isNotNull())
      .drop("event_name")
     )

display(df)

### 1. Aggregate revenue by traffic source
- Group by **`traffic_source`**
- Get sum of **`revenue`** as **`total_rev`**
- Get average of **`revenue`** as **`avg_rev`**

Remember to import any necessary built-in functions.

In [0]:
# TODO

trafficDF = (df.FILL_IN
)

display(trafficDF)

**CHECK YOUR WORK**

In [0]:
from pyspark.sql.functions import round

expected1 = [(12704560.0, 1083.175), (78800000.3, 983.2915), (24797837.0, 1076.6221), (47218429.0, 1086.8303), (16177893.0, 1083.4378), (8044326.0, 1087.218)]
testDF = trafficDF.sort("traffic_source").select(round("total_rev", 4).alias("total_rev"), round("avg_rev", 4).alias("avg_rev"))
result1 = [(row.total_rev, row.avg_rev) for row in testDF.collect()]

assert(expected1 == result1)

### 2. Get top three traffic sources by total revenue
- Sort by **`total_rev`** in descending order
- Limit to first three rows

In [0]:
# TODO
topTrafficDF = (trafficDF.FILL_IN
)
display(topTrafficDF)

**CHECK YOUR WORK**

In [0]:
expected2 = [(78800000.3, 983.2915), (47218429.0, 1086.8303), (24797837.0, 1076.6221)]
testDF = topTrafficDF.select(round("total_rev", 4).alias("total_rev"), round("avg_rev", 4).alias("avg_rev"))
result2 = [(row.total_rev, row.avg_rev) for row in testDF.collect()]

assert(expected2 == result2)

### 3. Limit revenue columns to two decimal places
- Modify columns **`avg_rev`** and **`total_rev`** to contain numbers with two decimal places
  - Use **`withColumn()`** with the same names to replace these columns
  - To limit to two decimal places, multiply each column by 100, cast to long, and then divide by 100

In [0]:
# TODO
finalDF = (topTrafficDF.FILL_IN
)

display(finalDF)

**CHECK YOUR WORK**

In [0]:
expected3 = [(78800000.29, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result3 = [(row.total_rev, row.avg_rev) for row in finalDF.collect()]

assert(expected3 == result3)

### 4. Bonus: Rewrite using a built-in math function
Find a built-in math function that rounds to a specified number of decimal places

In [0]:
# TODO
bonusDF = (topTrafficDF.FILL_IN
)

display(bonusDF)

**CHECK YOUR WORK**

In [0]:
expected4 = [(78800000.3, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result4 = [(row.total_rev, row.avg_rev) for row in bonusDF.collect()]

assert(expected4 == result4)

### 5. Chain all the steps above

In [0]:
# TODO
chainDF = (df.FILL_IN
)

display(chainDF)

**CHECK YOUR WORK**

In [0]:
expected5 = [(78800000.3, 983.29), (47218429.0, 1086.83), (24797837.0, 1076.62)]
result5 = [(row.total_rev, row.avg_rev) for row in chainDF.collect()]

assert(expected5 == result5)

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup