# Welcome to Aggregation

Aggregation: a cluster of things that have come or been brought together.

- [groupBy](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.groupBy.html) - Groups the `DataFrame` using the specified columns, so we can run aggregation on them.  Returns `GroupedData`.

In [None]:
from pyspark.context import SparkContext
from pyspark.sql import Row

spark_context = SparkContext.getOrCreate()
rdd = spark_context.parallelize(
    [
        Row("Adrian", "Cake",   1.23),
        Row("Adrian", "Wine",   7.00),
        Row("Dan",    "Tea",    3.20),
        Row("Dan",    "Rum",    9.20),
        Row("Fraser", "Cheese", 3.20),
        Row("Fraser", "Cheese", 4.00),
        Row("Fraser", "Cheese", 2.00),
    ]
)
shopping_dataframe = rdd.toDF(['name', 'product', 'price'])

shopping_dataframe.groupBy('name').count().show()

- [crosstab](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.crosstab.html#pyspark.sql.DataFrame.crosstab) - Computes a pair-wise frequency table of the given columns.

In [None]:
shopping_dataframe.crosstab('name', 'product').show()

## Exercises

1. Using the existing `shopping_dataframe` print to the console a new DataFrame which shows the total price for all products each person has.
1. Using the existing `shopping_dataframe` print to the console a new DataFrame which shows the average price across all products.
    <details>
      <summary>Hint</summary>
      groupBy can group across any number of columns, including zero.
    </details>

## Resources

- [GroupedData API](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.GroupedData.html#pyspark.sql.GroupedData)