# ADS2 - Tutorial 11 - Aggregations

Learning Outcomes:

1.   Use Aggregation functions to explore the properties of a DataFrame
2.   Use GroupedData to perform multiple aggregations at once, over specific subsets of data




In [None]:
# Apache Spark uses Java, so first we must install that
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Unpack Spark from google drive
!tar xzf /content/drive/MyDrive/spark-3.3.0-bin-hadoop3.tgz

In [None]:
# Set up environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

In [None]:
# Install findspark, which helps python locate the psyspark module files
!pip install -q findspark
import findspark
findspark.init()

In [None]:
# Finally, we initialse a "SparkSession", which handles the computations
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

In [None]:
from pyspark.sql import functions as F

# Exercise 1

Upload and read the AmazonBooks.csv file from the canvas page into a DataFrame. The dataset is described [here](https://www.kaggle.com/palanjali007/amazons-top-50-bestselling-novels-20092020?select=AmazonBooks+-+Sheet1.csv).

In [None]:
### Read in the .csv data, ensure the schema is appropriate

CsvPath = 

# Load .csv with header and ',' seperators


In [None]:
### Show the top 50 books from the year 2020, ordered by User Rating
# .filter, .sort/.orderBy, .show
# Use the 'truncate=False' kwarg in .show to display the full row


# Exercise 2

Aggregate [functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions) can be accessed through `pyspark.sql.functions`, this has been imported as `F` for ease of use. To perform a simple aggregation, you can call the function on a column name, then pass it to the `.select` method.

In [None]:
### EXAMPLE: Find the highest price of any book in the dataset

# Find max price, select that column

# To access the number in the DataFrame, use .first()[0]


In [None]:
### Find the mean User Rating of all books in the dataset
### Then use the .filter method to find the mean rating for
### fiction and non fiction books
# .select, .mean, .filter



In [None]:
### Use the .count aggregate function to find the number of fiction and
### non fiction entries in the dataset



In [None]:
### You aren't limited to selecting a single aggregate column
### Using the .filter and .count_distinct functions, find the 
### number of unique books and authors in both genres



In [None]:
### Use the .collect_set function to get a list of all the
### unique authors in the dataset, in alphabetical order
# .select, .sort_array, .collect_set



# Exercise 3

The `.groupBy()` method produces a GroupedData object, which can in turn be used to perform aggregations. You can even group over multiple columns.

In [None]:
### EXAMPLE: Find the mean and standard deviation of prices for each year

BooksDF.groupBy('Year')\
       .agg(F.mean('Price').alias('Mean_price'),
            F.stddev('Price').alias('StdDev_price'))\
       .show()

In [None]:
### Use the .groupBy method produce a single DataFrame containing
### the mean User Rating, number of entries, unique book count,
### and unique author count for Fiction and Non Fiction books
# You may find it useful to use Column expressions
# .count, .count_distinct, .mean, .groupBy, .agg

# set up the aggregations as new columns

# group by genre, then feed the aggregates into .agg


In [None]:
### Find the top rated book—in terms of Rating and number of Reviews—for
### each year in the dataset. Display both the name of the book, and the
### author
# .sort, .groupBy, .agg, .first
# Optional: .col, .desc



In [None]:
### As above, but do this separately for Fiction and Non Fiction
# .sort, .groupBy, .agg, .first
# Optional: .col, .desc


In [None]:
### Group the data by author, and show their highest rated book, the number
### of times they appear in the dataset, and the number of distinct books
# .sort, .groupBy, .agg, .first, .count, .count_distinct
# Optional: .col, .desc
