## Aggregate Functions
Aggregate functions are provided to compute aggregates over the set of values of columns.
- avg(column), 
- count(column), 
- sum(column), 
- abs(column),
- etc.

In [9]:
# Create a Spark Session object
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#Create a DataFrame from persons.csv
df = spark.read.load("./databases/persons_age.csv", format="csv", header=True, inferSchema=True)

# Compute the average of age
averageAge = df.agg({"age": "avg"})

In [10]:
df.show()
averageAge.show()

+----------+---+
|      name|age|
+----------+---+
| Arcangelo| 23|
|  Leonardo| 24|
|Margherita| 10|
|  Veronica| 21|
|Alessandro| 61|
+----------+---+

+--------+
|avg(age)|
+--------+
|    27.8|
+--------+



In [11]:
# Compute the average value over column age and the maximum over name
dfAverageAge = df.agg({"age": "avg", "name": "max"})
dfAverageAge.show()

+---------+--------+
|max(name)|avg(age)|
+---------+--------+
| Veronica|    27.8|
+---------+--------+



**NOTE:** you can't compute twice over the same column... to do that, do this:

In [12]:
import pyspark.sql.functions as f

dfAvgMaxAge = df.agg(f.avg("age"), f.max("age"))
dfAvgMaxAge.printSchema()
dfAvgMaxAge.show()

root
 |-- avg(age): double (nullable = true)
 |-- max(age): integer (nullable = true)

+--------+--------+
|avg(age)|max(age)|
+--------+--------+
|    27.8|      61|
+--------+--------+



In [18]:
dfAvgMaxCount = df.agg({"age": "avg", "name": "max", "*":"count"})
dfAvgMaxCount.printSchema()
dfAvgMaxCount.show()

root
 |-- max(name): string (nullable = true)
 |-- count(1): long (nullable = false)
 |-- avg(age): double (nullable = true)

+---------+--------+--------+
|max(name)|count(1)|avg(age)|
+---------+--------+--------+
| Veronica|       5|    27.8|
+---------+--------+--------+



### Group By and aggregates functions
The method groupBy(col1, .., coln) method of the DataFrame class combined with a set of aggregate methods can be used to split the input data in groups and compute aggregate function over each group.

In [21]:
# Create a DataFrame from persons.csv
df = spark.read.load( "./databases/persons_avg_group.csv", format="csv", header=True, inferSchema=True)
grouped = df.groupBy("name").avg("age")
df.show()
grouped.show()

+--------+---+
|    name|age|
+--------+---+
|   Marco| 15|
| Antonio| 22|
|Giovanni| 23|
| Antonio| 27|
| Antonio| 12|
|   Marco| 30|
+--------+---+

+--------+------------------+
|    name|          avg(age)|
+--------+------------------+
|Giovanni|              23.0|
| Antonio|20.333333333333332|
|   Marco|              22.5|
+--------+------------------+



In [24]:
# Create a DataFrame from persons.csv
df = spark.read.load( "./databases/persons_avg_group.csv",format="csv",header=True,inferSchema=True)

grouped = df.groupBy("name").agg({"age": "avg", "name": "count"})
df.show()
grouped.show()

+--------+---+
|    name|age|
+--------+---+
|   Marco| 15|
| Antonio| 22|
|Giovanni| 23|
| Antonio| 27|
| Antonio| 12|
|   Marco| 30|
+--------+---+

+--------+-----------+------------------+
|    name|count(name)|          avg(age)|
+--------+-----------+------------------+
|Giovanni|          1|              23.0|
| Antonio|          3|20.333333333333332|
|   Marco|          2|              22.5|
+--------+-----------+------------------+



### SORT
The sort(col1, .., coln, ascending=True) method of the DataFrame class returns a new DataFrame that:
- contains the same data of the input one 
- but the content is sorted by col1, .., coln 
- Ascending determines ascending vs. descending

### DFs with SQL language
Sparks allows querying the content of aDataFrame also by using the SQL language

The createOrReplaceTempView(tableName) method of the DataFrame classcan be used to assign a “table name” to theDataFrame on which it is invoked

The **sql(query)** method of the SparkSessionclass can be used to execute an SQL-like query.

**NOTE:** this erases errors only at runtime when an action is performed

In [25]:
# Example 1

# Create a DataFrame from persons.csv       
df = spark.read.load( "./databases/persons.json", format="json")

# Assign the “table name” people to the df DataFrame                          
df.createOrReplaceTempView("people");

# Select the persons with age between 20 and 31
# by querying the people table               
selectedPersons = spark.sql("SELECT * FROM people WHERE age>=20 and age<=31")

# Print the result on the standard output
selectedPersons.show()

+---+----+
|age|name|
+---+----+
| 30|John|
+---+----+



In [None]:
# Example 2

# Read persons_id.csv and store it in a DataFrame
dfPersons = spark.read.load("persons_id.csv", format="csv", header=True, inferSchema=True)

# Assign the “table name” people to the dfPerson
dfPersons.createOrReplaceTempView("people")

# Read liked_sports.csv and store it in a DataFrame
dfUidSports = spark.read.load("liked_sports.csv", format="csv", header=True, inferSchema=True)

# Assign the “table name” liked to dfUidSports
dfUidSports.createOrReplaceTempView("liked")

# Join the two input tables by using the
#SQL-like syntax
dfPersonLikes = spark.sql("SELECT * from people,liked where people.uid=liked.uid")
# Print the result on the standard output
dfPersonLikes.show()

In [None]:
# Example 3

# Create a DataFrame from persons.csv
df = spark.read.load( "persons.json", format="json")

# Assign the “table name” people to the df DataFrame
df.createOrReplaceTempView("people")

# Define groups based on the value of name and
# compute average and number of records for each group
nameAvgAgeCount = spark.sql("SELECT name, avg(age), count(name) FROM people GROUP BY name")

# Print the result on the standard output
nameAvgAgeCount.show()