## Exercise 48 - SparkSQL
- Input:
    - A CSV file containing a list of user profiles
    - Header
    - name,age,gender
    - Each line of the file contains the information about one user
- Output:
    - Select the names occurring at least two times and store in the output folder name and average(age) of the selected names.
    - The output does not contain the header line


- Example of input data:
    - name,age,gender
    - Paul,40,male
    - Paul,38,male
    - David,15,male
    - Susan,40,female
    - Susan,34,female
- Example of expected output:
    - Paul,39
    - Susan,37

In [12]:
from pyspark.sql import SparkSession

# Create a spark session
spark = SparkSession.builder.getOrCreate()

# Create a DataFrame from persons_age_name_gender.csv
dfPersons = spark.read.load('./databases/persons_ex.csv',\
                            format='csv',\
                            header=True,\
                            inferSchema=True)

In [13]:
dfPersons.printSchema()
dfPersons.show()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)

+-----+---+------+
| name|age|gender|
+-----+---+------+
| Paul| 40|  male|
| Paul| 28|  male|
|David| 15|  male|
|Susan| 40|female|
|Susan| 34|female|
+-----+---+------+



## **Solution 1 - SQL synatx**

In [14]:
# I replace DF name with a easier one
dfPersons.createOrReplaceTempView('persons') 

In [22]:
dfNameAvg = spark.sql("""SELECT name, avg(age) as avg
                                 FROM persons
                                 GROUP BY name
                                 HAVING count(*)>1""")

In [23]:
dfNameAvg.printSchema()
dfNameAvg.show()

root
 |-- name: string (nullable = true)
 |-- avg: double (nullable = true)

+-----+----+
| name| avg|
+-----+----+
|Susan|37.0|
| Paul|34.0|
+-----+----+



## **Solution 2 - SparkSQL API**

In [28]:
dfNameCountRecord = dfPersons.groupBy("name").agg({"*": "count", "age": "avg"})

In [29]:
dfNameCountRecord.printSchema()
dfNameCountRecord.show()

root
 |-- name: string (nullable = true)
 |-- count(1): long (nullable = false)
 |-- avg(age): double (nullable = true)

+-----+--------+--------+
| name|count(1)|avg(age)|
+-----+--------+--------+
|Susan|       2|    37.0|
|David|       1|    15.0|
| Paul|       2|    34.0|
+-----+--------+--------+



In [32]:
dfSelecteNames = dfNameCountRecord.filter("count(1)>=2").select('name','avg(age)')

In [33]:
dfSelecteNames.printSchema()
dfSelecteNames.show()

root
 |-- name: string (nullable = true)
 |-- avg(age): double (nullable = true)

+-----+--------+
| name|avg(age)|
+-----+--------+
|Susan|    37.0|
| Paul|    34.0|
+-----+--------+



In [None]:
dfSelectedNames.write.csv(outputFolder, header=True)

## **Solution 3 - RDDs**

In [34]:
# solution long 