## Exercise 49 - SparkSQL
- Input:
    - A csv file containing a list of profiles
    - Header: name,surname,age
    - Each line of the file contains one profile
    - name,surname,age
- Output:
    - A csv file containing one line for each profile. The original age attribute is substituted with a new attributed called rangeage of type String.
    - rangeage = "[" + (age/10)x10 + "-" + (age/10)x10 +9"]"


- Input:
    - name,surname,age
    - Paolo,Garza,42
    - Luca,Boccia,41
    - Maura,Bianchi,16
- Expected output:
    - name,surname,rangeage
    - Paolo,Garza,[40-49]
    - Luca,Boccia,[40-49]
    - Maura,Bianchi,[10-19]

In [1]:
from pyspark.sql import SparkSession

# Create a spark session
spark = SparkSession.builder.getOrCreate()

# Create a DataFrame from persons_age_name_gender.csv
dfPersons = spark.read.load('/data/students/bigdata-01QYD/ex_data/Ex49/data/',\
                            format='csv',\
                            header=True,\
                            inferSchema=True)

In [2]:
dfPersons.printSchema()
dfPersons.show()

root
 |-- name: string (nullable = true)
 |-- surname: string (nullable = true)
 |-- age: integer (nullable = true)

+-----+-------+---+
| name|surname|age|
+-----+-------+---+
|Paolo|  Garza| 42|
| Luca| Boccia| 41|
|Maura|Bianchi| 16|
+-----+-------+---+



In [3]:
# Define a user defined function called AgeCategory(age)
# that returns a string associated with the Category of the user.
# AgeCategory = ...

spark.udf.register("AgeCategory",\
                  lambda age: "["+str((age//10)*10)+"-"+str((age//10)*10+9)+"]")

<function __main__.<lambda>(age)>

In [8]:
# Define a DF 
profileDiscrizedAge = dfPersons.selectExpr("name", "surname", "AgeCategory(age) as rangeage")

#or in SQL

# I replace DF name with a easier one
dfPersons.createOrReplaceTempView('profiles') 

profileDiscrizedAgeSQL = spark.sql("""SELECT name, surname,
                                      AgeCategory(age) as rangeage
                                      FROM profiles""")

In [5]:
profileDiscrizedAge.printSchema()
profileDiscrizedAge.show()

root
 |-- name: string (nullable = true)
 |-- surname: string (nullable = true)
 |-- rangeage: string (nullable = true)

+-----+-------+--------+
| name|surname|rangeage|
+-----+-------+--------+
|Paolo|  Garza| [40-49]|
| Luca| Boccia| [40-49]|
|Maura|Bianchi| [10-19]|
+-----+-------+--------+



In [9]:
profileDiscrizedAgeSQL.printSchema()
profileDiscrizedAgeSQL.show()

root
 |-- name: string (nullable = true)
 |-- surname: string (nullable = true)
 |-- rangeage: string (nullable = true)

+-----+-------+--------+
| name|surname|rangeage|
+-----+-------+--------+
|Paolo|  Garza| [40-49]|
| Luca| Boccia| [40-49]|
|Maura|Bianchi| [10-19]|
+-----+-------+--------+



In [None]:
profileDiscrizedAgeSQL.write.csv(outputFolder, header=True)