# Analisi dei dati in PySpark

Utilizzeremo PySpark attraverso un Notebook, usando il pacchetto `findspark` installato tramite `conda`. Anche PySpark dovrà essere installato, attraverso conda ovvero scaricandolo dalla sua [home page](https://spark.apache.org/downloads.html).

Sarà necessario definire la variabile di ambiente `$SPARK_HOME` che punta al percorso di installazione di PySpark.

Il pacchetto `findspark` ci fornisce un semplice modo per inizializzare una `SparkSession` all'interno di un qualunque IDE, invocando poi il package `pyspark` e tutti i suoi moduli.

Assumendo che l'installazione di Python e Apache Hadoop sia già stata fatta correttamente e che sia stata definita `$SPARK_HOME`, basta usare il seguente prologo all'inizio di ogni applicazione Pyspark:

```python
import findspark
location = findspark.find()
findspark.init(location)

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('MyAppName').getOrCreate()

```

La ``SparkSession`` ci fornisce l'accesso alla _DataFrame API_ oltre che la possibilità di accedere ai diversi componenti del framework:

- `SparkContext`
- `SQLContext`
- `StreamingContext`
- `HiveContext`

In particolare, lo ``SparkContext`` consente l'accesso alle _RDD API_ e si potrà ottenere con il seguente codice

```python
sc = spark.sparkContext

```

In [29]:
# Creiamo la SparkSession per la nostra applicazione

import findspark

location = findspark.find()
findspark.init(location)

import pyspark

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SurvivedPassengers').getOrCreate()

In [30]:
# Lo SparkContext sarà descritto come segue

spark.sparkContext

In [31]:
# Importazione dei dati dal data set titanic.csv
from pyspark.sql.types import *

# costruzione dello schema del DataFrame

passengerSchema = StructType([
    StructField('PassengerID',ShortType(),False),
    StructField('Survived',ShortType(),False),
    StructField('Pclass',ShortType(),False),
    StructField('Name',StringType(),False),
    StructField('Sex',StringType(),False),
    StructField('Age',FloatType(),True),
    StructField('SibSp',IntegerType(),True),
    StructField('Parch',IntegerType(),True),
    StructField('Ticket',StringType(),True),
    StructField('Fare',FloatType(),True),
    StructField('Cabin',StringType(),True),
    StructField('Embarked',StringType(),True)
])

titanic = spark.read.format('csv')\
    .option('header','true')\
    .option('mode','FAILFAST')\
    .schema(passengerSchema)\
    .load('hdfs://localhost:9099/user/pirrone/spark/input/titanic.csv')
                    

In [32]:
titanic.printSchema()

root
 |-- PassengerID: short (nullable = true)
 |-- Survived: short (nullable = true)
 |-- Pclass: short (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: float (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: float (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [33]:
titanic.show()

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerID|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
|          6|       0|     3|    Moran, Mr. James|  male|null|    0|    0|      

In [34]:
# imbarcati

from pyspark.sql.functions import *

passengers = titanic.groupBy(expr('Sex as Gender'),expr('Pclass as Class'))\
                    .agg(expr('count(1) as pass_num'))

In [35]:
passengers.show()

+------+-----+--------+
|Gender|Class|pass_num|
+------+-----+--------+
|  male|    3|     347|
|female|    3|     144|
|female|    1|      94|
|female|    2|      76|
|  male|    2|     108|
|  male|    1|     122|
+------+-----+--------+



In [36]:
#sopravvissuti

survived = titanic.filter('Survived == 1').groupBy('Sex','Pclass')\
            .agg(expr('avg(Age)'),expr('count(1) as surv_num'))\


In [37]:
survived.show()

+------+------+------------------+--------+
|   Sex|Pclass|          avg(Age)|surv_num|
+------+------+------------------+--------+
|  male|     3| 22.27421052597071|      47|
|female|     3|19.329787234042552|      72|
|female|     1|  34.9390243902439|      91|
|female|     2|28.080882352941178|      70|
|  male|     2| 16.02199999888738|      17|
|  male|     1|36.248000000417235|      45|
+------+------+------------------+--------+



In [38]:
# Inner join tra passeggeri e sopravvisuti sulla chiave di raggruppamento
# drop delle colonne di chiave doppie
# creazione della colonna della percentuale di sopravvivenza

survivedPassengers = passengers.join(survived,
             (survived.Sex == passengers.Gender) & (survived.Pclass == passengers.Class),
              'inner')\
        .drop('Sex','Pclass')\
        .withColumn('surv_perc',100.0 * col('surv_num') / col('pass_num'))

In [39]:
survivedPassengers.show()

+------+-----+--------+------------------+--------+------------------+
|Gender|Class|pass_num|          avg(Age)|surv_num|         surv_perc|
+------+-----+--------+------------------+--------+------------------+
|  male|    3|     347| 22.27421052597071|      47|13.544668587896254|
|female|    3|     144|19.329787234042552|      72|              50.0|
|female|    1|      94|  34.9390243902439|      91| 96.80851063829788|
|female|    2|      76|28.080882352941178|      70| 92.10526315789474|
|  male|    2|     108| 16.02199999888738|      17| 15.74074074074074|
|  male|    1|     122|36.248000000417235|      45|36.885245901639344|
+------+-----+--------+------------------+--------+------------------+



In [40]:
# Salvataggio del file survived.tsv nel file system

survivedPassengers.write\
                  .format('csv')\
                  .mode('overwrite')\
                  .option('sep','\t')\
                  .save('hdfs://localhost:9099/user/pirrone/spark/output/survived.tsv')

In [13]:
# Utilizzo delle API SQL
# Creazione delle tabelle temporanee

titanic.createOrReplaceTempView('titanicTable')

In [41]:
passSQL = spark.sql("""
SELECT Sex AS Gender, Pclass AS Class, Count(1) as Pass_num FROM titanicTable 
GROUP BY Sex, PClass
""")
passSQL.createOrReplaceTempView('passTable')

In [42]:
passSQL.show()

+------+-----+--------+
|Gender|Class|Pass_num|
+------+-----+--------+
|  male|    3|     347|
|female|    3|     144|
|female|    1|      94|
|female|    2|      76|
|  male|    2|     108|
|  male|    1|     122|
+------+-----+--------+



In [43]:
survSQL = spark.sql("""
SELECT Sex, Pclass, AVG(Age) as Avg_age, Count(1) as Surv_num FROM titanicTable
WHERE Survived = 1
GROUP BY Sex, PClass
""")
survSQL.createOrReplaceTempView('survTable')

In [44]:
survSQL.show()

+------+------+------------------+--------+
|   Sex|Pclass|           Avg_age|Surv_num|
+------+------+------------------+--------+
|  male|     3| 22.27421052597071|      47|
|female|     3|19.329787234042552|      72|
|female|     1|  34.9390243902439|      91|
|female|     2|28.080882352941178|      70|
|  male|     2| 16.02199999888738|      17|
|  male|     1|36.248000000417235|      45|
+------+------+------------------+--------+



In [45]:
survPercSQL = spark.sql("""
SELECT passTable.Gender, passTable.Class, passTable.Pass_num, 
survTable.Avg_age, survTable.Surv_num, 
100.0 * survTable.Surv_num / passTable.Pass_num AS Surv_perc
FROM passTable
INNER JOIN survTable 
ON (passTable.Gender = survTable.Sex AND passTable.Class = survTable.PClass)
""")

In [46]:
survPercSQL.show()

+------+-----+--------+------------------+--------+-----------------+
|Gender|Class|Pass_num|           Avg_age|Surv_num|        Surv_perc|
+------+-----+--------+------------------+--------+-----------------+
|  male|    3|     347| 22.27421052597071|      47|13.54466858789625|
|female|    3|     144|19.329787234042552|      72|50.00000000000000|
|female|    1|      94|  34.9390243902439|      91|96.80851063829787|
|female|    2|      76|28.080882352941178|      70|92.10526315789474|
|  male|    2|     108| 16.02199999888738|      17|15.74074074074074|
|  male|    1|     122|36.248000000417235|      45|36.88524590163934|
+------+-----+--------+------------------+--------+-----------------+



In [47]:
# Uso della RDD API
# Estraiamo il RDD dal DataFrame titanic

titanicRDD = titanic.rdd

In [48]:
# titanicRDD è una lista di named tuple di classe Row
# mostriamo i primi 20 oggetti

titanicRDD.take(20)

[Row(PassengerID=1, Survived=0, Pclass=3, Name='Braund, Mr. Owen Harris', Sex='male', Age=22.0, SibSp=1, Parch=0, Ticket='A/5 21171', Fare=7.25, Cabin=None, Embarked='S'),
 Row(PassengerID=2, Survived=1, Pclass=1, Name='Cumings, Mrs. John Bradley (Florence Briggs Thayer)', Sex='female', Age=38.0, SibSp=1, Parch=0, Ticket='PC 17599', Fare=71.2833023071289, Cabin='C85', Embarked='C'),
 Row(PassengerID=3, Survived=1, Pclass=3, Name='Heikkinen, Miss. Laina', Sex='female', Age=26.0, SibSp=0, Parch=0, Ticket='STON/O2. 3101282', Fare=7.925000190734863, Cabin=None, Embarked='S'),
 Row(PassengerID=4, Survived=1, Pclass=1, Name='Futrelle, Mrs. Jacques Heath (Lily May Peel)', Sex='female', Age=35.0, SibSp=1, Parch=0, Ticket='113803', Fare=53.099998474121094, Cabin='C123', Embarked='S'),
 Row(PassengerID=5, Survived=0, Pclass=3, Name='Allen, Mr. William Henry', Sex='male', Age=35.0, SibSp=0, Parch=0, Ticket='373450', Fare=8.050000190734863, Cabin=None, Embarked='S'),
 Row(PassengerID=6, Survived=0

In [49]:
# mappiamo il RDD titanic per estrarre solo le colonne Sex, Pclass, Age
# e creiamo la chiave (Sex, Pclass)

keyPass = titanicRDD\
            .keyBy(lambda row: (row.Sex, row.Pclass))\
            .map(lambda row: (row[0], row[1].Age))


In [50]:
keyPass.take(5)

[(('male', 3), 22.0),
 (('female', 1), 38.0),
 (('female', 3), 26.0),
 (('female', 1), 35.0),
 (('male', 3), 35.0)]

In [51]:
# Contiamo per chiave e otteniamo i passeggeri imbarcati per classe e genere

embarkedPass = keyPass.countByKey()

embarkedPass

defaultdict(int,
            {('male', 3): 347,
             ('female', 1): 94,
             ('female', 3): 144,
             ('male', 1): 122,
             ('female', 2): 76,
             ('male', 2): 108})

In [52]:
# Analogo processing per i sopravvissuti, ma
# prima filtriamo per Survived == 1

keySurv = titanicRDD\
            .filter(lambda row: row.Survived == 1)\
            .keyBy(lambda row: (row.Sex, row.Pclass))\
            .map(lambda row: (row[0], row[1].Age))


In [53]:
keySurv.take(10)

[(('female', 1), 38.0),
 (('female', 3), 26.0),
 (('female', 1), 35.0),
 (('female', 3), 27.0),
 (('female', 2), 14.0),
 (('female', 3), 4.0),
 (('female', 1), 58.0),
 (('female', 2), 55.0),
 (('male', 2), None),
 (('female', 3), None)]

In [54]:
survPass = keySurv.countByKey()

survPass

defaultdict(int,
            {('female', 1): 91,
             ('female', 3): 72,
             ('female', 2): 70,
             ('male', 2): 17,
             ('male', 1): 45,
             ('male', 3): 47})

In [55]:
# effettuiamo il reduce per sommare le età
# con una funzione di somma che esclude i valori nulli
# poi effettuiamo il seguente map
# (<chiave>, <somma_eta>) --> 
# (<chiave>, <num_sopravvissuti>, 
#            <num_sopravvissuti>/<num_imbarcati>, 
#            <somma_eta>/<num_imbarcati>)

def addNoNull(x,y):
    if(x == None):
        return y
    elif(y == None):
        return x
    else:
        return x+y

keySurv.reduceByKey(addNoNull)\
        .map(lambda row: (row[0], 
                          survPass[row[0]], 
                          100.0*survPass[row[0]]/embarkedPass[row[0]], 
                          row[1]/survPass[row[0]]))\
        .collect()

[(('female', 1), 91, 96.80851063829788, 31.483516483516482),
 (('female', 3), 72, 50.0, 12.618055555555555),
 (('female', 2), 70, 92.10526315789474, 27.27857142857143),
 (('male', 2), 17, 15.74074074074074, 14.137058822547688),
 (('male', 1), 45, 36.885245901639344, 32.22044444481532),
 (('male', 3), 47, 13.544668587896254, 18.008936169933765)]