### Lab2 : Working with Spark SQL

#### We will review :

1. Loading CSV file formats using SparkSession
2. Creating DataFrame without inferring Schema 
3. Creating DataFrame inferring Schema 
4. Doing some preliminary analysis using Spark SQL on this dataset
5. Creating UDFs (User Defined Functions) and using them on the dataset
5. Saving a DataFrame into partitioned parquet files format

#### Small (Lab) Dataset :

* Air flight data - subset of ~ 100 MB (for demonstration purposes)
* Available in the IE cluster @: /data/shared/spark/flight_data/csv_tiny

#### Larger Dataset (Further Labs) :

* Air flight data - subset of ~ 2.5 GB (for cluster operation purposes)
* Available in the IE cluster @: /data/shared/spark/flight_data/csv_small


In [71]:
# First Let's start by :
# 1. Definining SPARK_HOME variable 
# 2. Using findspark to  let us work with Spark installation in the cluster

In [72]:
# import os
# os.environ['SPARK_HOME']= '/usr/hdp/current/spark2-client'
# os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3.6'
# from jupyter_core.paths import jupyter_data_dir
# print(jupyter_data_dir())

In [73]:
import findspark
findspark.init()
import pyspark

In [74]:
# Update Spark Configuration to use cluser
conf = pyspark.SparkConf()
conf.set('spark.master'         , 'yarn://172.31.34.139:7077')
conf.set('spark.executor.memory', '2g')
conf.set('spark.executor.cores' , '3')
conf.set('spark.cores.max'      , '3')
conf.set('spark.driver.memory'  , '2g')

<pyspark.conf.SparkConf at 0x7fc31180ac18>

In [75]:
# Create a SparkSession
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Spark-SQL-Lab2") \
    .getOrCreate()

In [76]:
dataset_path="/data/shared/spark/flight_data/csv_tiny/"

In [77]:
# Read in all available data files into a data frame
df = spark.read \
    .csv("file://"+dataset_path+"*.csv")   

### Now check the data schema

In [78]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- _c20: string (nullable = true)
 |-- _c21: string (nullable = true)
 |-- _c22: string (nullable = true)
 |-- _c23: string (nullable = true)
 |-- _c24: string (nullable = true)
 |-- _c25: string (nullable = true)
 |-- _c26: string (nullable = true)
 |-- _c27: string (nullable = tru

* Ok , but the column names are not very telling. 
* How to improve this? , by telling Spark to use the header ( if exists )

In [79]:
df = spark.read \
    .option("header", "true") \
    .csv("file://"+dataset_path+"*.csv")

In [80]:
df.printSchema()

root
 |-- Year: string (nullable = true)
 |-- Quarter: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- FlightDate: string (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- AirlineID: string (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- FlightNum: string (nullable = true)
 |-- OriginAirportID: string (nullable = true)
 |-- OriginAirportSeqID: string (nullable = true)
 |-- OriginCityMarketID: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- OriginCityName: string (nullable = true)
 |-- OriginState: string (nullable = true)
 |-- OriginStateFips: string (nullable = true)
 |-- OriginStateName: string (nullable = true)
 |-- OriginWac: string (nullable = true)
 |-- DestAirportID: string (nullable = true)
 |-- DestAirportSeqID: string (nullable = true)
 |-- DestCityMarketID: string (nullable = true)
 |-

* Better , but still one caveat though , all values are interpreted as string, while some of them (actually most), are of numeric nature ( e.g ) Year , Month , Flight Number
* How to improve this ?, by either telling Spark what schema to use OR telling it to infer the Schema of the data
* Note : Asking Spark to infer schema may have a performance impact depending on the number of rows required to infer the schema

In [81]:
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"*.csv")

In [82]:
df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Quarter: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- FlightDate: timestamp (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- AirlineID: integer (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- OriginAirportID: integer (nullable = true)
 |-- OriginAirportSeqID: integer (nullable = true)
 |-- OriginCityMarketID: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- OriginCityName: string (nullable = true)
 |-- OriginState: string (nullable = true)
 |-- OriginStateFips: integer (nullable = true)
 |-- OriginStateName: string (nullable = true)
 |-- OriginWac: integer (nullable = true)
 |-- DestAirportID: integer (nullable = true)
 |-- DestAirportSeqID: integer (nullable = true)
 |-- DestCityMarketID: integer (nu

### Now that a way to have data with schema , let's read in some more data 

In [83]:
# READ IN ONE MONTH WORTH OF DATA : ~ 200MB
# Note we use here * to read all the data from the specified folder
info = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .csv("file://"+dataset_path+"*.csv") 

In [84]:
# Register the table
info.registerTempTable("flights")

#### Worth Noting

* registerTempTable() creates an in-memory table avaialble within cluster in which it was created. The data is stored using Hive's in-memory columnar format and will only 'live' for the duration of the session.

* saveAsTable() creates a permanent, physical table stored using the Parquet format. This table is accessible to all clusters including external clusters and in between sessions. The table metadata including the location of the file(s) is stored within the Hive metastore.

In [85]:
info.columns

['Year',
 'Quarter',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'FlightDate',
 'UniqueCarrier',
 'AirlineID',
 'Carrier',
 'TailNum',
 'FlightNum',
 'OriginAirportID',
 'OriginAirportSeqID',
 'OriginCityMarketID',
 'Origin',
 'OriginCityName',
 'OriginState',
 'OriginStateFips',
 'OriginStateName',
 'OriginWac',
 'DestAirportID',
 'DestAirportSeqID',
 'DestCityMarketID',
 'Dest',
 'DestCityName',
 'DestState',
 'DestStateFips',
 'DestStateName',
 'DestWac',
 'CRSDepTime',
 'DepTime',
 'DepDelay',
 'DepDelayMinutes',
 'DepDel15',
 'DepartureDelayGroups',
 'DepTimeBlk',
 'TaxiOut',
 'WheelsOff',
 'WheelsOn',
 'TaxiIn',
 'CRSArrTime',
 'ArrTime',
 'ArrDelay',
 'ArrDelayMinutes',
 'ArrDel15',
 'ArrivalDelayGroups',
 'ArrTimeBlk',
 'Cancelled',
 'CancellationCode',
 'Diverted',
 'CRSElapsedTime',
 'ActualElapsedTime',
 'AirTime',
 'Flights',
 'Distance',
 'DistanceGroup',
 'CarrierDelay',
 'WeatherDelay',
 'NASDelay',
 'SecurityDelay',
 'LateAircraftDelay',
 'FirstDepTime',
 'TotalAddGTime',
 

#### Select the following columns from the full dataset

    Year
    Month
    DayOfMonth
    DayOfWeek
    FlightNum
    Origin
    Carrier
    Dest ( destination )
    DepTime ( departure time )
    DepDelay ( departure delay )
    ArrTime ( arrival time )
    ArrDelay ( arrival delay )
    Cancelled
    CancellationCode
    AirTime
    Distance


In [86]:
flights=spark.sql(
    "select " 
    +"year,month,dayofmonth,dayofweek,"
    +"flightnum,origin,carrier,dest,deptime,depdelay,"
    +"arrtime,arrdelay,cancelled,cancellationcode,"
    +"airtime,distance "
    +"FROM flights"
    )
# OR 
# selection=["year,month,dayofmonth,dayofweek,"
#       "flightnum,origin,dest,deptime,depdelay,"
#       "arrtime,arrdelay,cancelled,cancellationcode,"
#       "airtime,distance "]
# info.select(selection)

In [87]:
# Cache this DataFrame
flights.cache()

DataFrame[year: int, month: int, dayofmonth: int, dayofweek: int, flightnum: int, origin: string, carrier: string, dest: string, deptime: int, depdelay: double, arrtime: int, arrdelay: double, cancelled: double, cancellationcode: string, airtime: double, distance: double]

In [88]:
# Show the first 5 rows of the subset data to get a feeling of what to expect
flights.head(5)

[Row(year=2014, month=9, dayofmonth=1, dayofweek=1, flightnum=1, origin='JFK', carrier='AA', dest='LAX', deptime=851, depdelay=-9.0, arrtime=1144, arrdelay=-26.0, cancelled=0.0, cancellationcode=None, airtime=325.0, distance=2475.0),
 Row(year=2014, month=9, dayofmonth=2, dayofweek=2, flightnum=1, origin='JFK', carrier='AA', dest='LAX', deptime=902, depdelay=2.0, arrtime=1210, arrdelay=0.0, cancelled=0.0, cancellationcode=None, airtime=312.0, distance=2475.0),
 Row(year=2014, month=9, dayofmonth=3, dayofweek=3, flightnum=1, origin='JFK', carrier='AA', dest='LAX', deptime=849, depdelay=-11.0, arrtime=1215, arrdelay=5.0, cancelled=0.0, cancellationcode=None, airtime=330.0, distance=2475.0),
 Row(year=2014, month=9, dayofmonth=4, dayofweek=4, flightnum=1, origin='JFK', carrier='AA', dest='LAX', deptime=852, depdelay=-8.0, arrtime=1133, arrdelay=-37.0, cancelled=0.0, cancellationcode=None, airtime=316.0, distance=2475.0),
 Row(year=2014, month=9, dayofmonth=5, dayofweek=5, flightnum=1, ori

### Do some SQL queries ( use both the DataFrame and direct SQL queries )

1. Find the number of departing flights from a given airport
2. Find the total number of delayed flights on a given airport (consider delayed = depdelay > 5 min )
3. Find the average delay per flight on that airport
4. Find the top 5 airports with the highest average delays
5. Find general statistics for a given airport on flight departure delay
6. Find the top 5 airports in terms of flight cancellations


In [105]:
# how many records do we have in total?
total=flights.count()
print('Total nb.of flights: %d' % total)

# 1. and 2.
def statsByAirport(airport_id,delay_time):
    departing=flights.filter(flights['origin']==airport_id)
    delayed=flights.filter(flights['depdelay']>=delay_time)
    # One way
    departDelay=delayed.intersect(departing)
    ndep=departing.count()
    ndel=departDelay.count()
    return (ndep,ndel)
    
airport='JFK'
delay=5.0
n,m=statsByAirport(airport,delay)

print('Departing from %s : %d ' %(airport,n))
print('Delayed   from %s : %d ' %(airport,m))
print('Delayed Percentage : %f ' %((m/n)*100))

Total nb.of flights: 469489
Departing from JFK : 8275 
Delayed   from JFK : 1693 
Delayed Percentage : 20.459215 


In [104]:
# We could repeat this process for all available airports and get 
# and overview of the 'best and worse' 
# How may airports do we have
flights.select('origin').distinct().count()

312

### Create a UDF (User Defined Function) and use it

* Save this dataframe in parquet (columnar) format for boost in loading performance
* In order to do we want to 'be clever' and partition the data by specific atributes , in this case
* Year and Month

In [None]:
# THIS MAY RUN OUT OF MEM : Save the data into my HOME
my_home=os.environ['HOME']
out_dir="airline_data"
df_small.write.partitionBy(
        "Year","Month"
    ).parquet(
        "file://"
        + my_home
        +'/'
        + out_dir,
        mode='overwrite'
    )
print('Done!')

In [42]:
# Read CSV data into a dictionary of DataFrame : try to infer schema directly from the data
import itertools
year_list = ['2014']
month_list = ['1','2','3','4','5','6','7','8','9','10','11','12']

dict_df = {}

for (year_str,month_str) in list(itertools.product(year_list,month_list)):
    year_month_str = '%s_%s'%(year_str,month_str)
    print('Reading input data for year:%s month:%s'%(year_str,month_str))
    df = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .csv("file://"+dataset_path+"On_Time_On_Time_Performance_%s.csv"%(year_month_str))  
    df.cache()
    dict_df[year_month_str]=df
print('Done!')

Reading input data for year:2014 month:1
Reading input data for year:2014 month:2
Reading input data for year:2014 month:3
Reading input data for year:2014 month:4
Reading input data for year:2014 month:5
Reading input data for year:2014 month:6
Reading input data for year:2014 month:7
Reading input data for year:2014 month:8
Reading input data for year:2014 month:9
Reading input data for year:2014 month:10
Reading input data for year:2014 month:11
Reading input data for year:2014 month:12
Reading input data for year:2015 month:1
Reading input data for year:2015 month:2
Reading input data for year:2015 month:3
Reading input data for year:2015 month:4
Reading input data for year:2015 month:5
Reading input data for year:2015 month:6
Reading input data for year:2015 month:7
Reading input data for year:2015 month:8
Reading input data for year:2015 month:9
Reading input data for year:2015 month:10
Reading input data for year:2015 month:11
Reading input data for year:2015 month:12
Done!


In [14]:
# Create DataFrame : inferring Schema Using Databricks library
df_with_dbricks = spark.read \
    .format("com.databricks.spark.csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"On_Time_On_Time_Performance_2014_9.csv")

In [15]:
df_with_dbricks.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Quarter: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- FlightDate: timestamp (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- AirlineID: integer (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- OriginAirportID: integer (nullable = true)
 |-- OriginAirportSeqID: integer (nullable = true)
 |-- OriginCityMarketID: integer (nullable = true)
 |-- Origin: string (nullable = true)
 |-- OriginCityName: string (nullable = true)
 |-- OriginState: string (nullable = true)
 |-- OriginStateFips: integer (nullable = true)
 |-- OriginStateName: string (nullable = true)
 |-- OriginWac: integer (nullable = true)
 |-- DestAirportID: integer (nullable = true)
 |-- DestAirportSeqID: integer (nullable = true)
 |-- DestCityMarketID: integer (nu