### Assignment3 : Working with Spark SQL

#### Concepts :

* Creating DataFrames from CSV input data format
* Performing basic data analysis using Spark SQL
* Saving a DataFrame into partitioned parquet files format

#### Assignment Dataset :

* Air flight data - subset of ~ 100 MB (for demonstration purposes)
* Available in the IE cluster @: /data/shared/spark/flight_data/csv_tiny

#### Compulsory Part :

1. Create a DataFrame by loading the input CSV dataset.
2. Report the name of the columns and the nb of rows in the dataset. 
3. Create a in-memory DataFrame and a permantely stored table from this DataFrame.
4. Report the top 10 airport with most departures in the dataset. Make both use of the DataFrame API and a direct SQL query for this.
5. Save a subset DataFrame that only contains carrier , airport and departure delays , partitioning the output by carrier and airport into parquet format.

#### Bonus Part :

1. What is the flight with the longest delay
2. Report the best , top 5 , carriers ( column carrier ) in terms of smallest average departure delay on all airports.   Consider a flight delayed that one where depdelay > 0 min
3. Which destinations are most likely to get delays from JFK?

### Note : 

Usage of User Defined Functions is *not* strictly required for the assignment

In [9]:
# You need this to run Spark in the cluster
import findspark
findspark.init()
import pyspark

In [10]:
# LOCAL VM DATASET PATH
#import os 
# dataset_path=os.environ['HOME']+"/spark-course/Notebooks/Lab2/"
# CLUSTER DATASET PATH
dataset_path="/data/shared/spark/flight_data/csv_tiny/"

In [11]:
# Create a SparkSession and specify configuration
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("Assignment3Solution") \
    .getOrCreate()

### Compulsory Part

In [12]:
# 1. Create a DataFrame by loading the input CSV dataset.
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .csv("file://"+dataset_path+"*.csv")
    
# Note : caching the dataframe might be a good idea for repeating actions on it
df.cache()

DataFrame[Year: int, Quarter: int, Month: int, DayofMonth: int, DayOfWeek: int, FlightDate: timestamp, UniqueCarrier: string, AirlineID: int, Carrier: string, TailNum: string, FlightNum: int, OriginAirportID: int, OriginAirportSeqID: int, OriginCityMarketID: int, Origin: string, OriginCityName: string, OriginState: string, OriginStateFips: int, OriginStateName: string, OriginWac: int, DestAirportID: int, DestAirportSeqID: int, DestCityMarketID: int, Dest: string, DestCityName: string, DestState: string, DestStateFips: int, DestStateName: string, DestWac: int, CRSDepTime: int, DepTime: int, DepDelay: double, DepDelayMinutes: double, DepDel15: double, DepartureDelayGroups: int, DepTimeBlk: string, TaxiOut: double, WheelsOff: int, WheelsOn: int, TaxiIn: double, CRSArrTime: int, ArrTime: int, ArrDelay: double, ArrDelayMinutes: double, ArrDel15: double, ArrivalDelayGroups: int, ArrTimeBlk: string, Cancelled: double, CancellationCode: string, Diverted: double, CRSElapsedTime: double, ActualE

In [5]:
# 2. Report the name of the columns and the nb of rows in the dataset.
df.columns

['Year',
 'Quarter',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'FlightDate',
 'UniqueCarrier',
 'AirlineID',
 'Carrier',
 'TailNum',
 'FlightNum',
 'OriginAirportID',
 'OriginAirportSeqID',
 'OriginCityMarketID',
 'Origin',
 'OriginCityName',
 'OriginState',
 'OriginStateFips',
 'OriginStateName',
 'OriginWac',
 'DestAirportID',
 'DestAirportSeqID',
 'DestCityMarketID',
 'Dest',
 'DestCityName',
 'DestState',
 'DestStateFips',
 'DestStateName',
 'DestWac',
 'CRSDepTime',
 'DepTime',
 'DepDelay',
 'DepDelayMinutes',
 'DepDel15',
 'DepartureDelayGroups',
 'DepTimeBlk',
 'TaxiOut',
 'WheelsOff',
 'WheelsOn',
 'TaxiIn',
 'CRSArrTime',
 'ArrTime',
 'ArrDelay',
 'ArrDelayMinutes',
 'ArrDel15',
 'ArrivalDelayGroups',
 'ArrTimeBlk',
 'Cancelled',
 'CancellationCode',
 'Diverted',
 'CRSElapsedTime',
 'ActualElapsedTime',
 'AirTime',
 'Flights',
 'Distance',
 'DistanceGroup',
 'CarrierDelay',
 'WeatherDelay',
 'NASDelay',
 'SecurityDelay',
 'LateAircraftDelay',
 'FirstDepTime',
 'TotalAddGTime',
 

In [6]:
print('The nb. of records in the dataset is : %d' % df.count())

The nb. of records in the dataset is : 469489


In [5]:
# 3a. Create a in-memory DataFrame 
df.registerTempTable("flights")

In [8]:
# 3b. And a permantely stored table from this DataFrame.
df.write.saveAsTable("flights_table")

In [6]:
# 4. Report the top 10 airport with most departures in the dataset. Make both use of the DataFrame API and a direct SQL query for this.
# 4a. With DataFrame API ( subset the ones that actually flown , i.e , not cancelled)
df.groupBy('origin').count().sort('count',ascending=False).show(10)

+------+-----+
|origin|count|
+------+-----+
|   ATL|30196|
|   ORD|24870|
|   DFW|23025|
|   DEN|18935|
|   LAX|17589|
|   SFO|13878|
|   IAH|13496|
|   PHX|12126|
|   LAS|11231|
|   SEA| 9316|
+------+-----+
only showing top 10 rows



In [7]:
# 4b. With SQL API
query='SELECT origin,COUNT(*) as count FROM flights GROUP BY origin ORDER BY count DESC LIMIT 10'
top_10=spark.sql(query)
top_10.show()

+------+-----+
|origin|count|
+------+-----+
|   ATL|30196|
|   ORD|24870|
|   DFW|23025|
|   DEN|18935|
|   LAX|17589|
|   SFO|13878|
|   IAH|13496|
|   PHX|12126|
|   LAS|11231|
|   SEA| 9316|
+------+-----+



In [28]:
# 5. Save a subset DataFrame that only contains :
# carrier , airport and departure delays , partitioning the output by carrier and airport into parquet format.
sub_df=df.select('origin','carrier','depdelay')

In [29]:
my_home=os.environ['HOME']
out_dir="assignment3_out"
sub_df.write.partitionBy(
        "carrier","origin"
    ).parquet(
        "file://"
        + my_home
        +'/'
        + out_dir,
        mode='overwrite'
    )
print('Saved dataset subset!')

Saved dataset subset!


### Bonus Part

In [35]:
# 1.What is the flight with the longest delay
query='SELECT origin,dest,flightNum,carrier,FlightDate,depdelay FROM flights ORDER BY depdelay DESC LIMIT 1'
top_1_delayed=spark.sql(query)
top_1_delayed.show()

+------+----+---------+-------+-------------------+--------+
|origin|dest|flightNum|carrier|         FlightDate|depdelay|
+------+----+---------+-------+-------------------+--------+
|   AUS| JFK|      290|     AA|2014-09-09 00:00:00|  1727.0|
+------+----+---------+-------+-------------------+--------+



In [9]:
# 2. Report the best , top 5 , carriers ( column carrier ) 
# in terms of smallest average departure delay on all airports. 
# Consider a flight delayed that one where depdelay > 0 min

#~ TODO! : NEED TO RECHECK THIS! (take into account thershold?)
query = "SELECT carrier,AVG(depdelay) FROM flights GROUP BY carrier ORDER BY avg(depdelay) ASC LIMIT 5"

best_carriers=spark.sql(query)
best_carriers.show()

+-------+------------------+
|carrier|     avg(depdelay)|
+-------+------------------+
|     HA| 0.938279604730277|
|     AS|1.0673555589078292|
|     B6|1.6728524865944696|
|     US| 2.728345857799142|
|     FL| 4.245584874298774|
+-------+------------------+



In [10]:
# Alternative
df \
    .filter(df['depdelay']>0.0) \
    .groupBy('carrier') \
    .avg('depdelay') \
    .sort('avg(depdelay)',ascending=True) \
    .show(5)

+-------+------------------+
|carrier|     avg(depdelay)|
+-------+------------------+
|     FL|22.203951561504145|
|     WN|24.824700077173127|
|     DL|25.077322801912285|
|     US|26.357843137254903|
|     AS|26.533774834437086|
+-------+------------------+
only showing top 5 rows



In [48]:
# 3. Which destinations are most likely to get delays from JFK?
from_JFK=df.filter(df['origin']=='JFK')
from_JFK.filter(from_JFK['depdelay'] > 0) \
    .groupBy('origin', 'dest') \
    .avg('depdelay') \
    .sort('avg(depdelay)',ascending=False) \
    .show(5)

+------+----+------------------+
|origin|dest|     avg(depdelay)|
+------+----+------------------+
|   JFK| IAD| 49.64705882352941|
|   JFK| IND|49.266666666666666|
|   JFK| SRQ|              48.0|
|   JFK| CVG|              46.0|
|   JFK| DTW|              45.0|
+------+----+------------------+
only showing top 5 rows

