### Lab2 : Working with Spark SQL

We will review :

1. Loading CSV file formats using SparkSession
2. Creating DataFrame without inferring Schema 
3. Creating DataFrame inferring Schema 
4. Creating DataFrame using databricks library that help inferring the schema
5. Creating partitioned parquet files format
6. Doing some preliminary analysis of this dataset

Dataset :

* Air flight data - subset of ~ few GBs ( Include reference )

Dataset path (in the cluster) :

* /data/shared/spark/flight_data/csv_small


In [1]:
# First Let's start by :
# 1. Definining SPARK_HOME variable 
# 2. Using findspark to  let us work with Spark installation in the cluster

In [12]:
import os
os.environ['SPARK_HOME']="/usr/hdp/current/spark2-client"

In [3]:
import findspark
findspark.init()
import pyspark

In [4]:
# Create a SparkSession
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Spark-SQL-Lab2") \
    .getOrCreate()

In [16]:
print(spark)

<pyspark.sql.session.SparkSession object at 0x7f682c7ceeb8>


In [28]:
dataset_path="/data/shared/spark/flight_data/csv_small/"

In [34]:
# Read in one of the data files into a data frame
df = spark.read \
    .option("header", "true") \
    .csv("file://"+dataset_path+"On_Time_On_Time_Performance_2014_9.csv")

In [36]:
df.printSchema()

root
 |-- Year: string (nullable = true)
 |-- Quarter: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- DayofMonth: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- FlightDate: string (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- AirlineID: string (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- FlightNum: string (nullable = true)
 |-- OriginAirportID: string (nullable = true)
 |-- OriginAirportSeqID: string (nullable = true)
 |-- OriginCityMarketID: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- OriginCityName: string (nullable = true)
 |-- OriginState: string (nullable = true)
 |-- OriginStateFips: string (nullable = true)
 |-- OriginStateName: string (nullable = true)
 |-- OriginWac: string (nullable = true)
 |-- DestAirportID: string (nullable = true)
 |-- DestAirportSeqID: string (nullable = true)
 |-- DestCityMarketID: string (nullable = true)
 |-

In [42]:
# Read CSV data into a dictionary of DataFrame : try to infer schema directly from the data
import itertools
year_list = ['2014','2015']
month_list = ['1','2','3','4','5','6','7','8','9','10','11','12']

dict_df = {}
# Time the operation
for (year_str,month_str) in list(itertools.product(year_list,month_list)):
    year_month_str = '%s_%s'%(year_str,month_str)
    print('Reading input data for year:%s month:%s'%(year_str,month_str))
    df = spark.read \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .csv("file://"+dataset_path+"On_Time_On_Time_Performance_%s.csv"%(year_month_str))  
    df.cache()
    dict_df[year_month_str]=df
print('Done!')

Reading input data for year:2014 month:1
Reading input data for year:2014 month:2
Reading input data for year:2014 month:3
Reading input data for year:2014 month:4
Reading input data for year:2014 month:5
Reading input data for year:2014 month:6
Reading input data for year:2014 month:7
Reading input data for year:2014 month:8
Reading input data for year:2014 month:9
Reading input data for year:2014 month:10
Reading input data for year:2014 month:11
Reading input data for year:2014 month:12
Reading input data for year:2015 month:1
Reading input data for year:2015 month:2
Reading input data for year:2015 month:3
Reading input data for year:2015 month:4
Reading input data for year:2015 month:5
Reading input data for year:2015 month:6
Reading input data for year:2015 month:7
Reading input data for year:2015 month:8
Reading input data for year:2015 month:9
Reading input data for year:2015 month:10
Reading input data for year:2015 month:11
Reading input data for year:2015 month:12
Done!


In [None]:
# Create DataFrame : inferring Schema Using Databricks library
df_with_db = spark.read \
    .format("com.databricks.spark.csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(dataset_path+"*.csv")