#Flight Delays Data
In this section we’ll walk through a few examples of queries on the Airline On-Time
Performance and Causes of Flight Delays data set, which contains data on US flights
including date, delay, distance, origin, and destination. It’s available as a CSV file with
over a million records.
We will use SQL in your Spark applications via the spark.sql programmatic interface. Similar to the DataFrame API in its declarative
flavor, this interface allows you to query structured data in your Spark
applications.

In [2]:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = (SparkSession
         .builder
         .appName("SparkSQLExampleApp")
         .getOrCreate())

In [3]:
# Path to data set
csv_file = "/databricks-datasets/learning-spark-v2/flights/departuredelays.csv"

# Read and create a temporary view
# Infer schema (note that for larger files you
# may want to specify the schema)
df = (spark.read.format("csv")
      .option("inferSchema", "true")
      .option("header", "true")
      .load(csv_file))

df.createOrReplaceTempView("us_delay_flights_tbl")

In [4]:
df.show()

Now that we have a temporary view, we can issue SQL queries using Spark SQL.

#Query 1:
Find out all flights whose distance between origin and destination is greater than 1000

In [7]:
(spark.sql("""SELECT distance, origin, destination 
              FROM us_delay_flights_tbl 
              WHERE distance > 1000 
              ORDER BY distance DESC""")
.show(10, truncate=False))

In [8]:
"""Equivalent in DataFrame"""

from pyspark.sql.types import *
from pyspark.sql.functions import *

(df.select("distance", "origin", "destination")
 .where(col("distance") > 1000)
 .orderBy(desc("distance"))
 .show(10, truncate=False))

#Query 2:
Find out all flights with 2 hour delays between San Francisco and Chicago

In [10]:
(spark.sql("""SELECT date, delay, origin, destination 
              FROM us_delay_flights_tbl 
              WHERE delay > 120 AND origin = 'SFO' AND destination = 'ORD'
              ORDER BY delay DESC""")
.show(10, truncate=False))

#Query 3:
A more complicated query in SQL, let's label all US flights originating from airports with high, medium, low, no delays, regardless of destinations.

In [12]:
spark.sql("""SELECT delay, origin, destination,
              CASE
                  WHEN delay > 360 THEN 'Very Long Delays'
                  WHEN delay > 120 AND delay < 360 THEN  'Long Delays '
                  WHEN delay > 60 AND delay < 120 THEN  'Short Delays'
                  WHEN delay > 0 and delay < 60  THEN   'Tolerable Delays'
                  WHEN delay = 0 THEN 'No Delays'
                  ELSE 'No Delays'
               END AS Flight_Delays
               FROM us_delay_flights_tbl
               ORDER BY origin, delay DESC""").show(10, truncate=False)