## Basic Query Examples

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Create a SparkSession
spark = (SparkSession
    .builder
    .appName("SparkSQLExampleApp")
    .getOrCreate())

22/03/01 11:58:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Path to data set
csv_file = "../../../databricks-datasets/learning-spark-v2/flights/departuredelays.csv"

# Read and create a temporary view
# Infer schema (note that for larger files you may want to specify the schema)
df = (spark.read.format("csv")
    .option("inferSchema", "true")
    .option("header", "true")
    .load(csv_file))
df.createOrReplaceTempView("us_delay_flights_tbl")

                                                                                

Now that we have a temporary view, we can issue SQL queries using Spark SQL. These queries are no different from those you might issue against a SQL table in, say, a MySQL or PostgreSQL database. The point here is to show that Spark SQL offers an ANSI:2003–compliant SQL interface, and to demonstrate the interoperability between SQL and DataFrames.

The US flight delays data set has five columns:

- The date column contains a string like 02190925. When converted, this maps to 02-19 09:25 am.
- The delay column gives the delay in minutes between the scheduled and actual departure times. Early departures show negative numbers.
- The distance column gives the distance in miles from the origin airport to the destination airport.
- The origin column contains the origin IATA airport code.
- The destination column contains the destination IATA airport code.

With that in mind, let’s try some example queries against this data set. First, we’ll find all flights whose distance is greater than 1,000 miles:

In [4]:
sqlQuery = """
select distance, origin, destination
from us_delay_flights_tbl
where distance > 1000
order by distance desc
"""

spark.sql(sqlQuery).show(10)

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
+--------+------+-----------+
only showing top 10 rows



                                                                                

As the results show, all of the longest flights were between Honolulu (HNL) and New York (JFK). Next, we’ll find all flights between San Francisco (SFO) and Chicago (ORD) with at least a two-hour delay:

In [5]:
sqlQuery = """
select date, delay, origin, destination
from us_delay_flights_tbl
where delay > 120
    and origin = 'SFO'
    and destination = 'ORD'
order by delay desc
"""

spark.sql(sqlQuery).show(10)

+-------+-----+------+-----------+
|   date|delay|origin|destination|
+-------+-----+------+-----------+
|2190925| 1638|   SFO|        ORD|
|1031755|  396|   SFO|        ORD|
|1022330|  326|   SFO|        ORD|
|1051205|  320|   SFO|        ORD|
|1190925|  297|   SFO|        ORD|
|2171115|  296|   SFO|        ORD|
|1071040|  279|   SFO|        ORD|
|1051550|  274|   SFO|        ORD|
|3120730|  266|   SFO|        ORD|
|1261104|  258|   SFO|        ORD|
+-------+-----+------+-----------+
only showing top 10 rows



It seems there were many significantly delayed flights between these two cities, on different dates. (As an exercise, convert the `date` column into a readable format and find the days or months when these delays were most common. Were the delays related to winter months or holidays?)

Let’s try a more complicated query where we use the `CASE` clause in SQL. In the following example, we want to label all US flights, regardless of origin and destination, with an indication of the delays they experienced: Very Long Delays (> 6 hours), Long Delays (2–6 hours), etc. We’ll add these human-readable labels in a new column called `Flight_Delays`:

In [6]:
# Spark SQL API
sqlQuery = """
select
    delay, origin, destination,
    case
        when delay > 360 then 'Very Long Delays'
        when delay > 120 and delay <= 360 then 'Long Delays'
        when delay > 60 and delay <= 120 then 'Short Delays'
        when delay > 0 and delay <= 60 then 'Tolerable Delays'
        when delay = 0 then 'No Delays'
        else 'Early'
    end as Flight_Delays
from us_delay_flights_tbl
order by origin, delay desc
"""

spark.sql(sqlQuery).show(10)

+-----+------+-----------+-------------+
|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|  333|   ABE|        ATL|  Long Delays|
|  305|   ABE|        ATL|  Long Delays|
|  275|   ABE|        ATL|  Long Delays|
|  257|   ABE|        ATL|  Long Delays|
|  247|   ABE|        DTW|  Long Delays|
|  247|   ABE|        ATL|  Long Delays|
|  219|   ABE|        ORD|  Long Delays|
|  211|   ABE|        ATL|  Long Delays|
|  197|   ABE|        DTW|  Long Delays|
|  192|   ABE|        ORD|  Long Delays|
+-----+------+-----------+-------------+
only showing top 10 rows



In [7]:
# Spark SQL API
sqlQuery = """
with cte as
(
    select
        delay, origin, destination,
        case
            when delay > 360 then 'Very Long Delays'
            when delay > 120 and delay <= 360 then 'Long Delays'
            when delay > 60 and delay <= 120 then 'Short Delays'
            when delay > 0 and delay <= 60 then 'Tolerable Delays'
            when delay = 0 then 'No Delays'
            else 'Early'
        end as Flight_Delays
    from us_delay_flights_tbl
    order by origin, delay desc
)
select Flight_Delays, count(*) as counts
from cte
group by Flight_Delays
order by 2 desc
"""

spark.sql(sqlQuery).show(10)

+----------------+------+
|   Flight_Delays|counts|
+----------------+------+
|           Early|668729|
|Tolerable Delays|497742|
|       No Delays|131122|
|    Short Delays| 61039|
|     Long Delays| 31447|
|Very Long Delays|  1499|
+----------------+------+



In [8]:
# DataFrame API
from pyspark.sql.functions import col, desc

(df.select("distance", "origin", "destination")
    .where(col("distance") > 1000)
    .orderBy(desc("distance"))).show(10)

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
+--------+------+-----------+
only showing top 10 rows



In [9]:
# DataFrame API
from pyspark.sql.functions import col, desc

(df.select("distance", "origin", "destination")
    .where("distance > 1000")
    .orderBy("distance", ascending=False)).show(10)

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
|    4330|   HNL|        JFK|
+--------+------+-----------+
only showing top 10 rows



In [10]:
from pyspark.sql.functions import expr, asc, desc

# Spark SQL API
case = """
case
    when delay > 360 then 'Very Long Delays'
    when delay > 120 and delay <= 360 then 'Long Delays'
    when delay > 60 and delay <= 120 then 'Short Delays'
    when delay > 0 and delay <= 60 then 'Tolerable Delays'
    when delay = 0 then 'No Delays'
    else 'Early'
end as Flight_Delays
"""

(df.select("delay", "origin", "destination", expr(case))
    .orderBy(asc("origin"), desc("delay"))).show(10)

+-----+------+-----------+-------------+
|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|  333|   ABE|        ATL|  Long Delays|
|  305|   ABE|        ATL|  Long Delays|
|  275|   ABE|        ATL|  Long Delays|
|  257|   ABE|        ATL|  Long Delays|
|  247|   ABE|        DTW|  Long Delays|
|  247|   ABE|        ATL|  Long Delays|
|  219|   ABE|        ORD|  Long Delays|
|  211|   ABE|        ATL|  Long Delays|
|  197|   ABE|        DTW|  Long Delays|
|  192|   ABE|        ORD|  Long Delays|
+-----+------+-----------+-------------+
only showing top 10 rows



In [11]:
(df.select("delay", "origin", "destination", expr(case))
    .orderBy(col("origin").asc(), col("delay").desc())).show(10)

+-----+------+-----------+-------------+
|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|  333|   ABE|        ATL|  Long Delays|
|  305|   ABE|        ATL|  Long Delays|
|  275|   ABE|        ATL|  Long Delays|
|  257|   ABE|        ATL|  Long Delays|
|  247|   ABE|        ATL|  Long Delays|
|  247|   ABE|        DTW|  Long Delays|
|  219|   ABE|        ORD|  Long Delays|
|  211|   ABE|        ATL|  Long Delays|
|  197|   ABE|        DTW|  Long Delays|
|  192|   ABE|        ORD|  Long Delays|
+-----+------+-----------+-------------+
only showing top 10 rows

