# SQL Common Table Expressions

SQl is a very powerful language that allows the construction of very complex queries. But SQL only provides very limited support for structuting query in order to increase readability. So called "Common Table Expressions" (CTEs) are a common way to split up large SQL queries into smaller and more manageable chunks. This approach works with all SQL databases, not only with Spark SQL.

**Attention!** In Spark, CTEs will be optimized away, such that there is no difference in execution speed. Different (relational) databases might handle CTEs differently and they might represent an optimization barrier. Please consult the manual of your database before blindlingly using CTEs!

For this notebook, we will pick up the weather example and use Spark SQL to calculate some aggregates per year and country. We will not provide the most efficient implementation in order to make the whole structure more complicated to show the power of common table expressions.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

In [None]:
storageLocation = "s3://dimajix-training/data/weather"
#storageLocation = "/dimajix/data/weather-noaa-sample"

# 1. Register Temp Views

In a first step, we will load the input data via Spark. In order to use Spark SQL, we will also immediately register the loaded DataFrames as temp views.

## 1.1 Load Raw Data

Like before, we will load the raw measurement data and register it as a temp view called `raw_weather`.

In [None]:
from functools import reduce

# Read in all years, store them in an Python array
raw_weather_per_year = [spark.read.text(storageLocation + "/" + str(i)).withColumn("year", f.lit(i)) for i in range(2003,2020)]

# Union all years together
raw_weather = reduce(lambda l,r: l.union(r), raw_weather_per_year)                        

# Register Spark DataFrame as named temporary view called 'raw_weather'
raw_weather.createOrReplaceTempView("raw_weather")

In [None]:
# Display first 10 records
spark.sql("SELECT * FROM raw_weather LIMIT 10").toPandas()

## 2.2 Load Master Data

Now we will load the stations master data and register a temp view called `stations`.

In [None]:
stations = spark.read.csv(storageLocation + "/isd-history", header=True)

# Register Spark DataFrame as named temporary view called 'stations'
# YOUR CODE HERE

In [None]:
# Display first 10 records
# YOUR CODE HERE

# 2. Using Intermediate Tables

Let us first perform the weather analysis step by step using more intermediate temporary tables. Of course, this approach is very specific to Apache Spark and not directly available in other databases. But additional tools like [dbt](https://getdbt.com) will provide similar capabilities for generic SQL databases.

## 2.1 Extract Measurements

In the first step, we need to extract all the interesting attributes from the raw measurement data. We do this by using `substring` and approproiate casts and/or scaling.

In [None]:
query = """
    SELECT
        year,
        substring(value,5,6) AS usaf,
        substring(value,11,5) AS wban,
        substring(value,16,8) AS `date`,
        substring(value,24,4) AS `time`,
        substring(value,42,5) AS report_type,
        substring(value,61,3) AS wind_direction,
        substring(value,64,1) AS wind_direction_qual,
        substring(value,65,1) AS wind_observation,
        CAST(substring(value,66,4) AS FLOAT) / 10.0 AS wind_speed,
        substring(value,70,1) AS wind_speed_qual,
        CAST(substring(value,88,5) AS FLOAT) / 10.0 AS air_temperature,
        substring(value,93,1) AS air_temperature_qual
    FROM raw_weather
"""

# Create a Spark DataFrame for the SQL query above
# YOUR CODE HERE

# Register the DataFrame as a temp view with name 'weather'
# YOUR CODE HERE

# Display the first 10 records from the newly created temp view 'weather'
# YOUR CODE HERE


## 2.2 Join Data

In the next step, we join the extracted measurement data with the stations master data. We use the two columns `usaf` and `wban` for joining. Since the join represents an enrichment of the measurements, we chose a *left join*. In order to access the extracted measurements, we can simply use the `weather` temp view we just created above.

In [None]:
query = """
    -- YOUR CODE HERE
"""

# Create a Spark DataFrame for the SQL query above
joined_weather = spark.sql(query)

# Register the DataFrame as a temp view with name 'joined_weather'
joined_weather.createOrReplaceTempView("joined_weather")

# Display first 10 records from the newly created temp view 'joined_weather'
spark.sql("SELECT * FROM joined_weather LIMIT 10").toPandas()

## 2.3 Aggregate Temperature

We now aggregate the min, max and average air temperature per country and year. We will use a simple `WHERE` condition to ignore all records with invalid air temperature. Of course, we remember a more efficient overall solution. In order to make the example more complicated, we chose to ignore the simpler solution.

In [None]:
query = """
    -- YOUR CODE HERE
"""

# Create a Spark DataFrame for the SQL query above and register it as a temp view called 'year_country_temperature'
year_country_temperature = spark.sql(query)
year_country_temperature.createOrReplaceTempView("year_country_temperature")

# Display first 10 records from the newly created temp view
spark.sql("SELECT * FROM year_country_temperature LIMIT 10").toPandas()

## 2.4 Aggregate Wind (Exercise)

As the next step, create a similar query for calculating the min, max and average wind speed per country and year.

In [None]:
query = """
    -- YOUR CODE HERE
"""

# Create a Spark DataFrame for the SQL query above and register it as a temp view called 'year_country_wind'
# YOUR CODE HERE

# Display first 10 records from the newly created temp view
spark.sql("SELECT * FROM year_country_wind LIMIT 10").toPandas()

## 2.5 Join Final Result

So far, we have created the two temp views `year_country_temperature` and `year_country_wind`. Both contain some aggregated attributes per country and year. Now join together both data frames using the columns `country` and `year`. Here we chose an *outer join*, since both sides of the join could provide some relevant information without the other side. The final result should contain the following columns

* `year`
* `country`
* `min_wind_speed`
* `max_wind_speed`
* `avg_wind_speed`
* `min_air_temperature`
* `max_air_temperature`
* `avg_air_temperature`

The function SQL function `COALESCE` can be used to merge the columns `year` and `country` from the left and right sides of the joint into two final columns `year` and `country`.

In [None]:
query = """
    -- YOUR CODE HERE
"""

# Execute the query
result = spark.sql(query)
# Convert the result to a Pandas dataframe - this time without a 'limit' 
result.toPandas()

# 3.0 Common Table Expression

The following section will provide an SQL query, which essentially performs the very same calculation. But we implement all steps within a single query using *Common Table Expressions*, CTEs. A CTE can be seen as a local temp view, which is locally scoped to be only accessible within a single query. Therefore, a CTE can be interpreted as some sort of *table valued function* (but without a parameter).

## 3.1 Air Temperature

In order to provide an instructive example, let's start with a simpler query which only calculates the metrics related to the air temperature.

In [None]:
query = """
-- Extract measurements
-- YOUR CODE HERE

-- Join measurements and master data
-- YOUR CODE HERE

-- Calculate min/max/avg air temperature per county and year
-- YOUR CODE HERE
"""

# Execute the query and display result
result = spark.sql(query)
result.toPandas()

## 3.2 Full Query (Exercise)

In the next and final step, we construct a more complicated query, which also calculates the wind speed metrics and which performs the final join between the wind speed and air temperature metrics.

In [None]:
query = """
-- Extract measurements
-- YOUR CODE HERE

-- Join measurements and master data
-- YOUR CODE HERE

-- Aggregate air temperature
-- YOUR CODE HERE

-- Aggregate wind speed
-- YOUR CODE HERE

-- Join aggeragted wind speed and aggregated air temperature
-- YOUR CODE HERE
"""

# Execute the query and display result
result = spark.sql(query)
result.toPandas()