# SQL Introduction

SQL is a powerful language, which can get you extremly far. It is concise and sufficiently simple (once you get used to it). Spark supports SQL very well and is continuously improving the implementation by adding new features and providing better compatibility with other existing databases like Postgres etc.

In this notebook we will give a solid introduction to the all important `SELECT` statements, which provide similar capabilites like all Spark DataFrame transformations.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f

if not 'spark' in locals():
    spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory","4G") \
        .getOrCreate()

spark

## 1 Watson Sales Product Sample Data

First we load the data, which is provided as a single CSV file, which again is well supported by Apache Spark

In [None]:
basedir = "s3://dimajix-training/data/"

In [None]:
persons = spark.read.csv(basedir + "persons_header.csv", header=True, inferSchema=True)
persons.toPandas()

In order to work with SQL, we immediately register the Spark DataFrame as a temporary view.

In [None]:
persons.createOrReplaceTempView("persons")

# 2. Simple Transformations

Following the original Spark introductionary notebook, we start with very simple transformations. These are always executed inside a `SELECT` statement. The first version uses the simplest structure:

```sql
SELECT
    expression_1 AS result_column_1,
    expression_2 AS result_column_2,
    ...
FROM some_table_name
```

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
# YOUR CODE HERE

Lets look at a different example where we want to create a new DataFrame with the appropriate salutation in front of the name. We achieve this by the following `SELECT` statement:

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

## Exercise

Using the `persons` temp view, perform the following operations:
* Add a new column `status` which should be `child` if the person is younger than 18 and `adult` otherwise
* Replace the column `name` by a new column `hashed_name` containing the hash value of the name
* Drop the column `sex`

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

# 3. Filtering

*Filtering* denotes the process of keeping only rows which meet a certain filter criteria. SQL uses a `WHERE` condition to specify which records should be kept. Note that the conditions in the `WHERE` clause refer to the original table, not to the columns specified in the `SELECT`.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

## Exercise
Perform two different filter operations (with two different result sets):
* Select all women with a height of at least 160
* Select all persons which are younger than 20 or older than 30

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

# 4. Simple Aggregations

SQL supports aggregations without grouping inside a `SELECT` statement.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

# 5. Grouping & Aggregation

An important class of operation is grouping and aggregation, which is provided in SQL via a `SELECT aggregation GROUP BY grouping` statement. Note that in contrast to PySpark, you need to explictily add the grouping column to the list of expressions in order to see it in the result.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

## Exercise

Using the `persons` temp view, calculate the average height and the number of records per sex.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

# 6. Sorting

SQL also supports sorting data with the `ORDER BY` clause. For example we can sort all persons by their height as follows:

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

If nothing else is specified, SQL will sort the records in increasing order of the sort columns. If you require descending order, this can be specified by manipulating the sort column with the `DESC` modifier as follows:

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

## Exercise

As an exercise we want to sort all persons first by their sex and then by their descening age. Sorting by multiple columns can easily be achieved by specifying multiple columns separated with a comma in the `ORDER BY` clause.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

# 7. Joining Data

Every relation algebra also contains join operations which lets you combine multiple tables by a matching criterion. SQL also supports joins of multiple tables/views. In order to shed some light on that, we need a second DataFrame in addition to the `persons` temp view. Therefore we load some address data as follows:

In [None]:
addresses = spark.read.json(basedir + "addresses.json")
addresses.createOrReplaceTempView("addresses")
addresses.toPandas()

Now that we have the `addresses` view, we want to combine it with the `persons` view such that the city of every person is added as a new column. This is achieved by the `JOIN` clause which together with two parameters: The first parameter specifies the second DataFrame to join with, and the second parameter specifies the join condition. In this case we want to join all records, where the `name` column matches.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

Let me make some relevant remarks:
* The resulting DataFrame now contains two `name` columns - one comes from the `persons` view, the other from the `addresses` view. Since the join condition could have used some more complex expression, this behaviour is only logical since SQL cannot assume that all joins simply use directly some column value. For example we could also have transformed the column on the fly by converting the name to upper case directly inside the join condition.
* The result contains only persons where an address was found, although the original `persons` view contained more persons.
* There are no records of addresses without any person, although the `addresses` view contains information about some persons not available in the `persons` DataFrame.

So let us first address the first observation. We can easily get rid of the copied `name` column by either performing an explicit select of the desired columns, or by dropping the duplicate columns. The duplicate `name` columns can be addressed by their alias:

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

Now let us explain the last two observations. These are due to the used join type, which was a so called *inner* join. In this case, only records with information from both sides of the `JOIN` are included in the result.

In addition to the *inner* join, SQL also supports some additional joins:
* *outer join* will contain records for all elements from both DataFrames. If either the left or right DataFrames doesn't contain any information, the result will contain `None` values (= `NULL` values) for the corresponding columns.
* In a *right join*, the second view (the right view) as specified as an argument is the leading element. The result will contain records for every record in that view.
* In a *left join*, the first view (the left view) as specified as the object iteself is the leading element. The result will contain records for every record in that view.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()

# 8. Combining all Features

A general `SELECT` statement looks as follows (we omit CTEs for the moment)

```sql
SELECT
    expression_1 AS column_1,
    expression_2 AS column_2,
    ...
FROM table_1
JOIN table_2 ON ...
JOIN table_3 ON ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ...
LIMIT n
```

The different parts need to be specified in exactly this order. And they are also evaluated in exactly this order, which means:
* In the first step, all records are read from the table specified in the `FROM` clause.
* Then `JOIN` clauses are executed by reading the appropriate tables and matching the records. There can be multiple `JOIN` clauses.
* Then the `WHERE` clause is executed, i.e. all records are filtered.
* Then the `GROUP BY` clause is executed
* Then the `HAVING` clause is executed. It serves as an additional filter criteria *after* grouped aggregation.
* Now all column expressions in the `SELECT` part are evaulated.
* The result set is sorted according to the `ORDER BY` clause
* The first `n` records are taken accoring to the `LIMIT` clause

Of course, the SQL optimizer may execute things in a different order. But the conceptional ordering is exactly as described above.

In [None]:
query = """
    # YOUR CODE HERE
"""

# Execute query and display the result
result = spark.sql(query)
result.toPandas()