## Spark SQL

### Code to be executed before lecture

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, col
import requests

In [None]:
spark = (SparkSession.builder.appName("cs544")
         .master("spark://boss:7077")
         .config("spark.executor.memory", "512M")
         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
         .enableHiveSupport()
         .getOrCreate())

#### If you did not bring the Spark cluster down, you don't have to execute the below code. If you did bring it down, then please execute the below cells.

In [None]:
! wget https://ms.sites.cs.wisc.edu/cs544/data/sf.zip

In [None]:
!unzip sf.zip

In [None]:
!ls -lah

In [None]:
!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv

In [None]:
df = (spark.read.format("csv")
      .option("header", True)
      .option("inferSchema", True)
      .load("hdfs://nn:9000/sf.csv"))

In [None]:
cols = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
df.select(cols).write.format("parquet").mode("overwrite").save("hdfs://nn:9000/sf.parquet")
df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")

In [None]:
!hdfs dfs -rm hdfs://nn:9000/sf.csv

### HIVE View

Let's rename "Neighborhooods_-_Analysis_Boundaries" to "area".

In [None]:
df.withColumnRenamed("Neighborhooods_-_Analysis_Boundaries", "area").createOrReplaceTempView("calls")

In [None]:
spark.sql("SHOW TABLES").show()

In [None]:
spark.sql("SELECT * FROM calls LIMIT 3").toPandas()

### HIVE table

In [None]:
spark.sql("""

""")

In [None]:
spark.sql("SHOW TABLES").show()

In [None]:
spark.sql("SELECT * FROM stinky LIMIT 3").toPandas()

### HIVE data on HDFS

Where is it located?

```python
spark = (SparkSession.builder.appName("cs544")
         .master("spark://boss:7077")
         .config("spark.executor.memory", "512M")
         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
         .enableHiveSupport()
         .getOrCreate())
```

In [None]:
!hdfs dfs -ls hdfs://nn:9000/user/hive/warehouse/stinky/

In [None]:
df.rdd.getNumPartitions()

### Number of partitions: writing vs reading data

In [None]:
spark.sql("SELECT * FROM calls")

In [None]:
spark.sql("SELECT * FROM stinky")

### Create DataFrame from HIVE view or table

### Grouping

#### What are the unique area column values?

#### How many calls are there per area?

In [None]:
pandas_df = spark.sql("""
SELECT 
FROM 
GROUP BY 
ORDER BY 
""").toPandas()
pandas_df

In [None]:
pandas_df.set_index("area").plot.bar()

#### How many calls are there per groups/type?

In [None]:
spark.sql("""
SELECT 
FROM calls
GROUP BY 
""").toPandas().head()

#### For each call group, what percentage of calls are represented by the biggest type?

In [None]:
spark.sql("""
SELECT 
FROM 
GROUP BY 
""").toPandas()

Let's use DataFrame API to solve the same question.

### Window functions

#### What are three smallest call numbers for each area?

In [None]:
spark.sql("""
SELECT 
FROM 
""")

### Holidays dataset

In [None]:
!hdfs dfs -cp holidays2.csv hdfs://nn:9000/holidays2.csv

In [None]:
(spark.read
 .format("csv")
 .option("inferSchema", True)
 .option("header", True)
 .load("hdfs://nn:9000/holidays2.csv")
 .createOrReplaceTempView("holidays")
)

In [None]:
spark.table("holidays")

### Joining the SF fire data with holidays data

SQL version

In [None]:
spark.sql("""
SELECT *
FROM calls
LIMIT 5
""").toPandas()

In [None]:
spark.sql("""
SELECT *
FROM holidays
LIMIT 5
""").toPandas()

In [None]:
spark.sql("""
SELECT
FROM 

LIMIT 5
""").toPandas()

DataFrame version

In [None]:
# Create the DataFrames
calls = spark.table("calls")
holidays = spark.table("holidays")

In [None]:
# this doesn't trigger compute in Spark unlike pandas


#### How many calls on each kind of holiday?

In [None]:
calls.join(holidays, on=calls["Call_Date"] == holidays["date"], how="inner").toPandas()

#### What percent of fire dept calls are on holidays?

In [None]:
(calls
 .join(holidays, on=calls["Call_Date"] == holidays["date"], how=???)
 
 .toPandas())

### Web server REST API

Documentation: https://spark.apache.org/docs/latest/monitoring.html#rest-api

```
http://localhost:4040/api/v1/applications
http://localhost:4040/api/v1/applications/{app_id}/executors
# look for "totalTasks"
```

#### Appplications information
```
http://localhost:4040/api/v1/applications
```

In [None]:
r = requests.get("http://localhost:4040/api/v1/applications")
r.raise_for_status()
r.json()

Extracting app id.

In [None]:
app_id = 
app_id

#### Executors information

For example, how much work each executor has done

In [None]:
r = requests.get(f"http://localhost:4040/api/v1/applications/{app_id}/executors")
r.raise_for_status()
r.json()

#### How many total tasks have been run by each executor?

### Caching

Let's sample the data, create a single partition and try some caching.

In [None]:
# uses StorageLevel as "MEMORY_ONLY"
df = spark.table("calls")

Let's count the number of rows in our sample.

Let's take a look at total tasks executed by each executor.

In [None]:
r = requests.get(f"http://localhost:4040/api/v1/applications/{app_id}/executors")
r.raise_for_status()
[exec["totalTasks"] for exec in r.json()]

In [None]:
# Repeating count computation 30 times
for i in range(30):
    df.count()

In [None]:
r = requests.get(f"http://localhost:4040/api/v1/applications/{app_id}/executors")
r.raise_for_status()
[exec["totalTasks"] for exec in r.json()]

How can we have both executors do the work when we do caching? We need to use `StorageLevel` as `MEMORY_ONLY_2`. 
<br>**Try it by yourself using `persist` method instead of `cache` method.**

### Hash partitioning

What is a hash function?
- takes anything (e.g., just some bytes containing some data)
- returns a number (deteriministic, but ideally not with an obvious pattern)

In [None]:
print(hash(b"a"))
print(hash(b"b"))
print(hash(b"c"))
print(hash(b"d"))
print(hash(b"e"))
print(hash(b"f"))

We can use modulo operator (`%`) to determine what hash partition a particular value should go into.

For example, if we need 5 partitions:

In [None]:
print(hash(b"a") % 5)
print(hash(b"b") % 5)
print(hash(b"c") % 5)
print(hash(b"d") % 5)
print(hash(b"e") % 5)
print(hash(b"f") % 5)

In [None]:
random_string = "aaaabbbefghihijkllmlm"

In [None]:
partitions = [[], [], [], [], []]


partitions

#### Spark execution explanation

`.explain()` or `.explain("formatted")`

In [None]:
spark.sql("""
SELECT Call_Type, COUNT(*) as count
FROM calls
GROUP BY Call_Type
""")

In [None]:
spark.sql("""
SELECT Call_Type, COUNT(*) as count
FROM calls
GROUP BY Call_Type
""").explain()