### Spark DataFrames + SQL

## WARNING: do not run this notebook without swap enabled and make sure that you have sufficient RAM (`htop`) before you run this notebook

Add swap space (caching for anonymous data):
1. `sudo fallocate -l 1G /swapfile`
2. `sudo chmod g-r /swapfile`
3. `sudo chmod o-r /swapfile`
4. `sudo mkswap /swapfile`
5. `sudo swapon /swapfile`

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, col

In [None]:
spark = (SparkSession.builder.appName("cs544")
         .master("spark://boss:7077")
         .config("spark.executor.memory", "512M")
         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
         .enableHiveSupport()
         .getOrCreate())

### SF fire dataset

Data source: https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3/data

In [None]:
! wget https://ms.sites.cs.wisc.edu/cs544/data/sf.zip

In [None]:
!unzip sf.zip

In [None]:
!ls -lah

Let's copy sf.csv into HDFS.

In [None]:
!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv

Without schema inference.

In [None]:
df = spark.read.format("csv").option("header", True).load("hdfs://nn:9000/sf.csv")
df

With schema inference.

In [None]:
df = (spark.read.format("csv")
      .option("header", True)
      .option("inferSchema", True)
      .load("hdfs://nn:9000/sf.csv"))
df

### How to transform the data with functions on columns?

In [None]:
col("Call Date")

In [None]:
expr("Call Date")

In [None]:
df.select(col("Call Date")).limit(5).toPandas()

In [None]:
df.select(expr("`Call Date`")).limit(5).toPandas()

`alias` method.

In [None]:
df.select(expr("`Call Date`").alias("Date")).limit(5).toPandas()

Convert date to proper format using `to_date`.

In [None]:
df.select(expr("to_date(`Call Date`, 'MM/dd/yyyy')").alias("Date")).limit(5).toPandas()

#### GOAL: create a parquet file with this data, with no spaces in the column names

In [None]:
columns = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
columns[:5]

In [None]:
df.rdd.getNumPartitions()

Write data to HDFS using parquet file format.

In [None]:
(df.select(columns)
 .write
 .format("parquet")
 .mode("overwrite")
 .save("hdfs://nn:9000/sf.parquet"))

Let's check the files on HDFS.

In [None]:
!hdfs dfs -ls hdfs://nn:9000/

In [None]:
!hdfs dfs -ls hdfs://nn:9000/sf.parquet

Let's read the data from the parquet file that we wrote.

In [None]:
df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")
df

In [None]:
df.rdd.getNumPartitions()

Why does spark use fewer partitions now? Compression feature of parquet format.

Let's remove sf.csv now from HDFS.

In [None]:
!hdfs dfs -rm hdfs://nn:9000/sf.csv

### HIVE View

In [None]:
df.createTempView("calls")
df

In [None]:
df.createOrReplaceTempView("calls")
df

Let's rename "Neighborhooods_-_Analysis_Boundaries" to "area".

In [None]:
df.withColumnRenamed("Neighborhooods_-_Analysis_Boundaries", "area").createOrReplaceTempView("calls")

### `show` method

- not a pretty output

In [None]:
spark.sql("SELECT * FROM calls")

In [None]:
#spark.sql("SELECT * FROM calls").show()

### `toPandas` method

In [None]:
spark.sql("SELECT * FROM calls LIMIT 3").toPandas()

In [None]:
spark.sql("SHOW TABLES").show()

### HIVE table

In [None]:
spark.sql("""
SELECT *
FROM calls
WHERE Call_Type LIKE 'Odor%'
""").write.mode("overwrite").saveAsTable("stinky")

In [None]:
spark.sql("SHOW TABLES").show()

In [None]:
spark.sql("SELECT * FROM stinky LIMIT 3").toPandas()

Let's take a look at the data on HDFS.

```python
spark = (SparkSession.builder.appName("cs544")
         .master("spark://boss:7077")
         .config("spark.executor.memory", "512M")
         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
         .enableHiveSupport()
         .getOrCreate())
```

In [None]:
!hdfs dfs -ls hdfs://nn:9000/user/hive/warehouse/stinky/

In [None]:
spark.sql("SELECT * FROM calls").rdd.getNumPartitions()

In [None]:
spark.sql("SELECT * FROM stinky").rdd.getNumPartitions()

In [None]:
spark.table("calls")

In [None]:
spark.table("stinky")

### Grouping

### What are the unique area column values?

In [None]:
spark.sql("SELECT DISTINCT area FROM calls").collect()

### How many calls are there per area?

In [None]:
pandas_df = spark.sql("""
SELECT area, COUNT(*) as count
FROM calls
GROUP BY area
ORDER BY count DESC
""").toPandas()
pandas_df

In [None]:
pandas_df.set_index("area").plot.bar()

### How many calls are there per groups/type?

In [None]:
spark.sql("""
SELECT Call_Type_Group, Call_Type, COUNT(*) as count
FROM calls
GROUP BY Call_Type_Group, Call_Type
""").toPandas().head()

### For each call group, what percentage of calls are represented by the biggest type?

In [None]:
spark.sql("""
SELECT Call_Type_Group, MAX(count) / SUM(count)
FROM (
    SELECT Call_Type_Group, Call_Type, COUNT(*) as count
    FROM calls
    GROUP BY Call_Type_Group, Call_Type
)
GROUP BY Call_Type_Group
""").toPandas()

Let's use DataFrame API to solve the same question.

In [None]:
(spark.table("calls")
 .groupby("Call_Type_Group", "Call_Type")
 .count()
 .groupby("Call_Type_Group")
 .agg(expr("MAX(count) / SUM(count)").alias("perc"))
).toPandas()

### Window functions

### What are three smallest call numbers for each area?

In [None]:
spark.sql("""
SELECT area, Call_Number, row_number() OVER (PARTITION BY area ORDER BY Call_Number ASC) AS rownum
FROM calls
""").where("rownum <= 3").toPandas()