## WARNING: do not run this notebook without swap enabled and make sure to stop the kernel in "spark_intro" notebook.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr, col

### Spark + parquet

Add swap space (caching for anonymous data):
1. `sudo fallocate -l 1G /swapfile`
2. `sudo chmod g-r /swapfile`
3. `sudo chmod o-r /swapfile`
4. `sudo mkswap /swapfile`
5. `sudo swapon /swapfile`

### SF fire dataset

Data source: https://data.sfgov.org/Public-Safety/Fire-Department-Calls-for-Service/nuek-vuh3/data

In [None]:
! wget https://ms.sites.cs.wisc.edu/cs544/data/sf.zip

In [None]:
!unzip sf.zip

In [None]:
!ls -lah

Hive let's you take files in HDFS and converts them into tables in a database. Then, we can run SQL queries.

In [None]:
spark = (SparkSession.builder.appName("cs544")
         .master("spark://boss:7077")
         .config("spark.executor.memory", "512M")
         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
         .enableHiveSupport()
         .getOrCreate())

In [None]:
471859200 / 1024**2 # min needed in MB

Let's copy sf.csv into HDFS.

In [None]:
!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv

In [None]:
df = spark.read.format("csv").load("hdfs://nn:9000/sf.csv")

Let's convert first three lines to pandas dataframe.

In [None]:
df.limit(3).toPandas()

In [None]:
df = (spark.read.format("csv")
      .option("header", True)
      .load("hdfs://nn:9000/sf.csv"))

In [None]:
df.limit(3).toPandas()

In [None]:
df = (spark.read.format("csv")
      .option("header", True)
      .option("inferSchema", True)
      .load("hdfs://nn:9000/sf.csv"))

In [None]:
df

### How to transform the data with functions on columns?

In [None]:
col("Call Date")

In [None]:
expr("Call Date")

In [None]:
df.select(col("Call Date")).limit(5).toPandas()

In [None]:
df.select(expr("`Call Date`")).limit(5).toPandas()

In [None]:
df.select(expr("`Call Date`").alias("Date")).limit(5).toPandas()

In [None]:
df.select(expr("to_date(`Call Date`, 'MM/dd/yyyy')").alias("Date")).limit(5).toPandas()

#### GOAL: create a parquet file with this data, with no spaces in the column names

In [None]:
columns = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
columns[:5]

In [None]:
df.rdd.getNumPartitions()

In [None]:
(df.select(columns)
 .write
 .format("parquet")
 .mode("overwrite")
 .save("hdfs://nn:9000/sf.parquet"))

Let's check the files on HDFS.

In [None]:
!hdfs dfs -ls hdfs://nn:9000/

In [None]:
!hdfs dfs -ls hdfs://nn:9000/sf.parquet

Let's read the data from the parquet file that we wrote.

In [None]:
df = spark.read.format("parquet").load("hdfs://nn:9000/sf.parquet")

In [None]:
df

In [None]:
df.rdd.getNumPartitions()

Why does spark use fewer partitions now? Compression feature of parquet format.