# Spark ML

### Code to be executed before lecture

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

In [None]:
spark = (SparkSession.builder.appName("cs544")
         .master("spark://boss:7077")
         .config("spark.executor.memory", "512M")
         .config("spark.sql.warehouse.dir", "hdfs://nn:9000/user/hive/warehouse")
         .enableHiveSupport()
         .getOrCreate())

### WARNING: do not keep multiple copies of `sf.csv` on your VM as it will eat up disk space

Please make sure to move sf.csv into `25-spark-ml/nb` directory.

In [None]:
!hdfs dfs -cp sf.csv hdfs://nn:9000/sf.csv

In [None]:
df = (spark.read.format("csv")
      .option("header", True)
      .option("inferSchema", True)
      .load("hdfs://nn:9000/sf.csv"))

In [None]:
cols = [col(c).alias(c.replace(" ", "_")) for c in df.columns]
df.select(cols).write.mode("ignore").format("parquet").save("hdfs://nn:9000/sf.parquet")

In [None]:
!hdfs dfs -rm hdfs://nn:9000/sf.csv

In [None]:
(spark.read
 .format("parquet")
 .load("hdfs://nn:9000/sf.parquet")
 .createOrReplaceTempView("calls")
)

### Lecture starts here

### Spark execution explanation
`.explain()` or `.explain("formatted")`

In [None]:
spark.sql("""
SELECT Call_Type, COUNT(*)
FROM calls
GROUP BY Call_Type
""").explain("formatted")

### Bucketed data

`bucketBy(<col>)`

In [None]:
# would work without sampling, just using it to make it faster


Let's repeat the same SQL query now.

In [None]:
spark.sql("""
SELECT Call_Type, COUNT(*)
FROM call_by_type
GROUP BY Call_Type
""").explain("formatted")

### JOIN Algorithms (for a single machine)

In [None]:
# kind_id, color
fruits = [
    ("B", "Yellow"),
    ("A", "Green"),
    ("C", "Orange"),
    ("A", "Red"),
    ("C", "Purple"),
    ("B", "Green")
]

# kind_id, name (assume no duplicate kind_id's)
kinds = [
    ("A", "Apple"),
    ("B", "Banana"),
    ("C", "Carrot")
]

#### GOAL: print Yellow Banana, Green Apple, etc (any order)

### Option 1: Hash join
- Move smaller table to in-memory Python `dict`
- Iterate over larger table and find matches using `dict` lookup

In [None]:
kind_lookup = 
kind_lookup

### Option 2: sort merge join

- Sort both tables (can be done using disk too)
- Iterate over smaller table
  - Conditionally iterate over bigger table to find matches

In [None]:
fruits.sort()
kinds.sort()

In [None]:
fruits

In [None]:
kinds

In [None]:
fruit_idx = 0


### Spark ML

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({"x1": np.random.randint(0, 10, 100).astype(float), 
                   "x2": np.random.randint(0, 3, 100).astype(float)})
df["y"] = df["x1"] + df["x2"] + np.random.rand(len(df))

Let's convert pandas dataframe to Spark dataframe.

In [None]:
df = 
df

Recall that seed in Spark is not truly deterministic overall (because everytime we might have new partitions), just deterministic at the partition level.

Let's write data to Parquet format and read the data from the Parquet file.
We need to now use `mode("ignore")` to make sure that we work with the deterministic sample.

In [None]:
train.count(), test.count()

In [None]:
# import statement

- `DecisionTreeRegressor`: unfit model
- `DecisionTreeRegressionModel`: fitted model
    - In Spark, names ending in "Model" are the fitted ones

In [None]:
# ALWAYS needs a vector column - even for a single feature!
dt = 


### VectorAssembler

In [None]:
# import statement

In [None]:
va = VectorAssembler(inputCols=["x1", "x2"], outputCol="features")
dt = DecisionTreeRegressor(featuresCol="features", labelCol="y")

model = dt.fit(va.transform(train))

In [None]:
type(dt), type(model)