# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your https://jupyterhub.ischool.syr.edu/ workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [None]:
# Load the packages needed for this part
# create spark and sparkcontext objects
from pyspark.sql import SparkSession
import numpy as np

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

from pyspark.ml import feature, regression, Pipeline, pipeline
from pyspark.sql import types, Row, functions as fn
from pyspark import sql
import pandas as pd
import matplotlib.pyplot as plt

# Part 1: PCA and feature engineering

The government of Syracuse is trying to understand how to better keep its streets in good condition. Luckily, they have a datset obtained from patching said streets. Using this dataset, they want to easily visualize the characteristics of the city.

In [None]:
# load the data
syracuse_streets = spark.read.json("syracuse.json.gz")

In particular, the city is interested in understanding the following features:

- `Latitude` 
- `Longitude`
- `crack`: number of cracks on the street (visually inspected)
- `patch`: number of patches on the street (visually inspected)
- `pavement`: quality of the pavement
- `length`: street length
- `width`: street width

In [None]:
# take a look at the numerical features
syracuse_streets.select(['Latitude', 
               'Longitude', 
               'crack', 
               'patch',  
               'pavement',
               'length', 
               'width']).limit(10).toPandas()

For some of the questions, you will use the following user-defined function that transforms a vector into an array.

In [None]:
@fn.udf(returnType=types.ArrayType(types.FloatType()))
def to_array(col):
    return col.toArray().tolist()

For example, suppose you have latitude and longitude as a column of type vector. The way Spark encodes vectors is different from arrays and therefore they are not easy to manipulable. The above function allows you to transform a vector into an array for easy manipulation

In [None]:
(feature
 .VectorAssembler(inputCols=['Latitude', 'Longitude'], outputCol='feature_vector')
 .transform(syracuse_streets)
 .select('feature_vector')
 .show(5)
)

For example, we can take apart pieces of the array using `fn.expr('feature_array[i]')` notation to extract the ith element:

In [None]:
(feature
 .VectorAssembler(inputCols=['Latitude', 'Longitude'], outputCol='feature_vector')
 .transform(syracuse_streets)
 .select(to_array('feature_vector').alias('feature_array'))
 .select(fn.expr('feature_array[0]').alias('latitude'), 
         fn.expr('feature_array[1]').alias('longitude'))
 .show(5)
)

You can do a lot more things with arrays. Take a look at https://sparkbyexamples.com/spark/spark-sql-array-functions/

## Question 1: (30 pts) Simple PCA

In this question, you will perform PCA to understand and visualize the data. You will analyze the features `Latitude`, `Longitude`, `crack`, `patch`,  `pavement`, `length`, and `width`.

In [None]:
feature_list = ['Latitude', 
               'Longitude', 
               'crack', 
               'patch',  
               'pavement',
               'length', 
               'width']

In the cell below, create a pipeline model (i.e., fitted pipeline) called `pca_2d` that takes the features above and projects them into **two** principal components. Before fitting the PCA model, make sure to **standardized** your data (i.e., center **and** divide by standard deviation) using `feature.StandardScaler`. Make sure the PCA part of the model generates an output column called `pc`.

In [None]:
# create pipeline to produce principal components of data
# YOUR CODE HERE
raise NotImplementedError()

Check that the fitted pipeline works:

In [None]:
pca_2d.transform(syracuse_streets).select(feature_list + ['pc']).show(5)

In [None]:
# 15 pts
assert type(pca_2d) == pipeline.PipelineModel
assert type(pca_2d.stages[-1]) == feature.PCAModel
assert set(pca_2d.stages[0].extractParamMap()[(pca_2d.stages[0].inputCols)]) == \
 {'Longitude', 'Latitude', 'crack', 'patch', 'pavement', 'length', 'width'}
assert feature.StandardScalerModel in list(map(type, pca_2d.stages))

Now, let's visualize the principal components. Create a Pandas dataframe that contains the `Longitude`, PC 1 (name it `pc1`), and PC 2 (name it `pc2`). You should use the udf `to_array` defined above to pluck the first and second component of the principal component into their own columns. Call this pandas dataframe `syracuse_2d`.

In [None]:
# put your code here to produce a pandas dataframe with columns Longitude, pc1, and pc2
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# 10 pts
assert set(syracuse_2d.columns) == {'Longitude', 'pc1', 'pc2'}
assert syracuse_2d.shape[0] == syracuse_streets.count()

The code below will plot the dataframe and color points by Longitude:

In [None]:
import seaborn as sns
sns.set(style="darkgrid")
sns.scatterplot(data=syracuse_2d, x='pc1', y='pc2', hue=syracuse_2d.Longitude.tolist())

**(10 pts)**: Given the plot above, do you think the *loading* on Longitude is bigger in principal component 1 or principal component 2? Elaborate.

YOUR ANSWER HERE

## Question 2: (10 pts) More PCA

In the previous section, we only limited our analysis to two principal components. However, it is unclear whether two dimensions capture enough of the data.

In the following question, fit a new PCA analysis model, similar to that of Question 1, where you include all principal components (seven principal components because length of `feature_list` is 7). Call this PCA pipeline `pca_all`. You can reuse some of the components of the pipeline above

In [None]:
# create pipeline to produce principal components of data
# YOUR CODE HERE
raise NotImplementedError()

Test your pipeline below

In [None]:
pca_all.transform(syracuse_streets).select(to_array('pc')).show(5)

In [None]:
# 10 pts
assert type(pca_all) == pipeline.PipelineModel
assert type(pca_all.stages[-1]) == feature.PCAModel