<h1 style='text-aling:center;color:Navy'>  Big Data Systems - Fall 2021  </h1>
<h1 style='text-aling:center;color:Navy'>  Assignment 4  </h1>

***

<b>Submission Deadline: This assignment is due Friday, Mar 31 at 8:59 P.M.</b>

A few notes before you start:
- Directly sharing answers is not okay, but discussing problems with other students is encouraged.
- You should start early so that you have time to get help if you're stuck.

- Complete all the exercises below and turn in a write-up in the form of a Jupyuter notebook, that is, an .ipynb file. The write-up should include your code and answers to exercise questions. You will submit your assignment online as an attachment (*.ipynb), through Canvas under Assignment 4.

# <span style="color:#3665af">Big Data Learning with Scikit-learn </span>
<hr>

###### Goal
In this assignment, we will learn how to use linear regression in Scikit-learn to estimate values in a connected vehicles dataset.

###### Prerequisites
This assignment has the following dependencies:
- Jupyter Notebook, along with the following libraries (which should be installed on the Computing Platform):
  - Scikit Learn
  - Numpy
  - Pandas
  - matplotlib


<div style="font-size:30px;color:#3665af;background-color:#E9E9F5;padding:10px;">Assignment Hands-on 

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;">1. Setup </div>

- Visualize the position data, get some intuition about the geography
- Reduce the columns to the ones related to position and speed

In [None]:
from sklearn import svm
from sklearn.model_selection import train_test_split
# import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

In [None]:
# this data comes from the US DoT data website
# it is from a trial of connected vehicles travelling between Laramie and Cheyenne, WY
# https://data.transportation.gov/Automobiles/Wyoming-CV-Pilot-Basic-Safety-Message-One-Day-Samp/9k4m-a3jc
dataSrc = pd.read_csv("data/Wyoming_CV_Pilot_Basic_Safety_Message_One_Day_Sample.csv", low_memory=False)

In [None]:
# looking for the "coreData" columns instead of the "metaData" columns 
allColumns = list(filter(lambda colName: colName.startswith('coreData'), dataSrc.columns))
allColumns

In [None]:
# this should resemble the roadway from Laramie, WY to Cheyenne, WY
plt.scatter(dataSrc['coreData_position_lat'], dataSrc['coreData_position_long'])

<img src="map-projection-of-data.png" alt="Map projection of data" style="width: 400px;"/>

Hmm, that roadway kind of looks similar, but, a little off. That's because we are treating latitude and longitude as euclidian coordinates, when they are actually coordinates projected onto a sphere.

Can we see a relationship between latitude values and elevation?

In [None]:
coordinateSortedData = dataSrc.sort_values(by=['coreData_position_long', 'coreData_position_lat'])
plt.plot(coordinateSortedData['coreData_position_lat'], coordinateSortedData['coreData_elevation'])

One issue we note with this immediately: latitude is not a valid function of elevation, since some latitude values project multiple values in elevation. In other words, there is no bijection between all latitude points and elevation points. 

What about with respect to the longitudinal data points?

In [None]:
plt.plot(coordinateSortedData['coreData_position_long'], coordinateSortedData['coreData_elevation'])

Longitude appears to have a proper bijection to the domain of elevation points. We will consider this later when we build our linear model.

Wait, how many columns did our raw data have?

In [None]:
print(str(len(allColumns)) + " columns")

We don't need all 24. Let's just focus in on the columns that are related to position and movement. We can even rename them to something easier on the eye.

In [None]:
dataSubset = dataSrc[['coreData_position_lat','coreData_position_long','coreData_secMark','coreData_elevation','coreData_speed', 'coreData_heading']]
dataSubset = dataSubset.rename(columns={'coreData_position_lat':'lat','coreData_position_long':'lon','coreData_secMark':'time','coreData_elevation':'height','coreData_speed':'speed', 'coreData_heading': 'direction'})
dataSubset

Two of the columns are using the metric system. Here are two transforms we can use to switch away from the metric system to the USCU system (aka, feet and miles).

In [None]:
def metersToFeet(x):
    return x * 3.28084
def metersPerSecToMPH(x):
    return (metersToFeet(x) * 3600) / 5280

Let's convert the data in `dataSubset` from metric to our American distance measures, and store that in a new copy of the `dataSubset`. We do this because we want to manipulate data for analysis but maintain a copy without edits for any future analysis.

In [None]:
dataFt = dataSubset.copy()
dataFt['height'] = metersToFeet(dataSubset['height'])
dataFt['speed'] = metersPerSecToMPH(dataSubset['speed'])
dataFt

Now, what can we say about these columns? One way to find out is to call `describe()` on each of them.

In [None]:
dataFt['lat'].describe()

In [None]:
dataFt['lon'].describe()

In [None]:
dataFt['height'].describe()

<br>

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> 2. A Regression Proof-of-concept </div>

Let's show ourselves that building a linear model does what we think it should do

- Let's predict elevation based on position
  - Let's make our first linear model _fit_ to the longitudinal data
  - We will try different features and different linear regression models
  - Recall from above, there was a fairly obvious relationship between longitude and elevation. We will let the models show us that this was a relevant detail.

Our first model makes the assumption that a linear model on an x/y coordinate pair can be used to infer height. Put another way, we will use `lat` and `lon` (our _features_) to predict `height` (our _label_), using a linear model of the form `c_1(lat) + c_2(lon)`, where we are attempting to learn the coefficients `c_1` and `c_2` that best fit our training data.

In [None]:
# this may take some time.
latLonModel = svm.SVR()
X_train, X_test, y_train, y_test = train_test_split(dataFt[['lat','lon']], dataFt['height'], test_size=0.2)
latLonModel.fit(X_train, y_train)

In [None]:
predictions = latLonModel.predict(X_test)

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")

In [None]:
# how does it score?
print("Model Accuracy: {0:.2f}%".format(latLonModel.score(X_test, y_test) * 100))

This is not _bad_. It's also not _good_. Let's see if we can do better.

One common technique for fitting a regression model is to create **polynomial features**. These aim to create additional features that represent the polynomial terms of the inputs. For example, for some input column `x`, we can create additional columns to represent `x^2` and `x^3`.

In our case, let's see how creating an additional feature, `latlon` (computed as `latitude * longitude`) will help with our score.

In [None]:
polyFeatures = dataFt[['lat','lon']].copy()

# this is the line that adds a new column to polyFeatures that is the product of the columns lat and lon
polyFeatures['latlon'] = polyFeatures['lat'] * polyFeatures['lon']
polyFeatures

Ok! Let's try with our new feature!

In [None]:
polyLonModel = svm.SVR()
X_train, X_test, y_train, y_test = train_test_split(polyFeatures[['lat','lon','latlon']], dataFt['height'], test_size=0.2)
polyLonModel.fit(X_train, y_train)

In [None]:
predictions = polyLonModel.predict(X_test)

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel("True Values")
plt.ylabel("Predictions")

In [None]:
# how does it score?
print("Model Accuracy: {0:.2f}%".format(polyLonModel.score(X_test, y_test) * 100))

Much better! We doubled our accuracy without changing our model or data, but by coming up with more features from our data. Learning from the correct features is essential to good modeling.

<hr style="border-top: 5px solid purple; margin-top: 1px; margin-bottom: 1px"></hr>

<div style="font-size:30px;color:#3665af;background-color:#e1dfb1;padding:10px;">Exercise </div>

<div style="font-size:20px;color:#F1F8FC;background-color:#0095EA;padding:10px;"> 3. Let's Predict Speed</div>

Predicting speed will be a bit more challenging, but, we have some intuition. Perhaps there are properties related to the highway that might produce similar speed ratings, such as a steep climb on an eastbound section, or passing through a section where there was construction all day long. Think about what vehicle travel is like while you work on your model. 

#### Do not get lost attempting perfection! Grading will be based mostly on your answers to the questions below. Make a reasonable effort at refining your model, put a few hours into it, and explain your process for a passing grade.

##### In order to answer these questions, do the following:

- Use our data points to fit some of the data to a linear model
  - _important_: you will need to set `speed` to be your training label
- Experiment with different polynomial features
  - see polyLonModel, above, for an example
- Try changing the parameters of SVR, such as kernel, C, gamma, and degree, when appropriate
  - see the [SVR documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR) for a detailed description of these options.
- Try _at least one more_ of the following linear regression models that Scikit-Learn offers
  - [Lasso](http://scikit-learn.org/stable/modules/linear_model.html#lasso)
  - [ElasticNet](http://scikit-learn.org/stable/modules/linear_model.html#elastic-net)
  - [Ridge](http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression)
- [optional] play with integrating other learners and tools from the Scikit-Learn toolkit, such as PCA


<div style="font-size:20px;background-color:#0B6713;color:#F1E6E7;padding:10px;">
    TO-DO:
</div>

In [None]:
###### Your code here, showing results on your models.
###### Feel free to add additional cells. Good luck!

<div style="font-size:20px;background-color:#A74A54;color:#F1E6E7;padding:10px;">
    Questions 
</div>

<div style="width:100%;">
    <div style="width:15%;float:left;font-size:20px;background-color:#557aba;color:#eff3f9;padding:6px;font-wight:bold;text-align:center;">
    Question 1
    </div>
    <div style="width:85%;float:right;font-size:16px;background-color:#dce4f2;font-wight:normal;color:black;padding:6px;">
    What was your best model and accuracy? 
    </div>
</div>

<div style="width:100%;">
    <div style="width:15%;float:left;font-size:20px;background-color:#557aba;color:#eff3f9;padding:6px;font-wight:bold;text-align:center;">
    Question 2
    </div>
    <div style="width:85%;float:right;font-size:16px;background-color:#dce4f2;font-wight:normal;color:black;padding:6px;">
    What parameter settings did you use to achieve that accuracy?
    </div>
</div>

<div style="width:100%;">
    <div style="width:15%;float:left;font-size:20px;background-color:#557aba;color:#eff3f9;padding:6px;font-wight:bold;text-align:center;">
    Question 3
    </div>
    <div style="width:85%;float:right;font-size:16px;background-color:#dce4f2;font-wight:normal;color:black;padding:6px;">
    What features did you choose? Why?
    </div>
</div>

<div style="width:100%;">
    <div style="width:15%;float:left;font-size:20px;background-color:#557aba;color:#eff3f9;padding:6px;font-wight:bold;text-align:center;">
    Question 4
    </div>
    <div style="width:85%;float:right;font-size:16px;background-color:#dce4f2;font-wight:normal;color:black;padding:12px;">
    For the model of your most successful experiment (SVR, Lasso, ElasticNet, etc), what can you say about it's strengths related to this problem?
    </div>
</div>

<div style="width:100%;">
    <div style="width:15%;float:left;font-size:20px;background-color:#557aba;color:#eff3f9;padding:6px;font-wight:bold;text-align:center;">
    Question 5
    </div>
    <div style="width:85%;float:right;font-size:16px;background-color:#dce4f2;font-wight:normal;color:black;padding:6px;">
   Why is it harder to predict speed than height?
    </div>
</div>

<hr style="border-top: 5px solid purple; margin-top: 1px; margin-bottom: 1px"></hr>

<h2>Submission</h2>

<hr style="border-top: 5px solid orange; margin-top: 1px; margin-bottom: 1px"></hr>

<p style="text-align: justify;">You need to submit a Jupyter Notebook (*.ipynb) file that contains your completed code.


<span>The file name should be in <strong>FirstName_LastName</strong> format</span>.</p>
<p style="text-align: justify;"><span>DO NOT INCLUDE EXTRA FILES, SUCH AS THE INPUT DATASETS</span>, in your submission;</p>
<p style="text-align: justify;">Please download your assignment after submission and make sure it is not corrupted or empty! We will not be responsible for corrupted submissions and will not take a resubmission after the deadline.</p>

Need Help?
If you need help with this assignment, please get in touch with TAs on MS Teams or via their emails, or go to their office hours.
You are highly encouraged to ask your question on the designated channel for Assignment o on Microsoft Teams (not necessarily monitored by the instructor/TAs). Feel free to help other students with general questions. However, DO NOT share your solution.<hr style="border-top: 5px solid orange; margin-top: 1px; margin-bottom: 1px"></hr>