# Machine Learning with scikit-learn

* What is Machine Learning?
  * Types of Machine Learning
* Train-Test split
* Use `sklearn` to build linear regression model
* One-Hot Encoding
* Pipelines
* Evaluation Metrics

We will be referencing the [scikit-learn docs](https://scikit-learn.org/stable/user_guide.html) and [pandas docs](https://pandas.pydata.org/pandas-docs/stable/index.html) where relevant, and will be analyzing data from the New York Times COVID-19 US States dataset from https://github.com/nytimes/covid-19-data

**Disclaimer: Linear regression is not the most suitable algorithm for this dataset, but we are using it to illustrate how to use scikit-learn**

## What is Machine Learning?

* Learning patterns in your data without being explicitly programmed
* A function that maps features to an output

![](https://brookewenig.com/img/DL/al_ml_dl.png)

-sandbox
## Types of Machine Learning
* Supervised Learning
  * Regression <img src="https://miro.medium.com/max/640/1*LEmBCYAttxS6uI6rEyPLMQ.png" style="height: 250px; padding: 10px"/>
  * Classification
    <img src="https://cdn3-www.dogtime.com/assets/uploads/2018/10/puppies-cover.jpg" style="height: 250px; padding: 10px"/>
    <img src="https://images.unsplash.com/photo-1529778873920-4da4926a72c2?ixlib=rb-1.2.1&w=1000&q=80" style="height: 250px; padding: 10px"/>
* Unsupervised Learning
<img src="https://www.iotforall.com/wp-content/uploads/2018/01/Screen-Shot-2018-01-17-at-8.10.14-PM.png" style="height: 250px; padding: 10px"/>
* Reinforcement Learning
<img src="https://brookewenig.com/img/ReinforcementLearning/Rl_agent.png" style="height: 250px; padding: 10px"/>

Today we're going to start simple and focus on a supervised learning (regression) problem. Here we will use a linear regression model to predict the number of deaths resulting from COVID-19.

In [5]:
%fs ls databricks-datasets/COVID/covid-19-data/us-states.csv

path,name,size
dbfs:/databricks-datasets/COVID/covid-19-data/us-states.csv,us-states.csv,102167


In [6]:
import pandas as pd

df = pd.read_csv("/dbfs/databricks-datasets/COVID/covid-19-data/us-states.csv")
df.head()
df.date.max()

In [7]:
df.shape

## Relationship between Cases & Deaths

In [9]:
# To allow us to print out plots
%matplotlib inline

In [10]:
# Filter to 2020-05-01
df_05_01 = df[df["date"] == "2020-05-01"]

ax = df_05_01.plot(x="cases", y="deaths", kind="scatter", 
                   figsize=(12,8), s=100, title="Deaths vs Cases on 2020-05-01 - All States")

df_05_01[["cases", "deaths", "state"]].apply(lambda row: ax.text(*row), axis=1);

## New York & New Jersey are Outliers

In [12]:
# Filter to states that are NOT New York and NOT New Jersey
not_ny = df[(df["state"] != "New York") & (df["state"] != "New Jersey")]
not_ny.head()

Unnamed: 0,date,state,fips,cases,deaths
0,2020-01-21,Washington,53,1,0
1,2020-01-22,Washington,53,1,0
2,2020-01-23,Washington,53,1,0
3,2020-01-24,Illinois,17,1,0
4,2020-01-24,Washington,53,1,0


In [13]:
# Filter to 2020-05-01
not_ny_05_01 = not_ny[not_ny["date"] == "2020-05-01"]

ax = not_ny_05_01.plot(x="cases", y="deaths", kind="scatter", 
                   figsize=(12,8), s=50, title="Deaths vs Cases on 2020-05-01 - All States but NY and NJ")

not_ny_05_01[["cases", "deaths", "state"]].apply(lambda row: ax.text(*row), axis=1);

## New York verus California COVID-19 deaths comparison

In [15]:
df_ny_cali = df[(df["state"] == "New York") | (df["state"] == "California")]

# Let's pivot our df_ny_cali DataFrame so that we can plot deaths over time for both states
df_ny_cali_pivot = df_ny_cali.pivot(index='date', columns='state', values='deaths').fillna(0)
df_ny_cali_pivot

state,California,New York
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-25,0.0,0.0
2020-01-26,0.0,0.0
2020-01-27,0.0,0.0
2020-01-28,0.0,0.0
2020-01-29,0.0,0.0
2020-01-30,0.0,0.0
2020-01-31,0.0,0.0
2020-02-01,0.0,0.0
2020-02-02,0.0,0.0
2020-02-03,0.0,0.0


In [16]:
df_ny_cali_pivot.plot.line(title="Deaths 2020-01-25 to 2020-05-01 - CA and NY", figsize=(12,8))

## Train-Test Split

![](https://brookewenig.com/img/IntroML/trainTest.png)

Because this is temporal data, instead of doing a random split, we will use data from March 1 to April 7 to train our model, and test our model by predicting values for April 8 - 14.

In [19]:
train_df = df[(df["date"] >= "2020-03-01") & (df["date"] <= "2020-04-07")]
test_df = df[df["date"] > "2020-04-07"]

X_train = train_df[["cases"]]
y_train = train_df["deaths"]

X_test = test_df[["cases"]]
y_test = test_df["deaths"]

## Linear Regression

* Goal: Find the line of best fit
$$\hat{y} = w_0 + w_1x$$

$$\{y} ≈ \hat{y} + ϵ$$
* *x*: feature
* *y*: label

![](https://miro.medium.com/max/640/1*LEmBCYAttxS6uI6rEyPLMQ.png)

Here we will be fitting a [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) model from scikit-learn.

In [22]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression().fit(X_train, y_train)
print(f"num_deaths = {lr.intercept_:.4f} + {lr.coef_[0]:.4f}*cases")

Hmmm... if we have no cases, then there should be no deaths caused by COVID-19, so let's set the intercept to be 0.

In [24]:
lr = LinearRegression(fit_intercept=False).fit(X_train, y_train)
print(f"num_deaths = {lr.coef_[0]:.4f}*cases")

So this model is implying that there is a 2.9% mortality rate in our dataset. But we know that some states have higher mortality rates than others. Let's include the state as a feature!

## One-Hot Encoding
How do we handle non-numeric features, such as the state?

One idea:
* Create single numerical feature to represent non-numeric one
* Categorical features:
  * state = {'New York', 'California', 'Louisiana'}
  * 'New York' = 1, 'California' = 2, 'Louisiana' = 3
  
BUT this implies California is 2x New York!

Better idea:
* Create a ‘dummy’ feature for each category
* 'New York' => [1, 0, 0], 'California' => [0, 1, 0], 'Louisiana' => [0, 0, 1]

This technique is known as ["One Hot Encoding"](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [27]:
from sklearn.preprocessing import OneHotEncoder

X_train = train_df[["cases", "state"]]
X_test = test_df[["cases", "state"]]

enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
enc.fit(X_train).transform(X_train)

Let's check the shape

In [29]:
enc.fit(X_train).transform(X_train).shape

Yikes! It one-hot encoded the cases variable too

In [31]:
enc.categories_

We need the [column transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) to only apply the one hot encoding to a single column.

In [33]:
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([("enc", enc, ["state"])], remainder="passthrough")
ct.fit_transform(X_train).shape

## Pipelines

We can chain together a series of data transformations with a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). This way we also ensure that whatever operations we apply to our training set, we also apply in the same order to our test set.

In [35]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[("ct", ct), ("lr", lr)])
pipeline_model = pipeline.fit(X_train, y_train)

y_pred = pipeline_model.predict(X_test)

## How are the different states performing?

You'll notice that by adding in additional features, our coefficient for our cases feature changed as well.

In [37]:
print(f"num_deaths = {pipeline_model.steps[1][1].coef_[-1]:.4f}*cases + state_coef")

In [38]:
import pandas as pd
pd.set_option('display.float_format', '{:.2f}'.format)

categories = pipeline_model.steps[0][1].transformers[0][1].categories_[1]

pd.DataFrame(zip(categories, pipeline_model.steps[1][1].coef_[:-1]), columns=["State", "Coefficient"])

Unnamed: 0,State,Coefficient
0,Alabama,-7.8
1,Alaska,-1.31
2,Arizona,-4.3
3,Arkansas,-5.74
4,California,-35.19
5,Colorado,-9.42
6,Connecticut,-5.66
7,Delaware,-2.17
8,District of Columbia,-3.51
9,Florida,-44.47


## Evaluation Metrics

![](https://brookewenig.com/img/IntroML/RMSE.png)

Let's compute the MSE and RMSE for our dataset using the [sklearn.metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html?highlight=mean_squared_error).

In [41]:
from sklearn.metrics import mean_squared_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f"MSE is {mse:.1f}, RMSE is {rmse:.1f}")

## Visualize Predictions

In [43]:
pred = pd.concat([test_df.reset_index(drop=True), pd.DataFrame(y_pred, columns=["predicted_deaths"])], axis=1)
pred

Unnamed: 0,date,state,fips,cases,deaths,predicted_deaths
0,2020-04-08,Alabama,1,2499,67,69.46
1,2020-04-08,Alaska,2,224,5,5.62
2,2020-04-08,Arizona,4,2726,80,79.98
3,2020-04-08,Arkansas,5,1077,18,27.56
4,2020-04-08,California,6,19043,506,553.53
5,2020-04-08,Colorado,8,5655,192,165.41
6,2020-04-08,Connecticut,9,8781,335,265.82
7,2020-04-08,Delaware,10,1116,19,32.33
8,2020-04-08,District of Columbia,11,1440,27,41.01
9,2020-04-08,Florida,12,15690,322,440.60


Voila! You have successfully built a machine learning pipeline using scikit-learn!

To keep exploring with scikit-learn, checkout the datasets at [UCI ML Repository](https://archive.ics.uci.edu/ml/index.php) and [Kaggle](https://www.kaggle.com/)!