<a href="https://colab.research.google.com/github/mpHarm88/DS-Unit-2-Applied-Modeling/blob/master/module1/Mikio_Harman_assignment_applied_modeling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" information from the future?

## Reading

### ROC AUC
- [Machine Learning Meets Economics](http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/)
- [ROC curves and Area Under the Curve explained](https://www.dataschool.io/roc-curves-and-auc-explained/)
- [The philosophical argument for using ROC curves](https://lukeoakdenrayner.wordpress.com/2018/01/07/the-philosophical-argument-for-using-roc-curves/)

### Imbalanced Classes
- [imbalance-learn](https://github.com/scikit-learn-contrib/imbalanced-learn)
- [Learning from Imbalanced Classes](https://www.svds.com/tbt-learning-imbalanced-classes/)

### Last lesson
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

In [0]:
import numpy as np
import pandas as pd
import seaborn as sns
import category_encoders as ce
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.linear_model import LinearRegression
from sklearn.utils.multiclass import unique_labels
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [0]:
df1 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Aotizhongxin_20130301-20170228.csv')
df2 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Changping_20130301-20170228.csv')
df3 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Dingling_20130301-20170228.csv')
df4 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Dongsi_20130301-20170228.csv')
df5 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Guanyuan_20130301-20170228.csv')
df6 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Gucheng_20130301-20170228.csv')
df7 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Huairou_20130301-20170228.csv')
df8 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Nongzhanguan_20130301-20170228.csv')
df9 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Shunyi_20130301-20170228.csv')
df10 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Tiantan_20130301-20170228.csv')
df11 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Wanliu_20130301-20170228.csv')
df12 = pd.read_csv('../data/PRSA_Data_20130301-20170228/PRSA_Data_Wanshouxigong_20130301-20170228.csv')

In [0]:
df12.head()

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,1,2013,3,1,0,9.0,9.0,6.0,17.0,200.0,62.0,0.3,1021.9,-19.0,0.0,WNW,2.0,Wanshouxigong
1,2,2013,3,1,1,11.0,11.0,7.0,14.0,200.0,66.0,-0.1,1022.4,-19.3,0.0,WNW,4.4,Wanshouxigong
2,3,2013,3,1,2,8.0,8.0,,16.0,200.0,59.0,-0.6,1022.6,-19.7,0.0,WNW,4.7,Wanshouxigong
3,4,2013,3,1,3,8.0,8.0,3.0,16.0,,,-0.7,1023.5,-20.9,0.0,NW,2.6,Wanshouxigong
4,5,2013,3,1,4,8.0,8.0,3.0,,300.0,36.0,-0.9,1024.1,-21.7,0.0,WNW,2.5,Wanshouxigong


In [0]:
all_df = [df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12]
df_combined = pd.concat(all_df)
df_combined.shape

(420768, 18)

In [0]:
print(df_combined.dtypes)
df_combined.describe(exclude='number')

No           int64
year         int64
month        int64
day          int64
hour         int64
PM2.5      float64
PM10       float64
SO2        float64
NO2        float64
CO         float64
O3         float64
TEMP       float64
PRES       float64
DEWP       float64
RAIN       float64
wd          object
WSPM       float64
station     object
dtype: object


Unnamed: 0,wd,station
count,418946,420768
unique,16,12
top,NE,Shunyi
freq,43335,35064


In [0]:
df_combined.describe()

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,WSPM
count,420768.0,420768.0,420768.0,420768.0,420768.0,412029.0,414319.0,411747.0,408652.0,400067.0,407491.0,420370.0,420375.0,420365.0,420378.0,420450.0
mean,17532.5,2014.66256,6.52293,15.729637,11.5,79.793428,104.602618,15.830835,50.638586,1230.766454,57.372271,13.538976,1010.746982,2.490822,0.064476,1.729711
std,10122.116943,1.177198,3.448707,8.800102,6.922195,80.822391,91.772426,21.650603,35.127912,1160.182716,56.661607,11.436139,10.474055,13.793847,0.821004,1.246386
min,1.0,2013.0,1.0,1.0,0.0,2.0,2.0,0.2856,1.0265,100.0,0.2142,-19.9,982.4,-43.4,0.0,0.0
25%,8766.75,2014.0,4.0,8.0,5.75,20.0,36.0,3.0,23.0,500.0,11.0,3.1,1002.3,-8.9,0.0,0.9
50%,17532.5,2015.0,7.0,16.0,11.5,55.0,82.0,7.0,43.0,900.0,45.0,14.5,1010.4,3.1,0.0,1.4
75%,26298.25,2016.0,10.0,23.0,17.25,111.0,145.0,20.0,71.0,1500.0,82.0,23.3,1019.0,15.1,0.0,2.2
max,35064.0,2017.0,12.0,31.0,23.0,999.0,999.0,500.0,290.0,10000.0,1071.0,41.6,1042.8,29.1,72.5,13.2


In [0]:
df_combined['year'].value_counts()

2016    105408
2015    105120
2014    105120
2013     88128
2017     16992
Name: year, dtype: int64

In [0]:
df_combined = df_combined.fillna(0)

In [0]:
df_combined.isnull().sum()

No         0
year       0
month      0
day        0
hour       0
PM2.5      0
PM10       0
SO2        0
NO2        0
CO         0
O3         0
TEMP       0
PRES       0
DEWP       0
RAIN       0
wd         0
WSPM       0
station    0
dtype: int64

In [0]:
df_combined2 = df_combined

In [0]:
df_combined.describe()

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,WSPM
count,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0,420768.0
mean,17532.5,2014.66256,6.52293,15.729637,11.5,78.136185,102.999401,15.491432,49.180449,1170.215042,55.561935,13.526169,1009.802938,2.488436,0.064416,1.728403
std,10122.116943,1.177198,3.448707,8.800102,6.922195,80.784157,91.968604,21.539656,35.639163,1162.179098,56.655252,11.438304,32.602211,13.787455,0.820626,1.246821
min,1.0,2013.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-19.9,0.0,-43.4,0.0,0.0
25%,8766.75,2014.0,4.0,8.0,5.75,19.0,34.0,2.0,21.0,400.0,8.0,3.1,1002.2,-8.9,0.0,0.9
50%,17532.5,2015.0,7.0,16.0,11.5,53.0,81.0,7.0,42.0,800.0,43.0,14.5,1010.4,3.0,0.0,1.4
75%,26298.25,2016.0,10.0,23.0,17.25,109.0,144.0,19.0,70.0,1500.0,80.0,23.2,1019.0,15.1,0.0,2.2
max,35064.0,2017.0,12.0,31.0,23.0,999.0,999.0,500.0,290.0,10000.0,1071.0,41.6,1042.8,29.1,72.5,13.2


In [0]:
testing = df_combined[df_combined['year']==2017]

In [0]:
testing['station'].value_counts()

Aotizhongxin     1416
Wanshouxigong    1416
Nongzhanguan     1416
Tiantan          1416
Gucheng          1416
Huairou          1416
Guanyuan         1416
Dongsi           1416
Changping        1416
Shunyi           1416
Wanliu           1416
Dingling         1416
Name: station, dtype: int64

In [0]:
training = df_combined[df_combined['year']<2017]
print(training.shape)
training['year'].value_counts()

(403776, 18)


2016    105408
2015    105120
2014    105120
2013     88128
Name: year, dtype: int64

In [0]:
train, val = train_test_split(training, 
                              test_size=0.20,
                             random_state=42)
train.shape, val.shape

((323020, 18), (80756, 18))

In [0]:
target = 'CO'
X_train = train.drop(columns=target)
y_train = train[target]

X_val = val.drop(columns=target)
y_val = val[target]

X_test = testing.drop(columns=target)
y_test = testing[target]

X_train_CV = training.drop(columns=target)
y_train_CV = training[target]

In [0]:
X_train.dtypes

No           int64
year         int64
month        int64
day          int64
hour         int64
PM2.5      float64
PM10       float64
SO2        float64
NO2        float64
O3         float64
TEMP       float64
PRES       float64
DEWP       float64
RAIN       float64
wd          object
WSPM       float64
station     object
dtype: object

In [0]:
training.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
station,403776,12,Shunyi,33648
wd,403776,17,NE,40049


In [0]:
from sklearn.ensemble import RandomForestRegressor

pipe = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestRegressor(n_estimators=100, n_jobs=-1)
)



In [0]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['wd', 'station'], drop_invariant=False,
                                handle_missing='value', handle_unknown='value',
                                mapping=[{'col': 'wd', 'data_type': dtype('O'),
                                          'mapping': N       1
W       2
NNE     3
NW      4
NNW     5
E       6
SSE     7
WNW     8
SSW     9
ENE    10
SW     11
NE     12
ESE    13
SE     14
WSW    15
S      16
0      17
NaN    -2
dtype: int64},
                                         {'col': 'station',
                                          'data_type': dtype('O'),
                                          'mapping': Dongsi            1
Wanshouxi...
                ('randomforestregressor',
                 RandomForestRegressor(bootstrap=True, criterion='mse',
                                       max_depth=None, max_features='auto',
                                       max_leaf_nodes=

In [0]:
y_pred = pipe.predict(X_val)

In [0]:
pipe.score(X_val,y_val)

0.8949036147562612