<a href="https://colab.research.google.com/github/elliotgunn/DS-Unit-2-Applied-Modeling/blob/master/assignment_applied_modeling_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [x] Choose your target. Which column in your tabular dataset will you predict?
- [x] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [x] Determine whether your problem is regression or classification.
- [x] Choose your evaluation metric.
- [x] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [x] Begin to clean and explore your data.
- [x] Choose which features, if any, to exclude. Would some features "leak" information from the future? **No**

## Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), **by Lambda DS3 student** Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python package:
    # category_encoders, version >= 2.0
    !pip install --upgrade category_encoders
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Applied-Modeling.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module1')

## Research

https://www.sciencedirect.com/science/article/pii/S0968090X18311021

https://academic.oup.com/tse/advance-article/doi/10.1093/tse/tdy001/5306170

## Import multiple files from directory and `concat` into single df

Example: 

https://stackoverflow.com/questions/20908018/import-multiple-excel-files-into-python-pandas-and-concatenate-them-into-one-dat
https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe


In [0]:
# this is from 2014 - April 2017
df_2014_to_2017_url = 'https://www.toronto.ca/ext/open_data/catalog/data_set_files/Subway%20&%20SRT%20Logs%20(Jan01_14%20to%20April30_17).xlsx'

# TRAIN
df_2017_May_url = "https://www.toronto.ca/ext/open_data/catalog/data_set_files/Subway%20&%20SRT%20Logs%20(May%202017).xlsx"

df_2017_June_url = "https://www.toronto.ca/ext/open_data/catalog/data_set_files/SubwayDelay201706.xlsx"


# TEST
df_2019_Jan_url = "https://www.toronto.ca/ext/open_data/catalog/data_set_files/Subway_&_SRT_Logs_January_2019.xlsx"

df_2019_Feb_url = "https://www.toronto.ca/ext/open_data/catalog/data_set_files/Subway_&_SRT_Logs_February2019.xlsx"

import pandas as pd

df1 = pd.read_excel(df_2014_to_2017_url)
df2 = pd.read_excel(df_2017_May_url)
df3 = pd.read_excel(df_2017_June_url)

test1 = pd.read_excel(df_2019_Jan_url)
test2 = pd.read_excel(df_2019_Feb_url)

In [0]:
print(df1.shape)
print(df2.shape)
print(df3.shape)

(69016, 10)
(1634, 10)
(1568, 10)


Keep in mind that unlike the append() and extend() methods of Python lists, the append() method in Pandas does not modify the original object–instead it creates a new object with the combined data. It also is not a very efficient method, because it involves creation of a new index and data buffer. Thus, if you plan to do multiple append operations, it is generally better to build a list of DataFrames and pass them all at once to the concat() function. [link text](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html)



In [0]:
# in future: try to do this programmatically...

train_dfs = [df1, df2, df3]
train = pd.concat(train_dfs)

test_dfs = [test1, test2]
test = pd.concat(test_dfs)

all_dfs = train_dfs + test_dfs
all = pd.concat(all_dfs)

train.shape, test.shape, all.shape

((72218, 10), (3469, 10), (75687, 10))

## Baselines

In [0]:
# majority class baseline

all['Code'].value_counts(normalize=True)

In [0]:
# majority: MUSC     0.178512

majority_class = all['Code'].mode()[0]
pred = [majority_class] * len(all['Code'])

# use metric: accuracy for classification
from sklearn.metrics import accuracy_score
accuracy_score(all['Code'], pred)

0.1785115013146247

## Choose your target. Which column in your tabular dataset will you predict?

1. Predict delay class:  `Code` (Classification)
2. Predict delay time:  `Min Time` (Regression)


## Feature engineering approaches:

> But Stockholmståg has found a way to use that data to also predict the ripple effect a single delay has on its entire system. An accident somewhere along its route means a train will be delayed before it rolls into the next station. But that also affects the train behind it, and the train behind it, and so forth. Eventually a single incident can throw off the scheduling of an entire commuter system, even if the original source of the disruption has already been resolved. [link](https://gizmodo.com/a-new-algorithm-can-predict-subway-delays-two-hours-bef-1729539784)



## Wrangle + train/val/test + pipeline

In [0]:
def wrangle(X):
  
  X = X.copy()
  
  # drop 'Code' values with less than 1 in a class
  X = X[X.groupby('Code').Day.transform(len) >1]
  
  # Convert 'Date' to datetime
  X['Date'] = pd.to_datetime(X['Date'], infer_datetime_format=True)
  
  # Extract components from 'Date', then drop original column
  X['year'] = X['Date'].dt.year
  X['month'] = X['Date'].dt.month
  X['day'] = X['Date'].dt.day
  X = X.drop(columns='Date')
  
  # 'Time' is a timestamp: we have hour and minute information
  # Extract components from 'Time', then drop original column
  X['Time'] = pd.to_datetime(X['Time'], infer_datetime_format=True)
  X['hour'] = X['Time'].dt.hour
  X['minute'] = X['Time'].dt.minute
  X = X.drop(columns='Time')
  
  # Return wrangled dataframe
  return X

train = wrangle(train)
test = wrangle(test)

In [0]:
from sklearn.model_selection import train_test_split

# Split train into train & val. Make val the same size as test.
train, val = train_test_split(train, test_size=len(test),  
                              stratify=train['Code'], random_state=42)

# arrange data into X features matrix and y target vector
target = 'Code'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

In [0]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.4238122827346466


## Eval metric

In [0]:
# generate predicted
y_pred = pipeline.predict(X_val)

Precision Score (micro):  0.4238122827346466
Precision Score (macro):  0.17160970923370134
Precision Score (weighted):  0.3726795602253628


  'precision', 'predicted', average, warn_for)


In [0]:
# print('Precision,' precision_score(y_val, y_pred))

from sklearn.metrics import precision_score, recall_score

print('Precision Score (micro): ', precision_score(y_val, y_pred, average='micro'))
print('Precision Score (macro): ', precision_score(y_val, y_pred, average='macro'))
print('Precision Score (weighted): ', precision_score(y_val, y_pred, average='weighted'))
print('')
print('Recall Score (micro): ', recall_score(y_val, y_pred, average='micro'))
print('Recall Score (macro): ', recall_score(y_val, y_pred, average='macro'))
print('Recall Score (weighted): ', recall_score(y_val, y_pred, average='weighted'))

Precision Score (micro):  0.4238122827346466
Precision Score (macro):  0.17160970923370134
Precision Score (weighted):  0.3726795602253628

Recall Score (micro):  0.4238122827346466
Recall Score (macro):  0.13043691600932608
Recall Score (weighted):  0.4238122827346466


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [0]:
from sklearn.metrics import classification_report
print(classification_report(y_val, y_pred))