<a href="https://colab.research.google.com/github/KryssyCo/DS-Unit-2-Applied-Modeling/blob/master/Krista_Shepard_DSPT2_U2S7M1_Assignment_1_Applied_Modeling_Caterpillar_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" information from the future?

## Reading

### ROC AUC
- [Machine Learning Meets Economics](http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/)
- [ROC curves and Area Under the Curve explained](https://www.dataschool.io/roc-curves-and-auc-explained/)
- [The philosophical argument for using ROC curves](https://lukeoakdenrayner.wordpress.com/2018/01/07/the-philosophical-argument-for-using-roc-curves/)

### Imbalanced Classes
- [imbalance-learn](https://github.com/scikit-learn-contrib/imbalanced-learn)
- [Learning from Imbalanced Classes](https://www.svds.com/tbt-learning-imbalanced-classes/)

### Last lesson
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

### I chose a new data set from kaggle to work with, because I had worked with it the last time we covered this material in DS5 and I didn't understand it then. Will be fun to get a look at it with fresh, healthy, more educated eyes.

In [1]:
!wget https://github.com/KryssyCo/DS-Unit-2-Applied-Modeling/blob/master/caterpillar-tube-pricing.zip?raw=true

--2019-10-10 18:17:10--  https://github.com/KryssyCo/DS-Unit-2-Applied-Modeling/blob/master/caterpillar-tube-pricing.zip?raw=true
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/KryssyCo/DS-Unit-2-Applied-Modeling/raw/master/caterpillar-tube-pricing.zip [following]
--2019-10-10 18:17:10--  https://github.com/KryssyCo/DS-Unit-2-Applied-Modeling/raw/master/caterpillar-tube-pricing.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/KryssyCo/DS-Unit-2-Applied-Modeling/master/caterpillar-tube-pricing.zip [following]
--2019-10-10 18:17:10--  https://raw.githubusercontent.com/KryssyCo/DS-Unit-2-Applied-Modeling/master/caterpillar-tube-pricing.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.1

In [2]:
!ls *.zip

data.zip


In [3]:
!unzip /content/caterpillar-tube-pricing.zip?raw=true

Archive:  /content/caterpillar-tube-pricing.zip?raw=true
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: sample_submission.csv   
  inflating: data.zip                


In [4]:
!unzip /content/caterpillar-tube-pricing.zip?raw=true

Archive:  /content/caterpillar-tube-pricing.zip?raw=true
replace sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: sample_submission.csv   
  inflating: data.zip                


In [5]:
!unzip data.zip

Archive:  data.zip
replace competition_data/bill_of_materials.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: competition_data/bill_of_materials.csv  
  inflating: competition_data/comp_adaptor.csv  
  inflating: competition_data/comp_boss.csv  
  inflating: competition_data/comp_elbow.csv  
  inflating: competition_data/comp_float.csv  
  inflating: competition_data/comp_hfl.csv  
  inflating: competition_data/comp_nut.csv  
  inflating: competition_data/comp_other.csv  
  inflating: competition_data/comp_sleeve.csv  
  inflating: competition_data/comp_straight.csv  
  inflating: competition_data/comp_tee.csv  
  inflating: competition_data/comp_threaded.csv  
  inflating: competition_data/components.csv  
  inflating: competition_data/specs.csv  
  inflating: competition_data/test_set.csv  
  inflating: competition_data/train_set.csv  
  inflating: competition_data/tube.csv  
  inflating: competition_data/tube_end_form.csv  
  inflating: competition_data/type_component.csv 

In [0]:
from glob import glob
import pandas as pd

In [7]:
# Get filenames and shapes
for path in glob('competition_data/*.csv'):
  df = pd.read_csv(path)
  print(path, df.shape)

competition_data/type_end_form.csv (8, 2)
competition_data/type_component.csv (29, 2)
competition_data/specs.csv (21198, 11)
competition_data/comp_other.csv (1001, 3)
competition_data/comp_tee.csv (4, 14)
competition_data/tube_end_form.csv (27, 2)
competition_data/comp_elbow.csv (178, 16)
competition_data/comp_nut.csv (65, 11)
competition_data/train_set.csv (30213, 8)
competition_data/components.csv (2048, 3)
competition_data/comp_adaptor.csv (25, 20)
competition_data/bill_of_materials.csv (21198, 17)
competition_data/type_connection.csv (14, 2)
competition_data/comp_sleeve.csv (50, 10)
competition_data/comp_straight.csv (361, 12)
competition_data/comp_threaded.csv (194, 32)
competition_data/comp_boss.csv (147, 15)
competition_data/tube.csv (21198, 16)
competition_data/comp_hfl.csv (6, 9)
competition_data/test_set.csv (30235, 8)
competition_data/comp_float.csv (16, 7)


In [8]:
# Import category encoders
!pip install category_encoders



In [0]:
# Import libraries
import category_encoders as ce 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_log_error, mean_squared_error


### - [X] Choose your target. Which column in your tabular dataset will you predict? 
**Cost**

### - [X] Determine whether your problem is regression or classification.
**Regression**

In [10]:
# Explore train set
df = pd.read_csv('/content/competition_data/train_set.csv')
df.head()

Unnamed: 0,tube_assembly_id,supplier,quote_date,annual_usage,min_order_quantity,bracket_pricing,quantity,cost
0,TA-00002,S-0066,2013-07-07,0,0,Yes,1,21.905933
1,TA-00002,S-0066,2013-07-07,0,0,Yes,2,12.341214
2,TA-00002,S-0066,2013-07-07,0,0,Yes,5,6.601826
3,TA-00002,S-0066,2013-07-07,0,0,Yes,10,4.68777
4,TA-00002,S-0066,2013-07-07,0,0,Yes,25,3.541561


### - [X] Begin to clean and explore your data.


In [0]:
#Compare dates in First 10 lines of train and test set
# The dates overlap in train and test
trainval = pd.read_csv('competition_data/train_set.csv')
test = pd.read_csv('competition_data/test_set.csv')

In [12]:
trainval.head(10)

Unnamed: 0,tube_assembly_id,supplier,quote_date,annual_usage,min_order_quantity,bracket_pricing,quantity,cost
0,TA-00002,S-0066,2013-07-07,0,0,Yes,1,21.905933
1,TA-00002,S-0066,2013-07-07,0,0,Yes,2,12.341214
2,TA-00002,S-0066,2013-07-07,0,0,Yes,5,6.601826
3,TA-00002,S-0066,2013-07-07,0,0,Yes,10,4.68777
4,TA-00002,S-0066,2013-07-07,0,0,Yes,25,3.541561
5,TA-00002,S-0066,2013-07-07,0,0,Yes,50,3.224406
6,TA-00002,S-0066,2013-07-07,0,0,Yes,100,3.082521
7,TA-00002,S-0066,2013-07-07,0,0,Yes,250,2.99906
8,TA-00004,S-0066,2013-07-07,0,0,Yes,1,21.972702
9,TA-00004,S-0066,2013-07-07,0,0,Yes,2,12.407983


In [13]:
test.head(10)

Unnamed: 0,id,tube_assembly_id,supplier,quote_date,annual_usage,min_order_quantity,bracket_pricing,quantity
0,1,TA-00001,S-0066,2013-06-23,0,0,Yes,1
1,2,TA-00001,S-0066,2013-06-23,0,0,Yes,2
2,3,TA-00001,S-0066,2013-06-23,0,0,Yes,5
3,4,TA-00001,S-0066,2013-06-23,0,0,Yes,10
4,5,TA-00001,S-0066,2013-06-23,0,0,Yes,25
5,6,TA-00001,S-0066,2013-06-23,0,0,Yes,50
6,7,TA-00001,S-0066,2013-06-23,0,0,Yes,100
7,8,TA-00001,S-0066,2013-06-23,0,0,Yes,250
8,9,TA-00003,S-0066,2013-07-07,0,0,Yes,1
9,10,TA-00003,S-0066,2013-07-07,0,0,Yes,2


In [0]:
# Apply pandas to_datetime to both train and test
trainval['quote_date'] = pd.to_datetime(trainval['quote_date'], infer_datetime_format=True)

test['quote_date'] = pd.to_datetime(test['quote_date'], infer_datetime_format=True)

In [15]:
# Explore quote_date info.
trainval['quote_date'].describe()

count                   30213
unique                   1781
top       2013-10-01 00:00:00
freq                     2877
first     1982-09-22 00:00:00
last      2017-01-01 00:00:00
Name: quote_date, dtype: object

In [16]:
# There is definite overlap in the two data sets
test['quote_date'].describe()

count                   30235
unique                   1778
top       2013-09-01 00:00:00
freq                     2992
first     1985-11-16 00:00:00
last      2017-01-01 00:00:00
Name: quote_date, dtype: object

In [18]:
# Check to see if the test set has different tube assembly id's

# This checks for the unique tube assembly in train, val, and test

trainval_tube_assembly_id = trainval['tube_assembly_id'].unique()
test_tube_assembly_id = test['tube_assembly_id'].unique()
len(trainval_tube_assembly_id), len(test_tube_assembly_id)

(8855, 8856)

In [19]:
# This checks to see if there is any crossover in the assembly id'

set(trainval_tube_assembly_id) & set(test_tube_assembly_id)

set()

### - [X] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.

I am going to use tube_assembly_id to train, validate and test my model since there is no overlap

In [0]:
trainval_tube_assembly_id, val_tube_assembly_id = train_test_split(
    trainval_tube_assembly_id, random_state=42
)

In [28]:
# Look at length of train/val data

len(trainval_tube_assembly_id), len(val_tube_assembly_id)

(6641, 2214)

In [27]:
train = trainval[trainval.tube_assembly_id.isin(trainval_tube_assembly_id)]
val = trainval[trainval.tube_assembly_id.isin(val_tube_assembly_id)]
train.shape, val.shape, trainval.shape

((22628, 8), (7585, 8), (30213, 8))

In [32]:
# Make sure the sum of the train and val = the length of trainval.

len(train) + len(val) ==len(trainval)

True

### - [X] Choose your evaluation metric.
**root mean squared log error and root mean squared error**

In [0]:
# Define 'rmsle' and 'rmse' function

import numpy as np
from sklearn.metrics import mean_squared_log_error

def rmsle(y_true, y_pred):
  return np.sqrt(mean_squared_log_error(y_true, y_pred))
def rmse(y_true, y_pred):
  return np.sqrt(mean_squared_error(y_true, y_pred))

### - [X] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.

In [35]:
target = 'cost'
y_train = train[target]
y_val = val[target]
y_pred = np.full_like(y_val, fill_value=y_train.mean())
print('Validation RMSLE, Mean Baseline:', rmsle(y_val,y_pred))

Validation RMSLE, Mean Baseline: 0.9418101276064408


In [36]:
target = 'cost'
y_train = train[target]
y_val = val[target]
y_pred = np.full_like(y_val, fill_value=y_train.mean())
print('Validation RMSE, Mean Baseline:', rmse(y_val,y_pred))

Validation RMSE, Mean Baseline: 31.56520559484162


In [37]:
# Decided to do an R2 as well because I understand the output better

from sklearn.metrics import r2_score
print('Validation R^2, Mean, Baseline:', r2_score(y_val, y_pred))

Validation R^2, Mean, Baseline: -4.701447715138585e-06


### - [X] Begin to choose which features, if any, to exclude. Would some features "leak" information from the future?

To begin I am only going to choose one feature - quantity