# Final Project


For your final project, you will build a classifer for
the **Backorder Prediction** dataset by following our
operationalized machine learning pipeline.

![AppliedML_Workflow IMAGE MISSING](../images/AppliedML_Workflow.png)


--- 

## Data

Details of the dataset are located here:

Dataset: https://www.kaggle.com/tiredgeek/predict-bo-trial

The files are accessible in the JupyterHub environment:
 * `/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv`
 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`


## Exploration, Training, and Validation

You will examine the _training_ dataset and perform 
 * **data preparation and exploratory data analysis**, 
 * **anomaly detection / removal**,
 * **dimensionality reduction** and then
 * **train and validate 3 different models**.

Of the 3 different models, you are free to pick any estimator from scikit-learn 
or models we have so far covered using TensorFlow.

### Validation Assessment

Your first, intermediate, result will be an **assessment** of the models' performance.
This assessement should be grounded within a 10-fold cross-validation methodology.

This should include the confusion matrix and F-score for each classifier.


---

## Testing

Once you have chosen your final model, you will need to re-train it using all the training data.


--- 
##  Overview / Roadmap

**General steps**:

* Dataset carpentry & Exploratory Data Analysis
  * Develop functions to perform the necessary steps, you will have to carpentry the Training and the Testing data.
* Create 3 pipelines, each does:
    * Anomaly detection
    * Dimensionality reduction
    * Model training/validation
* Train chosen model full training data
* Evaluate model against testing
* Write a summary of your processing and an analysis of the model performance


#### <span style="background:yellow">Note:</span> The use of sklearn Pipelines and FeatureUnion is optional.   
However, your three models should follow a readable path from data to cross-validation statistics.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

## Load dataset

**Description**
~~~
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag
went_on_backorder - Product actually went on backorder. **This is the target value.**
~~~

**Note**: This is a real-world dataset without any processing.  
There will also be warnings due to fact that the 1st column is mixing integer and string values.  
The last column is what we are trying to predict.

In [2]:
# Dataset location

DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty
count,1687860.0,1586967.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0,1687860.0
mean,496.1118,7.872267,44.05202,178.1193,344.9867,506.3644,55.92607,175.0259,341.7288,525.2697,52.7723,2.043724,-6.872059,-6.437947,0.6264507
std,29615.23,7.056024,1342.742,5026.553,9795.152,14378.92,1928.196,5192.378,9613.167,14838.61,1254.983,236.0165,26.55636,25.84333,33.72224
min,-27256.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,-99.0,0.0
25%,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.63,0.66,0.0
50%,15.0,8.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,4.0,0.0,0.0,0.82,0.81,0.0
75%,80.0,9.0,0.0,4.0,12.0,20.0,4.0,15.0,31.0,47.0,3.0,0.0,0.97,0.95,0.0
max,12334400.0,52.0,489408.0,1427612.0,2461360.0,3777304.0,741774.0,1105478.0,2146625.0,3205172.0,313319.0,146496.0,1.0,1.0,12530.0


## Processing

In this section, goal is to figure out:

* which columns we can use directly,  
* which columns are usable after some processing,  
* and which columns are not processable or obviously irrelevant (like product id) that we will discard.

Then process and prepare this dataset for creating a predictive model.

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 23 columns):
sku                  1687861 non-null object
national_inv         1687860 non-null float64
lead_time            1586967 non-null float64
in_transit_qty       1687860 non-null float64
forecast_3_month     1687860 non-null float64
forecast_6_month     1687860 non-null float64
forecast_9_month     1687860 non-null float64
sales_1_month        1687860 non-null float64
sales_3_month        1687860 non-null float64
sales_6_month        1687860 non-null float64
sales_9_month        1687860 non-null float64
min_bank             1687860 non-null float64
potential_issue      1687860 non-null object
pieces_past_due      1687860 non-null float64
perf_6_month_avg     1687860 non-null float64
perf_12_month_avg    1687860 non-null float64
local_bo_qty         1687860 non-null float64
deck_risk            1687860 non-null object
oe_constraint        1687860 non-null object
ppap_risk        

### Take samples and examine the dataset

In [4]:
dataset.iloc[:3,:6]

Unnamed: 0,sku,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month
0,1424310,644.0,8.0,240.0,960.0,1680.0
1,1936046,76.0,8.0,0.0,67.0,67.0
2,2025677,4235.0,12.0,0.0,3850.0,3850.0


In [5]:
dataset.iloc[:3,6:12]

Unnamed: 0,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank
0,2520.0,265.0,793.0,1581.0,2592.0,240.0
1,67.0,17.0,48.0,104.0,159.0,26.0
2,3850.0,0.0,5.0,5.0,5.0,1.0


In [6]:
dataset.iloc[:3,12:18]

Unnamed: 0,potential_issue,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk
0,No,0.0,0.9,0.94,0.0,No
1,No,67.0,0.86,0.88,0.0,No
2,No,0.0,0.58,0.58,0.0,Yes


In [7]:
dataset.iloc[:3,18:24]

Unnamed: 0,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,No,No,Yes,No,No
1,No,No,Yes,No,No
2,No,No,Yes,No,No


### Drop columns that are obviously irrelevant or not processable

In [8]:
# Add code below this comment  (Question #E8001)
# ----------------------------------
dataset.drop(dataset.columns[[0,12,17,18,19,20,21]],axis=1,inplace=True)
dataset.head()

Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,went_on_backorder
0,644.0,8.0,240.0,960.0,1680.0,2520.0,265.0,793.0,1581.0,2592.0,240.0,0.0,0.9,0.94,0.0,No
1,76.0,8.0,0.0,67.0,67.0,67.0,17.0,48.0,104.0,159.0,26.0,67.0,0.86,0.88,0.0,No
2,4235.0,12.0,0.0,3850.0,3850.0,3850.0,0.0,5.0,5.0,5.0,1.0,0.0,0.58,0.58,0.0,No
3,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.73,0.78,0.0,No
4,4.0,8.0,0.0,6.0,12.0,18.0,3.0,6.0,10.0,15.0,2.0,0.0,0.93,0.96,0.0,No


### Find unique values of string columns

Now try to make sure that these Yes/No columns really only contains Yes or No.  
If that's true, proceed to convert them into binaries (0s and 1s).

**Tip**: use [unique()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) function of pandas Series.

Example

~~~python
print('went_on_backorder', dataset['went_on_backorder'].unique())
~~~

In [9]:
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)

# Add code below this comment  (Question #E8002)
# ----------------------------------

print('went_on_backorder', dataset['went_on_backorder'].unique())


['went_on_backorder']
went_on_backorder ['No' 'Yes' nan]


You may see **nan** also as possible values representing missing values in the dataset.

We fill them using most popular values, the [Mode](https://en.wikipedia.org/wiki/Mode_%28statistics%29) in Stats.

In [10]:
for column_name in yes_no_columns:
    
    mode = dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset[column_name].fillna(mode, inplace=True)

Filling missing values of went_on_backorder with No


In [11]:
dataset.fillna(dataset.mean(),inplace=True)

Unnamed: 0,national_inv,lead_time,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,went_on_backorder
0,644.0,8.000000,240.0,960.0,1680.0,2520.0,265.0,793.0,1581.0,2592.0,240.0,0.0,0.90,0.94,0.0,No
1,76.0,8.000000,0.0,67.0,67.0,67.0,17.0,48.0,104.0,159.0,26.0,67.0,0.86,0.88,0.0,No
2,4235.0,12.000000,0.0,3850.0,3850.0,3850.0,0.0,5.0,5.0,5.0,1.0,0.0,0.58,0.58,0.0,No
3,4.0,4.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.73,0.78,0.0,No
4,4.0,8.000000,0.0,6.0,12.0,18.0,3.0,6.0,10.0,15.0,2.0,0.0,0.93,0.96,0.0,No
5,4.0,8.000000,0.0,0.0,3.0,5.0,0.0,3.0,5.0,9.0,2.0,0.0,0.94,0.88,0.0,No
6,179.0,7.872267,0.0,0.0,0.0,0.0,0.0,2.0,13.0,34.0,1.0,0.0,-99.00,-99.00,0.0,No
7,35.0,8.000000,5.0,34.0,67.0,92.0,16.0,38.0,68.0,106.0,24.0,0.0,0.72,0.77,0.0,No
8,15.0,12.000000,0.0,0.0,0.0,0.0,1.0,1.0,2.0,3.0,0.0,0.0,0.70,0.76,0.0,No
9,1062.0,8.000000,4.0,799.0,1515.0,2305.0,317.0,1187.0,2285.0,3460.0,349.0,0.0,0.99,0.98,0.0,No


### Convert yes/no columns into binary (0s and 1s)

In [12]:
# Add code below this comment  (Question #E8003)
# ----------------------------------
dataset['went_on_backorder'] = list(map(['No','Yes'].index, dataset['went_on_backorder']))

Now all columns should be either int64 or float64.

In [13]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 16 columns):
national_inv         1687861 non-null float64
lead_time            1687861 non-null float64
in_transit_qty       1687861 non-null float64
forecast_3_month     1687861 non-null float64
forecast_6_month     1687861 non-null float64
forecast_9_month     1687861 non-null float64
sales_1_month        1687861 non-null float64
sales_3_month        1687861 non-null float64
sales_6_month        1687861 non-null float64
sales_9_month        1687861 non-null float64
min_bank             1687861 non-null float64
pieces_past_due      1687861 non-null float64
perf_6_month_avg     1687861 non-null float64
perf_12_month_avg    1687861 non-null float64
local_bo_qty         1687861 non-null float64
went_on_backorder    1687861 non-null int64
dtypes: float64(15), int64(1)
memory usage: 206.0 MB


## Pipeline

In this section, design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a model

You can add more notebook cells or import any Python modules as needed.

In [14]:
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, f_classif, mutual_info_classif

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV,cross_val_score



In [15]:
dataset.isnull().any()

national_inv         False
lead_time            False
in_transit_qty       False
forecast_3_month     False
forecast_6_month     False
forecast_9_month     False
sales_1_month        False
sales_3_month        False
sales_6_month        False
sales_9_month        False
min_bank             False
pieces_past_due      False
perf_6_month_avg     False
perf_12_month_avg    False
local_bo_qty         False
went_on_backorder    False
dtype: bool

### Your 1st pipeline 
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [16]:
# Add code below this comment  (Question #E8004)
# ----------------------------------

    
class CleanPipeline(Pipeline):
    def __init__(self, cleaning, steps):
        self.cleaning = cleaning
        super(CleanPipeline, self).__init__(steps)

    def fit(self, X, y):
        inliers = self.cleaning.fit(X,y).predict(X) == 1
        return super(CleanPipeline, self).fit(X[inliers], y[inliers])

pipeline = CleanPipeline(
    EllipticEnvelope(contamination = 0.3),
    [
        ('scale', StandardScaler()),
        ('pca', PCA()),
        ('classify', GaussianNB()),
    ])

grid1 = GridSearchCV(pipeline, cv=3, n_jobs=1, param_grid=[{
        'pca__n_components': [5, 10, 15],
}])


X = np.array(dataset.iloc[:3000,:-1])
y = np.array(dataset.went_on_backorder[:3000])
grid1.fit(X, y)

GridSearchCV(cv=3, error_score='raise',
       estimator=CleanPipeline(cleaning=EllipticEnvelope(assume_centered=False, contamination=0.3, random_state=None,
         store_precision=True, support_fraction=None),
       steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('classify', GaussianNB(priors=None))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'pca__n_components': [5, 10, 15]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

### Your 2nd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [17]:
# Add code below this comment  (Question #E8005)
# ---------------------------------------------


    
class CleanPipeline(Pipeline):
    def __init__(self, cleaning, steps):
        self.cleaning = cleaning
        super(CleanPipeline, self).__init__(steps)

    def fit(self, X, y):
        inliers = self.cleaning.fit(X,y).predict(X) == 1
        return super(CleanPipeline, self).fit(X[inliers], y[inliers])

pipeline = CleanPipeline(
    EllipticEnvelope(contamination = 0.3),
    [
        ('scale', StandardScaler()),
        ('pca', PCA()),
        ('classify', LogisticRegression()),
    ])

grid2 = GridSearchCV(pipeline, cv=3, n_jobs=1, param_grid=[{
        'pca__n_components': [5, 10, 15],
}])


X = np.array(dataset.iloc[:3000,:-1])
y = np.array(dataset.went_on_backorder[:3000])
grid2.fit(X, y)


GridSearchCV(cv=3, error_score='raise',
       estimator=CleanPipeline(cleaning=EllipticEnvelope(assume_centered=False, contamination=0.3, random_state=None,
         store_precision=True, support_fraction=None),
       steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_com...y='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'pca__n_components': [5, 10, 15]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

### Your 3rd pipeline
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [18]:
# Add code below this comment  (Question #E8006)
# ----------------------------------

    
class CleanPipeline(Pipeline):
    def __init__(self, cleaning, steps):
        self.cleaning = cleaning
        super(CleanPipeline, self).__init__(steps)

    def fit(self, X, y):
        inliers = self.cleaning.fit(X,y).predict(X) == 1
        return super(CleanPipeline, self).fit(X[inliers], y[inliers])

pipeline = CleanPipeline(
    EllipticEnvelope(contamination = 0.3),
    [
        ('scale', StandardScaler()),
        ('pca', PCA()),
        ('classify', RandomForestClassifier()),
    ])

grid3 = GridSearchCV(pipeline, cv=3, n_jobs=1, param_grid=[{
        'pca__n_components': [5, 10, 15]
}])


X = np.array(dataset.iloc[:3000,:-1])
y = np.array(dataset.went_on_backorder[:3000])
grid3.fit(X, y)

GridSearchCV(cv=3, error_score='raise',
       estimator=CleanPipeline(cleaning=EllipticEnvelope(assume_centered=False, contamination=0.3, random_state=None,
         store_precision=True, support_fraction=None),
       steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_com..._jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'pca__n_components': [5, 10, 15]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

## Document the cross-validation analysis for the three models

In [24]:
#  Note the grid has cross-validation results stored in .cv_results_['mean_test_score']
mean_scores = np.array(grid1.cv_results_['mean_test_score'])

# select score for best C
mean_scores = mean_scores.max(axis=0)
print(mean_scores)

0.857333333333


In [26]:
#  Note the grid has cross-validation results stored in .cv_results_['mean_test_score']
mean_scores = np.array(grid2.cv_results_['mean_test_score'])

# select score for best C
mean_scores = mean_scores.max(axis=0)
print(mean_scores)

0.948666666667


In [30]:
#  Note the grid has cross-validation results stored in .cv_results_['mean_test_score']
mean_scores = np.array(grid3.cv_results_['mean_test_score'])

# select score for best C
mean_scores = mean_scores.max(axis=0)
print(mean_scores)

0.992666666667


**<span style="background:yellow">Don't forget to share your chosen models and their cross-validation performance with the class on the dicussion board for module 8.</span>** 

---

# Retrain a model using the full training data set

## Train
Use the full training data set to train the model.

In [35]:
# Add code below this comment  (Question #E8008)
# ----------------------------------
X = np.array(dataset.iloc[:,:-1])
y = np.array(dataset.went_on_backorder)
random_forest_clf= grid3.fit(X,y)




### Save the trained model with the pickle library.

In [36]:
# Add code below this comment  (Question #E8009)
# ----------------------------------
from sklearn.externals import joblib
# save the model to disk
joblib.dump(random_forest_clf,'random_forest_clf.pkl')


['random_forest_clf.pkl']

### Reload the trained model from the pickle file
### Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

In [40]:
# Add code below this comment  (Question #E8010)
# ----------------------------------
X_test = np.array(dataset.iloc[3000:,:-1])
y_test = np.array(dataset.went_on_backorder[3000:])
#DATASET_TEST = '/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
#assert os.path.exists(DATASET_TEST)

loaded_model = joblib.load('random_forest_clf.pkl')

## Test
Test your new model using the testing data set.
 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`

In [42]:
from sklearn.metrics import accuracy_score, confusion_matrix

# Add code below this comment  (Question #E8011)
# ----------------------------------
y_pred=loaded_model.predict(X_test)

print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))



0.97408569609
[[1634032   39557]
 [   4105    7167]]


## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!