# Part 3: Unbiased Evaluation using a New Test Set

In this part, we are given a new test set (`/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`). We can now take advantage of the entire smart sample that we created in Part I. 

* Retrain a pipeline using the optimal parameters that the pipeline learned. We don't need to repeat GridSearch here. 

## Import modules as needed

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
import joblib
import pickle as pkl

## Load smart sample and the best pipeline from Part II

In [2]:
sample_data, sampled_X, sampled_y = joblib.load('sample-data.pkl')
sampled_X = pd.DataFrame(sampled_X,
                        columns = ['national_inv', 'in_transit_qty', 'forecast_3_month', 'forecast_6_month', 'forecast_9_month',
                                  'sales_1_month', 'sales_3_month', 'sales_6_month', 'sales_9_month', 'min_bank',
                                   'potential_issue', 'pieces_past_due', 'perf_6_month_avg', 'perf_12_month_avg',
                                   'local_bo_qty', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop'])
sampled_y = pd.DataFrame(sampled_y,
                        columns = ['went_on_backorder'])

dataset = sample_data

pipe1, model_grid, pipe2, model_grid2, pipe3, model_grid3 = joblib.load('pipelines.pkl')


##  Retrain a pipeline using the full sampled training data set

Use the full sampled training data set to train the pipeline.

In [3]:
from sklearn.preprocessing import MinMaxScaler

norm = MinMaxScaler().fit(sampled_X)

X_norm = norm.transform(sampled_X)

norm = MinMaxScaler().fit(sampled_y)

y_norm = norm.transform(sampled_y)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_norm, y_norm, test_size=0.25)

In [6]:
# Add code below this comment  (Question #E301)
# ----------------------------------
from sklearn.model_selection import GridSearchCV


param_grid2 = {'SVC__C': [100,200,300,400,500],
              'SVC__gamma': [0,1,2,3,4], 
              'PCA__n_components': [20],
              'SVC__kernel': ['rbf']}

model_grid2 = GridSearchCV(pipe2, param_grid2, cv=10, n_jobs=2)

fitted_model = model_grid2.fit(X_train, y_train)





  return f(*args, **kwargs)


### Save the trained model with the pickle library.

In [7]:
# Add code below this comment  
# -----------------------------

joblib.dump([fitted_model], 'model.pkl')




['model.pkl']


## Load the Testing Data and evaluate your model

 * `/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv`
 
* We need to preprocess this test data (follow the steps similar to Part I)
* If we have fitted any normalizer/standardizer in Part 2, then we have to transform this test data using the fitted normalizer/standardizer

In [49]:
# Preprocess the given test set  (Question #E302)
# ----------------------------------

DS_2 = '/dsa/data/all_datasets/back_order/Kaggle_Test_Dataset_v2.csv'
assert os.path.exists(DS_2)
ds_2 = pd.read_csv(DS_2).sample(frac = 1).reset_index(drop=True)



  interactivity=interactivity, compiler=compiler, result=result)


In [50]:
ds_2.head().transpose()

Unnamed: 0,0,1,2,3,4
sku,3335764,3405761,3491812,3507059,3309068
national_inv,3516.0,276.0,41.0,10.0,32.0
lead_time,8.0,,9.0,8.0,8.0
in_transit_qty,0.0,2.0,0.0,0.0,0.0
forecast_3_month,0.0,0.0,0.0,0.0,0.0
forecast_6_month,0.0,0.0,0.0,0.0,0.0
forecast_9_month,0.0,0.0,0.0,0.0,0.0
sales_1_month,0.0,13.0,0.0,0.0,0.0
sales_3_month,0.0,38.0,0.0,0.0,0.0
sales_6_month,0.0,56.0,0.0,0.0,0.0


In [51]:
ds_2 = ds_2.drop(columns=['sku', 'lead_time'])

In [52]:
yes_no_columns = list(filter(lambda i: ds_2[i].dtype!=np.float64, ds_2.columns))

for column_name in yes_no_columns:
    mode = ds_2[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    ds_2[column_name].fillna(mode, inplace=True)

ds_2.info()

Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242076 entries, 0 to 242075
Data columns (total 21 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   national_inv       242075 non-null  float64
 1   in_transit_qty     242075 non-null  float64
 2   forecast_3_month   242075 non-null  float64
 3   forecast_6_month   242075 non-null  float64
 4   forecast_9_month   242075 non-null  float64
 5   sales_1_month      242075 non-null  float64
 6   sales_3_month      242075 non-null  float64
 7   sales_6_month      242075 non-null  float64
 8   sales_9_month      242075 non-null  float64
 9   min_bank       

In [53]:
ds_2.potential_issue = ds_2.potential_issue.map(dict(Yes=1, No=0))
ds_2.deck_risk = ds_2.deck_risk.map(dict(Yes=1, No=0))
ds_2.oe_constraint = ds_2.oe_constraint.map(dict(Yes=1, No=0))
ds_2.ppap_risk = ds_2.ppap_risk.map(dict(Yes=1, No=0))
ds_2.stop_auto_buy = ds_2.stop_auto_buy.map(dict(Yes=1, No=0))
ds_2.rev_stop = ds_2.rev_stop.map(dict(Yes=1, No=0))
ds_2.went_on_backorder = ds_2.went_on_backorder.map(dict(Yes=1, No=0))

In [54]:
ds_2

Unnamed: 0,national_inv,in_transit_qty,forecast_3_month,forecast_6_month,forecast_9_month,sales_1_month,sales_3_month,sales_6_month,sales_9_month,min_bank,...,pieces_past_due,perf_6_month_avg,perf_12_month_avg,local_bo_qty,deck_risk,oe_constraint,ppap_risk,stop_auto_buy,rev_stop,went_on_backorder
0,3516.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.00,1.00,0.0,0,0,0,1,0,0
1,276.0,2.0,0.0,0.0,0.0,13.0,38.0,56.0,100.0,22.0,...,0.0,-99.00,-99.00,0.0,1,0,0,1,0,0
2,41.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.85,0.82,0.0,0,0,0,1,0,0
3,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.84,0.86,0.0,0,0,0,1,0,0
4,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.97,0.97,0.0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242071,2.0,0.0,1.0,2.0,4.0,0.0,1.0,3.0,4.0,2.0,...,0.0,0.84,0.82,0.0,0,0,0,1,0,0
242072,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.36,0.44,0.0,0,0,0,1,0,0
242073,453.0,0.0,0.0,0.0,0.0,9.0,33.0,87.0,157.0,27.0,...,0.0,0.95,0.95,0.0,0,0,0,1,0,0
242074,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.98,0.97,0.0,1,0,0,1,0,0


In [56]:
ds_2 = ds_2.dropna()

In [57]:
X = np.array(ds_2.iloc[:,:-1])
y = np.array(ds_2.went_on_backorder)

In [58]:
from sklearn.preprocessing import MinMaxScaler

norm1 = MinMaxScaler().fit(ds_2)

ds_norm = norm1.transform(ds_2)

In [59]:
norm2 = MinMaxScaler().fit(X)

X_norm = norm2.transform(X)

In [60]:
y = y.reshape(-1, 1)

norm3 = MinMaxScaler().fit(y)

y_norm = norm3.transform(y)

We can now predict and evaluate with the preprocessed test set. It would be interesting to see the performance with and without outliers removal from the test set. We can report confusion matrix, precision, recall, f1-score, accuracy, and other measures (if any). 

In [64]:
# Add code below this comment  (Question #E303)
# ----------------------------------

# With outlier removal
from sklearn.cluster import KMeans

def kmeans_session():
    # run k-means clustering
    km_clusters = KMeans(n_clusters=3, algorithm="full").fit_predict(X_norm, y_norm)
    
    # create cluster distribution, this time they are in tuples so we can sort easily
    dist_clusters = ((np.sum(km_clusters==z), z) for z in np.unique(km_clusters))
    
    # sort clusters descendingly by number of data entries in cluster
    dist_clusters = sorted(dist_clusters, reverse = True)
    
    # find out the cluster with max number of data entries
    max_cluster = dist_clusters[0][1]

    # select data in max_cluster as inliers
    inliers = km_clusters == max_cluster
    
    X_inliers = X_norm[inliers]
    y_inliers = y_norm[inliers]
    
    return X_inliers, y_inliers

X_inliers, y_inliers = kmeans_session()



In [66]:
from sklearn.metrics import classification_report, confusion_matrix

predicted_y1 = model_grid2.predict(X_inliers)

pd.DataFrame(confusion_matrix(y_inliers, predicted_y1))

Unnamed: 0,0,1
0,335,183711
1,5,2231


In [67]:
print(classification_report(y_inliers, predicted_y1))

              precision    recall  f1-score   support

         0.0       0.99      0.00      0.00    184046
         1.0       0.01      1.00      0.02      2236

    accuracy                           0.01    186282
   macro avg       0.50      0.50      0.01    186282
weighted avg       0.97      0.01      0.00    186282



In [68]:
# Without outlier removal

predicted_y2 = model_grid2.predict(X_norm)

pd.DataFrame(confusion_matrix(y_norm, predicted_y2))

Unnamed: 0,0,1
0,16153,223234
1,91,2597


In [69]:
print(classification_report(y_norm, predicted_y2))

              precision    recall  f1-score   support

         0.0       0.99      0.07      0.13    239387
         1.0       0.01      0.97      0.02      2688

    accuracy                           0.08    242075
   macro avg       0.50      0.52      0.07    242075
weighted avg       0.98      0.08      0.13    242075



## Conclusion

## Reflect

Imagine you are data scientist that has been tasked with developing a system to save your 
company money by predicting and preventing back orders of parts in the supply chain.

Write a **brief summary** for "management" that details your findings, 
your level of certainty and trust in the models, 
and recommendations for operationalizing these models for the business.

# Save your notebook!
## Then `File > Close and Halt`