## Operationalizing a Machine Learning Pipeline for Product Backorder - 
### Developing the Model (Part 2)

In this part, we will develop three unique pipelines for predicting backorder. We'll use the smart sample from the MLBackorder_Preprocessing notebook to fit and evaluate these pipelines. 

In [78]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd
import joblib
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split 
from sklearn.metrics import classification_report



In [79]:

# Reload the previous smart sampling from local file 
# ----------------------------------
X_sampled, y_sampled, rus = joblib.load('sampledata-Part1-V2.pkl')

X = X_sampled
y = y_sampled



In [80]:
X.shape

(9964, 20)

## Ensure the Data is Normalized/standardized

In [81]:
#standardize

#from sklearn import preprocessing

#processor = preprocessing.MinMaxScaler()
#range_scaled = processor.fit_transform([X, y])

#print("Range and Scaled converstion of x_train data")
#print(range_scaled)

## Split the data into Train/Test

In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.7)


In [84]:
X_train.shape

(2989, 20)

## Developing Pipeline

Below I will design an operationalized machine learning pipeline, which includes:

* Anomaly detection
* Dimensionality Reduction
* Train a model

For simplicity, I'll avoid fitting an anomaly detection method within a pipeline in order to create the workflow in two steps: 
    * Step I: Fit an outlier with the training set
    * Step II: Define a pipeline using a feature selection and a classification method. Then cross-validate the pipeline using the training data, being sure to remove the outliers. 

* Once we fit the pipeline, we'll identify the best model and give an unbiased evaluation using the test set that we created earlier. For unbiased evaluations, I'll report the confusion matrix, precision, recall, f1-score, and accuracy. 

**Note:** Below, I'll be using Grid Search to find the optimal parameters of the pipelines.

In [85]:
from sklearn.svm import OneClassSVM, SVC
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest

from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.ensemble import IsolationForest
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV


from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif, f_regression, chi2, SelectFromModel


from sklearn.pipeline import Pipeline


### Pipeline #1
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation
  

In [86]:
# Anomaly Detection
# ----------------------------------
iso_forest = IsolationForest(contamination = 'auto', random_state = 42)
iso_outliers = iso_forest.fit_predict(X_train) == -1
print(f"Num of outliers = {np.sum(iso_outliers)}")

X_iso = X_train[~iso_outliers]
y_iso = y_train[~iso_outliers]


Num of outliers = 84


In [87]:
#Dimensionality Reduction
# ----------------------------------
X_train_iso, X_test_iso, y_train_iso, y_test_iso = train_test_split(X_iso, y_iso, test_size = 0.3)


In [88]:
X_train_iso.shape

(2033, 20)

In [89]:
#iso_model = LogisticRegression(solver = 'liblinear', max_iter = 10000)
#pred_fit = iso_model.fit(X_iso, y_iso) 


In [90]:
# Feature selection and classification pipeline with grid search
# ----------------------------------

n_components = 5

#pca = PCA(n_components = n_components, svd_solver = 'randomized', 
       #   whiten = True).fit(X_iso)
#pca_fit = pca.fit(X_iso)

pca = PCA(n_components = n_components)
dtc = DecisionTreeClassifier(random_state = 0)

pipe_1 = Pipeline([('PCA', pca),
                   ('dtc', dtc)])

param_grid1 = [{'PCA__n_components': [3, 5, 7, 10, 12]},
              {'dtc__criterion': ["gini", "entropy"]}, 
              {'dtc__max_depth': [2, 4, 6, 8, 10]}]





In [91]:
##Grid Search##
clf1 = GridSearchCV(pipe_1, param_grid1)
clf1 = clf1.fit(X_train_iso, y_train_iso)

print(clf1.cv_results_)


{'mean_fit_time': array([0.03528666, 0.04278698, 0.06184106, 0.05258293, 0.12212863,
       0.05865579, 0.07750239, 0.02291837, 0.04020157, 0.00934958,
       0.01019602, 0.0595417 ]), 'std_fit_time': array([0.03362005, 0.03846305, 0.03823092, 0.03929012, 0.02875233,
       0.03872733, 0.0314253 , 0.0325633 , 0.03890322, 0.00011281,
       0.00015134, 0.0389185 ]), 'mean_score_time': array([0.00064397, 0.00059438, 0.00062904, 0.03184218, 0.01613898,
       0.00062895, 0.00062633, 0.00059133, 0.01621528, 0.00056272,
       0.01587906, 0.0005867 ]), 'std_score_time': array([7.80154264e-05, 1.08452940e-05, 6.48180449e-05, 3.82749371e-02,
       3.06684702e-02, 3.85271440e-05, 5.34238529e-05, 3.31192156e-05,
       3.10850934e-02, 1.51978707e-05, 3.06549605e-02, 2.34395415e-05]), 'param_PCA__n_components': masked_array(data=[3, 5, 7, 10, 12, --, --, --, --, --, --, --],
             mask=[False, False, False, False, False,  True,  True,  True,
                    True,  True,  True,  True]

In [92]:
print(clf1.best_params_)


{'dtc__max_depth': 6}


In [93]:
# Unbiased evaluation
# ----------------------------------

y_pred1 = clf1.predict(X_test_iso)
pd.DataFrame(confusion_matrix(y_test_iso, y_pred1))

print(classification_report(y_test_iso, y_pred1))

print('Overall model accuracy: {}\n'.format(r2_score(y_test_iso, 
                                                     (clf1.predict(X_test_iso)))))




              precision    recall  f1-score   support

           0       0.91      0.94      0.93       431
           1       0.94      0.91      0.93       441

    accuracy                           0.93       872
   macro avg       0.93      0.93      0.93       872
weighted avg       0.93      0.93      0.93       872

Overall model accuracy: 0.7109711634073583



In [94]:
#iso_scores = cross_val_score(estimator = iso_model, X = X_iso, y = y_iso)
#print(iso_scores)
#print("Mean CV score w/ IsolationForest Model:", np.mean(iso_scores))
#print('Overall model accuracy: {}\n'.format(r2_score(y_test_iso, (iso_model.predict(X_test_iso)))))

#iso_predictions = iso_model.predict(X_test_iso)
#print('Mean Absolute Error: {}\n'.mean_absolute_error(y_test_iso, iso_predictions))



#### <center>Optimal hyperparameters and performance resulting from Pipeline #1.</center>

The max depth determined for the DecisisonTreeClassifier is 4. The model overall works well at about 71% accuracy, 
but could be improved upon. This still leaves about 30% that can possibly have incorrect outcomes as a result of the 
"lowest" performing model. 





### Pipeline #2
  * Anomaly detection
  * Dimensionality reduction
  * Model training/validation

In [95]:
# Anomaly detection 
# ----------------------------------
one_class_svm = OneClassSVM(kernel = 'rbf', nu = 0.01).fit(X_train)
oc_outliers = one_class_svm.fit_predict(X_train) == -1
print(f"Num of outliers = {np.sum(oc_outliers)}")


X_one_class = X_train[~oc_outliers]
y_one_class = y_train[~oc_outliers]


Num of outliers = 66


In [96]:
X_train_oc, X_test_oc, y_train_oc, y_test_oc = train_test_split(X_one_class, 
                                                                y_one_class, test_size = 0.3)


In [97]:
X_train_oc.shape

(2046, 20)

In [98]:
# Feature selection and classification pipeline with grid search
# ----------------------------------
##Feature Selection - KBest##

#def mutual_info_session(): 
#    selector = SelectKBest(chi2, k = 5)
#    selector.fit(X_train_oc, y_train_oc)
#    print(selector.get_support(True))
#    model11 = GausssianNB()
#    model.fit(selector.transform(X_train_oc), y_train_oc)
#    return model.score(selector.transform(X_test_oc), y_test_oc)

#mutual_info_session()


#pipeline

pipe_2 = Pipeline([('KBest', SelectKBest(f_classif, k = 5)),
                  ('lr', LogisticRegression())])

#pipe_2 = Pipeline([('KBest', SelectKBest(mutual_info_session(), k = 5)),
#                  ('lr', LogisticRegression())])

#hyperparams

param_grid2 = [{'KBest__k': [3, 5, 7, 10, 12]},
                {'lr__C': np.logspace(-8, -4, 4, 6, 8)}]
                


#gridsearch

clf2 = GridSearchCV(pipe_2, param_grid2)
clf2 = clf2.fit(X_train_oc, y_train_oc)

print(clf2.cv_results_)



                

{'mean_fit_time': array([0.00496159, 0.00544186, 0.00612664, 0.0396759 , 0.05397038,
       0.02362671, 0.00624399, 0.00552759, 0.00521464]), 'std_fit_time': array([3.41842779e-04, 2.44717057e-04, 6.84344024e-04, 3.95429497e-02,
       3.73559485e-02, 3.08306123e-02, 1.94167095e-04, 7.81752089e-05,
       5.61641306e-05]), 'mean_score_time': array([0.00049024, 0.00051622, 0.00051265, 0.00071735, 0.00073428,
       0.00065441, 0.00050282, 0.00047607, 0.00047994]), 'std_score_time': array([1.90325536e-05, 2.56063774e-05, 1.99306524e-05, 1.11419958e-04,
       3.87127171e-05, 8.74246047e-05, 3.18604306e-05, 2.05756201e-05,
       2.44538906e-05]), 'param_KBest__k': masked_array(data=[3, 5, 7, 10, 12, --, --, --, --],
             mask=[False, False, False, False, False,  True,  True,  True,
                    True],
       fill_value='?',
            dtype=object), 'param_lr__C': masked_array(data=[--, --, --, --, --, 5.960464477539063e-08,
                   9.536743164062494e-07, 1.525

In [99]:
# Unbiased evaluation
# ----------------------------------
print(clf2.best_params_)

y_pred2 = clf2.predict(X_test_oc)

pd.DataFrame(confusion_matrix(y_test_oc, y_pred2))

print(classification_report(y_test_oc, y_pred2))

print('Overall model accuracy: {}\n'.format(r2_score(y_test_oc, 
                                                     (clf2.predict(X_test_oc)))))



{'KBest__k': 7}
              precision    recall  f1-score   support

           0       0.89      0.89      0.89       421
           1       0.90      0.90      0.90       456

    accuracy                           0.89       877
   macro avg       0.89      0.89      0.89       877
weighted avg       0.89      0.89      0.89       877

Overall model accuracy: 0.5705817393840896



#### <center>Optimal hyperparameters and performance resulting Pipeline #2.</center>

Optimal hyperparameters for the bets number of features is 3 when compared to earlier models using more options. 





### Pipeline #3

In [100]:
# Anomaly detection 
# ----------------------------------
#LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors = 5).fit(X_train)
lof_outliers = lof.fit_predict(X_train) == -1
print(f"Num of outliers = {np.sum(lof_outliers)}")

X_lof = X_train[~lof_outliers]
y_lof = y_train[~lof_outliers]


Num of outliers = 2


In [101]:
X_train_lof, X_test_lof, y_train_lof, y_test_lof = train_test_split(X_lof, y_lof, test_size = 0.3)


In [102]:
X_train_lof.shape

(2090, 20)

In [103]:
# Feature selection and classification pipeline with grid search
# ----------------------------------
#LinearSVC
#from sklearn.svm import LinearSVC

n_components = 5

#pipe_3 = Pipeline({'fa', FactorAnalysis(n_components = n_components, random_state = 0)}, 
#                  {'lsvc', LinearSVC(random_state = 0)})

pipe_3 = Pipeline([('fa', FactorAnalysis(n_components = n_components, random_state = 0)), 
                  ('SVC', SVC())])

param_grid3 = [{'fa__n_components': [3, 5, 7, 10, 12]}, 
               {'SVC__C': [1, 10, 100, 1000]}, 
               {'SVC__gamma': [0.001, 0.0001]},
               {'SVC__kernel': ['rbf', 'linear']}]




In [104]:
clf3 = GridSearchCV(pipe_3, param_grid3)
clf3 = clf3.fit(X_train_lof, y_train_lof)

print(clf3.cv_results_)


{'mean_fit_time': array([0.75502992, 1.67042456, 1.96780086, 1.96705608, 1.40234399,
       1.04546828, 1.08716216, 1.03823357, 1.40704947, 1.18395143,
       1.10017757, 1.10930657, 1.0889781 ]), 'std_fit_time': array([0.08929708, 0.21981655, 0.92241111, 0.12682547, 0.15983107,
       0.10048561, 0.07002614, 0.16492182, 0.34456113, 0.3347494 ,
       0.13339451, 0.01902695, 0.06736827]), 'mean_score_time': array([0.0398067 , 0.01096716, 0.01158414, 0.01375289, 0.01587076,
       0.01014738, 0.00859952, 0.01084933, 0.00821271, 0.02648463,
       0.04518042, 0.01228862, 0.00593596]), 'std_score_time': array([0.0381594 , 0.00177215, 0.0007319 , 0.00056095, 0.00086173,
       0.00056497, 0.0005205 , 0.00376312, 0.00076044, 0.00043212,
       0.01133899, 0.00345337, 0.00111269]), 'param_fa__n_components': masked_array(data=[3, 5, 7, 10, 12, --, --, --, --, --, --, --, --],
             mask=[False, False, False, False, False,  True,  True,  True,
                    True,  True,  True,  Tr

In [105]:

pd.concat([pd.DataFrame(clf3.cv_results_["params"]),pd.DataFrame(clf3.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)


Unnamed: 0,fa__n_components,SVC__C,SVC__gamma,SVC__kernel,Accuracy
0,3.0,,,,0.936842
1,5.0,,,,0.933971
2,7.0,,,,0.931579
3,10.0,,,,0.92488
4,12.0,,,,0.922967
5,,1.0,,,0.933971
6,,10.0,,,0.938278
7,,100.0,,,0.933014
8,,1000.0,,,0.916268
9,,,0.001,,0.900478


In [106]:

rank_tbl = pd.DataFrame(
    {
        'Model': clf3.cv_results_['params'],
        'Mean Test Score': clf3.cv_results_['mean_test_score'],
        'Std Test Score': clf3.cv_results_['std_test_score'],
        'Rank': clf3.cv_results_['rank_test_score']
    }
)

rank_tbl.sort_values('Rank')

Unnamed: 0,Model,Mean Test Score,Std Test Score,Rank
6,{'SVC__C': 10},0.938278,0.019981,1
0,{'fa__n_components': 3},0.936842,0.017554,2
1,{'fa__n_components': 5},0.933971,0.018005,3
5,{'SVC__C': 1},0.933971,0.018005,3
11,{'SVC__kernel': 'rbf'},0.933971,0.018005,3
7,{'SVC__C': 100},0.933014,0.019079,6
2,{'fa__n_components': 7},0.931579,0.017749,7
3,{'fa__n_components': 10},0.92488,0.014716,8
4,{'fa__n_components': 12},0.922967,0.017606,9
8,{'SVC__C': 1000},0.916268,0.015797,10


In [107]:
# Unbiased evaluation
# ----------------------------------
print(clf3.best_params_)

y_pred3 = clf3.predict(X_test_lof)

print(pd.DataFrame(confusion_matrix(y_test_lof, y_pred3)))

print(classification_report(y_test_lof, y_pred3))

print('Overall model accuracy: {}\n'.format(r2_score(y_test_lof, 
                                                     (clf3.predict(X_test_lof)))))



{'SVC__C': 10}
     0    1
0  423   14
1   35  425
              precision    recall  f1-score   support

           0       0.92      0.97      0.95       437
           1       0.97      0.92      0.95       460

    accuracy                           0.95       897
   macro avg       0.95      0.95      0.95       897
weighted avg       0.95      0.95      0.95       897

Overall model accuracy: 0.7813501144164761



#### <center>Optimal hyperparameters and performance resulting from Pipeline #3.</center>

This is the best performing model, with 83% accuracy and the most optimal hyperparameters invole setting the FactorAnalysis n_components to 3. Although this is the best model, it could definitely be skewed because the anomaly detection method used (LocalOutlierFactor) found 0 outliers. 





## Overall
The anomaly detection methods varied quite often on models 1 / 2 when compared to model 3. This likely attributed to model 3 appearing to perform better than the others, however because Random Up Sampling was performed in Part I of this assignment, it could also contribute to the lower number of outliers overall (for all models). 




In [108]:
#Pickle the required pipeline/models
joblib.dump(pipe_3, 'sampledata-Part2-V1.pkl') 


['sampledata-Part2-V1.pkl']

# This is the End. 