# STUDY GROUP - M03S36
## Pipelines

### Objectives

You will be able to:
* understand Machine Learning Maxims (MLM)
* understand concept of Pipelining and how it relates to MLM
* understand how to organize code for ease of use and reproducibility

### Pipelines

What are the steps in a machine learning workflow? 
* data cleaning
* EDA
* feature selection/engineering/reduction (PCA)
* model running/selection
* evaulation & interpretation

### Machine Learning Maxims 

**Better data beats fancier algoritms**

**Algorithms are commodities**

**Overfitting is the Devil!**

Why use a pipeline? - set of sequential steps for data transfromation and model training/selection

How would a pipeline help us honor our Machine Learning Maxims? - standardization in process which facilitates honoring of maxims

### GridSearch and Cross Validation

Now that we're tuning models, how does this affect our cross-validation? - 



### Now for an example...

In [11]:
# numpy and pandas for data manipulation
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)

# visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns

# sklearn and appropriate algorithms to choose between
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# model selection, data processing
from sklearn.model_selection import train_test_split 
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Evaluation metrics, classification
from sklearn.metrics import roc_curve, auc, confusion_matrix

# # Pickle for saving model files
# import pickle


In [15]:
# import cleaned data
df = pd.read_csv('backup_analytical_base_table.csv')

df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14068 entries, 0 to 14067
Data columns (total 26 columns):
avg_monthly_hrs            14068 non-null int64
filed_complaint            14068 non-null float64
last_evaluation            14068 non-null float64
n_projects                 14068 non-null int64
recently_promoted          14068 non-null float64
satisfaction               14068 non-null float64
status                     14068 non-null int64
tenure                     14068 non-null float64
last_evaluation_missing    14068 non-null int64
underperformer             14068 non-null int64
unhappy                    14068 non-null int64
overachiever               14068 non-null int64
department_IT              14068 non-null int64
department_Missing         14068 non-null int64
department_admin           14068 non-null int64
department_engineering     14068 non-null int64
department_finance         14068 non-null int64
department_management      14068 non-null int64
department_market

In [16]:
# split dataset
y = df.status
X = df.drop('status', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=123)

In [18]:
# build pipelines

# model pipelines 
pipelines = {'l1': make_pipeline(StandardScaler(),
                                LogisticRegression(penalty='l1', random_state=123)),
             'l2': make_pipeline(StandardScaler(),
                                LogisticRegression(penalty='l2', random_state=123)),
             'rf': make_pipeline(StandardScaler(),
                                RandomForestClassifier(random_state=123)),
             'gb': make_pipeline(StandardScaler(),
                                GradientBoostingClassifier(random_state=123))
            }


In [20]:
# declare hyperparameters
# research proper range for each hyperparameter to allow for best performance in GridSearchCV
l1_hp = {'logisticregression__C': np.linspace(1e-3, 1e3, 10)}
l2_hp = {'logisticregression__C': np.linspace(1e-3, 1e3, 10)}
rf_hp = {
    'randomforestclassifier__n_estimators': [100, 200],
    'randomforestclassifier__max_features': ['auto', 'sqrt', 0.33]
    }
gb_hp = {
    'gradientboostingclassifier__n_estimators': [100, 200],
    'gradientboostingclassifier__learning_rate': [0.05, 0.1, 0.2],
    'gradientboostingclassifier__max_depth': [1, 3, 5]
    }

hyperparameters = {'l1': l1_hp,
                   'l2': l2_hp,
                   'rf': rf_hp,
                   'gb': gb_hp
                  }

In [22]:
# fit and tune with cross-validation
fitted_models = {}

for name, pipeline in pipelines.items():
    # Create cross-validation object from pipeline and hyperparameters
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1)
    
    # Fit model on X_train, y_train
    model.fit(X_train, y_train)
    
    # Store model in fitted_models[name] 
    fitted_models[name] = model
    
    # Print '{name} has been fitted'
    print(name, 'has been fitted.')


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


l1 has been fitted.


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


l2 has been fitted.


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


rf has been fitted.


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


gb has been fitted.


In [24]:
# evaluation metrics
for name, model in fitted_models.items():
    print(name, model.best_score_)


l1 0.8459214501510574
l2 0.8459214501510574
rf 0.979651679402879
gb 0.9765416740714412


In [25]:
# calculate AUC for each fitted model
for name, model in fitted_models.items():
    pred = model.predict_proba(X_test)
    pred = [p[1] for p in pred]
    
    fpr, tpr, thresholds = roc_curve(y_test, pred)
    print( name, auc(fpr, tpr) )

l1 0.9027979220500826
l2 0.902798609111952
rf 0.9883690731466678
gb 0.9869925446916569


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)
