# Water Pumps - Modeling Pipeline

In this notebook, I will take what I learned in the Modeling notebook to create a pipeline that fits all three models. The notebook can be configured to either over sample or under sample the data. For each model, hyperparameter tuning will be performed, using the strategy developed in the Modeling notebook. The best parameters will then be selected for the final model for each model type. All results are saved to a csv file. At the end of the notebook, the results between each model can be compared.

## Import Libraries

In [1]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

## Configurations
Set up configurations for current run of this notebook.

In [206]:
binary = True
debug = True
# sampling_type = 'over'
sampling_type = 'under'
results_filename = 'pipeline_results.csv'
if binary:
    split_filename = results_filename.split('_')
    results_filename = '_'.join([split_filename[0], 'binary', split_filename[1]])
results_filepath = f'../results/{results_filename}'

## Load Train and Test Sets

In [207]:
def load_train_test():
    file_list = ['X_train', 'X_test', 'y_train', 'y_test']
    data_sets = []
    for filename in file_list:
        data_sets.append(pickle.load(open(f'../data/clean/{filename}', 'rb')))
    return tuple(data_sets)

In [208]:
X_train, X_test, y_train, y_test = load_train_test()

In [209]:
# Use this block to shortening training data for debugging.
if debug:
    row_cut = 100
    X_train = X_train[:100]
    y_train = y_train[:100]

### Baseline Model
Load predictions from baseline model.

In [210]:
y_pred_base = pickle.load(open(f'../data/clean/y_pred_base', 'rb'))

## Prepare Training Data

In [211]:
def prepare_data(X_train, X_test, y_train, y_test, sampling_type):
    
    if binary:
        y_train = pd.Series(y_train).replace({'functional needs repair': 'faulty', 'non functional': 'faulty'}).values
        y_test = pd.Series(y_test).replace({'functional needs repair': 'faulty', 'non functional': 'faulty'}).values
    
    scaler = MinMaxScaler().fit(X_train)
    X_train_rescaled = scaler.transform(X_train)
    X_test_pipe = scaler.transform(X_test)
    
    if sampling_type == 'over':
        X_train_pipe, y_train_pipe = SMOTE().fit_resample(X_train_rescaled, y_train)
    elif sampling_type == 'under':
        X_train_pipe, y_train_pipe = RandomUnderSampler(random_state=42).fit_resample(X_train_rescaled, y_train)
    else:
        raise Exception("sampling_type must be 'over' or 'under'. Please try again.")
    
    y_test_pipe = y_test
    
    return X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe

In [212]:
X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe = prepare_data(X_train, X_test, y_train, y_test, sampling_type)

## Modeling

### Load Saved Results
If a results file exists, load it, otherwise set it to `None`.

In [215]:
if os.path.isfile(results_filepath):
    df_results = pd.read_csv(results_filepath, index_col=0, header=[0, 1])
    df_results.drop('class count', level=1, axis=1, inplace=True)
else:
    df_results = None

In [216]:
df_results

Unnamed: 0_level_0,logreg_over,logreg_over,logreg_under,logreg_under,rf_over,rf_over,rf_under,rf_under,xgb_over,xgb_over,xgb_under,xgb_under
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall
faulty,0.528846,0.567761,0.511537,0.622022,0.569132,0.515352,0.530914,0.588671,0.514713,0.546321,0.516355,0.54738
functional,0.677973,0.642737,0.684977,0.580482,0.67911,0.724434,0.685298,0.632642,0.665038,0.636194,0.666146,0.637876


In [217]:
def store_results(y_test, y_pred, model_type, df=None):
    
    if df is not None:
        if model_type in df.columns:
            df.drop(model_type, level=0, axis=1, inplace=True)
    else:
        pass
    
    results = classification_report(y_test, y_pred, output_dict=True)
    df_results = pd.DataFrame(results).T
    df_results.drop(columns=['f1-score', 'support'], inplace=True)
    df_results.drop(['accuracy', 'macro avg', 'weighted avg'], inplace=True)
    
    multi_columns = [(model_type, x) for x in df_results.columns]
    df_results.columns = pd.MultiIndex.from_tuples(multi_columns)
    
    if df is None:
        return df_results
    else:
        df_results = pd.concat([df, df_results], axis=1)
        df_results.sort_index(axis=1, level=0, inplace=True)
        return df_results

### Baseline Model
Set up baseline model results and store in data frame.

In [218]:
model_type = 'base_line'

In [219]:
if binary == False:
    df_results = store_results(y_test, y_pred_base, model_type, df=df_results)
    df_results

### Logistic Regression

In [220]:
model_type = f'logreg_{sampling_type}'

In [221]:
logreg_rs = LogisticRegression(solver='saga', multi_class='multinomial', max_iter=10000)
rs_logreg_params = {'C': np.arange(0.2, 2.4, 0.4), 'penalty': ['l1', 'l2']}
rs_logreg = RandomizedSearchCV(logreg_rs, rs_logreg_params, random_state=42, n_jobs=-1)
rs_logreg.fit(X_train_pipe, y_train_pipe)

RandomizedSearchCV(estimator=LogisticRegression(max_iter=10000,
                                                multi_class='multinomial',
                                                solver='saga'),
                   n_jobs=-1,
                   param_distributions={'C': array([0.2, 0.6, 1. , 1.4, 1.8, 2.2]),
                                        'penalty': ['l1', 'l2']},
                   random_state=42)

In [222]:
def print_best_params(best_params):
    for key, value in best_params.items():
        if type(best_params[key]) == str:
            print(f' * {key}: {best_params[key]}')
        else:
            print(f' * {key}: {best_params[key]:0.3f}')

In [223]:
print('The best hyperparameters of logistic regression are:')
print_best_params(rs_logreg.best_params_)

The best hypeparameters of logistic regression are:
 * penalty: l1
 * C: 2.200


In [224]:
def save_hyper_params(model_type, sampling, binary, best_params):
    
    next_row = {
        'sampling': sampling,
        'num_classes': 3 - int(True == binary),
    }
    
    for key, value in best_params.items():
        next_row[key] = best_params[key]
    
    hyper_params_files = f'../results/{model_type}_hyperparams.csv'
    if os.path.isfile(hyper_params_files):
        df = pd.read_csv(hyper_params_files)
        next_index = len(df)
        df.loc[next_index] = next_row
    else:
        df = pd.DataFrame(next_row, index=[0])
    df.to_csv(hyper_params_files, index=False)
    
    return df

In [225]:
df_logreg_params = save_hyper_params('logreg', sampling_type, binary, rs_logreg.best_params_)

In [226]:
df_logreg_params

Unnamed: 0,sampling,num_classes,penalty,C
0,over,3,l1,2.2
1,under,3,l1,0.2
2,over,2,l1,2.2
3,under,2,l1,2.2


In [227]:
logreg_best = LogisticRegression(
    solver='saga', 
    multi_class='multinomial', 
    C=rs_logreg.best_params_['C'], 
    penalty=rs_logreg.best_params_['penalty'], 
    max_iter=10000
)
logreg_best.fit(X_train_pipe, y_train_pipe)

LogisticRegression(C=2.2000000000000006, max_iter=10000,
                   multi_class='multinomial', penalty='l1', solver='saga')

In [228]:
y_pred_logreg_best = logreg_best.predict(X_test_pipe)

In [229]:
df_results = store_results(y_test_pipe, y_pred_logreg_best, model_type, df=df_results)

In [230]:
df_results

Unnamed: 0_level_0,logreg_over,logreg_over,logreg_under,logreg_under,rf_over,rf_over,rf_under,rf_under,xgb_over,xgb_over,xgb_under,xgb_under
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall
faulty,0.528846,0.567761,0.511653,0.621758,0.569132,0.515352,0.530914,0.588671,0.514713,0.546321,0.516355,0.54738
functional,0.677973,0.642737,0.684965,0.580856,0.67911,0.724434,0.685298,0.632642,0.665038,0.636194,0.666146,0.637876


### Random Forest

In [231]:
model_type = f'rf_{sampling_type}'

In [232]:
rf_rs = RandomForestClassifier(random_state = 42, n_jobs=-1)

In [233]:
max_depth_list = list(np.arange(10, 110, 10))
max_depth_list.append(None)

In [234]:
rs_rf_params = {
    'bootstrap': [True, False],
    'max_depth': max_depth_list,
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': list(np.arange(1, 11, 1)),
    'min_samples_split': list(np.arange(1, 11, 1)),
    'n_estimators': list(np.arange(200, 2200, 200))
}

In [235]:
rs_rf = RandomizedSearchCV(rf_rs, rs_rf_params, random_state=42, n_jobs=-1)
rs_rf.fit(X_train_pipe, y_train_pipe)

RandomizedSearchCV(estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 3, 4, 5, 6,
                                                             7, 8, 9, 10],
                                        'min_samples_split': [1, 2, 3, 4, 5, 6,
                                                              7, 8, 9, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42)

In [236]:
print('The best hyperparameters for random forest are:')
print_best_params(rs_rf.best_params_)

The best hyperparameters for random forest are:
 * n_estimators: 1000.000
 * min_samples_split: 10.000
 * min_samples_leaf: 2.000
 * max_features: sqrt
 * max_depth: 80.000
 * bootstrap: 0.000


Double the number of trees for the final random forest model.

In [237]:
rs_rf.best_params_['n_estimators'] *= 2

In [238]:
df_rf_params = save_hyper_params('rf', sampling_type, binary, rs_rf.best_params_)

In [239]:
df_rf_params

Unnamed: 0,sampling,num_classes,n_estimators,min_samples_split,min_samples_leaf,max_features,max_depth,bootstrap
0,over,3,1600,3,1,auto,90,True
1,under,3,2400,7,3,auto,40,True
2,over,2,2000,9,3,sqrt,60,True
3,under,2,2000,10,2,sqrt,80,False


In [240]:
rf_best = RandomForestClassifier(
    n_estimators = rs_rf.best_params_['n_estimators'],
    min_samples_split = rs_rf.best_params_['min_samples_split'],
    min_samples_leaf = rs_rf.best_params_['min_samples_leaf'],
    max_features = rs_rf.best_params_['max_features'],
    max_depth = rs_rf.best_params_['max_depth'],
    bootstrap = rs_rf.best_params_['bootstrap'],
    random_state = 42, 
    n_jobs=-1
)

In [241]:
rf_best.fit(X_train_pipe, y_train_pipe)

RandomForestClassifier(bootstrap=False, max_depth=80, max_features='sqrt',
                       min_samples_leaf=2, min_samples_split=10,
                       n_estimators=2000, n_jobs=-1, random_state=42)

In [242]:
y_pred_rf = rf_best.predict(X_test_pipe)

In [243]:
df_results = store_results(y_test_pipe, y_pred_rf, model_type, df=df_results)

In [244]:
df_results

Unnamed: 0_level_0,logreg_over,logreg_over,logreg_under,logreg_under,rf_over,rf_over,rf_under,rf_under,xgb_over,xgb_over,xgb_under,xgb_under
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall
faulty,0.528846,0.567761,0.511653,0.621758,0.569132,0.515352,0.530914,0.588671,0.514713,0.546321,0.516355,0.54738
functional,0.677973,0.642737,0.684965,0.580856,0.67911,0.724434,0.685298,0.632642,0.665038,0.636194,0.666146,0.637876


### XGBoost

In [245]:
model_type = f'xgb_{sampling_type}'

In [246]:
if binary:
    class_mapping = {
        'functional': 0,
        'faulty': 1
    }
else:
    class_mapping = {
        'functional': 0,
        'functional needs repair': 1,
        'non functional': 2
    }

In [247]:
y_train_encoded = pd.Series(y_train_pipe).replace(class_mapping).values
y_test_encoded = pd.Series(y_test_pipe).replace(class_mapping).values

In [248]:
xgb_rs = XGBClassifier(n_estimators=1000)

In [249]:
if debug:
    rs_params_xgb_over = {
        'max_depth': [1],
        'min_child_weight': [1],
        'gamma': [0]
    }
else:
    rs_params_xgb_over = {
        'max_depth': list(np.arange(1, 7, 2)),
        'min_child_weight': list(np.arange(1, 7, 2)),
        'gamma': [0, 1, 5]
    }

In [250]:
rs_xgb = RandomizedSearchCV(xgb_rs, rs_params_xgb_over, random_state=42, n_jobs=-1, n_iter=100)
rs_xgb.fit(X_train_pipe, y_train_encoded)





RandomizedSearchCV(estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=nan,
                                           monotone_constraints=None,
                                           n_estimators=1000, n_jobs=None,
                                           num_parallel_tree=None,
                                           random_state=None, reg_alpha=None,
                                           reg_lam

In [251]:
print('The best parameters for XGBoost are:')
print_best_params(rs_xgb.best_params_)

The best parameters for XGBoost are:
 * min_child_weight: 1.000
 * max_depth: 1.000
 * gamma: 0.000


Set the total number of trees to 1000.

In [252]:
rs_xgb.best_params_['n_estimators'] = 1000

In [254]:
df_xgb_params = save_hyper_params('xgb', sampling_type, binary, rs_xgb.best_params_)

In [255]:
df_xgb_params

Unnamed: 0,sampling,num_classes,min_child_weight,max_depth,gamma,n_estimators
0,over,3,1,1,0,1000
1,under,3,1,1,0,1000
2,over,2,1,1,0,1000
3,under,2,1,1,0,1000


In [256]:
xgb_best = XGBClassifier(
    n_estimators=rs_xgb.best_params_['n_estimators'],
    max_depth=rs_xgb.best_params_['max_depth'],
    min_child_weight=rs_xgb.best_params_['min_child_weight'],
    gamma=rs_xgb.best_params_['gamma']
)

In [257]:
xgb_best.fit(X_train_pipe, y_train_encoded)





XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=1,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=16, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [258]:
y_pred_encoded = xgb_best.predict(X_test_pipe)

In [259]:
reverse_mapping = {}
if binary:
    reverse_mapping = {
        0:'functional',
        1:'faulty'
    }
else:
    reverse_mapping = {
        0:'functional',
        1:'functional needs repair',
        2:'non functional'
    }

In [260]:
y_pred_xgb = pd.Series(y_pred_encoded).replace(reverse_mapping).values

In [261]:
df_results = store_results(y_test_pipe, y_pred_xgb, model_type, df=df_results)

### Add Class Counts

In [262]:
def add_class_counts(df, y_test):
    df_class_counts = pd.DataFrame(pd.Series(y_test).value_counts())
    df_class_counts.columns = pd.MultiIndex.from_tuples([('', 'class count')])
    df_results = pd.concat([df, df_class_counts], axis=1)
    return df_results

In [263]:
df_results = add_class_counts(df_results, y_test_pipe)

In [264]:
df_results

Unnamed: 0_level_0,logreg_over,logreg_over,logreg_under,logreg_under,rf_over,rf_over,rf_under,rf_under,xgb_over,xgb_over,xgb_under,xgb_under,Unnamed: 13_level_0
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,class count
faulty,0.528846,0.567761,0.511653,0.621758,0.569132,0.515352,0.530914,0.588671,0.514713,0.546321,0.516355,0.54738,3778
functional,0.677973,0.642737,0.684965,0.580856,0.67911,0.724434,0.685298,0.632642,0.665038,0.636194,0.666146,0.637876,5349


### Save Results
Save the results from the pipeline to the results directory.

In [265]:
df_results.to_csv(results_filepath)

### Display Results

In [266]:
def get_results(get_binary=False):
    results_filename = 'pipeline_results.csv'
    if get_binary:
        split_filename = results_filename.split('_')
        results_filename = '_'.join([split_filename[0], 'binary', split_filename[1]])
    results_filepath = f'../results/{results_filename}'
    if os.path.isfile(results_filepath):
        df = pd.read_csv(results_filepath, index_col=0, header=[0, 1])
        unnnamed = [x[0] for x in df.columns if 'Unnamed' in x[0]][0]
        new_index = [(x[0].replace(unnnamed, ' '), x[1]) if unnnamed in x else x for x in df.columns.values.tolist()]
        df.columns = pd.MultiIndex.from_tuples(new_index)
        return df
    else:
        return None

In [267]:
df_final_results = get_results(get_binary=False)

In [268]:
df_final_binary_results = get_results(get_binary=True)

### Final Results:



#### All Three Classes

In [269]:
df_final_results

Unnamed: 0_level_0,base_line,base_line,logreg_over,logreg_over,logreg_under,logreg_under,rf_over,rf_over,rf_under,rf_under,xgb_over,xgb_over,xgb_under,xgb_under,Unnamed: 15_level_0
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,class count
functional,0.767559,0.911198,0.679266,0.678632,0.0,0.0,0.670067,0.749112,0.603381,0.373715,0.672786,0.739951,0.62165,0.438026,5349
functional needs repair,0.632258,0.151938,0.124478,0.231008,0.0,0.0,0.11039,0.052713,0.088183,0.460465,0.086262,0.083721,0.073048,0.269767,645
non functional,0.787948,0.659432,0.535963,0.442387,0.343267,1.0,0.522015,0.473029,0.484464,0.378232,0.549274,0.458985,0.468078,0.444622,3133


#### Two Classes:

In [270]:
df_final_binary_results

Unnamed: 0_level_0,logreg_over,logreg_over,logreg_under,logreg_under,rf_over,rf_over,rf_under,rf_under,xgb_over,xgb_over,xgb_under,xgb_under,Unnamed: 13_level_0
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,class count
faulty,0.528846,0.567761,0.511653,0.621758,0.569132,0.515352,0.530914,0.588671,0.514713,0.546321,0.516355,0.54738,3778
functional,0.677973,0.642737,0.684965,0.580856,0.67911,0.724434,0.685298,0.632642,0.665038,0.636194,0.666146,0.637876,5349


#### Hyperparameters
Here is a summary of the hyperparameters use for each model.

* Logistic Regression:

In [271]:
df_logreg_params

Unnamed: 0,sampling,num_classes,penalty,C
0,over,3,l1,2.2
1,under,3,l1,0.2
2,over,2,l1,2.2
3,under,2,l1,2.2


* Random Forest

In [272]:
df_rf_params

Unnamed: 0,sampling,num_classes,n_estimators,min_samples_split,min_samples_leaf,max_features,max_depth,bootstrap
0,over,3,1600,3,1,auto,90,True
1,under,3,2400,7,3,auto,40,True
2,over,2,2000,9,3,sqrt,60,True
3,under,2,2000,10,2,sqrt,80,False


* XGBoost

In [273]:
df_xgb_params

Unnamed: 0,sampling,num_classes,min_child_weight,max_depth,gamma,n_estimators
0,over,3,1,1,0,1000
1,under,3,1,1,0,1000
2,over,2,1,1,0,1000
3,under,2,1,1,0,1000


## Conclusions
### All Classes
* 