# Water Pumps - Modeling Pipeline

In this notebook, I will take what I learned in the Modeling notebook to create a pipeline that fits all three models. The notebook can be configured to either over sample or under sample the data. For each model, hyperparameter tuning will be performed, using the strategy developed in the Modeling notebook. The best parameters will then be selected for the final model for each model type. All results are saved to a csv file. At the end of the notebook, the results between each model can be compared.

## Import Libraries

In [1]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

## Configurations
Set up configurations for current run of this notebook.

In [104]:
binary = True
debug = False
# sampling_type = 'over'
sampling_type = 'under'
results_filename = 'pipeline_results.csv'
if binary:
    split_filename = results_filename.split('_')
    results_filename = '_'.join([split_filename[0], 'binary', split_filename[1]])
results_filepath = f'../results/{results_filename}'

## Load Train and Test Sets

In [105]:
def load_train_test():
    file_list = ['X_train', 'X_test', 'y_train', 'y_test']
    data_sets = []
    for filename in file_list:
        data_sets.append(pickle.load(open(f'../data/clean/{filename}', 'rb')))
    return tuple(data_sets)

In [106]:
X_train, X_test, y_train, y_test = load_train_test()

In [107]:
# Use this block to shortening training data for debugging.
if debug:
    row_cut = 100
    X_train = X_train[:100]
    y_train = y_train[:100]

Load predictions from baseline model.

In [108]:
y_pred_base = pickle.load(open(f'../data/clean/y_pred_base', 'rb'))

## Prepare Training Data

In [109]:
np.unique(np.where(y_train == 'functional needs repair', 'non functional', y_train))

array(['functional', 'non functional'], dtype=object)

In [110]:
def prepare_data(X_train, X_test, y_train, y_test, sampling_type):
    
    if binary:
        y_train = np.where(y_train == 'functional needs repair', 'non functional', y_train)
        y_test = np.where(y_test == 'functional needs repair', 'non functional', y_test)
    
    scaler = MinMaxScaler().fit(X_train)
    X_train_rescaled = scaler.transform(X_train)
    X_test_pipe = scaler.transform(X_test)
    
    if sampling_type == 'over':
        X_train_pipe, y_train_pipe = SMOTE().fit_resample(X_train_rescaled, y_train)
    elif sampling_type == 'under':
        X_train_pipe, y_train_pipe = RandomUnderSampler(random_state=42).fit_resample(X_train_rescaled, y_train)
    else:
        raise Exception("sampling_type must be 'over' or 'under'. Please try again.")
    
    y_test_pipe = y_test
    
    return X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe

In [111]:
X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe = prepare_data(X_train, X_test, y_train, y_test, sampling_type)

## Modeling

### Load Saved Results
If a results file exists, load it, otherwise set it to `None`.

In [112]:
if os.path.isfile(results_filepath):
    df_results = pd.read_csv(results_filepath, index_col=0, header=[0, 1])
else:
    df_results = None

In [113]:
df_results

In [114]:
def store_results(y_test, y_pred, model_type, df=None):
    
    if df is not None:
        if model_type in df.columns:
            df.drop(model_type, level=0, axis=1, inplace=True)
    else:
        pass
    
    results = classification_report(y_test, y_pred, output_dict=True)
    df_results = pd.DataFrame(results).T
    df_results.drop(columns=['f1-score', 'support'], inplace=True)
    df_results.drop(['accuracy', 'macro avg', 'weighted avg'], inplace=True)
    
    multi_columns = [(model_type, x) for x in df_results.columns]
    df_results.columns = pd.MultiIndex.from_tuples(multi_columns)
    
    if df is None:
        return df_results
    else:
        df_results = pd.concat([df, df_results], axis=1)
        df_results.sort_index(axis=1, level=0, inplace=True)
        return df_results

### Baseline Model
Set up baseline model results and store in data frame.

In [115]:
model_type = 'base_line'

In [116]:
df_results = store_results(y_test, y_pred_base, model_type, df=df_results)

In [117]:
df_results

Unnamed: 0_level_0,base_line,base_line
Unnamed: 0_level_1,precision,recall
functional,0.767559,0.911198
functional needs repair,0.632258,0.151938
non functional,0.787948,0.659432


### Logistic Regression

In [118]:
model_type = f'logreg_{sampling_type}'

In [119]:
logreg_rs = LogisticRegression(solver='saga', multi_class='multinomial', max_iter=10000)
rs_logreg_params = {'C': np.arange(0.2, 2.4, 0.4), 'penalty': ['l1', 'l2']}
rs_logreg = RandomizedSearchCV(logreg_rs, rs_logreg_params, random_state=42, n_jobs=-1)
rs_logreg.fit(X_train_pipe, y_train_pipe)

RandomizedSearchCV(estimator=LogisticRegression(max_iter=10000,
                                                multi_class='multinomial',
                                                solver='saga'),
                   n_jobs=-1,
                   param_distributions={'C': array([0.2, 0.6, 1. , 1.4, 1.8, 2.2]),
                                        'penalty': ['l1', 'l2']},
                   random_state=42)

In [120]:
best_C = rs_logreg.best_estimator_.get_params()['C']
best_penalty = rs_logreg.best_estimator_.get_params()['penalty']
print(f'The best value for C is {best_C:0.3f}.')
print(f'The best penalty is {best_penalty}.')

The best value for C is 1.800.
The best penalty is l2.


In [121]:
logreg_best = LogisticRegression(solver='saga', multi_class='multinomial', C=best_C, penalty=best_penalty, max_iter=10000)
logreg_best.fit(X_train_pipe, y_train_pipe)

LogisticRegression(C=1.8000000000000003, max_iter=10000,
                   multi_class='multinomial', solver='saga')

In [122]:
y_pred_logreg_best = logreg_best.predict(X_test_pipe)

In [123]:
df_results = store_results(y_test, y_pred_logreg_best, model_type, df=df_results)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [124]:
df_results

Unnamed: 0_level_0,base_line,base_line,logreg_under,logreg_under
Unnamed: 0_level_1,precision,recall,precision,recall
functional,0.767559,0.911198,0.815215,0.803328
functional needs repair,0.632258,0.151938,0.0,0.0
non functional,0.787948,0.659432,0.635373,0.781998


### Random Forest

In [125]:
model_type = f'rf_{sampling_type}'

In [126]:
rf_rs = RandomForestClassifier(random_state = 42, n_jobs=-1)

In [127]:
max_depth_list = list(np.arange(10, 110, 10))
max_depth_list.append(None)

In [128]:
rs_rf_params = {
    'bootstrap': [True, False],
    'max_depth': max_depth_list,
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': list(np.arange(1, 11, 1)),
    'min_samples_split': list(np.arange(1, 11, 1)),
    'n_estimators': list(np.arange(200, 2200, 200))
}

In [129]:
rs_rf = RandomizedSearchCV(rf_rs, rs_rf_params, random_state=42, n_jobs=-1)
rs_rf.fit(X_train_pipe, y_train_pipe)

RandomizedSearchCV(estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 3, 4, 5, 6,
                                                             7, 8, 9, 10],
                                        'min_samples_split': [1, 2, 3, 4, 5, 6,
                                                              7, 8, 9, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42)

In [130]:
print('The best hyperparameters for random forest are:')
print(rs_rf.best_params_)

The best hyperparameters for random forest are:
{'n_estimators': 1800, 'min_samples_split': 6, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 90, 'bootstrap': False}


In [131]:
rf_best = RandomForestClassifier(
    n_estimators = rs_rf.best_params_['n_estimators']*2,
    min_samples_split = rs_rf.best_params_['min_samples_split'],
    min_samples_leaf = rs_rf.best_params_['min_samples_leaf'],
    max_features = rs_rf.best_params_['max_features'],
    max_depth = rs_rf.best_params_['max_depth'],
    bootstrap = rs_rf.best_params_['bootstrap'],
    random_state = 42, 
    n_jobs=-1
)

In [132]:
rf_best.fit(X_train_pipe, y_train_pipe)

RandomForestClassifier(bootstrap=False, max_depth=90, min_samples_leaf=2,
                       min_samples_split=6, n_estimators=3600, n_jobs=-1,
                       random_state=42)

In [133]:
y_pred_rf = rf_best.predict(X_test_pipe)

In [134]:
df_results = store_results(y_test, y_pred_rf, model_type, df=df_results)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [135]:
df_results

Unnamed: 0_level_0,base_line,base_line,logreg_under,logreg_under,rf_under,rf_under
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall
functional,0.767559,0.911198,0.815215,0.803328,0.853146,0.82651
functional needs repair,0.632258,0.151938,0.0,0.0,0.0,0.0
non functional,0.787948,0.659432,0.635373,0.781998,0.66692,0.83977


### XGBoost

In [136]:
model_type = f'xgb_{sampling_type}'

In [137]:
class_mapping = {
    'functional': 0,
    'functional needs repair': 1,
    'non functional': 2
}

In [138]:
y_train_encoded = pd.Series(y_train_pipe).replace(class_mapping).values
y_test_encoded = pd.Series(y_test_pipe).replace(class_mapping).values

In [139]:
xgb_rs = XGBClassifier(n_estimators=1000)

In [140]:
rs_params_xgb_over = {
    'max_depth': list(np.arange(1, 7, 2)),
    'min_child_weight': list(np.arange(1, 7, 2)),
    'gamma': [0, 1, 5]
}

# rs_params_xgb_over = {
#     'max_depth': [1],
#     'min_child_weight': [1],
#     'gamma': [0]
# }

In [141]:
rs_xgb = RandomizedSearchCV(xgb_rs, rs_params_xgb_over, random_state=42, n_jobs=-1, n_iter=100)
rs_xgb.fit(X_train_pipe, y_train_encoded)





RandomizedSearchCV(estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=nan,
                                           monotone_constraints=None,
                                           n_estimators=1000, n_jobs=None,
                                           num_parallel_tree=None,
                                           random_state=None, reg_alpha=None,
                                           reg_lam

In [142]:
print('The best parameters for XGBoost are:')
print(rs_xgb.best_params_)

The best parameters for XGBoost are:
{'min_child_weight': 3, 'max_depth': 5, 'gamma': 1}


In [143]:
xgb_best = XGBClassifier(
    n_estimators=1000,
    max_depth=rs_xgb.best_params_['max_depth'],
    min_child_weight=rs_xgb.best_params_['min_child_weight'],
    gamma=rs_xgb.best_params_['gamma']
)

In [144]:
xgb_best.fit(X_train_pipe, y_train_encoded)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=1, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=3, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=16, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [145]:
y_pred_encoded = xgb_best.predict(X_test_pipe)

In [146]:
reverse_mapping = {
    0:'functional',
    1:'functional needs repair',
    2:'non functional'
}

In [147]:
y_pred_xgb = pd.Series(y_pred_encoded).replace(reverse_mapping).values

In [148]:
df_results = store_results(y_test, y_pred_xgb, model_type, df=df_results)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Save Results
Save the results from the pipeline to the results directory.

In [149]:
df_results.to_csv(results_filepath)

### Display Results

In [150]:
def get_results(get_binary=False):
    results_filename = 'pipeline_results.csv'
    if get_binary:
        split_filename = results_filename.split('_')
        results_filename = '_'.join([split_filename[0], 'binary', split_filename[1]])
    results_filepath = f'../results/{results_filename}'
    if os.path.isfile(results_filepath):
        return pd.read_csv(results_filepath, index_col=0, header=[0, 1])
    else:
        return None

In [151]:
df_final_results = get_results(get_binary=False)

In [152]:
df_final_binary_results = get_results(get_binary=True)

### Final Results:

In [153]:
df_final_results

Unnamed: 0_level_0,base_line,base_line,logreg_over,logreg_over,logreg_under,logreg_under,rf_over,rf_over,rf_under,rf_under,xgb_over,xgb_over,xgb_under,xgb_under
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall,precision,recall
functional,0.767559,0.911198,0.84921,0.663302,0.838948,0.637876,0.834596,0.865021,0.851116,0.648719,0.825203,0.872873,0.852985,0.646476
functional needs repair,0.632258,0.151938,0.226057,0.72093,0.218284,0.725581,0.467372,0.410853,0.259851,0.756589,0.480151,0.393798,0.235776,0.75814
non functional,0.787948,0.659432,0.736515,0.67986,0.717764,0.66805,0.809019,0.778806,0.715637,0.724545,0.809524,0.759655,0.737579,0.706033


### Final Results (Two Classes):

In [154]:
df_final_binary_results

Unnamed: 0_level_0,base_line,base_line,logreg_under,logreg_under,rf_under,rf_under,xgb_under,xgb_under
Unnamed: 0_level_1,precision,recall,precision,recall,precision,recall,precision,recall
functional,0.767559,0.911198,0.815215,0.803328,0.853146,0.82651,0.838447,0.827631
functional needs repair,0.632258,0.151938,0.0,0.0,0.0,0.0,0.0,0.0
non functional,0.787948,0.659432,0.635373,0.781998,0.66692,0.83977,0.662334,0.813278


## Conclusions
### All Classes
* 