# Water Pumps: Modeling
## Business Problem:
Tanzania is a developing country and access to water is very important for the health of the population. For this reason, it is vital that all water pumps are properly working. Currently, the only way to monitor pump working status is by physically visiting the site. This is time consuming and costly. Therefore, a more intelligent solution to monitor water pump status is desirable.

This project will address the following question: How can the government of Tanzania improve water pump maintenance by knowing the pump functional status in advance?

**Goal:** The client would like to err on the side of predicting a pump is failing, when in fact it is functional. This means, reducing type two error for _non-functional_ and _functional needs repair_ classes. Therefore, the modeling strategy will focus on improving the recall metric, especially related to these two classes.
    
## Import libraries

In [1]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

## Load Train and Test Sets

In [2]:
def load_train_test():
    file_list = ['X_train', 'X_test', 'y_train', 'y_test']
    data_sets = []
    for filename in file_list:
        data_sets.append(pickle.load(open(f'../data/clean/{filename}', 'rb')))
    return tuple(data_sets)

In [3]:
X_train, X_test, y_train, y_test = load_train_test()

Load predictions from baseline model.

In [4]:
y_test_base = pickle.load(open(f'../data/clean/y_test_base', 'rb'))

## Rescaling
Rescale the features to values between 0 and 1. Since the categorical variables are one-hot-encoded, this will ensure that the continuous variables are on the same scale.

In [5]:
scaler = MinMaxScaler().fit(X_train)
X_train_rescaled = scaler.transform(X_train)
X_test_rescaled = scaler.transform(X_test)

Check the value range after rescaling the training data.

In [6]:
X_train_rescaled.min(), X_train_rescaled.max()

(0.0, 1.0)

Check the value range after rescaling the test data.

In [7]:
X_test_rescaled.min(), X_test_rescaled.max()

(-7.388478523218112e-06, 1.0018616381450016)

Check the shape of the training and test features.

In [8]:
X_train_rescaled.shape, X_test_rescaled.shape

((21296, 230), (9127, 230))

## Resampling
Counter the class imbalanced data set by performing resampling. I will consider both over sampling and under sampling.

### Over Sampling

In [9]:
X_train_over, y_train_over = SMOTE().fit_resample(X_train_rescaled, y_train)

In [10]:
print(pd.Series(y_train_over).value_counts())

functional needs repair    12482
non functional             12482
functional                 12482
dtype: int64


### Under Sampling

In [11]:
X_train_under, y_train_under = RandomUnderSampler(random_state=42).fit_resample(X_train_rescaled, y_train)

In [12]:
print(pd.Series(y_train_under).value_counts())

functional                 1505
non functional             1505
functional needs repair    1505
dtype: int64


## Modeling
I will create two sets of models, one for over sampled training sets, and another for under sampled training sets. For each set of models, I will consider the following models:
* Logistic Regression.
* Random Forrest.
* XGBoost.

### Models with Over Sampling
#### Logistic Regression
First, I will try a basic logistic regression model, to get a base line for the over sampled data.

In [13]:
logreg_over = LogisticRegression(solver='saga', multi_class='multinomial')
logreg_over.fit(X_train_over, y_train_over)

LogisticRegression(multi_class='multinomial', solver='saga')

In [14]:
y_pred_train_logreg_over = logreg_over.predict(X_train_over)
y_pred_logreg_over = logreg_over.predict(X_test_rescaled)

In [15]:
print(classification_report(y_train_over, y_pred_train_logreg_over))

                         precision    recall  f1-score   support

             functional       0.69      0.66      0.68     12482
functional needs repair       0.69      0.77      0.73     12482
         non functional       0.74      0.69      0.71     12482

               accuracy                           0.70     37446
              macro avg       0.71      0.70      0.70     37446
           weighted avg       0.71      0.70      0.70     37446



In [16]:
print(classification_report(y_test, y_pred_logreg_over))

                         precision    recall  f1-score   support

             functional       0.84      0.66      0.74      5349
functional needs repair       0.23      0.73      0.34       645
         non functional       0.74      0.68      0.70      3133

               accuracy                           0.67      9127
              macro avg       0.60      0.69      0.60      9127
           weighted avg       0.76      0.67      0.70      9127



In [17]:
cr_train = classification_report(y_train_over, y_pred_train_logreg_over, output_dict=True)
df_cr_train = pd.DataFrame(cr_train).T

In [19]:
df_cr_train.drop(columns=['f1-score', 'support'], inplace=True)

In [20]:
df_cr_train.drop(['accuracy', 'macro avg', 'weighted avg'], inplace=True)

In [21]:
df_cr_train

Unnamed: 0,precision,recall
functional,0.690854,0.662073
functional needs repair,0.688775,0.765422
non functional,0.738224,0.686829


In [22]:
model_type = 'logreg_over'
multi_columns = [(model_type, x) for x in df_cr_train.columns]
df_cr_train.columns = pd.MultiIndex.from_tuples(multi_columns)

In [23]:
df_cr_train

Unnamed: 0_level_0,logreg_over,logreg_over
Unnamed: 0_level_1,precision,recall
functional,0.690854,0.662073
functional needs repair,0.688775,0.765422
non functional,0.738224,0.686829


**Observations:**
The roughly 5% drop in performance between the training and test scores indicates that over fitting could be a problem. I will try to address this using cross-validation. I will also use a grid search to identify the best regularization parameter, which could reduce the over fitting.

In [19]:
logreg_over_rs = LogisticRegression(solver='saga', multi_class='multinomial', max_iter=10000)
rs_logreg_params = {'C': np.arange(0.2, 2.4, 0.4), 'penalty': ['l1', 'l2']}
rs_logreg = RandomizedSearchCV(logreg_over_rs, rs_logreg_params, random_state=42, n_jobs=-1)
rs_logreg.fit(X_train_over, y_train_over)

RandomizedSearchCV(estimator=LogisticRegression(max_iter=10000,
                                                multi_class='multinomial',
                                                solver='saga'),
                   n_jobs=-1,
                   param_distributions={'C': array([0.2, 0.6, 1. , 1.4, 1.8, 2.2]),
                                        'penalty': ['l1', 'l2']},
                   random_state=42)

In [20]:
best_C = rs_logreg.best_estimator_.get_params()['C']
best_penalty = rs_logreg.best_estimator_.get_params()['penalty']
print(f'The best value for C is {best_C:0.3f}.')
print(f'The best penalty is {best_penalty}.')

The best value for C is 2.200.
The best penalty is l2.


In [21]:
logreg_over_2 = LogisticRegression(solver='saga', multi_class='multinomial', C=best_C, penalty=best_penalty, max_iter=10000)
logreg_over_2.fit(X_train_over, y_train_over)

LogisticRegression(C=2.2000000000000006, max_iter=10000,
                   multi_class='multinomial', solver='saga')

In [22]:
y_pred_train_logreg_over_2 = logreg_over_2.predict(X_train_over)
y_pred_logreg_over_2 = logreg_over_2.predict(X_test_rescaled)

In [23]:
print(classification_report(y_train_over, y_pred_train_logreg_over_2))

                         precision    recall  f1-score   support

             functional       0.69      0.66      0.68     12482
functional needs repair       0.69      0.77      0.73     12482
         non functional       0.74      0.69      0.71     12482

               accuracy                           0.71     37446
              macro avg       0.71      0.71      0.71     37446
           weighted avg       0.71      0.71      0.71     37446



In [24]:
print(classification_report(y_test, y_pred_logreg_over_2))

                         precision    recall  f1-score   support

             functional       0.85      0.66      0.74      5349
functional needs repair       0.23      0.73      0.35       645
         non functional       0.73      0.68      0.70      3133

               accuracy                           0.67      9127
              macro avg       0.60      0.69      0.60      9127
           weighted avg       0.76      0.67      0.70      9127



**Observations:**
After searching for better parameters using a randomized search, I do not see an improvement in the recall.

#### Random Forest
Next, I will try random forest. As with the logistic regression model, I will fit a model with not hyperparameter tooning to get a base line model.

In [25]:
rf_over = RandomForestClassifier(n_estimators=100, random_state = 42, n_jobs=-1)
rf_over.fit(X_train_over, y_train_over)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [26]:
y_pred_rf_over = rf_over.predict(X_test_rescaled)

In [27]:
print(classification_report(y_test, y_pred_rf_over))

                         precision    recall  f1-score   support

             functional       0.83      0.86      0.85      5349
functional needs repair       0.46      0.42      0.44       645
         non functional       0.81      0.77      0.79      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.69      0.69      9127
           weighted avg       0.80      0.80      0.80      9127



**Observations:** Without doing any hyperparameter tuning, we can see mixed results as compared with the logistic regression model. The minority class shows an improvement in precision, but a decrease in recall. On the otherhand, the majority class shows an improvement in recall for the random forest model.

In [28]:
rf_over_2 = RandomForestClassifier(n_estimators=100, random_state = 42, max_features='sqrt', n_jobs=-1)
rf_over_2.fit(X_train_over, y_train_over)

RandomForestClassifier(max_features='sqrt', n_jobs=-1, random_state=42)

In [29]:
y_pred_rf_over_2 = rf_over_2.predict(X_test_rescaled)

In [30]:
print(classification_report(y_test, y_pred_rf_over_2))

                         precision    recall  f1-score   support

             functional       0.83      0.86      0.85      5349
functional needs repair       0.46      0.42      0.44       645
         non functional       0.81      0.77      0.79      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.69      0.69      9127
           weighted avg       0.80      0.80      0.80      9127



In [31]:
rf_over_3 = RandomForestClassifier(n_estimators=100, random_state = 42, max_features='sqrt', max_depth=None, min_samples_split=2, n_jobs=-1)
rf_over_3.fit(X_train_over, y_train_over)

RandomForestClassifier(max_features='sqrt', n_jobs=-1, random_state=42)

In [32]:
y_pred_rf_over_3 = rf_over_3.predict(X_test_rescaled)

In [33]:
print(classification_report(y_test, y_pred_rf_over_3))

                         precision    recall  f1-score   support

             functional       0.83      0.86      0.85      5349
functional needs repair       0.46      0.42      0.44       645
         non functional       0.81      0.77      0.79      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.69      0.69      9127
           weighted avg       0.80      0.80      0.80      9127



**Observations:** After adjusting the parameters using the [suggestions](https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters) from scikit-learn, there is still not an improvement in recall for the minority classes. I will need to do a grid search with cross validation to find the best hyperparameters.

In [34]:
rf_over_rs = RandomForestClassifier(random_state = 42, n_jobs=-1)

In [35]:
max_depth_list = list(np.arange(10, 110, 10))
max_depth_list.append(None)

In [36]:
rs_params_rf_over = {
    'bootstrap': [True, False],
    'max_depth': max_depth_list,
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': list(np.arange(1, 11, 1)),
    'min_samples_split': list(np.arange(1, 11, 1)),
    'n_estimators': list(np.arange(200, 2200, 200))
}

In [37]:
# rs_rf_over = RandomizedSearchCV(rf_over_rs, rs_params_rf_over, random_state=42, n_jobs=-1)
# rs_rf_over.fit(X_train_over, y_train_over)

In [38]:
# rs_rf_over.best_params_

In [39]:
rf_over_4 = RandomForestClassifier(n_estimators=1600, 
                                   random_state = 42, 
                                   max_features='auto', 
                                   max_depth=90, 
                                   min_samples_split=3, 
                                   min_samples_leaf=1,
                                   bootstrap=True,
                                   n_jobs=-1)
rf_over_4.fit(X_train_over, y_train_over)

RandomForestClassifier(max_depth=90, min_samples_split=3, n_estimators=1600,
                       n_jobs=-1, random_state=42)

In [40]:
y_pred_rf_over_4 = rf_over_4.predict(X_test_rescaled)

In [41]:
print(classification_report(y_test, y_pred_rf_over_4))

                         precision    recall  f1-score   support

             functional       0.84      0.86      0.85      5349
functional needs repair       0.47      0.42      0.44       645
         non functional       0.81      0.78      0.79      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.69      0.70      9127
           weighted avg       0.80      0.80      0.80      9127



**Observations:** I do not see any large improvement in metrics, recall nor precision, after tuning the hyperparameters.

#### XGBoost
I will now try an XGBoost algorithm with the over sampled data.

In [44]:
np.unique(y_train_over)

array(['functional', 'functional needs repair', 'non functional'],
      dtype=object)

In [20]:
class_mapping = {
    'functional': 0,
    'functional needs repair': 1,
    'non functional': 2
}

In [21]:
y_train_over_encoded = pd.Series(y_train_over).replace(class_mapping).values
y_test_encoded = pd.Series(y_test).replace(class_mapping).values

In [48]:
np.unique(y_train_over_encoded)

array([0, 1, 2])

In [53]:
np.unique(y_test_encoded)

array([0, 1, 2])

In [49]:
xgb_over_1 = XGBClassifier()
xgb_over_1.fit(X_train_over, y_train_over_encoded)
print(xgb_over_1)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=16, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)


In [50]:
y_pred_xgb_over = xgb_over_1.predict(X_test_rescaled)

In [54]:
print(classification_report(y_test_encoded, y_pred_xgb_over))

              precision    recall  f1-score   support

           0       0.82      0.84      0.83      5349
           1       0.38      0.47      0.42       645
           2       0.81      0.73      0.77      3133

    accuracy                           0.78      9127
   macro avg       0.67      0.68      0.67      9127
weighted avg       0.79      0.78      0.78      9127



**Observations:**
The XGBoost algorithm perform slightly better in recall on the minority class, but slightly worse in precision. Precision and recall is worse for the additional classes.

Next, I will try two sets of hyperparameters suggested by a couple of blog posts. The first set of hyperparameters are taken from [here](https://towardsdatascience.com/fine-tuning-xgboost-in-python-like-a-boss-b4543ed8b1e).

In [56]:
xgb_over_2 = XGBClassifier(
    learning_rate=0.01,
    n_estimators=1000,
    max_depth=3,
    subsample=0.8,
    colsample_bytree=1,
    gamma=1
)
xgb_over_2.fit(X_train_over, y_train_over_encoded)
print(xgb_over_2)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=1, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.01, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=16, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=0.8,
              tree_method='exact', validate_parameters=1, verbosity=None)


In [57]:
y_pred_xgb_over_2 = xgb_over_2.predict(X_test_rescaled)

In [58]:
print(classification_report(y_test_encoded, y_pred_xgb_over_2))

              precision    recall  f1-score   support

           0       0.82      0.72      0.77      5349
           1       0.26      0.58      0.36       645
           2       0.72      0.68      0.70      3133

    accuracy                           0.70      9127
   macro avg       0.60      0.66      0.61      9127
weighted avg       0.74      0.70      0.72      9127



**Observations:**
The recall has improved for the minority classes.

Now, I will try a different set of initial hyperparameters suggested in this [article](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/).

In [59]:
xgb_over_3 = XGBClassifier(
    max_depth=5,
    min_child_weight=1,
    gamma=0,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=1,
    n_estimators=1000
)
xgb_over_3.fit(X_train_over, y_train_over_encoded)
print(xgb_over_3)



Parameters: { scale_pos_weight } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=16, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=0.8,
              tree_method='exact', validate_parameters=1, verbosity=None)


In [60]:
y_pred_xgb_over_3 = xgb_over_3.predict(X_test_rescaled)

In [61]:
print(classification_report(y_test_encoded, y_pred_xgb_over_3))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84      5349
           1       0.46      0.38      0.42       645
           2       0.79      0.77      0.78      3133

    accuracy                           0.80      9127
   macro avg       0.69      0.67      0.68      9127
weighted avg       0.79      0.80      0.79      9127



**Observations:** The recall actually performed worse for the minority class. But, the recall for the _non functional_ class performed better.

##### Hyperparameter Tuning
After trying a few sets of initial hyperparameter values, it is time to tune them to find the optimal set.

The low recall score could be due to over fitting. The XGBoost documentation suggests that over fitting can be reduced by optimizing the hyperparameters _max_depth_, _min_child_weight_ and _gamma_.

First, I will optimized _max_depth_ and _min_child_weight_.

In [66]:
rs_params_xgb_over = {
    'max_depth': list(np.arange(1, 7, 2)),
    'min_child_weight': list(np.arange(1, 7, 2))
}

In [68]:
xgb_over_4 = XGBClassifier(n_estimators=1000)
rs_xgb_over_1 = RandomizedSearchCV(xgb_over_4, rs_params_xgb_over, random_state=42, n_jobs=-1, n_iter=100)
rs_xgb_over_1.fit(X_train_over, y_train_over_encoded)
print(rs_xgb_over_1.best_params_)



{'min_child_weight': 1, 'max_depth': 5}


Next, I will optimize _gamma_.

In [17]:
rs_params_xgb_over_2 = {
    'gamma': [0, 1, 5]
}

In [22]:
xgb_over_5 = XGBClassifier(n_estimators=1000, min_child_weight=1, max_depth=5)
rs_xgb_over_2 = RandomizedSearchCV(xgb_over_5, rs_params_xgb_over_2, random_state=42, n_jobs=-1, n_iter=100)
rs_xgb_over_2.fit(X_train_over, y_train_over_encoded)
print(rs_xgb_over_2.best_params_)



{'gamma': 0}


In [23]:
xgb_over_6 = XGBClassifier(n_estimators=1000, min_child_weight=1, max_depth=5, gamma=0)
xgb_over_6.fit(X_train_over, y_train_over_encoded)
print(xgb_over_6)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=16, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)


In [24]:
y_pred_xgb_over_6 = xgb_over_6.predict(X_test_rescaled)

In [25]:
print(classification_report(y_test_encoded, y_pred_xgb_over_6))

              precision    recall  f1-score   support

           0       0.83      0.87      0.85      5349
           1       0.46      0.37      0.41       645
           2       0.81      0.77      0.79      3133

    accuracy                           0.80      9127
   macro avg       0.70      0.67      0.68      9127
weighted avg       0.80      0.80      0.80      9127



Finally, I will increase the number of trees and lower the learning rate, along with using the best hyperparameter values I found for _max_depth_, _min_child_weight_ and _gamma_. This should improve the model performance.

In [26]:
xgb_over_7 = XGBClassifier(n_estimators=5000, min_child_weight=1, max_depth=5, gamma=0, learning_rate=0.01)
xgb_over_7.fit(X_train_over, y_train_over_encoded)
print(xgb_over_7)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.01, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=5000, n_jobs=16, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)


In [27]:
y_pred_xgb_over_7 = xgb_over_7.predict(X_test_rescaled)

In [28]:
print(classification_report(y_test_encoded, y_pred_xgb_over_7))

              precision    recall  f1-score   support

           0       0.82      0.85      0.83      5349
           1       0.39      0.47      0.43       645
           2       0.81      0.74      0.77      3133

    accuracy                           0.78      9127
   macro avg       0.67      0.68      0.68      9127
weighted avg       0.79      0.78      0.78      9127

