# Water Pumps: Modeling
## Business Problem:
Tanzania is a developing country and access to water is very important for the health of the population. For this reason, it is vital that all water pumps are properly working. Currently, the only way to monitor pump working status is by physically visiting the site. This is time consuming and costly. Therefore, a more intelligent solution to monitor water pump status is desirable.

This project will address the following question: How can the government of Tanzania improve water pump maintenance by knowing the pump functional status in advance?

**Goal:** The client would like to err on the side of predicting a pump is failing, when in fact it is functional. This means, reducing type two error for _non-functional_ and _functional needs repair_ classes. Therefore, the modeling strategy will focus on improving the recall metric, especially related to these two classes.
    
## Import libraries

In [1]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import auc
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

## Load Train and Test Sets

In [2]:
def load_train_test():
    file_list = ['X_train', 'X_test', 'y_train', 'y_test']
    data_sets = []
    for filename in file_list:
        data_sets.append(pickle.load(open(f'../data/clean/{filename}', 'rb')))
    return tuple(data_sets)

In [3]:
X_train, X_test, y_train, y_test = load_train_test()

In [4]:
X_train.shape

(21296, 230)

Load predictions from baseline model.

In [5]:
y_test_base = pickle.load(open(f'../data/clean/y_test_base', 'rb'))

In [6]:
y_test_base.shape

(9127,)

## Rescaling
Rescale the features to values between 0 and 1. Since the categorical variables are one-hot-encoded, this will ensure that the continuous variables are on the same scale.

In [7]:
scaler = MinMaxScaler().fit(X_train)
X_train_rescaled = scaler.transform(X_train)
X_test_rescaled = scaler.transform(X_test)

In [8]:
X_train_rescaled.min(), X_train_rescaled.max()

(0.0, 1.0)

In [9]:
X_test_rescaled.min(), X_test_rescaled.max()

(-7.388478523218112e-06, 1.0018616381450016)

In [10]:
X_train_rescaled.shape, X_test_rescaled.shape

((21296, 230), (9127, 230))

## Resampling
Counter the class imbalanced data set by performing resampling. I will consider both over sampling and under sampling.

### Over Sampling

In [11]:
X_train_over, y_train_over = SMOTE().fit_resample(X_train_rescaled, y_train)

In [12]:
print(pd.Series(y_train_over).value_counts())

functional                 12482
functional needs repair    12482
non functional             12482
dtype: int64


### Under Sampling

In [13]:
X_train_under, y_train_under = RandomUnderSampler(random_state=42).fit_resample(X_train_rescaled, y_train)

In [14]:
print(pd.Series(y_train_under).value_counts())

non functional             1505
functional needs repair    1505
functional                 1505
dtype: int64


## Modeling
I will create two sets of models, one for over sampled training sets, and another for under sampled training sets. For each set of models, I will consider the following models:
* Logistic Regression.
* Random Forrest.
* XGBoost.

### Models with Over Sampling
#### Logistic Regression

In [15]:
logreg_over = LogisticRegression(solver='saga', multi_class='multinomial')
logreg_over.fit(X_train_over, y_train_over)

LogisticRegression(multi_class='multinomial', solver='saga')

In [16]:
y_pred_train_logreg_over = logreg_over.predict(X_train_over)
y_pred_logreg_over = logreg_over.predict(X_test_rescaled)

In [17]:
print(classification_report(y_train_over, y_pred_train_logreg_over))

                         precision    recall  f1-score   support

             functional       0.69      0.67      0.68     12482
functional needs repair       0.69      0.77      0.73     12482
         non functional       0.74      0.69      0.72     12482

               accuracy                           0.71     37446
              macro avg       0.71      0.71      0.71     37446
           weighted avg       0.71      0.71      0.71     37446



In [18]:
print(classification_report(y_test, y_pred_logreg_over))

                         precision    recall  f1-score   support

             functional       0.85      0.66      0.74      5349
functional needs repair       0.23      0.73      0.35       645
         non functional       0.73      0.68      0.70      3133

               accuracy                           0.67      9127
              macro avg       0.60      0.69      0.60      9127
           weighted avg       0.76      0.67      0.70      9127



**Observations:**
The roughly 5% drop in performance between the training and test scores indicates that over fitting could be a problem. I will try to address this using cross-validation. I will also use a grid search to identify the best regularization parameter, which could reduce the over fitting.

In [21]:
logreg_over_rs = LogisticRegression(solver='saga', multi_class='multinomial', max_iter=10000)
rs_logreg_params = {'C': np.arange(0.2, 2.4, 0.4), 'penalty': ['l1', 'l2']}
rs_logreg = RandomizedSearchCV(logreg_over_rs, rs_logreg_params, random_state=42, n_jobs=-1)
rs_logreg.fit(X_train_over, y_train_over)

RandomizedSearchCV(estimator=LogisticRegression(max_iter=10000,
                                                multi_class='multinomial',
                                                solver='saga'),
                   n_jobs=-1,
                   param_distributions={'C': array([0.2, 0.6, 1. , 1.4, 1.8, 2.2]),
                                        'penalty': ['l1', 'l2']},
                   random_state=42)

In [22]:
best_C = rs_logreg.best_estimator_.get_params()['C']
best_penalty = rs_logreg.best_estimator_.get_params()['penalty']
print(f'The best value for C is {best_C:0.3f}.')
print(f'The best penalty is {best_penalty}.')

The best value for C is 2.200.
The best penalty is l2.


In [23]:
logreg_over_2 = LogisticRegression(solver='saga', multi_class='multinomial', C=best_C, penalty=best_penalty, max_iter=10000)
logreg_over_2.fit(X_train_over, y_train_over)

LogisticRegression(C=2.2000000000000006, max_iter=10000,
                   multi_class='multinomial', solver='saga')

In [24]:
y_pred_train_logreg_over_2 = logreg_over_2.predict(X_train_over)
y_pred_logreg_over_2 = logreg_over_2.predict(X_test_rescaled)

In [25]:
print(classification_report(y_train_over, y_pred_train_logreg_over_2))

                         precision    recall  f1-score   support

             functional       0.69      0.67      0.68     12482
functional needs repair       0.69      0.77      0.73     12482
         non functional       0.74      0.69      0.72     12482

               accuracy                           0.71     37446
              macro avg       0.71      0.71      0.71     37446
           weighted avg       0.71      0.71      0.71     37446



In [26]:
print(classification_report(y_test, y_pred_logreg_over_2))

                         precision    recall  f1-score   support

             functional       0.85      0.66      0.74      5349
functional needs repair       0.23      0.73      0.35       645
         non functional       0.73      0.68      0.70      3133

               accuracy                           0.67      9127
              macro avg       0.60      0.69      0.60      9127
           weighted avg       0.76      0.67      0.70      9127



**Observations:**
After searching for better parameters using a randomized search, I do not see an improvement in the recall.

#### Random Forest

In [27]:
rf_over = RandomForestClassifier(n_estimators=100, random_state = 42, n_jobs=-1)
rf_over.fit(X_train_over, y_train_over)

RandomForestClassifier(n_jobs=-1, random_state=42)

In [28]:
y_pred_rf_over = rf_over.predict(X_test_rescaled)

In [29]:
print(classification_report(y_test, y_pred_rf_over))

                         precision    recall  f1-score   support

             functional       0.83      0.86      0.85      5349
functional needs repair       0.46      0.42      0.44       645
         non functional       0.81      0.77      0.78      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.68      0.69      9127
           weighted avg       0.79      0.80      0.80      9127



**Observations:** Without doing any hyperparameter tuning, we can see mixed results as compared with the logistic regression model. The minority class shows an improvement in precision, but a decrease in recall. On the otherhand, the majority class shows an improvement in recall for the random forest model.

In [31]:
rf_over_2 = RandomForestClassifier(n_estimators=100, random_state = 42, max_features='sqrt', n_jobs=-1)
rf_over_2.fit(X_train_over, y_train_over)

RandomForestClassifier(max_features='sqrt', n_jobs=-1, random_state=42)

In [32]:
y_pred_rf_over_2 = rf_over_2.predict(X_test_rescaled)

In [33]:
print(classification_report(y_test, y_pred_rf_over_2))

                         precision    recall  f1-score   support

             functional       0.83      0.86      0.85      5349
functional needs repair       0.46      0.42      0.44       645
         non functional       0.81      0.77      0.78      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.68      0.69      9127
           weighted avg       0.79      0.80      0.80      9127



In [34]:
rf_over_3 = RandomForestClassifier(n_estimators=100, random_state = 42, max_features='sqrt', max_depth=None, min_samples_split=2, n_jobs=-1)
rf_over_3.fit(X_train_over, y_train_over)

RandomForestClassifier(max_features='sqrt', n_jobs=-1, random_state=42)

In [35]:
y_pred_rf_over_3 = rf_over_3.predict(X_test_rescaled)

In [36]:
print(classification_report(y_test, y_pred_rf_over_3))

                         precision    recall  f1-score   support

             functional       0.83      0.86      0.85      5349
functional needs repair       0.46      0.42      0.44       645
         non functional       0.81      0.77      0.78      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.68      0.69      9127
           weighted avg       0.79      0.80      0.80      9127



**Observations:** After adjusting the parameters using the [suggestions](https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters) from scikit-learn, there is still not an improvement in recall for the minority classes. I will need to do a grid search with cross validation to find the best hyperparameters.

In [37]:
rf_over_rs = RandomForestClassifier(random_state = 42, n_jobs=-1)

In [38]:
max_depth_list = list(np.arange(10, 110, 10))
max_depth_list.append(None)

In [41]:
rs_params_rf_over = {
    'bootstrap': [True, False],
    'max_depth': max_depth_list,
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': list(np.arange(1, 11, 1)),
    'min_samples_split': list(np.arange(1, 11, 1)),
    'n_estimators': list(np.arange(200, 2200, 200))
}

In [42]:
rs_rf_over = RandomizedSearchCV(rf_over_rs, rs_params_rf_over, random_state=42, n_jobs=-1)
rs_rf_over.fit(X_train_over, y_train_over)

RandomizedSearchCV(estimator=RandomForestClassifier(n_jobs=-1, random_state=42),
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 3, 4, 5, 6,
                                                             7, 8, 9, 10],
                                        'min_samples_split': [1, 2, 3, 4, 5, 6,
                                                              7, 8, 9, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42)

In [43]:
rs_rf_over.best_params_

{'n_estimators': 800,
 'min_samples_split': 3,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 90,
 'bootstrap': True}

In [45]:
rf_over_4 = RandomForestClassifier(n_estimators=1600, 
                                   random_state = 42, 
                                   max_features='auto', 
                                   max_depth=90, 
                                   min_samples_split=3, 
                                   min_samples_leaf=1,
                                   bootstrap=True,
                                   n_jobs=-1)
rf_over_4.fit(X_train_over, y_train_over)

RandomForestClassifier(max_depth=90, min_samples_split=3, n_estimators=1600,
                       n_jobs=-1, random_state=42)

In [46]:
y_pred_rf_over_4 = rf_over_4.predict(X_test_rescaled)

In [47]:
print(classification_report(y_test, y_pred_rf_over_4))

                         precision    recall  f1-score   support

             functional       0.84      0.87      0.85      5349
functional needs repair       0.46      0.41      0.44       645
         non functional       0.81      0.78      0.80      3133

               accuracy                           0.80      9127
              macro avg       0.70      0.69      0.69      9127
           weighted avg       0.80      0.80      0.80      9127



**Observations:** I do not see any large improvement in metrics, recall nor precision, after tuning the hyperparameters.