<h1>Predictive Maintenance of Water Pump in Africa</h1>

This project is about using data analysis to predict when a water pump will be non-functional. The Dataset comes from the drivendata.org. <br>
To start this project we will like to import the data into our notebook and begin some basic analysis on it.

In [100]:
import pandas as pd
import numpy as np

df_values = pd.read_csv("Training_Set_Values.csv")
df_labels = pd.read_csv("Training_set_labels.csv")

df_values.shape, df_labels.shape


((59400, 40), (59400, 2))

As can be seen from the result above, there are 40 features and 59400 observations in the dataset. <br>
The first feature is the ID Number of the pump, which we will ignore.<br> 
Out of the other 39 features, we will drop some features manually.<br>
Possible reasons for drop are:<br>
* Duplicates/Redundant Column
* Feature assumed to have little to no correlation



In [101]:
dropped_df = df_values.drop(columns=["longitude", "latitude", "wpt_name", "num_private",
                              "recorded_by", "permit", "payment", "payment_type", 
                              "waterpoint_type_group", "basin", "subvillage", 
                              "region", "source", "extraction_type_group", 
                              "extraction_type_class", "district_code", "quantity_group", "lga", "ward"])

#Include Status into the DF so that any mutations we do to the df is done to the Status Series as well.
dropped_df["Status"] = df_labels["status_group"]
dropped_df.shape

(59400, 22)

After dropping some features, we are now left with 20 features. However not all of the observations have complete data and thus we will have to do some preprocessing to the new df. Some of the possible step we will take for the preprocessing are: <br>
* Dropping Observations that are not able to be predicted and is considered important e.g. Population
* Predicting Values that may seem possible to be predicted
* Leaving the NaN Values as it is and use one hot encoder.<br>

Columns that have string values such as funder, installer, extraction type, etcwill be encoded using one hot encoder.<br> For these Columns, we will not drop rows that has NaN Values to avoid losing too much data<br>


In [102]:
#Drop rows with population = 0
dropped_df1= dropped_df[dropped_df["population"] != 0]
dropped_df1=pd.get_dummies(dropped_df1, columns=["funder", "installer", "public_meeting", 
                                    "scheme_management", "scheme_name", "extraction_type", 
                                    "management", "management_group", "water_quality", 
                                    "quality_group", "quantity", "source_type", 
                                    "source_class", "waterpoint_type"])


After Using the pandas OneHotEncoder Method, we now have a dataset that looks ready for fitting into the model.<br>
However, one of the feature that can be a problem is the date_recorded as it is not in a suitable data form for fitting.<br>
The optimal way to work with a date time feature is to convert it to a np.datetime format which has lot of built-in methods and function to it.<br>


In [103]:
#Used to turn off the warning that occurs because of chained assignment
pd.set_option('mode.chained_assignment', None)

dropped_df1["date_recorded"] =pd.to_datetime(dropped_df1["date_recorded"])
dropped_df1.dtypes["date_recorded"]

dtype('<M8[ns]')

Now that the date_recorded column is in np.datetime format, lets separate the year-month-date into separate columns that will <br>
fitted into the classifier that we will be creating later. As the value of date may not be as important as month or year,<br>
We will not be using the date value in this case.

In [104]:
dropped_df1["year"] = dropped_df1["date_recorded"].dt.year
dropped_df1["month"] = dropped_df1["date_recorded"].dt.month
dropped_df1.shape

(38019, 5044)

Now the dataframe is almost ready. We just need to drop the 3 unnecessary columns left which are:
* id
* date_recorded
* Status (needs to be stored in a series)

In [105]:
target= dropped_df1["Status"]
data=dropped_df1.drop(columns=["date_recorded", "id", "Status"])

Now that all the data are ready to be fitted into the model,<br>
We can begin thinking about the model that we will want to use.<br>
<br>
We will be starting by splitting the dataframe into train set and test set<br>

In [106]:
from sklearn.model_selection import train_test_split

x_train,x_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

In [107]:
# from imblearn.over_sampling import RandomOverSampler

# ros=RandomOverSampler()
# x_resampled, y_resampled = ros.fit_resample(x_train,y_train)

With both training set and test set ready, we now have to make a decision on which model and hyperparameter to use.<br>
We will use 3 types of model with a small parameter grid to check each model accuracy.<br>

<h1>KNeighborsClassifier</h1>


In [108]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

KNN_pipe= Pipeline([('scaler', StandardScaler()), ('clf', KNeighborsClassifier())])
KNN_grid= {
    "clf__n_neighbors" : np.arange(5,15,2)
}
KNN_clf = GridSearchCV(KNN_pipe, KNN_grid, n_jobs=3, cv=3, scoring="f1_micro")
KNN_clf.fit(x_train,y_train)
y_pred_knn=KNN_clf.predict(x_test)

print(classification_report(y_test, y_pred_knn))


                         precision    recall  f1-score   support

             functional       0.76      0.85      0.81      4194
functional needs repair       0.40      0.23      0.29       508
         non functional       0.78      0.70      0.74      2902

               accuracy                           0.75      7604
              macro avg       0.65      0.60      0.61      7604
           weighted avg       0.74      0.75      0.75      7604



In [109]:
KNN_clf.best_score_

0.7467696162812153

In [110]:
KNN_clf.best_params_

{'clf__n_neighbors': 5}

In [111]:
KNN_clf.cv_results_

{'mean_fit_time': array([4.58695046, 3.88454167, 3.6301988 , 3.72567081, 3.97735898]),
 'std_fit_time': array([0.12743563, 0.2603673 , 0.19799002, 0.30992832, 0.26942537]),
 'mean_score_time': array([42.61925729, 40.90636047, 40.36663397, 41.46009644, 39.52946083]),
 'std_score_time': array([0.29269849, 0.24879807, 0.04957744, 0.28416687, 0.25813879]),
 'param_clf__n_neighbors': masked_array(data=[5, 7, 9, 11, 13],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'clf__n_neighbors': 5},
  {'clf__n_neighbors': 7},
  {'clf__n_neighbors': 9},
  {'clf__n_neighbors': 11},
  {'clf__n_neighbors': 13}],
 'split0_test_score': array([0.74889042, 0.74543841, 0.74050695, 0.73281389, 0.73074268]),
 'split1_test_score': array([0.74590649, 0.73722628, 0.73268889, 0.72943381, 0.72923654]),
 'split2_test_score': array([0.74551194, 0.73959361, 0.73495759, 0.72637601, 0.72223318]),
 'mean_test_score': array([0.74676962, 0.74075276, 0

<h1>LogisticRegression</h1>

Next LogisticRegression classifier. The LogisticRegression Classifier is basically a classifier that uses a linear boundary.<br>
The important things to note here is since there are more than 2 possible result (non-binary classifier), the classifier will use a one vs rest implementation 
instead of a 0.5 probability threshold<br>


In [112]:
from sklearn.linear_model import LogisticRegression

Logres_pipe= Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])
Logres_grid= {
    "clf__C" : np.logspace(0.1,1000, num=5)
}
Logres_clf = GridSearchCV(Logres_pipe, Logres_grid, n_jobs=3, cv=3, scoring="f1_micro")
Logres_clf.fit(x_train,y_train)
y_pred_logres=Logres_clf.predict(x_test)


overflow encountered in power


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression



                         precision    recall  f1-score   support

             functional       0.77      0.88      0.82      4194
functional needs repair       0.48      0.23      0.31       508
         non functional       0.80      0.72      0.75      2902

               accuracy                           0.77      7604
              macro avg       0.68      0.61      0.63      7604
           weighted avg       0.76      0.77      0.76      7604



In [113]:
print(classification_report(y_test, y_pred_logres))

                         precision    recall  f1-score   support

             functional       0.77      0.88      0.82      4194
functional needs repair       0.48      0.23      0.31       508
         non functional       0.80      0.72      0.75      2902

               accuracy                           0.77      7604
              macro avg       0.68      0.61      0.63      7604
           weighted avg       0.76      0.77      0.76      7604



In [114]:
Logres_clf.best_score_

0.7687323587547699

In [115]:
Logres_clf.best_params_

{'clf__C': 1.2589254117941673}

In [116]:
Logres_clf.cv_results_

{'mean_fit_time': array([73.99089638, 72.81114594, 74.72217051, 73.23563401, 75.61051059]),
 'std_fit_time': array([0.38986447, 0.67152299, 0.91132576, 1.5790553 , 1.50450102]),
 'mean_score_time': array([1.10713013, 1.25000866, 1.4388001 , 1.54596384, 1.4276433 ]),
 'std_score_time': array([0.2218039 , 0.52474658, 0.69916789, 0.71721255, 0.77255803]),
 'param_clf__C': masked_array(data=[1.2589254117941673, 1.1885022274369873e+250, inf, inf,
                    inf],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'clf__C': 1.2589254117941673},
  {'clf__C': 1.1885022274369873e+250},
  {'clf__C': inf},
  {'clf__C': inf},
  {'clf__C': inf}],
 'split0_test_score': array([0.77404083, 0.77374495, 0.77374495, 0.77374495, 0.77374495]),
 'split1_test_score': array([0.766522  , 0.76701519, 0.76701519, 0.76701519, 0.76701519]),
 'split2_test_score': array([0.76563425, 0.76494378, 0.76494378, 0.76494378, 0.76494378]),
 'mean

<h1>Decision Tree Classifier</h1>

In [117]:
from sklearn.tree import DecisionTreeClassifier

DTC_pipe= Pipeline([('scaler', StandardScaler()), ('clf', DecisionTreeClassifier())])
DTC_grid= {
    "clf__max_features" : ["auto", "sqrt", "log2"]
}
DTC_clf = GridSearchCV(DTC_pipe, DTC_grid, n_jobs=3, cv=3, scoring="f1_micro")
DTC_clf.fit(x_train,y_train)
y_pred_dtc=DTC_clf.predict(x_test)


print(classification_report(y_test, y_pred_dtc))

                         precision    recall  f1-score   support

             functional       0.79      0.80      0.79      4194
functional needs repair       0.34      0.33      0.34       508
         non functional       0.75      0.74      0.74      2902

               accuracy                           0.74      7604
              macro avg       0.63      0.62      0.63      7604
           weighted avg       0.74      0.74      0.74      7604



In [118]:
DTC_clf.best_score_


0.7375306252234468

In [119]:
DTC_clf.best_params_

{'clf__max_features': 'auto'}

In [120]:
DTC_clf.cv_results_

{'mean_fit_time': array([5.95701464, 5.60126241, 5.79463585]),
 'std_fit_time': array([0.0414389 , 0.02780252, 0.06129084]),
 'mean_score_time': array([1.10058069, 1.09391006, 1.29895878]),
 'std_score_time': array([0.01634289, 0.01821377, 0.01062719]),
 'param_clf__max_features': masked_array(data=['auto', 'sqrt', 'log2'],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'clf__max_features': 'auto'},
  {'clf__max_features': 'sqrt'},
  {'clf__max_features': 'log2'}],
 'split0_test_score': array([0.74356445, 0.74178913, 0.73241937]),
 'split1_test_score': array([0.73199842, 0.72292365, 0.72173999]),
 'split2_test_score': array([0.737029  , 0.74097455, 0.73377392]),
 'mean_test_score': array([0.73753063, 0.73522911, 0.72931109]),
 'std_test_score': array([0.00473512, 0.00870763, 0.00538206]),
 'rank_test_score': array([1, 2, 3])}

<h2> Data Visualization </h2>

First let us prepare some useful lists that will most probably be used multiple times in the future<br>

In [121]:
#Add all the classifier model to a list for future use
from sklearn.metrics import f1_score

model_list= [KNN_clf,Logres_clf,DTC_clf]
model_name_list = ["kNearestNeighbors", "Logistic Regression", "Decision Tree Classifier"]
score_list=[]
label_list=["functional", "functional needs repair","non functional"]

In [122]:
y_pred_knn=KNN_clf.predict(x_test)
y_pred_logres=Logres_clf.predict(x_test)

In [123]:
score_list.append(round(f1_score(y_test,y_pred_knn, average='micro'),4))
score_list.append(round(f1_score(y_test,y_pred_logres, average='micro'),4))
score_list.append(round(f1_score(y_test,y_pred_dtc, average='micro'),4))

In [124]:
score_list

[0.7542, 0.7722, 0.7449]

Now let us compare the scores of the classifier models side by side using a bar chart.<br>
We will be using the plotly module for visualization<br>

In [125]:
import plotly.express as px
import plotly.graph_objects as go

fig1= px.bar(y=model_name_list, x=score_list, hover_name=score_list, width=800, height=750)
fig1.update_layout(title_text="Classifier Score", title_x=0.5)
fig1.update_xaxes(title_text="Accuracy Score")
fig1.update_yaxes(title_text="Classification Model")

In [126]:
from sklearn.metrics import f1_score

f1_score(y_test,y_pred_knn, average='micro')

0.7542083114150447

As can be seen, these are the accuracy score of each model. We would then also visualize a confusionmatrix<br>
which will allows us to compare prediction probabilities.

In [127]:
from sklearn.metrics import confusion_matrix

cf1=confusion_matrix(y_test, y_pred_knn, normalize="true", labels=label_list)
cf2=confusion_matrix(y_test, y_pred_logres, normalize="true", labels=label_list)
cf3=confusion_matrix(y_test, y_pred_dtc, normalize="true", labels=label_list)

In [128]:
fig2 =go.Figure()
for cf in [cf1,cf2,cf3]:
    fig2.add_trace(go.Heatmap(
        x=label_list,y=label_list,z=cf, hoverinfo="text", hovertext=np.round(cf,5)
    ))
dropdown_buttons = [  
    {'label': 'KNN', 'method': 'update','args': [{'visible': [True, False, False]}, {'title': 'KNearestNeighbors classification Matrix'}]},  
    {'label': 'LogisticRegression', 'method': 'update','args': [{'visible': [False, True, False]}, {'title': 'LogisticRegression classification Matrix'}]},  
    {'label': "DecisionTreeClassifier", 'method': "update",'args': [{"visible": [False, False, True]}, {'title': 'DecisionTreeClassifier classification Matrix'}]}
    ]


fig2.update_xaxes(title_text="Predicted Label")
fig2.update_yaxes(title_text="True Label")
fig2.update_layout(title_text="KNearestNeighbors classification Matrix", width=800, height=600)
fig2.data[1].visible=False
fig2.data[2].visible=False
fig2.update_layout(updatemenus=[{'type': "dropdown",'x': 1.25,'y': 1.18,'showactive': True,'active': 0,'buttons': dropdown_buttons}], title_x=0.5)
fig2.show()