<h1>Predictive Maintenance of Water Pump in Africa</h1>

This project is about using data analysis to predict when a water pump will be non-functional. The Dataset comes from the drivendata.org. <br>
To start this project we will like to import the data into our notebook and begin some basic analysis on it.

In [1]:
import pandas as pd
import numpy as np

df_values = pd.read_csv("Training_Set_Values.csv")
df_labels = pd.read_csv("Training_set_labels.csv")

df_values.shape, df_labels.shape


((59400, 40), (59400, 2))

As can be seen from the result above, there are 40 features and 59400 observations in the dataset. <br>
The first feature is the ID Number of the pump, which we will ignore.<br> 
Out of the other 39 features, we will drop some features manually.<br>
Possible reasons for drop are:<br>
* Duplicates/Redundant Column
* Feature assumed to have little to no correlation



In [2]:
dropped_df = df_values.drop(columns=["longitude", "latitude", "wpt_name", "num_private",
                              "recorded_by", "permit", "payment", "payment_type", 
                              "waterpoint_type_group", "basin", "subvillage", 
                              "region", "source", "extraction_type_group", 
                              "extraction_type_class", "district_code", "quantity_group", "lga", "ward"])

#Include Status into the DF so that any mutations we do to the df is done to the Status Series as well.
dropped_df["Status"] = df_labels["status_group"]
dropped_df.shape

(59400, 22)

After dropping some features, we are now left with 20 features. However not all of the observations have complete data and thus we will have to do some preprocessing to the new df. Some of the possible step we will take for the preprocessing are: <br>
* Dropping Observations that are not able to be predicted and is considered important e.g. Population
* Predicting Values that may seem possible to be predicted
* Leaving the NaN Values as it is and use one hot encoder.<br>

Columns that have string values such as funder, installer, extraction type, etcwill be encoded using one hot encoder.<br> For these Columns, we will not drop rows that has NaN Values to avoid losing too much data<br>


In [3]:
#Drop rows with population = 0
dropped_df1= dropped_df[dropped_df["population"] != 0]
dropped_df1=pd.get_dummies(dropped_df1, columns=["funder", "installer", "public_meeting", 
                                    "scheme_management", "scheme_name", "extraction_type", 
                                    "management", "management_group", "water_quality", 
                                    "quality_group", "quantity", "source_type", 
                                    "source_class", "waterpoint_type"])


After Using the pandas OneHotEncoder Method, we now have a dataset that looks ready for fitting into the model.<br>
However, one of the feature that can be a problem is the date_recorded as it is not in a suitable data form for fitting.<br>
The optimal way to work with a date time feature is to convert it to a np.datetime format which has lot of built-in methods and function to it.<br>


In [4]:
#Used to turn off the warning that occurs because of chained assignment
pd.set_option('mode.chained_assignment', None)

dropped_df1["date_recorded"] =pd.to_datetime(dropped_df1["date_recorded"])
dropped_df1.dtypes["date_recorded"]

dtype('<M8[ns]')

Now that the date_recorded column is in np.datetime format, lets separate the year-month-date into separate columns that will <br>
fitted into the classifier that we will be creating later. As the value of date may not be as important as month or year,<br>
We will not be using the date value in this case.

In [5]:
dropped_df1["year"] = dropped_df1["date_recorded"].dt.year
dropped_df1["month"] = dropped_df1["date_recorded"].dt.month
dropped_df1.shape

(38019, 5044)

Now the dataframe is almost ready. We just need to drop the 3 unnecessary columns left which are:
* id
* date_recorded
* Status (needs to be stored in a series)

In [6]:
target= dropped_df1["Status"]
data=dropped_df1.drop(columns=["date_recorded", "id", "Status"])

We will now use the labelencoder from sklearn to encode the "target" series<br>

In [7]:
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
le.fit(target)
encoded_target=le.transform(target)

Now that all the data are ready to be fitted into the model,<br>
We can begin thinking about the model that we will want to use.<br>
<br>
We will be starting by splitting the dataframe into train set and test set<br>

In [8]:
from sklearn.model_selection import train_test_split

x_train,x_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

With both training set and test set ready, we now have to make a decision on which model and hyperparameter to use.<br>
In this project, I will be trying out 2 classifier models: LogisticRegression and also KNeighbour<br>
For each of this model, we will be using GridsearchCV with a total of 10-50 elements in each grid.<br>   

<h1>LogisticRegression</h1>

We will be starting with the LogisticRegression classifier. The LogisticRegression Classifier is basically a classifier that uses a linear boundary.<br>
The important things to note here is since there are more than 2 possible result (non-binary classifier), the classifier will use a one vs rest implementation 
instead of a 0.5 probability threshold<br>
<br>
The hyperparameters we will adjust for this model will be the value of C and the penalty type of the training loop.<br>
We will use 10 different values of C starting from 0.01 to 100(logarithmic steps) and 3 types of penalty in GridSearch<br> This means that we will have a total of 30 different combination of hyperparameters.


In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

c_values=np.geomspace(0.01,100,10)
logreg_hyperparam= {'clf__C': c_values, 'clf__penalty': ["l1", "l2", "elasticnet"]}
logreg_pipe= Pipeline([('scaler', StandardScaler()), ('clf', LogisticRegression())])

logreg_clf= GridSearchCV(logreg_pipe, logreg_hyperparam, scoring="accuracy")
logreg_clf.fit(x_train,y_train)
logreg_clf.best_score_



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

0.7763603485122472

As it turns out, with max_iter= 100 and after 45 min of waiting the model doesn't converge. This means that we either need to increase the number of iterations, preprocess the data even further or have more data. For now we will not use LogisticRegression and instead try out a completely different type of classifier first.

<h1>KNeighborsClassifier</h1>
The 2nd model we will create is KNeighborsClassifier. For this model,<br>
we will use 5 different value of n_neighbors as parameter for gridsearchcv.<br>
The other parameter that we will be varying is the weights variable. We will use two different type of weight in this project:<br>
uniform and distance. In total there will be 10 differentt combination for the project.

In [10]:
from sklearn.neighbors import KNeighborsClassifier

KNN_pipe= Pipeline([('scaler', StandardScaler()), ('clf', KNeighborsClassifier())])
KNN_hyperparam= {'clf__n_neighbors': [3,4,5,6,7], 'clf__weights': ["uniform", "distance"]}

KNN_clf= GridSearchCV(KNN_pipe, KNN_hyperparam, scoring="accuracy")
KNN_clf.fit(x_train,y_train)


GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('clf', KNeighborsClassifier())]),
             param_grid={'clf__n_neighbors': [3, 4, 5, 6, 7],
                         'clf__weights': ['uniform', 'distance']},
             scoring='accuracy')

In [12]:
KNN_clf.best_score_

0.761597895775111