# 6. Modelling
The purpose of this notebook is to create various models and try and determine which one works best for the project task. As well as trying various models, I will also experiment with different features and perform feature engineering to try and get the best predictors. As I go through this process I will try and explore the data further and continue with EDA while in the process since it is an iterative process
In this chapter I am going to build machine learning models to help us classify whether pumps in Tanzania are function, not function or functioning and needs repair. This is a ternary problem meaning we have three target classes

In [798]:
#importing necessary modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import roc_auc_score, roc_curve, precision_score, recall_score
from sklearn.metrics import accuracy_score, auc, f1_score,  classification_report
import scipy


In [775]:
#reading our data
modelling_data = pd.read_csv("modelling_data.csv")
#printing the first five rows
modelling_data.head()

Unnamed: 0,status_group,amount_tsh,gps_height,longitude,latitude,basin,region,lga,population,extraction_type_group,...,management_group,payment_type,water_quality,quantity,source,source_class,waterpoint_type,installer,permit,public_meeting
0,functional,50.0,1390,34.938093,-9.856322,Lake Nyasa,Iringa,Ludewa,109,gravity,...,user-group,annually,soft,enough,spring,groundwater,communal standpipe,Roman,False,True
1,functional,0.0,1399,34.698766,-2.147466,Lake Victoria,Mara,Serengeti,280,gravity,...,user-group,never pay,soft,insufficient,rainwater harvesting,surface,communal standpipe,GRUMETI,True,True
2,functional,25.0,686,37.460664,-3.821329,Pangani,Manyara,Simanjiro,250,gravity,...,user-group,per bucket,soft,enough,dam,surface,communal standpipe multiple,World vision,True,True
3,non functional,0.0,263,38.486161,-11.155298,Ruvuma / Southern Coast,Mtwara,Nanyumbu,58,submersible,...,user-group,never pay,soft,dry,machine dbh,groundwater,communal standpipe multiple,UNICEF,True,True
4,functional,0.0,0,31.130847,-1.825359,Lake Victoria,Kagera,Karagwe,0,gravity,...,other,never pay,soft,seasonal,rainwater harvesting,surface,communal standpipe,Artisan,True,True


Summary statistics our our numerical columns

In [776]:
#summary statistics
modelling_data.describe()

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,population
count,59400.0,59400.0,59400.0,59400.0,59400.0
mean,12.748566,668.297239,34.077427,-5.706033,179.909983
std,20.976109,693.11635,6.567432,2.946019,471.482176
min,0.0,-90.0,0.0,-11.64944,0.0
25%,0.0,0.0,33.090347,-8.540621,0.0
50%,0.0,369.0,34.908743,-5.021597,25.0
75%,20.0,1319.25,37.178387,-3.326156,215.0
max,50.0,2770.0,40.345193,-2e-08,30500.0


I will  now select our target and the features from our data set so we can start building our models. Our target column is "status_group" and the other columns will be our features. We will also create a mapper to feature engineer our target classes into 0,1 and 2

In [777]:
modelling_data["status_group"].replace({"functional": 0, "non functional": 1, "functional needs repair": 2}, inplace= True)
#selecting target
y = modelling_data["status_group"]
#selecting features
X = modelling_data.select_dtypes(["float", "int"])
X = X.drop("status_group", axis= 1)



We will need to to normalize all features into a consistent scale of 0 to 1 since classification models only choose from  0 or 1 especially for numeric features. After that we will split the data into train and test splits so we can train evaluate and test our model. Since this is a ternary classification we will split our data into three sets; train, validate and test. Train will take 70% of the data while validate and train will take 15 % each.

In [778]:
#splitting data into train and combine(val and test)
X_train, X_combined, y_train, y_combined = train_test_split(X, y, train_size= 0.7, random_state= 1)
#splitting combined into validate and test
X_val, X_test, y_val, y_test = train_test_split(X_combined, y_combined, train_size= 0.5, random_state= 1)

To ensure we have split our data well lets plot histograms to see their distribution

Normalizing our data and OneHotEncoding our target column using a pipeline. We use pipelines to ensure flow in our work and to avoid data leakage in our process

In [779]:
#normalizing using StandardScaler()
scaler_pipeline =  Pipeline(steps= [("scaler", StandardScaler())])
#onehotencoding our categorical column
ohe_pipeline = Pipeline(steps=[("ohe", OneHotEncoder(drop= "first"))])
#creating a transformer
transformer = ColumnTransformer(transformers= [
                                ("scaler", scaler_pipeline, [0, 1, 2, 3, 4])],remainder= "passthrough")

## 6.1 Vanilla LogisticRegression Model.
This a pure logistic regression classification model with no tuning.

In [780]:
#logistic regression pipeline
logistic_pipeline = Pipeline(steps= [
                        ("transformer", transformer),
                        ("logreg", LogisticRegression(max_iter= 200, random_state= 42))
                                ])
#fit our data
logistic_pipeline.fit(X_train, y_train)

In [781]:
#getting the score of our train model
logistic_pipeline.score(X_train, y_train)

0.574050024050024

In [782]:
#validating our model
logistic_pipeline.fit(X_val, y_val)
#validation score
logistic_pipeline.score(X_val, y_val)

0.5741863075196408

In [783]:
#testing our model
logistic_pipeline.fit(X_test, y_test)
#test score
logistic_pipeline.score(X_test, y_test)

0.5737373737373738

## 6.2 LogisticRegressionCV
This is a tuned logistic regression model that searches for the best regularization parameter using cross validation. Remember that our target class is also imbalanced and that can make the model favour the most appearing  class. I will sort the class imbalance using weights that are inversely proportional to our class frequencies. I will just just pass the argument "balanced" to class weight paramater.

In [784]:
#building LogisticRegressionCV with balanced classes
logistic_pipeline = Pipeline(steps= [
                                    ("transformer", transformer),
                                    ("logregCV", LogisticRegressionCV(max_iter= 100, cv= 5, random_state=1, class_weight= "balanced"))])
#fit the data
logistic_pipeline.fit(X_train, y_train)

Fitting our model and getting the score of our model

In [785]:

#train score
logistic_pipeline.score(X_train, y_train)

0.5235209235209235

Validating our model and obtaining the validation score

In [786]:
#fit X_val
logistic_pipeline.fit(X_val, y_val)
#val score
logistic_pipeline.score(X_val, y_val)

0.5108866442199775

Finally we will test our model and also obtain the test score

In [787]:
#fit the data
logistic_pipeline.fit(X_test, y_test)
#test score
logistic_pipeline.score(X_val, y_val)

0.5129068462401796

In [788]:
def metrics(model, X, y):
    print(f"The score is: {model.score(X, y)}")
    

## 6.3 DecisionTreeClassifier

This classifier performs a recursive partition of the sample space efficiently as possible into sets with similar data points until you get close to a homogenous set and can reasonably predict the value for the new data points.
I am going to build a basic tree to see how it performs on how data before tuning it again to see if it improves or not. I will also add extra categorical features to the model since I have been using only numerical columns in my initial model. I will use pipelines here

In [789]:
X_new = modelling_data.drop(["status_group", "basin", "region", "lga", "management", "source_class", "installer"], axis = 1)


In [790]:
#performing train, validate, test split for our new added features
X_train, X_combined, y_train, y_combined = train_test_split(X_new, y, train_size= 0.7, random_state=1)
#val and test split
X_val, X_test, y_val, y_test = train_test_split(X_combined, y_combined, train_size= 0.5, random_state=1)

In [791]:
#using pipeline to encode and fit our model
#scaling or numerical columns and encoding categorical columns
transformer2 = ColumnTransformer(transformers= [
                                ("scaler", scaler_pipeline, [0, 1, 2, 3, 4]),
                                ("ohe", ohe_pipeline, [5, 6, 7, 8, 9, 10, 11, 12, 13])], remainder= "passthrough")

In [792]:
tree_pipeline = Pipeline(steps = [
                        ("transformer2", transformer2),
                        ("tree_clf", DecisionTreeClassifier())
])
#fitting our train values
tree_pipeline.fit(X_train, y_train)

In [793]:
#train score and accuracy
tree_pipeline.score(X_train, y_train)

0.9950456950456951

In [794]:
#fit our validation set and see the score
tree_pipeline.fit(X_val, y_val)
#validation score
tree_pipeline.score(X_val, y_val)

0.9962962962962963

In [795]:
#fit our test set
tree_pipeline.fit(X_test, y_test)
#test score
tree_pipeline.score(X_test, y_test)

0.9967452300785634

### Hyperparameter tuned DecisionTree 
I am going to tune my model hyperparameyters such as "max_depth", "min_sample_leaf" among others. I am going to see if this will improve my model performance. It is often referred to as to as the hyperparameter space for the optimum values. I'll use, Combinatoric Grid Searching, ehich is probably the most popular because it performs an exhaustive search of all possible combinations. Grid Search works by training a model on the data for each unique combination of parameters and then returning the parameters of the model that performed best. To protect us from randomness, I will use K-Fold cross-validation during this step. 

In [811]:
#creating dictionary grid
param_grid = {
                "tree_clf__min_samples_split": [2, 3, 4, 5, 6],
                "tree_clf__min_samples_leaf": [1, 2, 3, 4, 5],
                "tree_clf__criterion": ["gini", "entropy"]}
#instantiate GridSearchCV
grid = GridSearchCV(tree_pipeline , param_grid,
                    cv=3, return_train_score= True)
grid.fit(X_train, y_train)