## Final Project Submission

Please fill out:
* Student name: Christopher Hollman
* Student pace: self paced
* Scheduled project review date/time: 
* Instructor name: Abhineet Kulkarni
* Blog post URL:


# Project Overview:

The government of Tanzania is seeking insight into the functionality of existing water sources throughout the country. This project aims to use machine learning to identify sources that are in need of repair or replacement in order to ensure a given area continues to have access to clean water. 

In [1]:
#importing necessary tools
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.tree import DecisionTreeClassifier
from xgboost.sklearn import XGBClassifier 
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

In [2]:
training_labels = pd.read_csv('data/training_set_labels.csv')
training_values = pd.read_csv('data/training_set_values.csv')

## Initial Data Exploration:
The target variable for this dataset is currently split into three categories. For our purposes
any source that is not labeled as functional will need attention, eliminating the need to delineate
between the two other categories.

In [3]:
training_labels.status_group.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

In [4]:
#Converting 3 labels into binary categories. 
label_conversions = {
    'functional':0,
    'functional needs repair':1,
    'non functional':1
}
y = training_labels['status_group'].replace(label_conversions) 

## Exploration of Predictors:
There are quite a few categories that seem to be very similar to one another if not redundant.
We will drop a number of these to simplify our model.

In [5]:
training_values.dtypes

id                         int64
amount_tsh               float64
date_recorded             object
funder                    object
gps_height                 int64
installer                 object
longitude                float64
latitude                 float64
wpt_name                  object
num_private                int64
basin                     object
subvillage                object
region                    object
region_code                int64
district_code              int64
lga                       object
ward                      object
population                 int64
public_meeting            object
recorded_by               object
scheme_management         object
scheme_name               object
permit                    object
construction_year          int64
extraction_type           object
extraction_type_group     object
extraction_type_class     object
management                object
management_group          object
payment                   object
payment_ty

In [6]:
#These columns are either duplicates or more poorly categorized versions of columns that we kept

unusable_columns = ['id','date_recorded', 'wpt_name', 'num_private', 
                    'subvillage', 'lga', 'ward', 'recorded_by', 'extraction_type_group', 
                    'extraction_type', 'scheme_name', 'management', 'waterpoint_type_group', 
                    'source', 'source_class', 'quantity_group', 'quality_group', 
                    'payment_type', 'latitude', 'longitude']

In [7]:
X_vals = training_values.drop(unusable_columns, axis=1)
X_vals.columns

Index(['amount_tsh', 'funder', 'gps_height', 'installer', 'basin', 'region',
       'region_code', 'district_code', 'population', 'public_meeting',
       'scheme_management', 'permit', 'construction_year',
       'extraction_type_class', 'management_group', 'payment', 'water_quality',
       'quantity', 'source_type', 'waterpoint_type'],
      dtype='object')

### Dealing with missing values:
There are quite a few missing values, each of which we will deal with in a slightly different way. For construction
year we will take the median of the dataset excluding zero values (2000) and replace any zeros with this number. 
For scheme management we will label these values as 'unknown'. For funder and installer we will list them with the majority of the values as 'other'. I used the 'pad' method for public meeting and permit in order to preserve the ratio of yes/no values.

In [8]:
X_vals.isna().sum()

amount_tsh                  0
funder                   3635
gps_height                  0
installer                3655
basin                       0
region                      0
region_code                 0
district_code               0
population                  0
public_meeting           3334
scheme_management        3877
permit                   3056
construction_year           0
extraction_type_class       0
management_group            0
payment                     0
water_quality               0
quantity                    0
source_type                 0
waterpoint_type             0
dtype: int64

In [9]:
X_vals['quantity'].value_counts()

enough          33186
insufficient    15129
dry              6246
seasonal         4050
unknown           789
Name: quantity, dtype: int64

In [10]:
#Zeros seem to be okay here, most of are values are quite low, indicating that this is likely a real number
#rather than a numerical placeholder.
X_vals['population'].describe()

count    59400.000000
mean       179.909983
std        471.482176
min          0.000000
25%          0.000000
50%         25.000000
75%        215.000000
max      30500.000000
Name: population, dtype: float64

In [11]:
#Categorizing unknown values as such

X_vals['scheme_management'].fillna('Unknown', inplace=True)

In [12]:
#padding true/false categories and converting to integers

X_vals['public_meeting'] = X_vals['public_meeting'].fillna(method='pad').astype(int)

X_vals['permit'] = X_vals['permit'].fillna(method='pad').astype(int)

### Funder/Installer Columns:
These columns posed a challenge in that the appear to have been entered manually, resulting in
extremely high cardinality and many entries that are close to one another but misspelled or abbreviated.
I limited my categories to the top 30, cleaned those entries up as best I could and categorized the rest as 'other'

In [13]:
#replacing unknown construction years with median value

print(X_vals['construction_year'].value_counts())
X_vals['construction_year'].replace(0,2000, inplace=True)

0       20709
2010     2645
2008     2613
2009     2533
2000     2091
2007     1587
2006     1471
2003     1286
2011     1256
2004     1123
2012     1084
2002     1075
1978     1037
1995     1014
2005     1011
1999      979
1998      966
1990      954
1985      945
1980      811
1996      811
1984      779
1982      744
1994      738
1972      708
1974      676
1997      644
1992      640
1993      608
2001      540
1988      521
1983      488
1975      437
1986      434
1976      414
1970      411
1991      324
1989      316
1987      302
1981      238
1977      202
1979      192
1973      184
2013      176
1971      145
1960      102
1967       88
1963       85
1968       77
1969       59
1964       40
1962       30
1961       21
1965       19
1966       17
Name: construction_year, dtype: int64


In [14]:

top_installers = set(X_vals['installer'].value_counts().index[:30].values)
top_funders = set(X_vals['funder'].value_counts().index[:30].values)
print(top_installers)
print(top_funders)

{'DANIDA', 'KKKT', 'OXFAM', 'DANID', 'WU', 'Government', 'ACRA', 'World vision', 'Hesawa', 'Community', 'WEDECO', 'RWE', 'District Council', 'LGA', 'District council', 'Gover', 'TASAF', 'World Vision', 'TWESA', 'Commu', 'CES', 'SEMA', '0', 'TCRS', 'HESAWA', 'AMREF', 'DWE', 'Dmdd', 'DW', 'Central government'}
{'World Bank', 'Tcrs', 'Rc Church', 'Dhv', 'Netherlands', 'Amref', 'Dwe', 'Fini Water', 'Private Individual', 'Hesawa', 'Hifab', 'Germany Republi', 'District Council', 'Government Of Tanzania', 'Tasaf', 'Lga', 'World Vision', 'Wateraid', 'Oxfam', 'Unicef', 'Danida', 'Adb', '0', 'Rwssp', 'Water', 'Dwsp', 'Norad', 'Isf', 'Kkkt', 'Ministry Of Water'}


In [15]:
#combining similar labels within top 30 and replacing remaining values with 'other'

installer_replace = {
    'Commu':'Community',
    '0':'Unknown',
    'DANID':'DANIDA',
    'District council':'District Council',
   'DW':'DWE',
    'Gov':'Government',
    'Gover':'Government',
    'Central Government':'Government',
    'HESAWA':'Hesawa',
    'World vision':"World Vision"
}

X_vals['installer'].replace(installer_replace, inplace=True)

X_vals['funder'].replace('0','Unknown', inplace=True)

top_installers = set(X_vals['installer'].value_counts().index[:30].values)
top_funders = set(X_vals['funder'].value_counts().index[:30].values)

In [16]:
for i in range(len(X_vals)):
    if X_vals['installer'][i] not in top_installers:
        X_vals['installer'][i] = 'Other'

for i in range(len(X_vals)):
    if X_vals['funder'][i] not in top_funders:
        X_vals['funder'][i] = 'Other'        

In [17]:
X_vals['installer'].value_counts()

Other                         23679
DWE                           17648
Government                     2318
Community                      1613
DANIDA                         1602
Hesawa                         1379
RWE                            1206
District Council                943
KKKT                            898
Unknown                         780
TCRS                            707
World Vision                    678
Central government              622
CES                             610
LGA                             408
WEDECO                          397
TASAF                           396
AMREF                           329
TWESA                           316
WU                              301
Dmdd                            287
ACRA                            278
SEMA                            249
OXFAM                           234
Da                              224
Gove                            222
Idara ya maji                   222
UNICEF                      

In [18]:
X_vals['funder'].value_counts()

Other                     26129
Government Of Tanzania     9084
Danida                     3114
Hesawa                     2202
Rwssp                      1374
World Bank                 1349
Kkkt                       1287
World Vision               1246
Unicef                     1057
Tasaf                       877
District Council            843
Dhv                         829
Private Individual          826
Dwsp                        811
Unknown                     781
Norad                       765
Germany Republi             610
Tcrs                        602
Ministry Of Water           590
Water                       583
Dwe                         484
Netherlands                 470
Hifab                       450
Adb                         448
Lga                         442
Amref                       425
Fini Water                  393
Oxfam                       359
Wateraid                    333
Rc Church                   321
Isf                         316
Name: fu

## Fitting Decision Tree Model:
The model we will use as a baseline will be an untuned Decision tree. This is a relatively simple yet effective model 
that doesn't require any scaling of our data. 

In [19]:
#Creating dummy variables for categorical data

X_vals = pd.get_dummies(X_vals)

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_vals, y, test_size=.25, random_state=0)

In [21]:
# Fitting Decision Tree
clf_dt = DecisionTreeClassifier()
clf_dt.fit(X_train, y_train)
 
 

print('Decison Tree Scores: \n', classification_report(y_test,clf_dt.predict(X_test)))


Decison Tree Scores: 
               precision    recall  f1-score   support

           0       0.78      0.82      0.80      8022
           1       0.77      0.74      0.75      6828

    accuracy                           0.78     14850
   macro avg       0.78      0.78      0.78     14850
weighted avg       0.78      0.78      0.78     14850



### Untuned Decision Tree Performance:
The metrics to focus on are precision and recall for 1 values. This model has .77 precision, meaning that
it is correct 77% of the time when it predicts that a water sources is in need or repair or replacement. The .74 recall score indicates that this model correctly identifies 74% of total sources that are in our target category. This is a very serviceable baseline model but can likely be improved.

In [31]:
param_grid_dt = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [2, 4, 6, 8, 10, 15, 20, 25, 30],
    'min_samples_split': [10, 12, 14, 15, 16, 17]
}

gs_tree = GridSearchCV(clf_dt, param_grid_dt, cv=3)
gs_tree.fit(X_train, y_train)

print(gs_tree.best_params_)
print(gs_tree.best_score_)

{'criterion': 'gini', 'max_depth': 20, 'min_samples_split': 17}
0.780628507295174


In [28]:
clf_dt_2 = DecisionTreeClassifier(
    criterion='gini', 
    max_depth=20, 
    min_samples_split=10
)
clf_dt_2.fit(X_train, y_train)
print('Decison Tree Scores: \n', classification_report(y_test,clf_dt_2.predict(X_test)))
 

Decison Tree Scores: 
               precision    recall  f1-score   support

           0       0.77      0.87      0.81      8022
           1       0.82      0.69      0.75      6828

    accuracy                           0.79     14850
   macro avg       0.79      0.78      0.78     14850
weighted avg       0.79      0.79      0.78     14850



### Tuned Decision Tree Performance:
With these tuned parameters our model is more precise (82%) when categorizing a source as in need of repair, however
it has lost performance in terms of missing more sources that should be flagged as in need of service. The overall
accuracy of the model improved marginally. 

## Fitting a Random Forest Model:
Given the amount of noise present in our data, it is likely that we can improve our model but fitting a random
forest model made up of a consensus of many simple decision trees. 

In [32]:
#fit random forest 
clf_rf = RandomForestClassifier(random_state=42) 
clf_rf.fit(X_train,y_train)
print('Random Forest Scores \n', classification_report(y_test,clf_rf.predict(X_test)))

Random Forest Scores 
               precision    recall  f1-score   support

           0       0.81      0.85      0.83      8022
           1       0.82      0.76      0.79      6828

    accuracy                           0.81     14850
   macro avg       0.81      0.81      0.81     14850
weighted avg       0.81      0.81      0.81     14850



### Untuned Random Forest Performance:
This more complex model resulted in a similar precision for our target category without the drop off in recall.
The model has better overall accuracy as well. Already a good improvement here. 

In [34]:
param_grid_rf = {
    'n_estimators': [50, 100, 150],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10],
    'min_samples_split': [6,8,10,12],
    'min_samples_leaf': [1, 3, 5]
}


gs_forest = GridSearchCV(clf_rf, param_grid_rf, cv=3)
gs_forest.fit(X_train, y_train)

print(gs_forest.best_params_)
print(gs_forest.best_estimator_)
print(gs_forest.best_score_)

KeyboardInterrupt: 

In [None]:
clf_rf_2 = RandomForestClassifier(
    n_estimators=150,
    critetion='entropy,
    min_samples_split = 6,
    min_samples_leaf = 1,
    random_state=42) 
clf_rf_2.fit(X_train, y_train)
print('Tuned Random Forest Scores \n' classification_report(y_test,clf_rf_2.predict(X_test)))

### Tuned Random Forest Performance:

## Fitting XG Boost Model
This tends to be one of the more accurate models readily available for classification tasks. It should perform similarly or better as compared to the Random Forest model. 

In [26]:
# XGBoost
clf_xgb = XGBClassifier()
clf_xgb.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [27]:
print('Random Forest Scores \n', classification_report(y_test,clf_rf.predict(X_test)))
print('XG Boost Scores \n', classification_report(y_test,clf_xgb.predict(X_test)))

Random Forest Scores 
               precision    recall  f1-score   support

           0       0.80      0.86      0.83      8022
           1       0.82      0.76      0.79      6828

    accuracy                           0.81     14850
   macro avg       0.81      0.81      0.81     14850
weighted avg       0.81      0.81      0.81     14850

XG Boost Scores 
               precision    recall  f1-score   support

           0       0.78      0.88      0.83      8022
           1       0.84      0.71      0.77      6828

    accuracy                           0.80     14850
   macro avg       0.81      0.80      0.80     14850
weighted avg       0.81      0.80      0.80     14850

