I am going to test three different machine learning algorithms, a random forest classifier, a gradient boosting classifier, and an adaboost classifier.  The reason that I am choosing these is that they tend to perform well on classification tasks without excessive computation time, as could be the case if using SVMs.  They also do not typically need as much data to perform well as some other techniques, particularly neural networks.  One of the biggest downsides to using these classifiers is that they do not handle nans or text data easily, which is why I used naive bayes on the text columns as a work around.  Additionally, one can make feature importance plots using these algorithms, which could be useful if Taarifa would like to have a sense of what is likely to predict a breakdown.  I am going to tune the gradient boosting model, as this is usually the best performer and was the best performer when using the default sklearn models.

In [25]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(42)

In [26]:
train_df = pd.read_csv('../data/processed/clean_train_dummies.csv')
test_df = pd.read_csv('../data/processed/clean_test_dummies.csv')
train_df.drop('date_recorded', axis=1, inplace=True)
test_df.drop('date_recorded', axis=1, inplace=True)
train_df.info()
test_df.info()

train_y = train_df.pop('status_group').values
train_X = train_df.values
test_X = test_df.values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 81 columns):
id                                    59400 non-null int64
amount_tsh                            59400 non-null float64
gps_height                            59400 non-null int64
longitude                             59400 non-null float64
latitude                              59400 non-null float64
num_private                           59400 non-null int64
population                            59400 non-null float64
public_meeting                        59400 non-null bool
permit                                59400 non-null bool
construction_year                     59400 non-null float64
status_group                          59400 non-null object
dwe                                   59400 non-null float64
government                            59400 non-null float64
hesawa                                59400 non-null float64
rwe                                   59400 non-nu

In [27]:
model_train_X, model_valid_X, model_train_y, model_valid_y = train_test_split(train_X, train_y)

In [4]:
model_rf = RandomForestClassifier()
model_rf.fit(model_train_X, model_train_y)
model_rf.score(model_valid_X, model_valid_y)

0.76693602693602692

In [28]:
model_gb = GradientBoostingClassifier(n_estimators=250, max_depth=10, learning_rate=.1)
model_gb.fit(model_train_X, model_train_y)
model_gb.score(model_valid_X, model_valid_y)

0.81528619528619528

In [6]:
model_ab = AdaBoostClassifier()
model_ab.fit(model_train_X, model_train_y)
model_ab.score(model_valid_X, model_valid_y)

0.76033670033670031

Gb w/o construction_year = .745
Gb w/ construction_year = .752
Gb w/ construction and dummies = .7802
Gb w/ construction, dummies, installer = .7804
GB w/ construction, dummies, installer, population = .7858
Gb w/ construction, dummies, installer, population, quantity = .8094
Gb w/ construction, dummies, installer, population, quantity, payment = .8112
Gb w/ construction, dummies, installer, population, quantity, payment = .8152

In [29]:
test_df['status_group'] = model_gb.predict(test_X)
submission = test_df[['id', 'status_group']]
submission.to_csv('../models/submission.csv', index=False)