# Algorithm Experimentation

Here I'm going to play with a bunch of different models and see which ones look promising for further tuning.

In [1]:
import sqlite3
import numpy as np
conn = sqlite3.connect('./sqlite/training_incidents.sqlite')

### Get data into memory in trainable shape


In [2]:
# get data from SQLite into array in memory
all_data = []
db_results = conn.execute("SELECT * from incidents")
for rec in db_results:
    all_data.append(np.array(rec))

In [3]:
# split data into inputs and labels
inputs = []
labels = []
for rec in all_data:
    labels.append(rec[1])
    inputs.append(rec[2:])

all_inputs = np.array(inputs)
all_labels = np.array(labels)

print len(all_inputs)
print len(all_labels)

print all_inputs[0]
print all_labels[0]

162256
162256
[  0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   1.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   9.04255319e-02   2.97619048e-02
   0.00000000e+00   1.49253731e-02   2.10629656e-04   0.00000000e+00
   1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   1.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   1.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   1.00000000e+00
   0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
   7.09467456e-04   

### Split into training and test sets

In [4]:
from sklearn.cross_validation import train_test_split
inputs_train, inputs_test, labels_train, labels_test = train_test_split(all_inputs, all_labels, test_size=0.10, random_state=30)

from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score

### Dummy Regression


In [5]:
from sklearn.dummy import DummyRegressor

clf = DummyRegressor()
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)

print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)



EVS 0.0
MAE 1.56258541876
MSE 4.43610116624
MedAE 1.39117735278
r^2 -3.33379834123e-05


### Naive LinearRegression

In [6]:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)




[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 8.23730469  7.23046875  5.32910156 ...,  7.70507812  7.28320312
  9.21484375]
EVS 0.347285692111
MAE 1.21626071288
MSE 2.89541094851
MedAE 0.937364216181
r^2 0.347285517809


### Ridge Regression

In [7]:
clf = linear_model.Ridge(alpha = 1.0)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 8.2337465   7.22991219  5.27001242 ...,  7.66760418  7.28340093
  9.23120098]
EVS 0.347364112726
MAE 1.21621885287
MSE 2.89506232303
MedAE 0.938697121699
r^2 0.34736410869


### Lasso Regression

In [8]:
clf = linear_model.Lasso(alpha = 0.1)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 7.94187503  7.44677656  7.14776276 ...,  7.9156455   7.45113925
  8.39462487]
EVS 0.154660444835
MAE 1.4200431766
MSE 3.74992400501
MedAE 1.13946967413
r^2 0.154652051569


### ElasticNet Regression

In [9]:
clf = linear_model.ElasticNet(alpha=0.1, l1_ratio=0.85)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 7.92789745  7.48147052  7.01870574 ...,  8.01079787  7.48514401
  8.46332887]
EVS 0.162612066188
MAE 1.41469766203
MSE 3.71464885882
MedAE 1.13659862355
r^2 0.162604152044


### Stochastic Gradient Descent

In [10]:
clf = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001, l1_ratio=0.15, n_iter=50,
                                learning_rate='invscaling', eta0=0.01, power_t=0.25)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 8.28875994  7.26963642  5.36622562 ...,  7.84278183  7.32357312
  9.36043443]
EVS 0.346702814386
MAE 1.21042123882
MSE 2.90106638084
MedAE 0.928586210792
r^2 0.346010609807


### Bayseian Regression


In [11]:
clf = linear_model.BayesianRidge(n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 8.23379852  7.23072882  5.27335338 ...,  7.66901131  7.28458753
  9.23769374]
EVS 0.347298323291
MAE 1.21627237018
MSE 2.89535417793
MedAE 0.939232250163
r^2 0.347298315639


### Passive Aggressive Regression

In [12]:
clf = linear_model.PassiveAggressiveRegressor(C=5.0, n_iter=50, loss='epsilon_insensitive', epsilon=0.5, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 9.65134992  9.78907633  8.63205319 ...,  7.17820072  9.87824068
  9.91837588]
EVS 0.0942227227386
MAE 1.72025464872
MSE 5.40196804218
MedAE 1.35588450982
r^2 -0.217769372351


### RANSAC regression

In [13]:
clf = linear_model.RANSACRegressor(random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 9.45288086  7.08081055  7.45922852 ...,  8.68212891  7.05786133
  9.16699219]
EVS -7.72965385936
MAE 2.35748594294
MSE 38.7751449271
MedAE 1.02439986228
r^2 -7.74110759857


### TheilSenRegressor

In [14]:
clf = linear_model.TheilSenRegressor(random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 8.24306248  7.21379487  5.44141752 ...,  7.42865163  7.29338415
  9.27517088]
EVS 0.282489148493
MAE 1.24444780116
MSE 3.18531519239
MedAE 0.940333868429
r^2 0.281932204654


### DecisionTree Regression

In [15]:
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(min_samples_leaf=2, min_samples_split=5, random_state=31)

clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[  7.52898194   6.76391424   7.27395938 ...,   8.65234823   8.20657186
  10.24817881]
EVS 0.416083706276
MAE 1.01082551069
MSE 2.59044823332
MedAE 0.618765996789
r^2 0.416033472511


### ExtraTree Regression

In [16]:
from sklearn.tree import ExtraTreeRegressor
clf = ExtraTreeRegressor(min_samples_split=5, min_samples_leaf=2, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 7.60090246  5.98355904  5.37269855 ...,  6.90775528  8.40562142
  9.80759353]
EVS 0.260549402262
MAE 1.2321955925
MSE 3.28019766204
MedAE 0.840093846785
r^2 0.26054278424


### Support Vector Machine Regression

In [17]:
from sklearn import svm
clf = svm.SVR()

# svm takes 5-evar on full dataset
svm_inputs, other_inputs, svm_labels, other_labels = train_test_split(inputs_train, labels_train, test_size=0.90, random_state=30)
clf.fit(svm_inputs, svm_labels)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [18]:
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 8.74553845  7.40230594  5.46685168 ...,  8.32342493  7.4466364   9.5860955 ]
EVS 0.345625791876
MAE 1.19843029269
MSE 2.95048549975
MedAE 0.890304183378
r^2 0.334870023831


### AdaBoost Regression

In [19]:
from sklearn.ensemble import AdaBoostRegressor

clf = AdaBoostRegressor(n_estimators=50, learning_rate=1.0, loss='linear', random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 7.29476505  6.60219495  5.91612847 ...,  5.91612847  6.60219495
  7.88586816]
EVS 0.52086896463
MAE 1.26715586397
MSE 2.63043661006
MedAE 1.02591720935
r^2 0.407018865231


### Bagging Regression


In [20]:
from sklearn.ensemble import BaggingRegressor

clf = BaggingRegressor(n_estimators=10, n_jobs=-1, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 7.17364127  6.87377547  4.48069071 ...,  8.31286294  8.24400372
  9.37252642]
EVS 0.594771267463
MAE 0.869870997385
MSE 1.8004320005
MedAE 0.552146091786
r^2 0.594127375415


### Extra Trees Ensemble Regression

In [21]:
from sklearn.ensemble import ExtraTreesRegressor

clf = ExtraTreesRegressor(n_estimators=10, min_samples_split=5, min_samples_leaf=2, n_jobs=-11, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 8.24286966  6.55576324  4.9489376  ...,  8.13284527  8.44511369
  9.68104999]
EVS 0.51120645102
MAE 0.995278386013
MSE 2.16837362787
MedAE 0.675744425944
r^2 0.511182041209


### Gradient Boost Regression

In [22]:
from sklearn.ensemble import GradientBoostingRegressor

clf = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0,
                                 min_samples_split=5, min_samples_leaf=2, random_state=31, alpha=0.9)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 7.57316049  6.74993082  6.48124198 ...,  7.02860952  7.83939272
  9.3038018 ]
EVS 0.633519637141
MAE 0.845661823286
MSE 1.62569279953
MedAE 0.568967408315
r^2 0.63351895371


### Random Forest Regression

In [23]:
from sklearn.ensemble import RandomForestRegressor

clf = RandomForestRegressor(n_estimators=10, min_samples_split=5, min_samples_leaf=2, n_jobs=-1, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 7.60090246  6.2146081   7.13089883 ...,  9.61580548  7.52240023
  9.90348755]
[ 7.32236257  6.77827536  5.66432801 ...,  8.65809819  8.07618883
  9.72288033]
EVS 0.618824479808
MAE 0.843130942967
MSE 1.69186743076
MedAE 0.530264030961
r^2 0.618601161064


## Analysis

Lots of useful results here, let's start with the dummy model.  Just ignoring the inputs, the random regressor produced predictably bad results:

```
EVS 0.0
MAE 1.56258541876
MSE 4.43610116624
MedAE 1.39117735278
r^2 -3.33379834123e-05
```

a negative r squared number and a mean absolute error of 1.56.  The basic naive linear regression produced surprisingly decent results:

```
EVS 0.347285692111
MAE 1.21626071288
MSE 2.89541094851
MedAE 0.937364216181
r^2 0.347285517809
```

an r^2 of 0.34 when we're talking about damages up 5-6 digits is not too bad, especially because precision isn't nearly as important as order of magnitude in this use case. Most of the other non-ensemble methods have scores very similar to the naive regression.  DecisionTree is one that stands out; it's r-squared is > 0.4 and it's mean and median absolute errors are quite low compared to other simple regressions.   It's not surprising to me that decision trees work well based upon the simple categorical shape of many of the features.  However, even with no parameter tuning yet, the ensemble methods are looking really good right out of the gate.  Both Random Forest and Gradient Boost regressions have r-squared scores over 0.6 with no particular tuning.  Based on r^2, it seems like Random Forest, Gradient Boost, and Bagging are all candidates for hyperparameter tuning and optimization, we can see from there which ones play out.