# Algorithm Experimentation

Here I'm going to play with a bunch of different models and see which ones look promising for further tuning.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import sqlite3
import numpy as np
import pandas as pd
conn = sqlite3.connect('./sqlite/training_incidents.sqlite')

### Get data into memory in trainable shape


In [2]:
# get data from SQLite into array in memory
all_data = []
db_results = conn.execute("SELECT * from incidents")
for rec in db_results:
    all_data.append(np.array(rec))

In [3]:
# split data into inputs and labels
inputs = []
labels = []
for rec in all_data:
    # index 0 is just db id, we don't need that
    if rec[1] > 5.0:
        labels.append(rec[1])
        inputs.append(rec[2:])

all_inputs = np.array(inputs)
all_labels = np.array(labels)

print len(all_inputs)
print len(all_labels)
# 162448 in full set,                            r^2 range 0.327
# 158313 after removing low-end outliers < 5,    r^2 range 0.337
# 157531 after removing low-end outliers < 10,   r^2 range 0.336
# 156664 after removing low-end outliers < 25,   r^2 range 0.328
# 154567 after removing low-end outliers < 50,   r^2 range 0.313
# 149192 after removing low-end outliers < 100,  r^2 range 0.313
# 143318 after removing low-end outliers < 250,  r^2 range 0.31
# 128288 after removing low-end outliers < 500,  r^2 range 0.30
# 109824 after removing low-end outliers < 1000, r^2 range 0.27
# 103628 after removing low-end outliers < 1500, r^2 range 0.25
# 56820 after removing low-end outliers < 5000,  r^2 range 0.18
# 35461 after removing low-end outliers < 10000, r^2 range 0.13

158313
158313


### Split into training and test sets

In [4]:
from sklearn.cross_validation import train_test_split
inputs_train, inputs_test, labels_train, labels_test = train_test_split(all_inputs, all_labels, test_size=0.10, random_state=30)

from sklearn.metrics import explained_variance_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import median_absolute_error
from sklearn.metrics import r2_score

### Dummy Regression


In [131]:
from sklearn.dummy import DummyRegressor

clf = DummyRegressor()
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)

print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)



EVS 0.0
MAE 8994.24919087
MSE 160592565.919
MedAE 7324.2225209
r^2 -1.25578303172e-05


### Naive LinearRegression

In [132]:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)


[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  8004.   3042.  12362. ...,  21878.  33852.   5904.]
EVS 0.337502798405
MAE 6771.33527034
MSE 106391864.925
MedAE 3994.0
r^2 0.337496101598


### Ridge Regression

In [133]:
clf = linear_model.Ridge(alpha = 1.0)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  8003.57916827   2911.55516439  12370.11936152 ...,  21896.27040623
  33883.47337838   5933.8220702 ]
EVS 0.3374757206
MAE 6768.88649223
MSE 106396246.012
MedAE 3991.02905013
r^2 0.337468820496


### Lasso Regression

In [134]:
clf = linear_model.Lasso(alpha = 0.1)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  8005.60313718   2912.27966517  12369.41278766 ...,  21877.95079448
  33884.24611259   5932.32082179]
EVS 0.337472809327
MAE 6768.47699638
MSE 106396705.082
MedAE 3990.60742437
r^2 0.337465961861


### ElasticNet Regression

In [135]:
clf = linear_model.ElasticNet(alpha=0.1, l1_ratio=0.85)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  7831.49736691   4391.61459373  10610.21465554 ...,  19309.84392238
  29287.11014411   6599.80727267]
EVS 0.31633636811
MAE 6951.26896467
MSE 109790831.742
MedAE 4283.40607024
r^2 0.31633067914


### Stochastic Gradient Descent

In [136]:
clf = linear_model.SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.001, l1_ratio=0.15, n_iter=50,
                                learning_rate='invscaling', eta0=0.01, power_t=0.25)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  7931.65997832   3185.74547516  12302.94909154 ...,  21293.10545569
  33174.44252656   5515.72341942]
EVS 0.336706197879
MAE 6725.92039159
MSE 106539140.143
MedAE 3911.35367241
r^2 0.336579016382


### Bayseian Regression


In [138]:
clf = linear_model.BayesianRidge(n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  8013.72573099   2916.93669418  12404.31805054 ...,  21910.38131485
  33896.96181085   5934.30404982]
EVS 0.337427187161
MAE 6769.08317424
MSE 106404030.463
MedAE 3990.26945881
r^2 0.337420346588


### Passive Aggressive Regression

In [147]:
clf = linear_model.PassiveAggressiveRegressor(C=5.0, n_iter=50, loss='epsilon_insensitive', epsilon=0.5, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  5277.67733441    938.73392571   7300.24258968 ...,  13631.17092476
  24605.95804115   3648.63878451]
EVS 0.310113593215
MAE 6266.91543719
MSE 119943719.986
MedAE 2711.20730322
r^2 0.253108476519


### RANSAC regression

In [151]:
clf = linear_model.RANSACRegressor(random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[ -1607.75    980.75  12146.75 ...,  34669.    11165.75   9964.75]
EVS -0.0820842407075
MAE 8677.73799899
MSE 174987299.368
MedAE 4909.75
r^2 -0.0896488005356


### TheilSenRegressor

In [152]:
clf = linear_model.TheilSenRegressor(random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  7901.82461543   1490.83642784  10425.69831101 ...,  20724.46848336
  34882.7099878    5773.47003079]
EVS 0.263020817546
MAE 6940.71036693
MSE 118396903.948
MedAE 3945.9399042
r^2 0.262740525513


### DecisionTree Regression

In [11]:
from sklearn.tree import DecisionTreeRegressor
clf = DecisionTreeRegressor(min_samples_leaf=2, min_samples_split=5, random_state=31)

clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  3250.    1250.    5000.  ...,  20000.   55000.    9562.5]
EVS 0.234324949447
MAE 5807.30401372
MSE 122960196.502
MedAE 1566.66666667
r^2 0.234324827482


### ExtraTree Regression

In [13]:
from sklearn.tree import ExtraTreeRegressor
clf = ExtraTreeRegressor(min_samples_split=5, min_samples_leaf=2, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  9375.            550.           9000.         ...,  41666.66666667
  37975.           6000.        ]
EVS 0.202306107868
MAE 6333.2717283
MSE 128102567.891
MedAE 2250.0
r^2 0.202303195988


### Support Vector Machine Regression

In [5]:
from sklearn import svm
clf = svm.SVR()

# svm takes 5-evar on full dataset
svm_inputs, other_inputs, svm_labels, other_labels = train_test_split(inputs_train, labels_train, test_size=0.90, random_state=30)
clf.fit(svm_inputs, svm_labels)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [6]:
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[ 3615.43161602  3615.93483242  3549.11169049 ...,  3488.57380239
  3741.87372973  3362.23151733]
EVS 0.00799937935024
MAE 7567.1136987
MSE 187228764.648
MedAE 2862.27354508
r^2 -0.165876606827


### AdaBoost Regression

In [19]:
from sklearn.ensemble import AdaBoostRegressor

clf = AdaBoostRegressor(n_estimators=50, learning_rate=1.0, loss='linear', random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[ 13414.66618564   7394.49039007   3448.2567983  ...,  15554.23573383
  30606.04948456  21229.84598719]
EVS 0.420871189241
MAE 6683.69374767
MSE 97023743.9995
MedAE 4394.49039007
r^2 0.395831545177


### Bagging Regression


In [20]:
from sklearn.ensemble import BaggingRegressor

clf = BaggingRegressor(n_estimators=10, n_jobs=-1, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[ 12900.     810.    4600.  ...,  18650.   37800.   17034.5]
EVS 0.487894169029
MAE 5101.09792322
MSE 82331602.4696
MedAE 1917.68849206
r^2 0.487319752911


### Extra Trees Ensemble Regression

In [21]:
from sklearn.ensemble import ExtraTreesRegressor

clf = ExtraTreesRegressor(n_estimators=10, min_samples_split=5, min_samples_leaf=2, n_jobs=-11, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  5972.5          1215.5         21035.83333333 ...,  35933.33333333
  34380.83333333  17070.83333333]
EVS 0.481069108776
MAE 5368.46049118
MSE 83336201.7904
MedAE 2365.83333333
r^2 0.481064096364


### Gradient Boost Regression

In [22]:
from sklearn.ensemble import GradientBoostingRegressor

clf = GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0,
                                 min_samples_split=5, min_samples_leaf=2, random_state=31, alpha=0.9)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  9704.1573131    2268.02472045   3556.80589375 ...,  16526.5530498
  37174.70792291  15274.09137617]
EVS 0.530806727221
MAE 5193.65907787
MSE 75350317.5651
MedAE 2432.56661458
r^2 0.530792329207


### Random Forest Regression

In [23]:
from sklearn.ensemble import RandomForestRegressor

clf = RandomForestRegressor(n_estimators=10, min_samples_split=5, min_samples_leaf=2, n_jobs=-1, random_state=31)
clf.fit(inputs_train, labels_train)
labels_predict = clf.predict(inputs_test)
print labels_test
print labels_predict
print "EVS", explained_variance_score(labels_test, labels_predict)
print "MAE", mean_absolute_error(labels_test, labels_predict)
print "MSE", mean_squared_error(labels_test, labels_predict)
print "MedAE", median_absolute_error(labels_test, labels_predict)
print "r^2", r2_score(labels_test, labels_predict)

[ 10000.   2000.   5000. ...,  18500.  30000.  23000.]
[  9516.66666667    564.43253968   4778.57142857 ...,  21407.
  35871.42857143  16646.3452381 ]
EVS 0.511044580666
MAE 4998.38333244
MSE 78560649.9964
MedAE 1919.21230159
r^2 0.510801536185


## Analysis

Lots of useful results here, let's start with the dummy model.  Just ignoring the inputs, the random regressor produced predictably bad results:

```
EVS 0.0
MAE 8994.24919087
MSE 160592565.919
MedAE 7324.2225209
r^2 -1.25578303172e-05
```

a negative r squared number and a mean absolute error of $8,994 (which was actually surprisingly small, I expected this to be worse.  The basic naive linear regression produced surprisingly good results:

```
EVS 0.337502798405
MAE 6771.33527034
MSE 106391864.925
MedAE 3994.0
r^2 0.337496101598
```

a median error of $3,994 when we're talking about damages up 5-6 digits is not too bad, especially because precision isn't nearly as important as order of magnitude in this use case. Most of the other non-ensemble methods have scores very similar to the naive regression.  DecisionTree is one that stands out; although it's r-squared is lower than other simple regressions, it's mean and median absolute errors are quite low.  I expect there are some few values that it's _way_ off on that are skewing the scores.  It's not surprising to me that decision trees work well based uponHowever, even with no parameter tuning yet, the ensemble methods are looking really good right out of the gate.  Both Random Forest and Gradient Boost regressions have r-squared scores over 0.5 with no particular tuning.  Based on MSE, it seems like Random Forest, Gradient Boost, and Bagging are all candidates for hyperparameter tuning and optimization, we can see from there which ones play out.