# Predicting County-Level Crime Statistics using Machine Learning
IBM Group: Jiaqi Chen, Ruixi Liu, Anuj Patel, Guangzhe Zhu

## File Importing

Import all the necessary data files. See "Instructions to Run.docx" file to ensure all file imports are completed accurately.

In [1]:
import glob
import string as str
import pandas as pd
import numpy as np
import os

In [2]:
#Ensure folder named 'VA' exists in the current directory of this notebook
dir = './VA/'

#OS flag 
unix_flag = (os.name == 'posix');

os.walk(dir)
subdir = [x[0] for x in os.walk(dir)][1:]

In [3]:
years = [x[-4:] for x in subdir]
varList = ['agencies', 'cde_agencies', 'nibrs_offense', 'nibrs_incident', 'nibrs_offense_type','nibrs_location_type']

In [4]:
years

['2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017']

In [5]:
for i,d in enumerate(subdir):
    filepath = d + "/*.csv"
    paths = glob.glob(filepath)
    files = []
    varnames = []
    for p in paths:
        if (unix_flag):
            f = p.split("/")[-1]
        else:
            f = p.split("\\")[1]
        files.append(f)
        varnames.append(f[:-4])
    for idx,var in enumerate(varnames):
        if var.lower() in varList:
            pdName = var.lower() + "_" + years[i]
            exec(pdName + "= pd.read_csv(r\"" + paths[idx] + "\", encoding = 'cp1252', low_memory=False)")
            exec(pdName + ".columns = " + pdName + ".columns.str.lower()")

In [6]:
house_prices = pd.read_csv("Median Housing Prices_2017_2009.csv")

In [7]:
population = pd.read_csv("population.csv")

In [8]:
employment = pd.read_csv("employment.csv")

***

## Feature Extraction and Cleaning

In this section, we extract features from our data and performing data cleaning. We begin with addressing the differing data structure between the 2016 and 2017 data files to extract the number of total officers.

In [9]:
cde_agencies_2016 = agencies_2016
cde_agencies_2016['total_officers'] = cde_agencies_2016['male_officer'] + cde_agencies_2016['female_officer']
cde_agencies_2016['primary_county'] = cde_agencies_2016['county_name'].str.split(';').str[0]

cde_agencies_2017 = agencies_2017
cde_agencies_2017['total_officers'] = cde_agencies_2017['male_officer'] + cde_agencies_2017['female_officer']
cde_agencies_2017['primary_county'] = cde_agencies_2017['county_name'].str.split(';').str[0]

We go on to join all the features we have selected into one dataframe named `nibrs`.

In [10]:
for y in years:
    exec("officerNum_" + y + " = cde_agencies_" + y + "[['primary_county', 'total_officers']]")
    exec("officerNum_" + y + " = officerNum_" + y + ".groupby(['primary_county'], as_index=False).sum()")
    exec("nibrs_offense_" + y + " = nibrs_offense_" + y + "[['offense_id', 'incident_id', 'offense_type_id']]")
    exec("nibrs_incident_" + y + " = nibrs_incident_" + y + "[['agency_id', 'incident_id', 'incident_date', 'incident_hour']]")
    exec("nibrs_offense_type_" + y + " = nibrs_offense_type_" + y + "[['offense_type_id', 'offense_category_name']]")
    exec("cde_agencies_" + y + " = cde_agencies_" + y + "[['agency_id', 'primary_county']]")
    exec("nibrs_" + y + " = pd.merge(nibrs_offense_" + y + ", nibrs_incident_" + y + ", how = 'left', on = 'incident_id')")
    exec("nibrs_" + y + " = pd.merge(nibrs_" + y + ", nibrs_offense_type_" + y + ", how = 'left', on = 'offense_type_id')")
    exec("nibrs_" + y + " = pd.merge(nibrs_" + y + ", cde_agencies_" + y + ", how = 'left', on = 'agency_id')")
    exec("nibrs_" + y + " = pd.merge(nibrs_" + y + ", officerNum_" + y + ", how = 'left', on = 'primary_county')")

In [11]:
varListStr = "["
for y in years:
    varListStr += "nibrs_" + y + ","
varListStr = varListStr[:-1] + "]"

In [12]:
exec("nibrs = pd.concat(" + varListStr + ", ignore_index=True, sort=False)")

Delete files that are not needed anymore to clear up memory.

In [13]:
for y in years:
    exec("del nibrs_offense_" + y)
    exec("del nibrs_incident_" + y)
    exec("del nibrs_offense_type_" + y)
    exec("del cde_agencies_" + y)
    exec("del nibrs_" + y)
    exec("del officerNum_" + y)

Preview of the `nibrs` dataframe.

In [14]:
nibrs

Unnamed: 0,offense_id,incident_id,offense_type_id,agency_id,incident_date,incident_hour,offense_category_name,primary_county,total_officers
0,55428804,51846574,45,20092,2009-03-10 00:00:00,16.0,Larceny/Theft Offenses,Campbell,75.0
1,55423732,51846575,49,20092,2009-03-23 00:00:00,,Burglary/Breaking & Entering,Campbell,75.0
2,55428805,51846576,8,20092,2009-03-28 00:00:00,15.0,Pornography/Obscene Material,Campbell,75.0
3,55423733,51846577,36,20092,2009-04-18 00:00:00,3.0,Sex Offenses,Campbell,75.0
4,55431980,51846578,16,20092,2009-04-23 00:00:00,16.0,Drug/Narcotic Offenses,Campbell,75.0
5,55423734,51846579,45,20092,2009-04-15 00:00:00,,Larceny/Theft Offenses,Campbell,75.0
6,55423735,51846580,16,20092,2009-04-30 00:00:00,22.0,Drug/Narcotic Offenses,Campbell,75.0
7,55424498,51846581,45,20092,2009-05-20 00:00:00,10.0,Larceny/Theft Offenses,Campbell,75.0
8,55423736,51846582,7,20092,2009-06-09 00:00:00,1.0,Larceny/Theft Offenses,Campbell,75.0
9,55423737,51846583,47,20092,2009-06-12 00:00:00,21.0,Larceny/Theft Offenses,Campbell,75.0


We convert the `incident_date` column to datetime format and extract year, month, day, and weekday.

In [15]:
nibrs['incident_date'] = pd.to_datetime(nibrs['incident_date'])

In [16]:
nibrs['year'] = nibrs['incident_date'].dt.year
nibrs['month'] = nibrs['incident_date'].dt.month
nibrs['day'] = nibrs['incident_date'].dt.day
nibrs['weekday'] = nibrs['incident_date'].dt.weekday_name
nibrs.drop('incident_date', axis=1, inplace=True)

In [17]:
nibrs['primary_county'] = nibrs['primary_county'].str.title()

For agencies with multiple listed counties, the first listed county is considered its primary county.

In [18]:
nibrs['county'] = nibrs['primary_county'].str.split(';').str[0]
nibrs.drop('primary_county', axis=1, inplace=True)

Drop all rows where there is no listed county or incident time.

In [19]:
nibrs.dropna(subset=['county', 'incident_hour'], inplace=True)

***

## Joining Outside Datasets

Join our demographic datasets obtained from the US Census Bureau, performing some minor data cleaning along the way.

### Median House Prices

In [20]:
house_prices = house_prices.rename(columns = {'Geography':'county', 'Estimate; Median value (dollars)':'median_house_price', 'Year':'year'})

In [21]:
house_prices['county'] = house_prices['county'].str.title()

In [22]:
nibrs = pd.merge(nibrs, house_prices, how = 'left', on = ['county', 'year'])

### Population

In [23]:
population = population[['CTYNAME', 'YEAR', 'TOT_POP']]

In [24]:
population = population.rename(columns = {'CTYNAME':'county', 'TOT_POP':'population', 'YEAR':'year'})

In [25]:
population['county'] = population['county'].str.replace(" County", '')
population['county'] = population['county'].str.title()

In [26]:
nibrs = pd.merge(nibrs, population, how = 'left', on = ['county', 'year'])

In [27]:
nibrs = nibrs[nibrs.year >= 2009]

### Employment

In [28]:
employment = employment.rename(columns = {'Geography':'county', 'EmploymentPopulationRatio':'employment', 'Year':'year'})

In [29]:
employment['county'] = employment['county'].str.title()

In [30]:
nibrs = pd.merge(nibrs, employment, how = 'left', on = ['county', 'year'])

In [31]:
nibrs

Unnamed: 0,offense_id,incident_id,offense_type_id,agency_id,incident_hour,offense_category_name,total_officers,year,month,day,weekday,county,median_house_price,population,employment
0,55428804,51846574,45,20092,16.0,Larceny/Theft Offenses,75.0,2009,3,10,Tuesday,Campbell,127500,54472,61.2
1,55428805,51846576,8,20092,15.0,Pornography/Obscene Material,75.0,2009,3,28,Saturday,Campbell,127500,54472,61.2
2,55423733,51846577,36,20092,3.0,Sex Offenses,75.0,2009,4,18,Saturday,Campbell,127500,54472,61.2
3,55431980,51846578,16,20092,16.0,Drug/Narcotic Offenses,75.0,2009,4,23,Thursday,Campbell,127500,54472,61.2
4,55423735,51846580,16,20092,22.0,Drug/Narcotic Offenses,75.0,2009,4,30,Thursday,Campbell,127500,54472,61.2
5,55424498,51846581,45,20092,10.0,Larceny/Theft Offenses,75.0,2009,5,20,Wednesday,Campbell,127500,54472,61.2
6,55423736,51846582,7,20092,1.0,Larceny/Theft Offenses,75.0,2009,6,9,Tuesday,Campbell,127500,54472,61.2
7,55423737,51846583,47,20092,21.0,Larceny/Theft Offenses,75.0,2009,6,12,Friday,Campbell,127500,54472,61.2
8,55428806,51846584,47,20092,18.0,Larceny/Theft Offenses,75.0,2009,6,26,Friday,Campbell,127500,54472,61.2
9,55436530,51846585,21,20092,22.0,Motor Vehicle Theft,75.0,2009,7,11,Saturday,Campbell,127500,54472,61.2


***

## Model Building

Using H2O, we build and evaluate different models for predictive purposes.

In [32]:
import h2o
import numpy as np
import pandas as pd
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator 
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.grid.grid_search import H2OGridSearch

In [33]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
  Starting server from C:\Users\anujp\Anaconda3\lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\anujp\AppData\Local\Temp\tmpj4j6qf7_
  JVM stdout: C:\Users\anujp\AppData\Local\Temp\tmpj4j6qf7_\h2o_anujp_started_from_python.out
  JVM stderr: C:\Users\anujp\AppData\Local\Temp\tmpj4j6qf7_\h2o_anujp_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,America/New_York
H2O data parsing timezone:,UTC
H2O cluster version:,3.23.0.4586
H2O cluster version age:,2 months and 21 days
H2O cluster name:,H2O_from_python_anujp_mjc0mr
H2O cluster total nodes:,1
H2O cluster free memory:,3.535 Gb
H2O cluster total cores:,8
H2O cluster allowed cores:,8


In [34]:
h2o_nibrs = h2o.H2OFrame(nibrs)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [35]:
h2o_nibrs['year'] = h2o_nibrs['year'].asfactor()
h2o_nibrs['month'] = h2o_nibrs['month'].asfactor()
h2o_nibrs['day'] = h2o_nibrs['day'].asfactor()

In [36]:
h2o_nibrs.types

{'offense_id': 'int',
 'incident_id': 'int',
 'offense_type_id': 'int',
 'agency_id': 'int',
 'incident_hour': 'int',
 'offense_category_name': 'enum',
 'total_officers': 'int',
 'year': 'enum',
 'month': 'enum',
 'day': 'enum',
 'weekday': 'enum',
 'county': 'enum',
 'median_house_price': 'int',
 'population': 'int',
 'employment': 'real'}

Split the data into 40% training, 30% validation, and 30% test.

In [37]:
training, validation, test = h2o_nibrs.split_frame(ratios=[0.4, 0.3], seed = 12345)

In [38]:
y = 'offense_category_name'
X = [name for name in h2o_nibrs.columns if name not in ['offense_id', 'incident_id', 'offense_type_id', 'agency_id','year',y]]

In [39]:
print(y)
print(X)

offense_category_name
['incident_hour', 'total_officers', 'month', 'day', 'weekday', 'county', 'median_house_price', 'population', 'employment']


### Random Forest

In [40]:
## Random forest with random hyperparameter search
# train many different GBM models with random hyperparameters
# and select best model based on validation error

# define random grid search parameters
hyper_parameters = {'ntrees':list(range(0, 100, 10)),
                    'max_depth':list(range(1, 11, 1))}

# define search strategy
search_criteria = {'strategy':'RandomDiscrete',
                   'max_models':60,
                   'max_runtime_secs':600,
                   'seed': 12345}

# initialize grid search
rf_gsearch = H2OGridSearch(H2ORandomForestEstimator,
                        hyper_params=hyper_parameters,
                        search_criteria=search_criteria)

# execute training w/ grid search
rf_gsearch.train(x=X,
          y=y,
          training_frame=training,
          validation_frame=validation)

# view detailed results at http://localhost:54321/flow/index.html

drf Grid Build progress: |████████████████████████████████████████████████| 100%


In [41]:
# show grid search results
rf_gsearch.show()

# select best model
rf_gsearch_model = rf_gsearch.get_grid()[0]

# print model information
rf_gsearch_model

    max_depth ntrees  \
0           8     30   
1           7     90   
2           4     20   
3           3     10   
4           1     80   
5          10      0   

                                                     model_ids  \
0  Grid_DRF_py_6_sid_b7da_model_python_1557965135576_1_model_1   
1  Grid_DRF_py_6_sid_b7da_model_python_1557965135576_1_model_6   
2  Grid_DRF_py_6_sid_b7da_model_python_1557965135576_1_model_3   
3  Grid_DRF_py_6_sid_b7da_model_python_1557965135576_1_model_5   
4  Grid_DRF_py_6_sid_b7da_model_python_1557965135576_1_model_4   
5  Grid_DRF_py_6_sid_b7da_model_python_1557965135576_1_model_2   

              logloss  
0  1.9318066392280022  
1  1.9361418800791015  
2  1.9564919743935514  
3  1.9640345303605409  
4  1.9898964355856665  
5    34.5387763949111  
Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  Grid_DRF_py_6_sid_b7da_model_python_1557965135576_1_model_1


ModelMetricsMultinomial: drf
** Reported on train data. **

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
0.0,816.0,0.0,0.0,0.0,14.0,154.0,0.0,0.0,7.0,0.0,0.0,0.0,2741.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"3,732 / 3,732"
0.0,60253.0,0.0,4.0,0.0,480.0,9083.0,0.0,0.0,612.0,0.0,0.0,0.0,239969.0,0.0,0.0,0.0,0.0,10.0,0.0,5.0,0.8058960,"250,163 / 310,416"
0.0,12.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,59.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,72 / 72
0.0,10526.0,0.0,4.0,0.0,159.0,2032.0,0.0,0.0,132.0,0.0,0.0,0.0,76752.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.9999554,"89,603 / 89,607"
0.0,1519.0,0.0,0.0,3.0,27.0,466.0,0.0,0.0,119.0,1.0,0.0,0.0,20537.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.9998677,"22,671 / 22,674"
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,3337.0,0.0,0.0,0.0,24.0,305.0,0.0,0.0,19.0,0.0,1.0,0.0,13794.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"17,480 / 17,480"
0.0,2773.0,0.0,0.0,0.0,29.0,620.0,0.0,0.0,127.0,0.0,0.0,0.0,14281.0,0.0,0.0,0.0,1.0,29.0,0.0,0.0,0.9983763,"17,831 / 17,860"
0.0,529.0,0.0,0.0,0.0,10.0,159.0,0.0,0.0,15.0,0.0,0.0,0.0,4151.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"4,864 / 4,864"



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.3242200
2,0.5275633
3,0.6842743
4,0.7977454
5,0.8641324
6,0.9055583
7,0.930344
8,0.9488916
9,0.9635908



ModelMetricsMultinomial: drf
** Reported on validation data. **

MSE: 0.6595363697496395
RMSE: 0.8121184456405601
LogLoss: 1.9318066392280022
Mean Per-Class Error: 0.9420189921259252
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
1.0,574.0,0.0,0.0,0.0,12.0,111.0,0.0,0.0,7.0,0.0,0.0,0.0,1951.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9996235,"2,655 / 2,656"
0.0,45198.0,0.0,1.0,0.0,248.0,6815.0,0.0,0.0,335.0,0.0,0.0,0.0,179721.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.8054494,"187,122 / 232,320"
0.0,14.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,49.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,65 / 65
0.0,7903.0,0.0,0.0,0.0,86.0,1575.0,0.0,0.0,62.0,0.0,0.0,0.0,57538.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"67,164 / 67,164"
0.0,1175.0,0.0,0.0,0.0,6.0,327.0,0.0,0.0,60.0,0.0,0.0,0.0,15284.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,"16,853 / 16,853"
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,2612.0,0.0,0.0,0.0,17.0,241.0,0.0,0.0,17.0,0.0,0.0,0.0,10512.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,"13,400 / 13,400"
0.0,1952.0,0.0,0.0,0.0,21.0,473.0,0.0,0.0,92.0,0.0,0.0,0.0,10621.0,0.0,0.0,0.0,0.0,16.0,0.0,0.0,0.9987856,"13,159 / 13,175"
0.0,394.0,0.0,0.0,0.0,5.0,107.0,0.0,0.0,4.0,0.0,0.0,0.0,3080.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"3,590 / 3,590"



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.3256396
2,0.5280359
3,0.6850148
4,0.798106
5,0.8642516
6,0.9057376
7,0.9304815
8,0.9490889
9,0.9639991


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,validation_rmse,validation_logloss,validation_classification_error
,2019-05-15 20:06:40,0.251 sec,0.0,,,,,,
,2019-05-15 20:07:01,21.076 sec,4.0,0.8129087,1.9825773,0.6781555,0.8126018,1.9417059,0.6755650
,2019-05-15 20:08:28,1 min 48.513 sec,24.0,0.8123635,1.9355382,0.6758191,0.8122049,1.9320725,0.6743740
,2019-05-15 20:09:43,3 min 3.227 sec,30.0,0.8122719,1.9344609,0.6757800,0.8121184,1.9318066,0.6743604


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
incident_hour,219721.5,1.0,0.3674239
county,210569.9062500,0.9583491,0.3521203
weekday,37016.0546875,0.1684681,0.0618992
day,27135.8046875,0.1235009,0.0453772
total_officers,24982.4863281,0.1137007,0.0417763
median_house_price,24953.4960938,0.1135687,0.0417279
population,20053.4003906,0.0912674,0.0335338
employment,18639.7558594,0.0848336,0.0311699
month,14933.1054688,0.0679638,0.0249715




In [42]:
# measure gbm Logloss
rf_gsearch_train = rf_gsearch_model.logloss(train=True)
rf_gsearch_validation = rf_gsearch_model.logloss(valid=True)
rf_gsearch_test = rf_gsearch_model.model_performance(test_data=test).logloss()
print(rf_gsearch_train)
print(rf_gsearch_validation)
print(rf_gsearch_test)

1.9344609204837788
1.9318066392280022
1.9331329906653252


### Gradient Boosting Machine

In [48]:
## GBM with random hyperparameter search
# train many different GBM models with random hyperparameters
# and select best model based on validation error

# define random grid search parameters
hyper_parameters = {'ntrees':list(range(0, 500, 50)),
                    'max_depth':list(range(1, 11, 1)),
                    'sample_rate':[s/float(10) for s in range(1, 11)],
                    'col_sample_rate':[s/float(10) for s in range(1, 11)]}

# define search strategy
search_criteria = {'strategy':'RandomDiscrete',
                   'max_models':60,
                   'max_runtime_secs':600,
                   'seed': 12345}

# initialize grid search
gbm = H2OGridSearch(H2OGradientBoostingEstimator,
                        hyper_params=hyper_parameters,
                        search_criteria=search_criteria)

# execute training w/ grid search
gbm.train(x=X,
          y=y,
          training_frame=training,
          validation_frame=validation)

# view detailed results at http://localhost:54321/flow/index.html

gbm Grid Build progress: |████████████████████████████████████████████████| 100%


In [49]:
# show grid search results
gbm.show()

# select best model
gbm_model = gbm.get_grid()[0]

# print model information
gbm_model

    col_sample_rate max_depth ntrees sample_rate  \
0               0.9         7    450         1.0   

                                                     model_ids  \
0  Grid_GBM_py_6_sid_ad1d_model_python_1555600338471_4_model_1   

              logloss  
0  1.9257459052284973  
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  Grid_GBM_py_6_sid_ad1d_model_python_1555600338471_4_model_1


ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.6520483143177295
RMSE: 0.8074950862498975
LogLoss: 1.8917034705497446
Mean Per-Class Error: 0.9229925975985735
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
61.0,911.0,0.0,3.0,0.0,45.0,229.0,0.0,0.0,40.0,0.0,0.0,0.0,2442.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9836549,"3,671 / 3,732"
1.0,70551.0,1.0,249.0,6.0,2639.0,16773.0,1.0,0.0,3446.0,0.0,1.0,2.0,216677.0,1.0,4.0,9.0,0.0,40.0,0.0,15.0,0.7727211,"239,865 / 310,416"
0.0,11.0,5.0,0.0,0.0,1.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,51.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.9305556,67 / 72
0.0,12401.0,0.0,608.0,5.0,845.0,3595.0,1.0,0.0,1205.0,0.0,0.0,0.0,70930.0,1.0,0.0,0.0,0.0,13.0,0.0,3.0,0.9932148,"88,999 / 89,607"
0.0,1694.0,0.0,16.0,66.0,101.0,960.0,1.0,0.0,747.0,0.0,0.0,0.0,19083.0,0.0,0.0,0.0,0.0,5.0,0.0,1.0,0.9970892,"22,608 / 22,674"
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,3844.0,0.0,20.0,0.0,204.0,821.0,0.0,0.0,155.0,0.0,1.0,0.0,12422.0,0.0,0.0,0.0,11.0,1.0,0.0,1.0,0.9993707,"17,469 / 17,480"
0.0,2945.0,0.0,11.0,1.0,146.0,1122.0,0.0,0.0,629.0,0.0,0.0,0.0,12770.0,0.0,2.0,0.0,0.0,234.0,0.0,0.0,0.9868981,"17,626 / 17,860"
0.0,604.0,0.0,4.0,0.0,52.0,314.0,1.0,0.0,80.0,0.0,0.0,0.0,3800.0,0.0,0.0,0.0,0.0,1.0,8.0,0.0,0.9983553,"4,856 / 4,864"



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.3361941
2,0.546463
3,0.7003663
4,0.8081261
5,0.8728432
6,0.9132826
7,0.9389794
8,0.9570242
9,0.9698805



ModelMetricsMultinomial: gbm
** Reported on validation data. **

MSE: 0.6570362656928761
RMSE: 0.8105777357495554
LogLoss: 1.9257459052284973
Mean Per-Class Error: 0.9349758746984413
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
5.0,666.0,0.0,6.0,0.0,47.0,166.0,0.0,0.0,22.0,0.0,0.0,0.0,1743.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.9981175,"2,651 / 2,656"
4.0,50545.0,0.0,171.0,4.0,2137.0,13332.0,2.0,0.0,2560.0,0.0,2.0,2.0,163493.0,1.0,7.0,4.0,0.0,43.0,0.0,13.0,0.7824337,"181,775 / 232,320"
0.0,14.0,0.0,0.0,0.0,2.0,8.0,0.0,0.0,2.0,0.0,0.0,0.0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,65 / 65
1.0,9225.0,0.0,284.0,5.0,673.0,2803.0,1.0,0.0,908.0,1.0,1.0,1.0,53248.0,2.0,0.0,0.0,0.0,8.0,0.0,3.0,0.9957715,"66,880 / 67,164"
0.0,1333.0,0.0,11.0,8.0,89.0,732.0,0.0,0.0,533.0,0.0,0.0,0.0,14138.0,0.0,1.0,2.0,0.0,4.0,0.0,2.0,0.9995253,"16,845 / 16,853"
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,3023.0,0.0,16.0,0.0,162.0,646.0,0.0,0.0,108.0,0.0,0.0,0.0,9440.0,0.0,0.0,0.0,1.0,0.0,0.0,4.0,0.9999254,"13,399 / 13,400"
0.0,2170.0,1.0,8.0,0.0,98.0,817.0,1.0,0.0,478.0,0.0,1.0,0.0,9525.0,0.0,1.0,0.0,0.0,73.0,0.0,2.0,0.9944592,"13,102 / 13,175"
0.0,453.0,0.0,7.0,0.0,35.0,245.0,1.0,0.0,51.0,0.0,0.0,0.0,2797.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.9997214,"3,589 / 3,590"



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.3298971
2,0.5358435
3,0.6909885
4,0.8021515
5,0.8671895
6,0.9067456
7,0.93128
8,0.9498254
9,0.9643933


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,validation_rmse,validation_logloss,validation_classification_error
,2019-04-18 11:57:19,0.007 sec,0.0,0.9523810,3.0445224,0.8207598,0.9523810,3.0445224,0.8211191
,2019-04-18 11:58:01,42.071 sec,5.0,0.8858607,2.2915593,0.6710387,0.8863232,2.2971349,0.6726594
,2019-04-18 12:00:15,2 min 56.067 sec,21.0,0.8186402,1.9387992,0.6663646,0.8206756,1.9596696,0.6709568
,2019-04-18 12:06:51,9 min 31.646 sec,221.0,0.8074951,1.8917035,0.6638059,0.8105777,1.9257459,0.6701029
,2019-04-18 12:09:31,12 min 12.004 sec,222.0,0.8074951,1.8917035,0.6638059,0.8105777,1.9257459,0.6701029


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
county,76164.1796875,1.0,0.4091745
incident_hour,63511.5820312,0.8338773,0.3412013
day,16221.7675781,0.2129842,0.0871477
weekday,11888.6308594,0.1560922,0.0638689
month,7988.6582031,0.1048873,0.0429172
median_house_price,3287.2460938,0.0431600,0.0176600
employment,3170.7724609,0.0416308,0.0170342
population,3104.9516602,0.0407666,0.0166806
total_officers,803.2937012,0.0105469,0.0043155




In [50]:
# measure gbm Logloss
gbm_train = gbm_model.logloss(train=True)
gbm_validation = gbm_model.logloss(valid=True)
gbm_test = gbm_model.model_performance(test_data=test).logloss()
print(gbm_train)
print(gbm_validation)
print(gbm_test)

1.8917034705497446
1.9257459052284973
1.9271377540259977


### Neural Network

Using grid search for hyperparameter selection

In [51]:
# NN with random hyperparameter search
# train many different NN models with random hyperparameters
# and select best model based on validation error

# define random grid search parameters
hyper_parameters = {'hidden':[[160, 320], [90, 180], [320, 160, 80], [100], [50, 50, 50, 50]],
                    'l1':[s/1e4 for s in range(0, 1000, 100)],
                    'l2':[s/1e5 for s in range(0, 1000, 100)],
                    'input_dropout_ratio':[s/1e2 for s in range(0,20,2)],
                    'epochs':list(range(0,100,10))}

# define search strategy
search_criteria = {'strategy':'RandomDiscrete',
                   'max_models':60,
                   'max_runtime_secs':600,
                   'seed':12345}

# initialize grid search
nn_gsearch = H2OGridSearch(H2ODeepLearningEstimator,
                        hyper_params=hyper_parameters,
                        search_criteria=search_criteria)

# execute training w/ grid search
nn_gsearch.train(x=X,
              y=y,
              training_frame=training,
              validation_frame=validation,)

# view detailed results at http://localhost:54321/flow/index.html

deeplearning Grid Build progress: |███████████████████████████████████████| 100%


In [52]:
# show grid search results
nn_gsearch.show()

# select best model
nn_gsearch_model = nn_gsearch.get_grid()[0]

# print model information
nn_gsearch_model

    epochs hidden input_dropout_ratio    l1     l2  \
0     80.0  [100]                0.12  0.06  0.009   

                                                              model_ids  \
0  Grid_DeepLearning_py_6_sid_ad1d_model_python_1555600338471_5_model_1   

             logloss  
0  2.113442617303093  
Model Details
H2ODeepLearningEstimator :  Deep Learning
Model Key:  Grid_DeepLearning_py_6_sid_ad1d_model_python_1555600338471_5_model_1


ModelMetricsMultinomial: deeplearning
** Reported on train data. **

MSE: 0.7059461673274332
RMSE: 0.8402060267145394
LogLoss: 2.124250458847267
Mean Per-Class Error: 0.8571428571428571
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,19 / 19
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1972.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"1,972 / 1,972"
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0 / 0
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,560.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,560 / 560
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,140.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,140 / 140
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,109.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,109 / 109
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,112.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,112 / 112
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,28 / 28



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.3030635
2,0.4992043
3,0.6482992
4,0.7705391
5,0.8262383
6,0.8992441
7,0.9211259
8,0.9323652
9,0.9466878



ModelMetricsMultinomial: deeplearning
** Reported on validation data. **

MSE: 0.7038299452393195
RMSE: 0.8389457343829334
LogLoss: 2.113442617303093
Mean Per-Class Error: 0.9523809523809523
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2656.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"2,656 / 2,656"
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,232320.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"232,320 / 232,320"
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,65.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,65 / 65
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,67164.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"67,164 / 67,164"
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16853.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"16,853 / 16,853"
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13400.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"13,400 / 13,400"
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13175.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"13,175 / 13,175"
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3590.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"3,590 / 3,590"



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.3071403
2,0.5049207
3,0.6530960
4,0.7750444
5,0.8322229
6,0.9004688
7,0.9210079
8,0.9324157
9,0.9461595


Scoring History: 


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,timestamp,duration,training_speed,epochs,iterations,samples,training_rmse,training_logloss,training_r2,training_classification_error,validation_rmse,validation_logloss,validation_r2,validation_classification_error
,2019-04-18 12:54:05,0.000 sec,,0.0,0,0.0,,,,,,,,
,2019-04-18 12:54:06,7.219 sec,69813 obs/sec,0.0637478,1,99903.0,0.8422367,2.1550335,0.9742772,0.6969365,0.8410612,2.1449283,0.9743355,0.6928597
,2019-04-18 12:55:02,1 min 2.658 sec,147780 obs/sec,4.8508008,76,7601981.0,0.8351546,2.1403867,0.9747080,0.6969365,0.8336234,2.1275179,0.9747874,0.6928597
,2019-04-18 12:55:56,1 min 57.208 sec,151093 obs/sec,9.7024682,152,15205320.0000000,0.8391310,2.1256002,0.9744665,0.6969365,0.8378036,2.1148412,0.9745339,0.6928597
,2019-04-18 12:56:50,2 min 51.158 sec,150731 obs/sec,14.3589397,225,22502756.0000000,0.8397532,2.1244614,0.9744287,0.6969365,0.8384688,2.1137838,0.9744935,0.6928597
,2019-04-18 12:57:44,3 min 44.364 sec,151685 obs/sec,19.0830668,299,29906219.0000000,0.8312518,2.1234555,0.9749438,0.6969365,0.8297188,2.1116918,0.9750231,0.6928597
,2019-04-18 12:58:37,4 min 37.984 sec,152056 obs/sec,23.8056574,373,37307274.0000000,0.8358211,2.1280995,0.9746676,0.6969365,0.8343771,2.1166798,0.9747418,0.6928597
,2019-04-18 12:59:32,5 min 32.713 sec,152429 obs/sec,28.6543869,449,44906009.0000000,0.8432472,2.1211191,0.9742154,0.6969365,0.8420450,2.1117610,0.9742754,0.6928597
,2019-04-18 13:00:27,6 min 27.532 sec,152597 obs/sec,33.5041974,525,52506438.0000000,0.8443660,2.1390967,0.9741470,0.6969365,0.8431062,2.1281369,0.9742106,0.6928597


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
county.Gloucester,0.7837845,1.0,0.0084076
county.Lee,0.7431087,0.9481034,0.0079713
county.Norton City,0.7369671,0.9402676,0.0079054
county.Nottoway,0.7282716,0.9291733,0.0078121
county.Madison,0.7051723,0.8997018,0.0075643
---,---,---,---
incident_hour,0.0033012,0.0042119,0.0000354
county.missing(NA),0.0,0.0,0.0
day.missing(NA),0.0,0.0,0.0



See the whole table with table.as_data_frame()




In [53]:
# measure random forest Logloss
nn_gsearch_train = nn_gsearch_model.logloss(train=True)
nn_gsearch_validation = nn_gsearch_model.logloss(valid=True)
nn_gsearch_test = nn_gsearch_model.model_performance(test_data=test).logloss()
print(nn_gsearch_train)
print(nn_gsearch_validation)
print(nn_gsearch_test)

2.124250458847267
2.113442617303093
2.1141615259841475


### Naive Bayes

Using grid search for hyperparameter selection

In [55]:
from h2o.estimators.naive_bayes import H2ONaiveBayesEstimator

In [56]:
## Naive Bayes with random hyperparameter search

# define random grid search parameters
hyper_parameters = {# try more parameters later
                    'laplace':list(range(0,10,1))}

# define search strategy
search_criteria = {'strategy':'RandomDiscrete',
                   'max_models':60,
                   'max_runtime_secs':600,
                   'seed': 12345}

# initialize grid search
nb = H2OGridSearch(H2ONaiveBayesEstimator,
                        hyper_params=hyper_parameters,
                        search_criteria=search_criteria)

# execute training w/ grid search
nb.train(x=X,
          y=y,
          training_frame=training,
          validation_frame=validation)

# view detailed results at http://localhost:54321/flow/index.html

naivebayes Grid Build progress: |█████████████████████████████████████████| 100%


In [57]:
# show grid search results
nb.show()

# select best model
nb_model = nb.get_grid()[0]

# print model information
nb_model

    laplace  \
0       4.0   
1       5.0   
2       3.0   
3       6.0   
4       7.0   
5       2.0   
6       8.0   
7       9.0   
8       1.0   
9       0.0   

                                                             model_ids  \
0  Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_10   
1   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_2   
2   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_6   
3   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_4   
4   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_8   
5   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_1   
6   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_5   
7   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_7   
8   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_3   
9   Grid_NaiveBayes_py_6_sid_ad1d_model_python_1555600338471_6_model_9   

              loglo

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
0.0,1378.0,0.0,304.0,0.0,0.0,172.0,0.0,0.0,61.0,0.0,0.0,0.0,1817.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"3,732 / 3,732"
0.0,118328.0,0.0,16593.0,0.0,24.0,8608.0,0.0,0.0,5404.0,0.0,0.0,0.0,161458.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.6188083,"192,088 / 310,416"
0.0,16.0,0.0,7.0,0.0,0.0,2.0,0.0,0.0,5.0,0.0,0.0,0.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,72 / 72
0.0,28201.0,0.0,10892.0,0.0,10.0,3009.0,0.0,0.0,960.0,0.0,0.0,0.0,46535.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8784470,"78,715 / 89,607"
0.0,5885.0,0.0,1719.0,0.0,1.0,785.0,0.0,0.0,443.0,0.0,0.0,0.0,13840.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,"22,674 / 22,674"
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,6259.0,0.0,921.0,0.0,0.0,190.0,0.0,0.0,361.0,0.0,0.0,0.0,9749.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"17,480 / 17,480"
0.0,6180.0,0.0,1332.0,0.0,1.0,776.0,0.0,0.0,563.0,0.0,0.0,0.0,9007.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.9999440,"17,859 / 17,860"
0.0,1218.0,0.0,295.0,0.0,1.0,179.0,0.0,0.0,134.0,0.0,0.0,0.0,3037.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"4,864 / 4,864"



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.2923894
2,0.4941263
3,0.6366082
4,0.7566783
5,0.8512411
6,0.8957037
7,0.9219677
8,0.9413589
9,0.9562221



ModelMetricsMultinomial: naivebayes
** Reported on validation data. **

MSE: 0.6765688237320546
RMSE: 0.8225380378633286
LogLoss: 2.0546927505037793
Mean Per-Class Error: 0.9399998008511768
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
Arson,Assault Offenses,Bribery,Burglary/Breaking & Entering,Counterfeiting/Forgery,Destruction/Damage/Vandalism of Property,Drug/Narcotic Offenses,Embezzlement,Extortion/Blackmail,Fraud Offenses,Gambling Offenses,Homicide Offenses,Kidnapping/Abduction,Larceny/Theft Offenses,Motor Vehicle Theft,Pornography/Obscene Material,Prostitution Offenses,Robbery,Sex Offenses,Stolen Property Offenses,Weapon Law Violations,Error,Rate
0.0,984.0,0.0,223.0,0.0,0.0,111.0,0.0,0.0,44.0,0.0,0.0,0.0,1294.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"2,656 / 2,656"
0.0,88202.0,0.0,12429.0,0.0,27.0,6539.0,0.0,0.0,4115.0,0.0,0.0,0.0,121007.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.6203426,"144,118 / 232,320"
0.0,20.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,2.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,65 / 65
0.0,21194.0,0.0,8030.0,0.0,9.0,2308.0,0.0,0.0,710.0,0.0,0.0,0.0,34913.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.8804419,"59,134 / 67,164"
0.0,4407.0,0.0,1307.0,0.0,1.0,584.0,0.0,0.0,328.0,0.0,0.0,0.0,10226.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"16,853 / 16,853"
---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---,---
0.0,4880.0,0.0,719.0,0.0,0.0,154.0,0.0,0.0,261.0,0.0,0.0,0.0,7386.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"13,400 / 13,400"
0.0,4545.0,0.0,1027.0,0.0,2.0,581.0,0.0,0.0,470.0,0.0,0.0,0.0,6550.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"13,175 / 13,175"
0.0,947.0,0.0,273.0,0.0,1.0,149.0,0.0,0.0,80.0,0.0,0.0,0.0,2140.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,"3,590 / 3,590"



See the whole table with table.as_data_frame()
Top-10 Hit Ratios: 


0,1
k,hit_ratio
1,0.2920437
2,0.4938559
3,0.6366423
4,0.7562190
5,0.8515132
6,0.895606
7,0.9216209
8,0.9407595
9,0.955977




In [58]:
# measure naive bayes Logloss
nb_train = nb_model.logloss(train=True)
nb_validation = nb_model.logloss(valid=True)
nb_test = nb_model.model_performance(test_data=test).logloss()
print(nb_train)
print(nb_validation)
print(nb_test)

2.052907390701993
2.0546927505037793
2.0565497794435688


### Results

View and compare model log loss metrics across the training, validation, and test sets.

In [61]:
logloss = pd.DataFrame(np.array([[rf_gsearch_train,rf_gsearch_validation,rf_gsearch_test], [gbm_train,gbm_validation,gbm_test],
                                 [nn_gsearch_train ,nn_gsearch_validation,nn_gsearch_test],[nb_train,nb_validation,nb_test]]),
                       columns=['Train', 'Validation', 'Test'],
                       index = ['Random Forest','Gradient Boosting Machine','Neural Network','Naive Bayes'])

In [62]:
logloss

Unnamed: 0,Train,Validation,Test
Random Forest,1.934461,1.931807,1.933133
Gradient Boosting Machine,1.891703,1.925746,1.927138
Neural Network,2.12425,2.113443,2.114162
Naive Bayes,2.052907,2.054693,2.05655


We determined our Gradient Boosting Machine (GBM) to be our best model. To obtain the export file for this model, go to the H2O Flow page BEFORE running the shutdown command below (http://localhost:54321/). Using the top navigation bar, click "Model" > "Export Model". In the new module, select the model you want to export from the dropdown menu and under "Path", insert the path in your computer you want to save the model under **as well as** the model name. Example: "C:/Documents/GBM_Model"

In [35]:
h2o.cluster().shutdown(prompt=False)

H2O session _sid_ad28 closed.
