# Hail Model Notebook

The hail model notebook performed rudimentary feature engineering from the cleaned dataset before submission to Darwin for automatic processing. The order of operations was:

1) Import Data and Libraries

2) Feature Engineering

3) Uploading and Cleaning the data in Darwin

4) Training a Model in Darwin

5) Analyzing the Model in Darwin

6) Testing a model with Darwin

7) Analyzing the model using the results of the test data

## Import Data and Libraries

Import the required libraries, start the darwin session, and upload the dataset from the cwd.

In [1]:
import pandas as pd
import datetime
from datetime import timedelta
import numpy as np
import sklearn.metrics

#displays all datasets' columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


from amb_sdk.sdk import DarwinSdk
s = DarwinSdk()
s.set_url('https://amb-demo-api.sparkcognition.com/v1/')
s.auth_login_user('--User Name--','--Password--')

(True,
 'Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJleHAiOjE1NTYwODQwOTMsImlhdCI6MTU1NjA3Njg5MywibmJmIjoxNTU2MDc2ODkzLCJqdGkiOiI0Y2VmOWYwZi1lZDExLTQ2ZWYtODMxMC1mN2UyM2FjZTQzYmYiLCJpZGVudGl0eSI6ImEzZjlkNjBhLTRmMzgtMTFlOS1iNTZiLTFiYmE4ZjhlNzJhZiIsImZyZXNoIjpmYWxzZSwidHlwZSI6ImFjY2VzcyJ9.1AE98f7ubYPgRIMtDDUu_8Ikktbh3E5eztiO0ozEP1U')

In [2]:
data = pd.read_csv('clean_storm_dataset.csv') # path needs to be set to wherever the clean data is, if not in the cwd
data.head()

Unnamed: 0,STATE,YEAR,MONTH_NAME,EVENT_TYPE,CZ_NAME,BEGIN_DATE_TIME,DAMAGE_PROPERTY,SOURCE,MAGNITUDE,MAGNITUDE_TYPE,FLOOD_CAUSE,TOR_F_SCALE,TOR_LENGTH,TOR_WIDTH,BEGIN_LOCATION,END_LOCATION,BEGIN_LAT,BEGIN_LON,END_LAT,END_LON,EPISODE_NARRATIVE,EVENT_NARRATIVE,DURATION_seconds,WIND_SPEED,HAIL_SIZE
0,ARKANSAS,2008,February,Hail,SCOTT,05-FEB-08 16:15:00,0.0,Law Enforcement,1.75,,,,,,HON,HON,34.93,-94.18,34.93,-94.18,"Early on the 5th, a strong storm system approa...",,0.0,,1.75
1,ARKANSAS,2008,January,Thunderstorm Wind,MONROE,08-JAN-08 13:20:00,0.0,Law Enforcement,50.0,EG,,,,,HOLLY GROVE,HOLLY GROVE,34.6,-91.2,34.6,-91.2,Severe thunderstorms affected a large part of ...,Trees and power lines were blown down.,0.0,50.0,
2,ARIZONA,2008,January,Flood,PIMA,28-JAN-08 03:00:00,0.0,Newspaper,,,Heavy Rain,,,,CASCABEL,CASCABEL,32.375,-111.0101,32.3691,-111.0156,A trough of low pressure off the Western U.S. ...,A swift water rescue occurred about 4 am at th...,7200.0,,
3,ILLINOIS,2008,December,Thunderstorm Wind,IROQUOIS,27-DEC-08 14:04:00,0.0,Public,65.0,EG,,,,,ASHKUM,ASHKUM,40.88,-87.95,40.88,-87.95,Heavy rain fell across northern Illinois durin...,A farmer reported buildings and vehicles moved...,0.0,65.0,
4,LAKE MICHIGAN,2008,December,Marine Thunderstorm Wind,GARY TO BURNS HARBOR IN,27-DEC-08 15:20:00,0.0,C-MAN Station,39.0,MG,,,,,BURNS HARBOR,BURNS HARBOR,41.647,-87.147,41.647,-87.147,Strong thunderstorms moved across parts of far...,,0.0,39.0,


## Feature Engineering

The property damage values were binned to reflect the rounding of the inputs and allow the problem to be solved as a labeling instead of regression problem.

In [3]:
#bin the property damages
bins = [0, 1000, 10000, 50000, 200000, 1000000, 1000000000,1000000000000]
labels = [str('0 to 1k'), str('1k to 10k'),str('10k to 50k'),str('50k to 200k'),
          str('200k to 1M'),str('1M to 1B'), str('greater than 1B')]

data['BINNED_PROPERTY_DAMAGE'] = pd.cut(data['DAMAGE_PROPERTY'],include_lowest=True, bins=bins, labels=labels)
hail_data = data.loc[(data['EVENT_TYPE'] == 'Hail') | (data['EVENT_TYPE'] == 'Marine Hail')]

#divide the data into 12 distinct groups by event type:
print(pd.value_counts(hail_data.BINNED_PROPERTY_DAMAGE))

0 to 1k            121072
1k to 10k            4463
10k to 50k           2406
50k to 200k           774
200k to 1M            585
1M to 1B              366
greater than 1B         3
Name: BINNED_PROPERTY_DAMAGE, dtype: int64


Select the subset of data that is applicable only to this model, as defined by the breakdown under the explore_clean_engineer notebook.

In [4]:
hail_data = data.loc[(data['EVENT_TYPE'] == 'Hail') | (data['EVENT_TYPE'] == 'Marine Hail')]

Drop unnecessary columns. The `DAMAGE_PROPERTY` column is redundant with the binned values, and the other values are not populated in the subset.

In [5]:
# Drop data that we don't want included in the model
hail_data = hail_data.drop('DAMAGE_PROPERTY', axis = 1)

# Drap data irrevelent to the Wind Model
hail_data = hail_data.drop(['WIND_SPEED','MAGNITUDE_TYPE','FLOOD_CAUSE','TOR_F_SCALE', 'TOR_LENGTH', 'TOR_WIDTH','HAIL_SIZE'], axis = 1)
hail_data.head(50)

Unnamed: 0,STATE,YEAR,MONTH_NAME,EVENT_TYPE,CZ_NAME,BEGIN_DATE_TIME,SOURCE,MAGNITUDE,BEGIN_LOCATION,END_LOCATION,BEGIN_LAT,BEGIN_LON,END_LAT,END_LON,EPISODE_NARRATIVE,EVENT_NARRATIVE,DURATION_seconds,BINNED_PROPERTY_DAMAGE
0,ARKANSAS,2008,February,Hail,SCOTT,05-FEB-08 16:15:00,Law Enforcement,1.75,HON,HON,34.93,-94.18,34.93,-94.18,"Early on the 5th, a strong storm system approa...",,0.0,0 to 1k
8,SOUTH CAROLINA,2008,May,Hail,CHESTERFIELD,05-MAY-08 18:30:00,Public,0.88,CHERAW MUNI ARPT,CHERAW MUNI ARPT,34.72,-79.97,34.72,-79.97,A cluster of pulse storms moved through the Mi...,Nickel size hail was reported near the Cheraw ...,0.0,0 to 1k
9,SOUTH CAROLINA,2008,May,Hail,SUMTER,05-MAY-08 14:15:00,Law Enforcement,1.75,CLAREMONT,CLAREMONT,33.946,-80.628,33.946,-80.628,A cluster of pulse storms moved through the Mi...,The Highway Patrol reported golf ball size hai...,0.0,0 to 1k
10,GEORGIA,2008,May,Hail,LINCOLN,10-MAY-08 04:30:00,Law Enforcement,0.88,LINCOLNTON,LINCOLNTON,33.78,-82.48,33.78,-82.48,Cluster pulse storms moved through the CSRA an...,Sheriff reported nickel size hail in Lincolnton.,0.0,0 to 1k
11,GEORGIA,2008,May,Hail,BURKE,10-MAY-08 19:47:00,Fire Department/Rescue,1.0,WAYNESBORO,WAYNESBORO,33.08,-82.1927,33.08,-82.1582,Cluster pulse storms moved through the CSRA an...,Fire department reported quarter size hail.,720.0,0 to 1k
12,SOUTH CAROLINA,2008,May,Hail,MCCORMICK,10-MAY-08 04:42:00,Law Enforcement,0.88,MERIWETHER,MERIWETHER,33.65,-82.17,33.65,-82.17,Clusters of pulse storms moved across the CSRA...,Sheriff reported nickel size hail.,0.0,0 to 1k
14,SOUTH CAROLINA,2008,May,Hail,EDGEFIELD,10-MAY-08 05:12:00,Amateur Radio,0.88,MORGANA,MORGANA,33.59,-82.02,33.59,-82.02,Clusters of pulse storms moved across the CSRA...,SKYWARN HAM radio operator reported nickel siz...,0.0,0 to 1k
15,SOUTH CAROLINA,2008,May,Hail,AIKEN,10-MAY-08 05:30:00,Amateur Radio,1.75,AIKEN,AIKEN,33.5211,-81.72,33.55,-81.72,Clusters of pulse storms moved across the CSRA...,SKYWARN HAM radio spotters reported quarter to...,900.0,0 to 1k
16,TEXAS,2008,April,Hail,BELL,25-APR-08 17:49:00,Amateur Radio,1.75,TAYLOR VLY,TAYLOR VLY,31.03,-97.42,31.03,-97.42,A strong line of storms as well as several dis...,Golfball-size hail was reported between Belton...,0.0,1k to 10k
17,TEXAS,2008,April,Hail,BELL,25-APR-08 17:58:00,Amateur Radio,1.75,LITTLE RIVER - ACADEMY,LITTLE RIVER - ACADEMY,30.9864,-97.3554,30.9864,-97.3554,A strong line of storms as well as several dis...,,0.0,1k to 10k


## Uploading Information to Darwin

The data is sampled to extract a test dataset. The test and train datasets are then saved to CSV files under the 'event_subsets' folder and uploaded to Darwin. Darwin is then ordered to start cleaning the data.

In [7]:
# Take out a 500 row subset of test data
hail_data_test=hail_data.sample(1500)
hail_data_train=hail_data.drop(hail_data_test.index)

# convert the test data and the main data to a csv
hail_data_train.to_csv('event_subsets/hail_data_train.csv')
hail_data_test.to_csv('event_subsets/hail_data_test.csv')

# Upload the train to Darwin
s.upload_dataset('event_subsets/hail_data_train.csv', 'hail_data_train')
s.upload_dataset('event_subsets/hail_data_test.csv', 'hail_data_test')

# Clean the training dataset using Darwin
s.clean_data('hail_data_train',target = 'BINNED_PROPERTY_DAMAGE')

(True,
 {'job_name': 'c0a5a4bb5c504d9cbf1666b9f7ee6299',
  'artifact_name': 'bd8ba0443f7e44868a4de710d6b44919'})

In [8]:
s.wait_for_job('f7ee1fc7e5c24158b686fe57389cfc17')

{'status': 'Complete', 'starttime': '2019-04-22T18:21:55.870332', 'endtime': '2019-04-22T18:25:16.682359', 'percent_complete': 100, 'job_type': 'CleanDataTiny', 'loss': None, 'generations': None, 'dataset_names': ['hail_data_train'], 'artifact_names': ['b45f3718c4e14ac0ba5f13f0678378ff'], 'model_name': None, 'job_error': ''}


(True, 'Job completed')

We lookup the datasets to confirm that they have been properly uploaded.

In [73]:
s.lookup_dataset()

(True,
 [{'name': 'hr_train',
   'mbytes': 0.6898021697998047,
   'minimum_recommeded_train_time': '5 minutes',
   'updated_at': '2019-04-20T23:56:19.042932',
   'categorical': None,
   'sequential': None,
   'imbalanced': None},
  {'name': 'hr_test',
   'mbytes': 0.17252063751220703,
   'minimum_recommeded_train_time': '5 minutes',
   'updated_at': '2019-04-20T23:56:19.405261',
   'categorical': None,
   'sequential': None,
   'imbalanced': None},
  {'name': 'flood_data_train',
   'mbytes': 26.48654079437256,
   'minimum_recommeded_train_time': '5 minutes',
   'updated_at': '2019-04-22T11:50:56.337543',
   'categorical': None,
   'sequential': None,
   'imbalanced': None},
  {'name': 'hail_data_train',
   'mbytes': 74.30458927154541,
   'minimum_recommeded_train_time': '5 minutes',
   'updated_at': '2019-04-22T18:21:54.373038',
   'categorical': None,
   'sequential': None,
   'imbalanced': None},
  {'name': 'hail_data_test',
   'mbytes': 0.9263954162597656,
   'minimum_recommeded_tra

## Creation of the Model

Next, the Darwin is ordered to create a model based on the training dataset.

In [74]:
print(s.download_dataset('hail_data_test')[1]['filename'])

(True,
 {'filename': 'C:\\Users\\freeb\\AppData\\Local\\Temp\\hail_data_test-part0-_4hxk_5c.csv',
  'part': 0,
  'note': 'part 0 of 0'})

In [75]:
print('C:\\Users\\freeb\\AppData\\Local\\Temp\\hail_data_test-part0-_4hxk_5c.csv')

C:\Users\freeb\AppData\Local\Temp\hail_data_test-part0-_4hxk_5c.csv


In [77]:
# Create the model
s.delete_model('hail_model')
s.create_model(dataset_names='hail_data_train', model_name='hail_model', max_train_time = '00:30')

(True,
 {'job_name': 'b4a747d0bc2641b384f80c95af86682d',
  'job_id': '5c98b9b4-6556-11e9-880e-7bf2e0f0470e',
  'model_name': 'hail_model'})

In [82]:
s.wait_for_job('b4a747d0bc2641b384f80c95af86682d')

{'status': 'Complete', 'starttime': '2019-04-22T18:28:40.660924', 'endtime': '2019-04-22T19:07:37.443729', 'percent_complete': 100, 'job_type': 'TrainModel', 'loss': 1.249639630317688, 'generations': 8, 'dataset_names': ['hail_data_train'], 'artifact_names': None, 'model_name': 'hail_model', 'job_error': ''}


(True, 'Job completed')

We lookup the model to confirm that the model has been properly constructed.

In [83]:
s.lookup_model()

(True,
 [{'id': '5301ba46-63f2-11e9-8c21-3b32e5393f40',
   'name': 'heavy_rain_model_20190421000001',
   'type': None,
   'problem_type': None,
   'updated_at': None,
   'trained_on': [],
   'generations': 0,
   'loss': None,
   'complete': False,
   'parameters': {'train_time': '00:05'},
   'description': None,
   'train_time_seconds': 0,
   'algorithm': None,
   'running_job_id': None},
  {'id': '5c98461e-6556-11e9-880e-57541405989a',
   'name': 'hail_model',
   'type': 'Supervised',
   'problem_type': None,
   'updated_at': '2019-04-22T19:07:37.419949',
   'trained_on': ['hail_data_train'],
   'generations': 8,
   'loss': 1.249639630317688,
   'complete': True,
   'parameters': {'train_time': '00:30',
    'target': 'BINNED_PROPERTY_DAMAGE',
    'recurrent': True,
    'max_unique_values': 50,
    'max_int_uniques': 15,
    'impute': 'ffill',
    'big_data': False},
   'description': {'best_genome': {'type': 'RandomForestClassifier',
     'parameters': {'bootstrap': True,
      'crite

We began to clean test data using the cleaning method applied to the model which was just created.

In [84]:
s.clean_data('hail_data_test',target = 'BINNED_PROPERTY_DAMAGE', model_name = 'hail_model')

(True,
 {'job_name': 'b544cddd9adb4ca1b107e78a7411223a',
  'artifact_name': '7123e897eba34e4f91dea165ccae9fd6'})

In [85]:
s.wait_for_job('b544cddd9adb4ca1b107e78a7411223a')

{'status': 'Running', 'starttime': '2019-04-22T19:12:15.64045', 'endtime': None, 'percent_complete': 0, 'job_type': 'CleanDataTiny', 'loss': None, 'generations': None, 'dataset_names': ['hail_data_test'], 'artifact_names': ['7123e897eba34e4f91dea165ccae9fd6'], 'model_name': None, 'job_error': ''}
{'status': 'Complete', 'starttime': '2019-04-22T19:12:15.64045', 'endtime': '2019-04-22T19:12:23.05567', 'percent_complete': 100, 'job_type': 'CleanDataTiny', 'loss': None, 'generations': None, 'dataset_names': ['hail_data_test'], 'artifact_names': ['7123e897eba34e4f91dea165ccae9fd6'], 'model_name': None, 'job_error': ''}


(True, 'Job completed')

## Analysis of the Model

Concurrently, the exisisting model was analyzed.

In [2]:
s.analyze_model('hail_model')

(True,
 {'job_name': '495274422a0242ec856d87f01e82ad24',
  'artifact_name': '44458749de5b415d82555c4a685c6dac'})

In [9]:
s.wait_for_job('495274422a0242ec856d87f01e82ad24')

{'status': 'Complete', 'starttime': '2019-04-22T22:12:02.012618', 'endtime': '2019-04-22T22:15:39.282038', 'percent_complete': 100, 'job_type': 'AnalyzeModel', 'loss': 1.249639630317688, 'generations': 8, 'dataset_names': None, 'artifact_names': ['44458749de5b415d82555c4a685c6dac'], 'model_name': 'hail_model', 'job_error': ''}


(True, 'Job completed')

The retrieved artifact indicated in descending order of importance the properties most important to the model.

In [7]:
results = s.download_artifact('44458749de5b415d82555c4a685c6dac')
results[1]

MAGNITUDE                             1.869166e-01
BEGIN_LON                             1.859738e-01
END_LON                               1.606392e-01
BEGIN_LAT                             1.323900e-01
END_LAT                               1.310332e-01
Unnamed: 0                            7.057317e-02
DURATION_seconds                      2.732105e-02
YEAR = 2014                           7.610505e-03
MONTH_NAME = June                     6.031469e-03
MONTH_NAME = May                      5.935985e-03
SOURCE = Public                       5.617114e-03
YEAR = 2011                           5.088064e-03
YEAR = 2013                           4.965488e-03
SOURCE = Amateur Radio                4.795108e-03
YEAR = 2015                           4.670378e-03
MONTH_NAME = March                    4.576735e-03
YEAR = 2012                           4.449586e-03
SOURCE = Trained Spotter              3.664951e-03
YEAR = 2009                           3.496391e-03
SOURCE = Emergency Manager     

The model's parameters were also analyzed.

In [33]:
s.display_population('wind_model')

(True,
 {'population': {'model_types': {'DeepNeuralNetwork': {'model_description': [{'layer 1': {'type': 'LSTMGene',
        'parameters': {'activation': 'relu', 'numunits': 70, 'numlayers': 1}}},
      {'layer 2': {'type': 'LinearGene',
        'parameters': {'activation': 'relu', 'numunits': 19}}}],
     'loss_function': 'CrossEntropy',
     'fitness': 1.6725940973962325},
    'RandomForest': {'model_description': {'type': 'RandomForestClassifier',
      'parameters': {'bootstrap': False,
       'criterion': 'gini',
       'max_depth': 7,
       'max_features': 0.8507892505977884,
       'max_leaf_nodes': None,
       'min_impurity_decrease': 0.0,
       'min_samples_leaf': 1,
       'min_samples_split': 16,
       'n_jobs': -1,
       'min_weight_fraction_leaf': 0.0,
       'n_estimators': 65}},
     'loss_function': 'CrossEntropy',
     'fitness': 1.6787765860710828},
    'GradientBoosted': {'model_description': {'type': 'XGBClassifier',
      'parameters': {'base_score': 0.5,
    

## Analysis of the Test Results

Finally, the test data property damage was analyzed using the model and cleaned training data.

In [94]:
s.run_model('hail_data_test','hail_model')

(True,
 {'job_name': 'ec27c1b0ddbc4fa9bc984b19e94239d7',
  'artifact_name': '839d1d5494004dceb0eeef2a68f0cc6c'})

In [10]:
s.wait_for_job('ec27c1b0ddbc4fa9bc984b19e94239d7')

{'status': 'Complete', 'starttime': '2019-04-22T19:20:43.447575', 'endtime': '2019-04-22T19:21:30.114283', 'percent_complete': 100, 'job_type': 'RunModel', 'loss': 1.249639630317688, 'generations': 8, 'dataset_names': ['hail_data_test'], 'artifact_names': ['839d1d5494004dceb0eeef2a68f0cc6c'], 'model_name': 'hail_model', 'job_error': ''}


(True, 'Job completed')

In [96]:
s.download_artifact('839d1d5494004dceb0eeef2a68f0cc6c','C:\\Users\\freeb\\Documents\\GitHub\\storm_analytics')

(True,
 {'filename': 'C:\\Users\\freeb\\Documents\\GitHub\\storm_analytics\\artifact.csv'})

The resulting artifact determined that the model was over 99% accurate, but was actually mostly missing the rare class, as demonstrated by f1 scores of 0 for every category except the majority class.

In [100]:
test_results = pd.read_csv('artifact.csv')
test_actual = pd.read_csv('event_subsets\\hail_data_test.csv')
correct_count = 0
incorrect_count = 0

for i, row in test_results.iterrows():
    j = 0
    if test_results.at[j,'BINNED_PROPERTY_DAMAGE'] == row['BINNED_PROPERTY_DAMAGE']:
        correct_count += 1
    else:
        incorrect_count+=1
    j+=1
        
print(correct_count)
print(incorrect_count)
print('Accuracy: ' + str(correct_count/1500))

#sklearn.metrics.precision_recall_fscore_support(test_results,test_actual)

y_pred = test_results['BINNED_PROPERTY_DAMAGE']
y_actual = test_actual['BINNED_PROPERTY_DAMAGE']

precision, recall, fscore, support = sklearn.metrics.precision_recall_fscore_support(y_pred,y_actual)

#labels = [str('0 to 2.5k'), str('2.5k to 5k'),str('5k to 10k'),str('10k to 20k'),str('20k to 30k'),str('30k to 40k'),
#          str('40k to 50k'),str('50k to 75k'),str('75k to 100k'),str('100k to 200k'),str('200k to 300k'),
#          str('300k to 400k'),str('400k to 500k'),str('500k to 600k'),str('600k to 700k'),str('700k to 800k'),
#          str('800k to 900k'),str('900k to 1B'),str('greater than 1B')]

results = pd.DataFrame({'precision': precision, 'Recall' : recall, 'fscore': fscore, 'support' : support})
results

1454
46
Accuracy: 2.908
19
7


Unnamed: 0,precision,Recall,fscore,support
0,0.992063,0.945667,0.96831,1454
1,0.108108,0.666667,0.186047,6
2,0.0,0.0,0.0,0
3,0.321429,0.473684,0.382979,38
4,0.0,0.0,0.0,2
5,0.0,0.0,0.0,0
6,0.0,0.0,0.0,0
