# RI Prediction Model

### Prediction of Hurricane Rapid Intensification

Project Summary: Storms that undergo rapid intensification (RI) are associated with the highest forecat errors and larger economic losses.
    To reduce the damage caused by the hurricane, accurate prediction of hurricane intensity is critical. This project aims to improve previous models that predict whether or not hurricane will experience RI within 24 hours.
    
Project Goal: Finding a model that effectively predicts whether the hurricane will or will not experience a Rapid Intensification (RI)

In [210]:
# Python code to build Machine Learning model for hurricane intensity forecast  
import pandas as pd # For data manipulation and analysis
pd.set_option('display.max_columns', 500)
import numpy as np # For scientific computing

#Machine Learning Tools
import sklearn # For machine learning library
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

#from sklearn.ensemble import ExtraTreesClassifier    # Extra tree classifier
from sklearn.metrics import confusion_matrix # Compute confusion matrix to evaluate the accuracy of a classification.
from sklearn.metrics import brier_score_loss  # Compute the Brier score
from sklearn import metrics
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import brier_score_loss  # Compute the Brier score


import matplotlib.pyplot as plt  #plotting library


from imblearn.over_sampling import SMOTE #SMOTE oversampling

#Displaying More Rows
pd.options.display.max_rows = 500

## Importing Data and Data Inspection

Let's take a look closer at each data set. Few of the data characteristics that we are looking for:
- column names
- data type
- number of rows
- missing values (null)
- plausible features that could impact survival rate
- target column

In [211]:
# Set up the location of the data
fname='Dataset_SHIPS_RII_ATL.csv'
#fname='Dataset_SHIPS_RII_EPAC.csv'

# Read SHIPS data
ships = pd.read_csv(fname)
ships

Unnamed: 0,NAME,DATE,HOUR,VMX0,LAT,LON,MSLP,ID,DELV12,DELV24,DELV36,DELV48,PER,SHRD,D200,RHLO,PX30,SDBT,POT,OHC,TPW,PC2,U200,TPWC,AVBT,RSST
0,ALEX,980727,12,25,11.3,-25.4,1009,AL011998,0,5,10,10,9999,6.3,103,68,72,13.8,-101,12,0,-58,-7.9,55.7,-473,27.4
1,ALEX,980727,18,25,11.7,-27.2,1009,AL011998,0,5,10,10,9999,11.2,118,69,55,12.6,-102,17,0,-10,-6.4,55.7,-360,27.4
2,ALEX,980728,0,25,12.2,-29.2,1009,AL011998,5,10,10,10,0,8.6,116,71,70,12.8,-105,21,0,-3,-8.8,56.9,-381,27.4
3,ALEX,980728,6,25,12.6,-31.3,1008,AL011998,5,10,10,15,0,12.2,91,71,57,12.2,-100,29,0,-44,-6.0,52.7,-481,27.2
4,ALEX,980728,12,30,12.9,-33.3,1007,AL011998,5,5,5,10,5,10.5,88,71,83,10.1,-89,15,190,-46,-6.5,55.3,-516,27.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7400,RINA,171108,0,45,34.6,-48.7,999,AL192017,0,0,9999,9999,5,20.5,29,71,31,24.3,-37,0,1000,123,22.0,42.9,-35,23.5
7401,RINA,171108,6,50,36.4,-48.7,996,AL192017,-5,9999,9999,9999,10,20.8,19,73,15,14.7,-28,0,969,90,21.8,43.0,-121,22.9
7402,RINA,171108,12,45,38.3,-48.8,994,AL192017,0,9999,9999,9999,0,21.2,12,74,9,13.5,-29,0,966,126,23.0,43.0,-38,22.7
7403,RINA,171108,18,45,40.1,-49.0,992,AL192017,9999,9999,9999,9999,-5,22.5,14,70,7,11.5,-14,0,948,75,28.4,44.3,-166,21.3


In [212]:
# Set all 9999s as NaNs
ships = ships.replace(9999,np.NaN)

In [213]:
# Function: Counts all null values in each columns
def count_null(data):
    null_each_col={}

    for col in data.columns:
        null_count=data[col].isnull().sum()
        null_each_col[col]=null_count
    return null_each_col

In [214]:
null_counts = count_null(ships)
null_counts

# There are a lot of null values in DELV12, DELV24, DELV36, DELV48, PER, PX30, SDBT, OHC, PC2, AVBT
# DELV12, DELV24, DELV36, DELV48 will be dropped during the data cleaning since they will not be used as the predictor.


{'NAME': 0,
 'DATE': 0,
 'HOUR': 0,
 'VMX0': 0,
 'LAT': 0,
 'LON': 0,
 'MSLP': 0,
 'ID': 0,
 'DELV12': 718,
 'DELV24': 1427,
 'DELV36': 2082,
 'DELV48': 2682,
 'PER': 1456,
 'SHRD': 0,
 'D200': 0,
 'RHLO': 0,
 'PX30': 402,
 'SDBT': 402,
 'POT': 0,
 'OHC': 19,
 'TPW': 0,
 'PC2': 387,
 'U200': 0,
 'TPWC': 0,
 'AVBT': 402,
 'RSST': 0}

## Parameters Set-up

Setting up RI threshold, train/forecast threshold, target variable/feature, and climatoloogy constant

In [215]:
# Year range for training and validating
year_train=['1998','2008']

# Year range for forecast
year_fcst=['2009','2017']

# Variable name for predictand
TargetName='DELV24'

# Threshold of Rapid Intensification 
RIValue=30

# Climatology of RI (30 kt) frequency at Atlantic basin (Kaplan et al. 2015)
clim=0.125   #ATL 30 kt
#clim=0.084   #EPAC 30 kt

## Data Cleaning

- Cleaning Date Column
- Creating Target Column 'TAR'
- Dropping all DELV Columns
- Dropping all null values

In [216]:
# Pad the date columns with 00 for the year 2000
ships['DATE'] = ships['DATE'].apply(lambda x: str(x).zfill(6))

# Extract month from date
ships['MONTH'] = ships['DATE'].apply(lambda x: str(x)[2:4])

# Extract year from date
ships['YEAR'] = ships['DATE'].apply(lambda x: ('19' + str(x)[0:2]) if (str(x)[0:1]!= '0' and str(x)[0:1]!= '1') else ('20' + str(x)[0:2]))
ships.head()

# Set the target column
ships['TAR'] = ships[TargetName].apply(lambda x: 1 if x >= RIValue else 0)

# Dropping all DELV columns
# - Dropping DELV columns before droping null values saved 1971 data points
ships=ships.drop(['DELV12', 'DELV24', 'DELV36', 'DELV48'], axis=1)

# drop NaNs
ships=ships.dropna()
ships

Unnamed: 0,NAME,DATE,HOUR,VMX0,LAT,LON,MSLP,ID,PER,SHRD,D200,RHLO,PX30,SDBT,POT,OHC,TPW,PC2,U200,TPWC,AVBT,RSST,MONTH,YEAR,TAR
2,ALEX,980728,0,25,12.2,-29.2,1009,AL011998,0.0,8.6,116,71,70.0,12.8,-105,21.0,0,-3.0,-8.8,56.9,-381.0,27.4,07,1998,0
3,ALEX,980728,6,25,12.6,-31.3,1008,AL011998,0.0,12.2,91,71,57.0,12.2,-100,29.0,0,-44.0,-6.0,52.7,-481.0,27.2,07,1998,0
4,ALEX,980728,12,30,12.9,-33.3,1007,AL011998,5.0,10.5,88,71,83.0,10.1,-89,15.0,190,-46.0,-6.5,55.3,-516.0,27.1,07,1998,0
5,ALEX,980728,18,30,13.1,-35.1,1006,AL011998,5.0,9.7,44,72,35.0,15.9,-86,22.0,15,36.0,-7.2,56.6,-270.0,27.1,07,1998,0
6,ALEX,980729,0,35,13.3,-36.8,1005,AL011998,5.0,9.9,37,74,56.0,14.5,-80,24.0,0,-6.0,-10.3,57.7,-443.0,27.1,07,1998,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7400,RINA,171108,0,45,34.6,-48.7,999,AL192017,5.0,20.5,29,71,31.0,24.3,-37,0.0,1000,123.0,22.0,42.9,-35.0,23.5,11,2017,0
7401,RINA,171108,6,50,36.4,-48.7,996,AL192017,10.0,20.8,19,73,15.0,14.7,-28,0.0,969,90.0,21.8,43.0,-121.0,22.9,11,2017,0
7402,RINA,171108,12,45,38.3,-48.8,994,AL192017,0.0,21.2,12,74,9.0,13.5,-29,0.0,966,126.0,23.0,43.0,-38.0,22.7,11,2017,0
7403,RINA,171108,18,45,40.1,-49.0,992,AL192017,-5.0,22.5,14,70,7.0,11.5,-14,0.0,948,75.0,28.4,44.3,-166.0,21.3,11,2017,0


## Creating Sets for Data and Model Hyperparameter Tuning

- Predictor Sets: Sets with different predictor features
- Data Sets:      Data Sets with different sampling methods
- Model Sets:     Different types of models 

In [217]:
#Predictor Sets

Predictor_Sets=[]

# Adding different features into the list
Predictor_Sets.append(('Set1', ['PER','SHRD','D200','TPW','PC2','SDBT','POT','OHC','VMX0']))
Predictor_Sets.append(('Set2', ['PER','SHRD','D200','TPW','PC2','POT','OHC','VMX0']))
Predictor_Sets.append(('Set3', ['PER','SHRD','D200','TPW','SDBT','POT','OHC','VMX0']))
Predictor_Sets.append(('Kaplan 2015', ['PER','SHRD','D200','RHLO','PX30','SDBT','POT','OHC']))
Predictor_Sets.append(('Kaplan 2015 Imp', ['PER','SHRD','D200','TPW','PC2','SDBT','POT','OHC','VMX0'])) #ICDA didn't exist in the dataset

In [218]:
# Data Sets

Data_Sets=[]

# Data Set 1: Diving Data Sets into Train and Forecast By Year
data_train_by_year = ships[(ships['YEAR']>=year_train[0]) & (ships['YEAR']<=year_train[1])]
data_fcst_by_year = ships[(ships['YEAR']>=year_fcst[0]) & (ships['YEAR']<=year_fcst[1])]

TAR_0 = data_train_by_year[data_train_by_year['TAR']==0]
TAR_1 = data_train_by_year[data_train_by_year['TAR']==1]

TAR_0_count_train,TAR_1_count_train=data_train_by_year['TAR'].value_counts()

# Data Set 2: Undersampling Data Set 1
### Undersampling RI=0 Rows

TAR_0_under=TAR_0.sample(TAR_1_count_train)
data_train_under = pd.concat([TAR_0_under,TAR_1], axis=0)

# data_train_under['TAR'].value_counts().plot(kind='bar', title='count (target)')

# Data Set 3: Oversampling Data Set 1
### Undersampling RI=1 Rows

TAR_1_over=TAR_1.sample(TAR_0_count_train,replace=True)
data_train_over = pd.concat([TAR_0,TAR_1_over], axis=0)

# data_train_over['TAR'].value_counts().plot(kind='bar', title='count (target)')
# plt.title('TAR Count After Oversampling')


# Data Set 4: Normalized Data Set 1

def normalization(data,column_names):
    data_nur=data[column_names]
    data_nur=(data_nur-data_nur.mean())/(data_nur.std()) #standard normal distribution
    data[column_names]=data_nur[column_names]
    return data

num_col=['VMX0','PER','SHRD','D200','RHLO','PX30','SDBT','POT','OHC','TPW','PC2','U200','TPWC','AVBT','RSST']
ships_nor=normalization(ships,num_col)

data_train_nor = ships_nor[(ships_nor['YEAR']>=year_train[0]) & (ships_nor['YEAR']<=year_train[1])]
data_fcst_nor = ships_nor[(ships_nor['YEAR']>=year_fcst[0]) & (ships_nor['YEAR']<=year_fcst[1])]

TAR_0_nor = data_train_nor[data_train_nor['TAR']==0]
TAR_1_nor = data_train_nor[data_train_nor['TAR']==1]

# Data Set 5: Undersampling Data Set 4 (Normalized)

TAR_0_under_nor=TAR_0_nor.sample(TAR_1_count_train)
data_train_under_nor = pd.concat([TAR_0_under_nor,TAR_1_nor], axis=0)

# Data Set 6: Oversampling Data Set 5 (Normalized)

TAR_1_over_nor=TAR_1_nor.sample(TAR_0_count_train,replace=True)
data_train_over_nor = pd.concat([TAR_0_nor,TAR_1_over_nor], axis=0)



# Adding created data set into a series
Data_Sets.append(('By Year', data_train_by_year,data_fcst_by_year))
Data_Sets.append(('Undersample', data_train_under,data_fcst_by_year))
Data_Sets.append(('Oversample', data_train_over,data_fcst_by_year))
Data_Sets.append(('Normalized', data_train_nor,data_fcst_nor))
Data_Sets.append(('Undersample (Normalized)', data_train_under_nor,data_fcst_by_year))
Data_Sets.append(('Oversample (Normalized)', data_train_over_nor,data_fcst_by_year))


In [219]:
# Model Sets

Models = []

# Adding different machine learning models into the list
Models.append(('Logistic_Regression', LogisticRegression(solver='liblinear', multi_class='ovr')))
Models.append(('K_Neighbors_Classifier', KNeighborsClassifier(n_neighbors=15)))
Models.append(('Decision_Tree', DecisionTreeClassifier(random_state=1,min_samples_split=10,max_depth=5)))
Models.append(('Random_Forest', RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=66, max_depth=6, min_samples_leaf=2, class_weight='balanced')))
Models.append(('Neural_Network_Logistic', MLPClassifier((10,10), activation='logistic',max_iter=3000)))
Models.append(('Neural_Network_ReLU', MLPClassifier((10,10), activation='relu',max_iter=3000)))

## Machine Learning Calculator

ML_Calc function takes in predictor,data, and model sets to predict the given target variable of the forecast data

In [220]:
def ML_Calc(predictors,data_sets,models,target):
   
    results=pd.DataFrame()

    for name_pred,features in predictors:
        for name_data,data_train,data_fcst in data_sets:
        # All predictors for training and validating
            XData_train = data_train[features]
            YData_train = data_train[target]

            # All predictors for training and validating
            XData_fcst = data_fcst[features]
            YData_fcst = data_fcst[target]
            # For loop will iterate through each models in 'models' list
            
            for name_ML, model in models:
                model.fit(XData_train,YData_train)
                prediction_train=model.predict(XData_train)
                prediction_fcst=model.predict(XData_fcst)
                
                # Performance Metrics
                cmatrix_fcst = confusion_matrix(YData_fcst, prediction_fcst)
                
                false_neg=cmatrix_fcst[1,0]
                false_pos=cmatrix_fcst[0,1]
                true_pos=cmatrix_fcst[1,1]
                true_neg=cmatrix_fcst[0,0]
                
                pss=((cmatrix_fcst[0,0] * cmatrix_fcst[1,1]) - (cmatrix_fcst[0,1] * cmatrix_fcst[1,0])) * 1.0 / ((cmatrix_fcst[1,1] + cmatrix_fcst[1,0]) * (cmatrix_fcst[0,1] + cmatrix_fcst[0,0]))
                far=(cmatrix_fcst[0,1] * 1.0) / (cmatrix_fcst[0,1] + cmatrix_fcst[1,1])
                pod=(cmatrix_fcst[1,1] * 1.0) / (cmatrix_fcst[1,0] + cmatrix_fcst[1,1]) 
                precision=precision_score(YData_fcst, prediction_fcst, average=None)[1]
                recall=recall_score(YData_fcst, prediction_fcst)
                f1=f1_score(YData_fcst, prediction_fcst)
                brier_score=brier_score_loss(YData_fcst,prediction_fcst)
                
                results=results.append({"Predictor Set":name_pred,"Data Set":name_data, 
                                        "ML Name":name_ML,"False Negative":false_neg,"False Positive":false_pos,
                                        "PSS":pss,"FAR":far,"POD":pod, "Recall":recall,"Precision":precision,"F1":f1, 
                                        "Brier Score":brier_score}, 
                                       ignore_index = True) 




    results=results[["Data Set","Predictor Set", "ML Name","False Negative",
                     "False Positive","PSS","FAR","POD","F1","Precision","Brier Score"]]
    results=results.sort_values('False Negative')
    return results


In [221]:
Result=ML_Calc(Predictor_Sets,Data_Sets,Models,'TAR')

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [222]:
Result.sort_values('PSS',ascending=False)

Unnamed: 0,Data Set,Predictor Set,ML Name,False Negative,False Positive,PSS,FAR,POD,F1,Precision,Brier Score
84,Oversample,Set3,Logistic_Regression,32.0,533.0,0.518005,0.844691,0.753846,0.257556,0.155309,0.236402
80,Undersample,Set3,Decision_Tree,22.0,767.0,0.491389,0.876571,0.830769,0.214925,0.123429,0.330126
140,Oversample (Normalized),Kaplan 2015,Decision_Tree,14.0,908.0,0.490538,0.886719,0.892308,0.20104,0.113281,0.385774
78,Undersample,Set3,Logistic_Regression,32.0,597.0,0.489687,0.858993,0.753846,0.237576,0.141007,0.26318
44,Undersample,Set2,Decision_Tree,26.0,704.0,0.488496,0.871287,0.8,0.221748,0.128713,0.305439
152,Undersample,Kaplan 2015 Imp,Decision_Tree,26.0,704.0,0.488496,0.871287,0.8,0.221748,0.128713,0.305439
8,Undersample,Set1,Decision_Tree,26.0,704.0,0.488496,0.871287,0.8,0.221748,0.128713,0.305439
89,Oversample,Set3,Neural_Network_ReLU,44.0,413.0,0.478795,0.827655,0.661538,0.27345,0.172345,0.191213
6,Undersample,Set1,Logistic_Regression,35.0,576.0,0.475902,0.85842,0.730769,0.237203,0.14158,0.255649
150,Undersample,Kaplan 2015 Imp,Logistic_Regression,35.0,576.0,0.475902,0.85842,0.730769,0.237203,0.14158,0.255649
