# Intro

The goal of this project is to build a model that takes in tropical cyclone tracking data and classifies accurately whether readings indicate that a storm is a severe Tropical Storm or a less disruptive disturbance. 

## Business Case

The resulting model will be used by meterologists to understand whether an incoming storm is a major threat to a certain area, and therefore inform news agenices, local governments, and the public to prepare accordingly. 

## Data Understanding

The data for this project is from the National Oceanic and Atmospheric Administration's International Best Track Archive for Climate Stewardship (IBTrACS) project. The goal of this project is make available tropical cyclone best track data to aid understanding of the distribution, frequency, and intensity of tropical cyclones worldwide.

Because the idea is to have a global data source, this data is pulled from many source agenices worldwide, and therefore has many columns that are duplicative, inconsistent, or difficult to interpret. When doing this analysis, reference was made to the data documentation saved in this repository.

[Source](https://www.ncdc.noaa.gov/ibtracs/index.php)

I'll start by importing my data (and all my other imports) to describe it further.

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, accuracy_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

import warnings
warnings.filterwarnings('ignore')

In [58]:
df = pd.read_csv('data/ibtracs.since1980.list.v04r00.csv', dtype='object', parse_dates=True, skiprows=[1], na_values=' ')
#drop first row as it's a multi index

pd.set_option('display.max_columns', None)
df.head(3)

Unnamed: 0,SID,SEASON,NUMBER,BASIN,SUBBASIN,NAME,ISO_TIME,NATURE,LAT,LON,WMO_WIND,WMO_PRES,WMO_AGENCY,TRACK_TYPE,DIST2LAND,LANDFALL,IFLAG,USA_AGENCY,USA_ATCF_ID,USA_LAT,USA_LON,USA_RECORD,USA_STATUS,USA_WIND,USA_PRES,USA_SSHS,USA_R34_NE,USA_R34_SE,USA_R34_SW,USA_R34_NW,USA_R50_NE,USA_R50_SE,USA_R50_SW,USA_R50_NW,USA_R64_NE,USA_R64_SE,USA_R64_SW,USA_R64_NW,USA_POCI,USA_ROCI,USA_RMW,USA_EYE,TOKYO_LAT,TOKYO_LON,TOKYO_GRADE,TOKYO_WIND,TOKYO_PRES,TOKYO_R50_DIR,TOKYO_R50_LONG,TOKYO_R50_SHORT,TOKYO_R30_DIR,TOKYO_R30_LONG,TOKYO_R30_SHORT,TOKYO_LAND,CMA_LAT,CMA_LON,CMA_CAT,CMA_WIND,CMA_PRES,HKO_LAT,HKO_LON,HKO_CAT,HKO_WIND,HKO_PRES,NEWDELHI_LAT,NEWDELHI_LON,NEWDELHI_GRADE,NEWDELHI_WIND,NEWDELHI_PRES,NEWDELHI_CI,NEWDELHI_DP,NEWDELHI_POCI,REUNION_LAT,REUNION_LON,REUNION_TYPE,REUNION_WIND,REUNION_PRES,REUNION_TNUM,REUNION_CI,REUNION_RMW,REUNION_R34_NE,REUNION_R34_SE,REUNION_R34_SW,REUNION_R34_NW,REUNION_R50_NE,REUNION_R50_SE,REUNION_R50_SW,REUNION_R50_NW,REUNION_R64_NE,REUNION_R64_SE,REUNION_R64_SW,REUNION_R64_NW,BOM_LAT,BOM_LON,BOM_TYPE,BOM_WIND,BOM_PRES,BOM_TNUM,BOM_CI,BOM_RMW,BOM_R34_NE,BOM_R34_SE,BOM_R34_SW,BOM_R34_NW,BOM_R50_NE,BOM_R50_SE,BOM_R50_SW,BOM_R50_NW,BOM_R64_NE,BOM_R64_SE,BOM_R64_SW,BOM_R64_NW,BOM_ROCI,BOM_POCI,BOM_EYE,BOM_POS_METHOD,BOM_PRES_METHOD,NADI_LAT,NADI_LON,NADI_CAT,NADI_WIND,NADI_PRES,WELLINGTON_LAT,WELLINGTON_LON,WELLINGTON_WIND,WELLINGTON_PRES,DS824_LAT,DS824_LON,DS824_STAGE,DS824_WIND,DS824_PRES,TD9636_LAT,TD9636_LON,TD9636_STAGE,TD9636_WIND,TD9636_PRES,TD9635_LAT,TD9635_LON,TD9635_WIND,TD9635_PRES,TD9635_ROCI,NEUMANN_LAT,NEUMANN_LON,NEUMANN_CLASS,NEUMANN_WIND,NEUMANN_PRES,MLC_LAT,MLC_LON,MLC_CLASS,MLC_WIND,MLC_PRES,USA_GUST,BOM_GUST,BOM_GUST_PER,REUNION_GUST,REUNION_GUST_PER,USA_SEAHGT,USA_SEARAD_NE,USA_SEARAD_SE,USA_SEARAD_SW,USA_SEARAD_NW,STORM_SPEED,STORM_DIR
0,1980001S13173,1980,1,SP,MM,PENI,1980-01-01 00:00:00,TS,-12.5,172.5,,,,main,647,647,O________OO_O_,jtwc_sh,SH051980,-12.5,172.5,,,25,,-1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-12.5,172.5,TC,25,,-12.5,172.5,1,25.0,,,,,,,-12.5,172.5,TC,25,,,,,,,,,,,,,,,,,6,351
1,1980001S13173,1980,1,SP,MM,PENI,1980-01-01 03:00:00,TS,-12.1927,172.441,,,,main,653,653,P________PP_P_,,SH051980,-12.1825,172.432,,,25,,-1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-12.1825,172.432,TC,25,,-12.2234,172.469,1,,,,,,,,-12.1825,172.432,TC,25,,,,,,,,,,,,,,,,,6,351
2,1980001S13173,1980,1,SP,MM,PENI,1980-01-01 06:00:00,TS,-11.9144,172.412,,,,main,670,670,O________OP_O_,jtwc_sh,SH051980,-11.9,172.4,,,25,,-1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-11.9,172.4,TC,25,,-11.9575,172.45,1,,,,,,,,-11.9,172.4,TC,25,,,,,,,,,,,,,,,,,5,358


In [3]:
print(df.shape)

(271883, 163)


The size of the file is really large but it will get smaller throughout the cleaning process. To start off with, there are 163 columns and they are all reading in as object datatypes. I'll need to go through and clean these up.

In [59]:
df.columns = [x.lower() for x in df.columns]
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271883 entries, 0 to 271882
Data columns (total 163 columns):
sid                 object
season              object
number              object
basin               object
subbasin            object
name                object
iso_time            object
nature              object
lat                 object
lon                 object
wmo_wind            object
wmo_pres            object
wmo_agency          object
track_type          object
dist2land           object
landfall            object
iflag               object
usa_agency          object
usa_atcf_id         object
usa_lat             object
usa_lon             object
usa_record          object
usa_status          object
usa_wind            object
usa_pres            object
usa_sshs            object
usa_r34_ne          object
usa_r34_se          object
usa_r34_sw          object
usa_r34_nw          object
usa_r50_ne          object
usa_r50_se          object
usa_r50_sw          obje

The dataset has readings for storms at multiple points in their progression. There are 4,458 unique storms tracked.

In [6]:
df['sid'].nunique()

4458

My classification task will be to identify whether they are minor storms or severe Tropical Storms. Looking at my target column, 'nature', I can see six different classes that I want to sort into two so this will be a binary - severe storm or not severe. 

NR, not reported, and MX, mixture will be removed as they don't tell me anything. TS, tropical storm, will be my '1' - a severe storm. ET, DS, and SS are extratropical, disturbance, and subtropical - less severe storms. These will be my '0' class. 

In [7]:
df['nature'].unique()

array(['TS', 'NR', 'ET', 'MX', 'SS', 'DS'], dtype=object)

## Data Exploration & Cleaning

The first thing I'm going to do is engineer my Y by making a binary column.

In [13]:
#removing unclassified rows
df.drop(df.loc[df['nature'] == 'NR'].index, inplace=True)
df.drop(df.loc[df['nature'] == 'MX'].index, inplace=True)

#new column  
df['target'] = 0

# loop through the data and input a 1 where the storm is a Tropical storm
for row in df.index:
    if df['nature'][row] == 'TS':
        df['target'][row] = 1

In [14]:
df['target'].value_counts(normalize=True)

1    0.897215
0    0.102785
Name: target, dtype: float64

So there is a pretty severe class imbalance here. Before I address that, I want to define my features. Taking a preliminary look through the columns in the dataframe and the documentation as a guide, I'm selecting the following as potential features to explore. This is a lot less than 163! A lot of these were blank, repeats, or not useful info.

In [48]:
df['week_of_year'] = df['iso_time'].dt.week

0          1
1          1
2          1
3          1
4          1
5          1
6          1
7          1
8          1
9          1
10         1
11         1
12         1
13         1
14         1
15         1
16         1
17         1
18         1
19         1
20         1
21         1
22         1
23         1
24         1
25         1
26         1
27         1
28         1
29         1
30         1
31         1
32         1
33         1
34         1
35         1
36         1
37         1
38         1
39         1
40         1
45         1
46         1
53         1
54         1
55         1
56         1
57         1
58         1
59         1
60         1
61         1
62         1
63         1
64         1
65         1
66         1
67         1
68         1
69         1
70         1
71         1
72         1
73         1
74         1
75         1
76         1
77         1
78         1
79         1
80         1
81         1
82         1
83         1
84         1
85         1
86         1

In [50]:
initial_feats = ['sid', 'season', 'basin', 'subbasin', 'lat', 'lon', 
                 'wmo_wind', 'dist2land', 'ds824_wind', 'td9636_stage', 'storm_speed', 'storm_dir']
xs_df = df[initial_feats]

xs_df[['lat', 'lon', 'dist2land', 'season', 'wmo_wind', 'ds824_wind', 'td9636_stage', 'storm_speed', 'storm_dir']] = xs_df[['lat', 'lon', 'dist2land', 'season', 'wmo_wind', 'ds824_wind', 'td9636_stage', 'storm_speed', 'storm_dir']].apply(pd.to_numeric)

#taking my datetime object and pulling out the week as a feature
df['iso_time'] = pd.to_datetime(df['iso_time'])
xs_df['week_of_year'] = df['iso_time'].dt.week

In [51]:
xs_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226423 entries, 0 to 271072
Data columns (total 13 columns):
sid             226423 non-null object
season          226423 non-null int64
basin           190696 non-null object
subbasin        197687 non-null object
lat             226423 non-null float64
lon             226423 non-null float64
wmo_wind        91501 non-null float64
dist2land       226423 non-null int64
ds824_wind      4139 non-null float64
td9636_stage    40790 non-null float64
storm_speed     226422 non-null float64
storm_dir       226422 non-null float64
week_of_year    226423 non-null int64
dtypes: float64(7), int64(3), object(3)
memory usage: 34.2+ MB


So now I have a clearer picture of what my data looks like. I'm going to drop the wind column with only 4K values, and inspect the other columns with nulls more closely. For now I am just focusing on the numerics.

In [57]:
xs_df = xs_df.drop('ds824_wind', axis=1)
y_df = df['target'].to_frame()
clean_df = pd.concat([xs_df, y_df], axis=1)

KeyError: "['ds824_wind'] not found in axis"

In [53]:
clean_df.head()

Unnamed: 0,sid,season,basin,subbasin,lat,lon,wmo_wind,dist2land,td9636_stage,storm_speed,storm_dir,week_of_year,target
0,1980001S13173,1980,SP,MM,-12.5,172.5,,647,1.0,6.0,351.0,1,1
1,1980001S13173,1980,SP,MM,-12.1927,172.441,,653,1.0,6.0,351.0,1,1
2,1980001S13173,1980,SP,MM,-11.9144,172.412,,670,1.0,5.0,358.0,1,1
3,1980001S13173,1980,SP,MM,-11.6863,172.435,,682,1.0,4.0,12.0,1,1
4,1980001S13173,1980,SP,MM,-11.5,172.5,,703,1.0,4.0,22.0,1,1


In [54]:
clean_df['wmo_wind'].isna().sum()/len(clean_df['wmo_wind'])

0.5958846936927785

In [55]:
clean_df['td9636_stage'].isna().sum()/len(clean_df['td9636_stage'])

0.8198504568882137

It appears that the majority of wmo_wind and td9636_stage are also null. I want to look at my data grouped by storm ID to see if there are any other trends I can identify.

In [44]:
pd.set_option('display.max_rows', None)
grouped_df = clean_df.groupby(['sid']).max()
grouped_df

Unnamed: 0_level_0,season,lat,lon,wmo_wind,dist2land,td9636_stage,storm_speed,storm_dir,month,target
sid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1980001S13173,1980,-11.1525,189.5,65.0,934,4.0,22.0,358.0,1,1
1980002S15081,1980,-13.8825,80.0,29.0,2230,,7.0,320.0,1,1
1980003S15137,1980,-14.66,161.0,50.0,818,2.0,26.0,237.0,1,1
1980005S11059,1980,-11.0,59.0,25.0,1014,,35.0,308.0,1,1
1980005S14120,1980,-13.6333,120.7,115.0,490,4.0,20.0,339.0,1,1
1980009S14066,1980,-12.0,66.8,20.0,1328,,7.0,355.0,1,1
1980010S20043,1980,-16.0,53.0,29.0,282,,15.0,354.0,1,1
1980010S22048,1980,-25.3,51.3,29.0,410,,16.0,170.0,1,1
1980015S18060,1980,-16.7584,68.0,,1227,4.0,29.0,353.0,1,1
1980018S10130,1980,-10.1,130.4,100.0,2494,4.0,22.0,337.0,1,1


In [41]:
grouped_df['wmo_wind'].isna().sum()/len(grouped_df)

0.1196319018404908

In [42]:
grouped_df['td9636_stage'].isna().sum()/len(grouped_df)

0.8105828220858896

So from looking at the data grouped by storm ID, I can see that the 'stage' feature stopped being recorded altogether in the 90s. This feature isn't going to be useful to me so I'm going to drop it. 11% of the storms have no wind speed recorded, so I could potentially fill that with a mean value. 

In [56]:
clean_df = clean_df.drop(['td9636_stage'], axis=1)
clean_df.head()

Unnamed: 0,sid,season,basin,subbasin,lat,lon,dist2land,td9636_stage,storm_speed,storm_dir,week_of_year,target
0,1980001S13173,1980,SP,MM,-12.5,172.5,647,1.0,6.0,351.0,1,1
1,1980001S13173,1980,SP,MM,-12.1927,172.441,653,1.0,6.0,351.0,1,1
2,1980001S13173,1980,SP,MM,-11.9144,172.412,670,1.0,5.0,358.0,1,1
3,1980001S13173,1980,SP,MM,-11.6863,172.435,682,1.0,4.0,12.0,1,1
4,1980001S13173,1980,SP,MM,-11.5,172.5,703,1.0,4.0,22.0,1,1
