# Exploratory Data Analysis of Hurricane Data

Source of data: [National Oceanic and Atmospheric Administration](https://www.nhc.noaa.gov/data/)  
Database name: Atlantic HURDAT2    
Description from website: Atlantic Hurricane Data 1851-2017. This dataset has a comma-delimited, text format with six-hourly information on the location, maximum winds, central pressure, and (beginning in 2004) size of all known tropical cyclones and subtropical cyclones.  
Database format notes: [Link](https://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf)  
Wikipedia link to list of costliest hurricanes: [Link](https://en.wikipedia.org/wiki/List_of_costliest_Atlantic_hurricanes)

Storm events database data downloads: [Link](https://www.ncdc.noaa.gov/stormevents/ftp.jsp)

In [23]:
# Necessary Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import datetime as dt

%matplotlib inline

The data we're looking to read in is already in a .csv format. This should be a perfect candidate for the pandas.read_csv() method, however some inconsistencies in the csv structure ruin the call. I've left it in for demonstration purposes, and show my work-around below.

In [2]:
url = 'https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt'
hurdat_full = pd.read_csv(url)
hurdat_full.head(15)

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

# Dealing with variable column lengths
The above csv separated format doesn't play well with the Pandas method because the data effectively has two types of rows.
Each storm gets its own "header" line (4 columns), and all the rows underneath (21 columns) contain time-series data corresponding to that strom.  
To counter this, I read in the file line by line and make adjustments every time there is a new storm header row.

In [3]:
# Basic initial exploration of row structure
file_name = 'hurdat2-1851-2017-050118.txt'
path = './'
with open(path+file_name) as file:
    reader = csv.reader(file)
    
    total = 0
    count = 0
    landfall = 0
    intensity = 0
    for row in reader:
        
        total += 1
        if len(row) < 21:
            count += 1
        elif ' L' in row:
            landfall += 1
        elif ' I' in row:
            intensity += 1
    
    print('total entries           ', total)
    print('no. of storms           ', count)
    print('no. of landfalls        ', landfall)
    print('no. of intensity peaks  ', intensity)    
# Check values below after consolidating database by storm      

total entries            52151
no. of storms            1848
no. of landfalls         943
no. of intensity peaks   28


In [32]:
file_name = 'hurdat2-1851-2017-050118.txt'
path = './'
with open(path+file_name) as file:
    
    reader = csv.reader(file)
    
    # Distinct labels for distinct rows.
    # See database format link in header for details.
    storm_cols = ['stormID','name', 'entries_n','extra']
    data_cols = ['date', 'time', 'record_type','status',
                 'latitude','longitude','max_sust_v', 'min_p',
                '34kt_r_ne', '34kt_r_se', '34kt_r_sw', '34kt_r_nw',
                '50kt_r_ne', '50kt_r_se', '50kt_r_sw', '50kt_r_nw',
                '64kt_r_ne', '64kt_r_se', '64kt_r_sw', '64kt_r_nw']
    
    # Adjust column names to accomodate for parsed in data and reformatting
    data_cols_new = ['stormID','name'] + data_cols + ['empty']
    
    # METHODOLOGY:
    # Recognize when a new header row occurs and adjust labels
    # Assign time-series to data after header row to a stormID and name 
    
    stormID, name = '', ''
    storms = []
    
    for row in reader:
        
        # Determine if header row & re-assign ID & name
        if len(row) == 4:
            stormID = row[0].strip()
            name = row[1].strip()
            
        else:
            storms.append([stormID,name] + row)    
    
    all_storms = pd.DataFrame(storms, columns = data_cols_new)

all_storms.head()

Unnamed: 0,stormID,name,date,time,record_type,status,latitude,longitude,max_sust_v,min_p,...,34kt_r_nw,50kt_r_ne,50kt_r_se,50kt_r_sw,50kt_r_nw,64kt_r_ne,64kt_r_se,64kt_r_sw,64kt_r_nw,empty
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,


Data needs to be cleaned by:
* Dropping empty final row
* stripping all strings of whitespace 
* simplifying date and time columns into a single datetime object
* make appropriate conversions to floats for latitude and longitude

Note that all values of `-999` indicate missing data.

In [33]:
# Remvoing unnecessary columns
all_storms.drop(labels = ['empty'], axis = 1, inplace = True)
all_storms.head()

Unnamed: 0,stormID,name,date,time,record_type,status,latitude,longitude,max_sust_v,min_p,...,34kt_r_sw,34kt_r_nw,50kt_r_ne,50kt_r_se,50kt_r_sw,50kt_r_nw,64kt_r_ne,64kt_r_se,64kt_r_sw,64kt_r_nw
0,AL011851,UNNAMED,18510625,0,,HU,28.0N,94.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
1,AL011851,UNNAMED,18510625,600,,HU,28.0N,95.4W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999


Data should now be summarized by storm into the appropriate independent variables for the model. Those steps include:  

* storm length: difference b/t final and first datetime objects
* storm distance travelled: difference b/t final and first coordinates
* made landfall: 1 or 0
* maximum max sustained windspeed (`max_sust_v`), knots (kt)
* minimum max sustained windspeed, knots (kt)
* max sustained windspeed at landfall, knots (kt)
* minimum pressure, millibar
* maximum 34kt radius, nautical miles (nm)
* maximum 50kt radius, nm
* maximum 64kt radius, nm
* hurricane diameter
* forward speed
* hurricane severity index [Link](https://en.wikipedia.org/wiki/Hurricane_Severity_Index)

In [34]:
# Transform latitude & longitude coordinates
def transform_coord(coord):
    '''
    Accept coordinate in string form, return signed float representation.
    Argument: str: example '28.2N'
    Return: float: example 28.2
    '''
    new_coord = 0
    value = float(coord[:-1])
    
    if coord[-1] == 'N' or coord[-1] == 'E':
        new_coord += value
    elif coord[-1] == 'S' or coord[-1] == 'W':
        new_coord -= value
    else:
        print('Unexpected direction received')
        return -999
    
    return new_coord
    

In [35]:
all_storms.latitude = all_storms['latitude'].apply(transform_coord)
all_storms.longitude = all_storms['longitude'].apply(transform_coord)

In [36]:
all_storms.head()

Unnamed: 0,stormID,name,date,time,record_type,status,latitude,longitude,max_sust_v,min_p,...,34kt_r_sw,34kt_r_nw,50kt_r_ne,50kt_r_se,50kt_r_sw,50kt_r_nw,64kt_r_ne,64kt_r_se,64kt_r_sw,64kt_r_nw
0,AL011851,UNNAMED,18510625,0,,HU,28.0,-94.8,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
1,AL011851,UNNAMED,18510625,600,,HU,28.0,-95.4,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
2,AL011851,UNNAMED,18510625,1200,,HU,28.0,-96.0,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
3,AL011851,UNNAMED,18510625,1800,,HU,28.1,-96.5,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2,-96.8,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,-999


In [37]:
# list of labels to confirm columns
# Transform all numeric columns after latitude and longitude
print(all_storms.columns)
for column in all_storms.columns[6:]:
    all_storms[column] = pd.to_numeric(all_storms[column], errors = 'raise')
    
# Confirm successful change
all_storms.dtypes

Index(['stormID', 'name', 'date', 'time', 'record_type', 'status', 'latitude',
       'longitude', 'max_sust_v', 'min_p', '34kt_r_ne', '34kt_r_se',
       '34kt_r_sw', '34kt_r_nw', '50kt_r_ne', '50kt_r_se', '50kt_r_sw',
       '50kt_r_nw', '64kt_r_ne', '64kt_r_se', '64kt_r_sw', '64kt_r_nw'],
      dtype='object')


stormID         object
name            object
date            object
time            object
record_type     object
status          object
latitude       float64
longitude      float64
max_sust_v       int64
min_p            int64
34kt_r_ne        int64
34kt_r_se        int64
34kt_r_sw        int64
34kt_r_nw        int64
50kt_r_ne        int64
50kt_r_se        int64
50kt_r_sw        int64
50kt_r_nw        int64
64kt_r_ne        int64
64kt_r_se        int64
64kt_r_sw        int64
64kt_r_nw        int64
dtype: object

In [109]:
all_storms['datetime'] = [dt.datetime.strptime(time_str,'%Y%m%d %H%M')
                        for time_str in all_storms.date+all_storms.time]
all_storms.head()

Unnamed: 0,stormID,name,date,time,record_type,status,latitude,longitude,max_sust_v,min_p,...,34kt_r_nw,50kt_r_ne,50kt_r_se,50kt_r_sw,50kt_r_nw,64kt_r_ne,64kt_r_se,64kt_r_sw,64kt_r_nw,datetime
0,AL011851,UNNAMED,18510625,0,,HU,28.0,-94.8,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,1851-06-25 00:00:00
1,AL011851,UNNAMED,18510625,600,,HU,28.0,-95.4,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,1851-06-25 06:00:00
2,AL011851,UNNAMED,18510625,1200,,HU,28.0,-96.0,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,1851-06-25 12:00:00
3,AL011851,UNNAMED,18510625,1800,,HU,28.1,-96.5,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,1851-06-25 18:00:00
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2,-96.8,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,1851-06-25 21:00:00


In [39]:
all_storms.datetime[1] - all_storms.datetime[0]

Timedelta('0 days 06:00:00')

### Creating a dataframe of individual storm data
In the following code, go through each stormID to get summary characteristics from the time series data. 
Note that while some storms do share names (the NOAA reuses names every 6 to 7 years), the ID is still unique.  

HURDAT2 marks `-999` for any measurement missing data. To deal with this, the step gathering max and median values for wind velocity, pressure, and wind radii exclude any values < 0 from the calculation.

In [115]:
# Isolate storm specific properties

old_cols = ['stormID', 'name', 'date', 'time', 'record_type','status',
                 'latitude','longitude','max_sust_v', 'min_p',
                '34kt_r_ne', '34kt_r_se', '34kt_r_sw', '34kt_r_nw',
                '50kt_r_ne', '50kt_r_se', '50kt_r_sw', '50kt_r_nw',
                '64kt_r_ne', '64kt_r_se', '64kt_r_sw', '64kt_r_nw']

old_cols = ['stormID', 'name', 'duration', 'landfall',
            'lat_delta','lon_delta', 'wind_v_max', 'wind_v_med',
            'p_min', 'p_med',
            '34kt_r_max', '34kt_r_med',
            '50kt_r_max', '50kt_r_med',
            '64kt_r_max', '64kt_r_med']

# Create total storm variables for each storm, corresponding to above columns
storms_to_concat = []
for storm in all_storms.stormID.unique():
    this_storm = all_storms[all_storms.stormID == storm].reset_index()
    
    # Initializing the storm's individual row data
    this_row = pd.DataFrame([storm], columns = ['stormID'])
    
    # Getting storm name
    # Picking last value in list of unique names to account for possible instance
    # of unnamed storm later receiving name
    this_row['name'] = this_storm.name.unique()[-1]
    
    # Time duration as datetime timedelta object
    this_row['duration'] = this_storm.datetime.iloc[-1] - this_storm.datetime.iloc[0]
    
    #Determining landfall
    if " L" in this_storm.record_type.unique():
        this_row['landfall'] = 1
    else:
        this_row['landfall'] = 0
    
    #Getting coordinate deltas
    #Using net displacement rather total distance travelled
    this_row['lat_delta'] = abs(this_storm.latitude.iloc[-1] - this_storm.latitude.iloc[0])
    this_row['lon_delta'] = abs(this_storm.longitude.iloc[-1] - this_storm.longitude.iloc[0])
    
    # Added precaution for the following variables. All -999 values are excluded.
    
    # Getting max/med sustained windspeed and pressure stats
    # For pressure we are looking for the minimum
    this_row['wind_v_max'] = this_storm.max_sust_v[this_storm.max_sust_v>=0].max()
    this_row['wind_v_med'] = this_storm.max_sust_v[this_storm.max_sust_v>=0].median()
    this_row['p_min'] = this_storm.min_p[this_storm.min_p>=0].min()
    this_row['p_med'] = this_storm.min_p[this_storm.min_p>=0].median()
    
    # Extracting maximum and median wind radius values (nautical miles)
    for val in ['34', '50', '64']:
        directions = [val+'kt_r_ne', val+'kt_r_se', val+'kt_r_sw', val+'kt_r_nw']
        label_max = val+'kt_r_max'
        label_med = val+'kt_r_med'
        this_row[label_max] = np.max(this_storm[directions]
                                     [this_storm[directions]>=0].max())
        this_row[label_med] = np.mean(this_storm[directions]
                                      [this_storm[directions]>=0].median())
        
    # Append storm dataframe to container for later concatenation
    storms_to_concat.append(this_row)

# Dataframe where every row is a storm
ind_storms = pd.concat(storms_to_concat).reset_index()
    

In [126]:
print(ind_storms.shape)
ind_storms.head()

(1848, 17)


Unnamed: 0,index,stormID,name,duration,landfall,lat_delta,lon_delta,wind_v_max,wind_v_med,p_min,p_med,34kt_r_max,34kt_r_med,50kt_r_max,50kt_r_med,64kt_r_max,64kt_r_med
0,0,AL011851,UNNAMED,3 days 00:00:00,1,3.0,5.4,80.0,60.0,,,,,,,,
1,0,AL021851,UNNAMED,0 days 00:00:00,0,0.0,0.0,80.0,80.0,,,,,,,,
2,0,AL031851,UNNAMED,0 days 00:00:00,0,0.0,0.0,50.0,50.0,,,,,,,,
3,0,AL041851,UNNAMED,11 days 18:00:00,1,35.1,6.2,100.0,70.0,,,,,,,,
4,0,AL051851,UNNAMED,3 days 18:00:00,0,0.0,0.0,50.0,50.0,,,,,,,,


In [117]:
ind_storms.tail(20)

Unnamed: 0,index,stormID,name,duration,landfall,lat_delta,lon_delta,wind_v_max,wind_v_med,p_min,p_med,34kt_r_max,34kt_r_med,50kt_r_max,50kt_r_med,64kt_r_max,64kt_r_med
1828,0,AL152016,NICOLE,15 days 06:00:00,0,35.8,23.3,120.0,60.0,950.0,975.0,420.0,92.5,180.0,32.5,90.0,0.0
1829,0,AL162016,OTTO,8 days 18:00:00,1,3.4,12.6,100.0,45.0,975.0,1000.0,50.0,32.5,30.0,0.0,15.0,0.0
1830,0,AL012017,ARLENE,6 days 12:00:00,0,3.9,3.9,55.0,40.0,986.0,994.0,360.0,105.0,180.0,0.0,0.0,0.0
1831,0,AL022017,BRET,1 days 15:00:00,1,3.1,13.0,45.0,40.0,1007.0,1008.0,120.0,40.0,0.0,0.0,0.0,0.0
1832,0,AL032017,CINDY,4 days 12:00:00,1,15.3,10.1,50.0,37.5,991.0,996.5,240.0,36.25,80.0,0.0,0.0,0.0
1833,0,AL042017,UNNAMED,2 days 00:00:00,0,3.6,13.8,25.0,25.0,1009.0,1010.0,0.0,0.0,0.0,0.0,0.0,0.0
1834,0,AL052017,DON,1 days 12:00:00,0,1.1,9.4,45.0,35.0,1005.0,1009.0,30.0,15.0,0.0,0.0,0.0,0.0
1835,0,AL062017,EMILY,2 days 06:00:00,1,2.3,7.3,50.0,30.0,1001.0,1008.0,50.0,0.0,20.0,0.0,0.0,0.0
1836,0,AL072017,FRANKLIN,3 days 18:00:00,1,4.3,16.8,75.0,50.0,981.0,996.0,160.0,76.25,50.0,7.5,30.0,0.0
1837,0,AL082017,GERT,6 days 18:00:00,0,28.1,30.3,95.0,47.5,962.0,995.0,330.0,46.25,70.0,8.75,30.0,0.0


From the above analysis, it seems that storms missing data may fall into two (for now) categories:
* those from times when it was impossible to gather such data
* recent, smaller storms

I'm going to pull out those storms to have a closer look.

In [123]:
storms_na = ind_storms[ind_storms.isnull().any(axis = 1)]
storms_na[storms_na.landfall == 1].shape

(499, 17)