# Exploratory Data Analysis of Hurricane Data

Source of data: [National Oceanic and Atmospheric Administration](https://www.nhc.noaa.gov/data/)  
Database name: Atlantic HURDAT2    
Description from website: Atlantic Hurricane Data 1851-2017. This dataset has a comma-delimited, text format with six-hourly information on the location, maximum winds, central pressure, and (beginning in 2004) size of all known tropical cyclones and subtropical cyclones.  
Database format notes: [Link](https://www.nhc.noaa.gov/data/hurdat/hurdat2-format-atlantic.pdf)  
Wikipedia link to list of costliest hurricanes: [Link](https://en.wikipedia.org/wiki/List_of_costliest_Atlantic_hurricanes)

Storm events database data downloads: [Link](https://www.ncdc.noaa.gov/stormevents/ftp.jsp)

In [1]:
# Necessary Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import csv
import datetime

%matplotlib inline

The data we're looking to read in is already in a .csv format. This should be a perfect candidate for the pandas.read_csv() method, however some inconsistencies in the csv structure ruin the call. I've left it in for demonstration purposes, and show my work-around below.

In [2]:
url = 'https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2017-050118.txt'
hurdat_full = pd.read_csv(url)
hurdat_full.head(15)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3.1,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,AL011851,UNNAMED,14,Unnamed: 3
18510625,0000,,HU,28.0N,94.8W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510625,0600,,HU,28.0N,95.4W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510625,1200,,HU,28.0N,96.0W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510625,1800,,HU,28.1N,96.5W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510625,2100,L,HU,28.2N,96.8W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510626,0000,,HU,28.2N,97.0W,70.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510626,0600,,TS,28.3N,97.6W,60.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510626,1200,,TS,28.4N,98.3W,60.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510626,1800,,TS,28.6N,98.9W,50.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,
18510627,0000,,TS,29.0N,99.4W,50.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,


# Dealing with variable column lengths
The above csv separated format doesn't play well with the Pandas method because the data effectively has two types of rows.
Each storm gets its own "header" line (4 columns), and all the rows underneath (21 columns) contain time-series data corresponding to that strom.  
To counter this, I read in the file line by line and make adjustments every time there is a new storm header row.

In [3]:
# Basic initial exploration of row structure
file_name = 'hurdat2-1851-2017-050118.txt'
with open(file_name) as file:
    reader = csv.reader(file)
    
    total = 0
    count = 0
    landfall = 0
    intensity = 0
    for row in reader:
        
        total += 1
        if len(row) < 21:
            count += 1
        elif ' L' in row:
            landfall += 1
        elif ' I' in row:
            intensity += 1
    
    print('total entries           ', total)
    print('no. of storms           ', count)
    print('no. of landfalls        ', landfall)
    print('no. of intensity peaks  ', intensity)    
# Check values below after consolidating database by storm      

total entries            52151
no. of storms            1848
no. of landfalls         943
no. of intensity peaks   28


In [7]:
file_name = 'hurdat2-1851-2017-050118.txt'
with open(file_name) as file:
    
    reader = csv.reader(file)
    
    # Distinct labels for distinct rows.
    # See database format link in header for details.
    storm_cols = ['stormID','name', 'entries_n','extra']
    data_cols = ['date', 'time', 'record_type','status',
                 'latitude','longitude','max_sust_v', 'min_p',
                '34kt_r_ne', '34kt_r_se', '34kt_r_sw', '34kt_r_nw',
                '50kt_r_ne', '50kt_r_se', '50kt_r_sw', '50kt_r_nw',
                '64kt_r_ne', '64kt_r_se', '64kt_r_sw', '64kt_r_nw']
    
    # Adjust column names to accomodate for parsed in data and reformatting
    data_cols_new = ['stormID','name'] + data_cols + ['empty']
    
    # METHODOLOGY:
    # Recognize when a new header row occurs and adjust labels
    # Assign time-series to data after header row to a stormID and name 
    
    stormID, name = '', ''
    storms = []
    
    for row in reader:
        
        # Determine if header row & re-assign ID & name
        if len(row) == 4:
            stormID = row[0].strip()
            name = row[1].strip()
            
        else:
            storms.append([stormID,name] + row)    
    
    all_storms = pd.DataFrame(storms, columns = data_cols_new)

all_storms

Unnamed: 0,stormID,name,date,time,record_type,status,latitude,longitude,max_sust_v,min_p,...,34kt_v_nw,50kt_v_ne,50kt_v_se,50kt_v_sw,50kt_v_nw,64kt_v_ne,64kt_v_se,64kt_v_sw,64kt_v_nw,empty
0,AL011851,UNNAMED,18510625,0000,,HU,28.0N,94.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
1,AL011851,UNNAMED,18510625,0600,,HU,28.0N,95.4W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
2,AL011851,UNNAMED,18510625,1200,,HU,28.0N,96.0W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
3,AL011851,UNNAMED,18510625,1800,,HU,28.1N,96.5W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
4,AL011851,UNNAMED,18510625,2100,L,HU,28.2N,96.8W,80,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
5,AL011851,UNNAMED,18510626,0000,,HU,28.2N,97.0W,70,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
6,AL011851,UNNAMED,18510626,0600,,TS,28.3N,97.6W,60,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
7,AL011851,UNNAMED,18510626,1200,,TS,28.4N,98.3W,60,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
8,AL011851,UNNAMED,18510626,1800,,TS,28.6N,98.9W,50,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,
9,AL011851,UNNAMED,18510627,0000,,TS,29.0N,99.4W,50,-999,...,-999,-999,-999,-999,-999,-999,-999,-999,-999,


Data needs to be cleaned by:
* Dropping empty final row
* stripping all strings of whitespace 
* simplifying date and time columns into a single datetime object
* make appropriate conversions to floats for latitude and longitude

Note that all values of `-999` indicate missing data.

In [6]:
# Remvoing unnecessary columns
all_storms.drop(labels = ['empty'], axis = 1, inplace = True)

Data should now be summarized by storm into the appropriate independent variables for the model. Those steps include:  

* storm length: difference b/t final and first datetime objects
* storm distance travelled: difference b/t final and first coordinates
* made landfall: 1 or 0
* maximum max sustained windspeed (`max_sust_v`), knots (kt)
* minimum max sustained windspeed, knots (kt)
* max sustained windspeed at landfall, knots (kt)
* minimum pressure, millibar
* maximum 34kt radius, nautical miles (nm)
* maximum 50kt radius, nm
* maximum 64kt radius, nm
* hurricane diameter
* forward speed
* hurricane severity index [Link](https://en.wikipedia.org/wiki/Hurricane_Severity_Index)