## Preparing the data

The data for this project is in the HURDAT2 format from NHC. This format contains rows of storm positions interspersed with header rows denoting which storm the subsequent position data corresponds to.

Because of this, the raw data table has a few problems:
- No column names are provided.
- Columns contain a mix of data types.
- Many rows are full of missing data for most columns.

As a result, our first goal will be to convert the data into a more usable format. We begin by importing the raw data as is into a Pandas DataFrame, `atl`. In doing so, we also assign column names which correspond to the information in the storm position data rows. We will later separate the header rows into a new DataFrame and assign them their own column names.

In [1]:
import pandas as pd
import numpy as np


# create a list of column names

header = ['date', 'time', 'recordID', 'status', 'lat', 'long', 'maxSustWind', 'minPressure', 'extNE34', 'extSE34', 'extSW34', 'extNW34', 'extNE50', 'extSE50', 'extSW50', 'extNW50', 'extNE64', 'extSE64', 'extSW64', 'extNW64']


# import Best Track Data (HURDAT2) using out column names, and verify the new DataFrame.

atl = pd.read_csv('atlantic.csv', names = header)
atl.head()

Unnamed: 0,date,time,recordID,status,lat,long,maxSustWind,minPressure,extNE34,extSE34,extSW34,extNW34,extNE50,extSE50,extSW50,extNW50,extNE64,extSE64,extSW64,extNW64
0,AL011851,UNNAMED,14.0,,,,,,,,,,,,,,,,,
1,18510625,0000,,HU,28.0N,94.8W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
2,18510625,0600,,HU,28.0N,95.4W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
3,18510625,1200,,HU,28.0N,96.0W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
4,18510625,1800,,HU,28.1N,96.5W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0


From the NHC documentation for the HURDAT2 database, column names are as follows:

- `date`: timecode for the position entry, format `YYYYMMDD`
- `time`: timecode for the position entry, format `HHMM` in 24-Hr UTC
- `recordID`: special designation for significant position entries. Can be empty or contain values:
 - **C**: closest approach to coast when not followed by landfall
 - **G**: genesis
 - **I**: intensity peak in both pressure and wind
 - **L**: landfall
 - **P**: minimum pressure
 - **R**: additional intensity detail during rapid changes
 - **S**: change of status
 - **T**: additional track/position detail
 - **W**: maximum wind speed
- `status`: tropical depression, tropical storm, hurricane, extratropical cyclone, subtropical depression, subtropical storm, low pressure system, tropical wave, or disturbance
- `lat`: latitude of center of storm
- `long`: longitude of center of storm
- `maxSustWind`: maximum sustained wind
- `minPressure`: minimum central pressure
- `extDDXX`: extent of `XX` nautical mile per hour (knots) winds in the `DD` cardinal direction quadrant

Next, we want to separate the rows of `atl` into two new DataFrames, `storms` for header rows and `positions` for position data.

In [4]:
# first, create a new column to denote whether rows are header rows (True) or data rows (False)


header = [] # list to be used as new column atl['header']


for entry in atl['date']:
    
    if entry.find('AL') != -1: # all header columns, and only header columns, contain 'AL'
        header.append(True)
    else:
        header.append(False)

        
atl['header'] = pd.Series(header) # add the list as a pandas series into a column of atl dataFrame


# create dataframes of only storm names and only position data so we can edit the columns and dtypes

storms = atl[atl['header'] == True].copy() # all header columns of atl copied into new dataframe storms
positions = atl[atl['header'] == False].copy() # all data columns of atl copied into new dataframe positions


# confirm all rows were sorted into one dataframe or the other

print(atl.shape[0], "=", storms.shape[0], "+", positions.shape[0])

53747 = 1894 + 51853


Now that we have our DataFrames, `storms` and `positions`, each one needs a bit more preparation.

In [None]:
# for the storms dataframe, we need to remove unnecessary columns, rename existing columns, create a new
#     year column, and assign the correct dtypes to all columns


# drop unnecessary columns, rename remaining columns, and clean up indices

storms.drop(['status', 'lat', 'long', 'maxSustWind', 'minPressure', 'extNE34', 'extSE34', 'extSW34', 'extNW34', 'extNE50', 'extSE50', 'extSW50', 'extNW50', 'extNE64', 'extSE64', 'extSW64', 'extNW64', 'header'], axis = 1, inplace = True)
storms.columns = ['stormID', 'name', 'numPositions']
storms.reset_index(drop=True, inplace=True)

In [None]:
# column names are as follows:
#
# 'stormID': an individual identifier for each storm in the form ALXXYYYY denoting the storm was the XXth storm
#     of (A)t(L)antic Hurricane Season YYYY. Useful when storms in different years share the same name, and for
#     unnamed storms.
# 'name': name of storm.
# 'numPositions': the number of position entries in positions dataFrame corresponding to this storm

In [None]:
#convert the new columns to the correct dtypes

stormYears = [] #create a new list to be used as a numeric years column

for stormID in storms['stormID']:
    stormYears.append(stormID[4:9]) # strip out the year from the stormID.
                                    # note that this year may not necessarily correspond to the calendar dates
                                    #      during which the storm existed, but rather the Hurricane Season to which
                                    #      it belonged.
    
    
storms['year'] = pd.Series(stormYears).astype('int') # assign new year column as integer dtype
storms['numPositions'] = storms['numPositions'].astype('int') # reassign number of positions integer dtype
storms['name'] = storms['name'].astype('str').str.strip() # reassign storm names string dtype and strip whitespace
storms['stormID'] = storms['stormID'].astype('str').str.strip() # reassign stormID string dtype and strip whitespace


# this completes work on the storms dataFrame. We can verify it now.

storms.head()

In [None]:
# for the positions dataFrame we need to clean up the indices, reformat the latitude and longitude columns
#     to make them usable by geopandas, create new columns for the storm name and stormId to make the 
#     dataframe searchable by these criteria, reform


# clean up the indices

positions.reset_index(drop=True, inplace=True)

In [None]:
# we can convert the latitude and longitude information into integers by removing the cardinal direction.
# we can instead write XX.XW as -XX.X and XX.XE as XX.X.
# we can also write XX.XN as XX.X and XX.XS as -XX.X.


intLat = [] # create lists to be used as new series for latitude and longitude 
intLong = []
        
    
for cardLat in positions['lat']:
    if cardLat.find('N') != -1: # for latitudes of degrees north, strip the whitespace and N
        intLat.append(cardLat.strip(" N"))
    else: # for latitudes of degrees south, strip the whitespace and S, and add a negative to the front
        intLat.append('-'+cardLat.strip(" S"))
    
for cardLong in positions['long']:
    if cardLong.find('E') != -1: #for longitudes of degrees east, strip the whitespace and E
        intLong.append(cardLong.strip(" E"))
    else: # for longitudes of degrees west, strip the whitespace and W, and add a negative to the front
        intLong.append('-'+cardLong.strip(" W"))

        
# replace the existing longitude and latitude columns with the new ones

positions['lat'] = pd.Series(intLat).astype('float')
positions['long'] = pd.Series(intLong).astype('float')


In [None]:
#use the number of position updates for each storm to create a column for the positions dataframe
#containing the appropriate names


stormNames = [] # create a list to be used as the names column for the positions dataFrame


for i in range(len(storms)): # for each storm in the storms dataFrame...
    for j in range(storms['numPositions'][i]): # for the number of rows indicated, add the storm name to the list
        stormNames.append(storms['name'][i])
        
        
#add the new list containing a name for every row of positions into positions dataFrame
        
positions['name'] = pd.Series(stormNames)

In [None]:
#repeat the process for storm IDs


stormIDs = []


for i in range(len(storms)):
    for j in range(storms['numPositions'][i]):
        stormIDs.append(storms['stormID'][i])
        
        
positions['stormID'] = pd.Series(stormIDs)
positions.head()

In [None]:
# now we can filter the storm position data by the storm each data point is from... e.g.:

def lookupID(stormName):
    return storms[storms['name'] == stormName]

lookupID('ISABEL')

In [None]:
# now we can find the unique stormID by searching the name of a storm and use it to query the position table
#     for data on a specific storm... e.g.:

positions[positions['stormID'] == 'AL132003'].head()

### Creating Plots based on StormID

Now that we can look up stormIDs based on name, we can use the stormID to make other queries.

Using GeoPandas, we can create a simple plot of the track of the storm we request using a stormID.

In [None]:
import geopandas as gpd

In [None]:
def plotTrack(toPlotID):
    
    # create a temporary dataframe containing position entries for the given stormID
    
    plotTrackDF = positions[positions['stormID'] == toPlotID].copy()
                                                                    
    #create a geodataframe from the x and y coordinates of the selected stormID
    
    plotTrackGDF = gpd.GeoDataFrame(plotTrackDF, geometry = gpd.points_from_xy(plotTrackDF['long'], plotTrackDF['lat']))
    
    #import the world map
    world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
    
    #plot the selected storm's coordinates over a background map of North America
    plotTrackGDF.plot(ax = world[world['continent'] == 'North America'].plot(color = 'white', edgecolor = 'black'), color = 'red')

In [None]:
plotTrack('AL132003')