# November 14, 2020

## Some additional considerations

### A month column for the storms dataframe

It will be easier to create graphics based on data for specific months if we add a column to the `storms` dataset that gives the month the storm formed in.

This will be easier to parse if we make the `date` column in `positions` a datetime variable, which we should've done to begin with.

In [9]:
import pandas as pd

positions = pd.read_csv("../data/02_intermediate/Atlantic_positions.csv")
storms = pd.read_csv("../data/02_intermediate/Atlantic_storms.csv")

month_formed = []

positions["date"] = pd.to_datetime(positions["date"], format = "%Y%m%d")

for stormID in storms["stormID"]:
    
    months = positions[positions["stormID"] == stormID]["date"].dt.month
    
    monthslist = months.tolist()
    
    month_formed.append(monthslist[0])

storms["month_formed"] = pd.Series(month_formed)

storms.head()

Unnamed: 0,stormID,name,numPositions,year,month_formed
0,AL011851,UNNAMED,14,1851,6
1,AL021851,UNNAMED,1,1851,7
2,AL031851,UNNAMED,1,1851,7
3,AL041851,UNNAMED,49,1851,8
4,AL051851,UNNAMED,16,1851,9


This looks like it worked, but let's check a few cases farther down the storm list to make sure.

In [16]:
storms[storms["year"] >= 2017]

Unnamed: 0,stormID,name,numPositions,year,month_formed
1839,AL012017,ARLENE,27,2017,4
1840,AL022017,BRET,9,2017,6
1841,AL032017,CINDY,20,2017,6
1842,AL042017,FOUR,9,2017,7
1843,AL052017,DON,7,2017,7
1844,AL062017,EMILY,11,2017,7
1845,AL072017,FRANKLIN,18,2017,8
1846,AL082017,GERT,28,2017,8
1847,AL092017,HARVEY,74,2017,8
1848,AL112017,IRMA,66,2017,8


Seems like this should be right, then!

I'll add this code into the data cleaning function and run it again.

### Web scraping for latest version of dataset

I ran into the issue earlier of the filename changing with each new version of the dataset. However, the link to this dataset is always hosted at the same location on the NHC Data Archive page. So, by employing a simple web scraper using BeautifulSoup I can convert the data download function to always download the latest version.

There's really no need to use a historical version of the dataset as rows newer than desired can just be excluded as a subset.

In [17]:
import os, sys
import pandas as pd

import requests
from bs4 import BeautifulSoup

root_dir = os.path.join(os.getcwd(), '..')
sys.path.append(root_dir)

def download_atlantic_hurdat_raw(dest_filename = "Atlantic"):
    
    URL = "https://www.nhc.noaa.gov/data/"
    r = requests.get(URL)

    soup = BeautifulSoup(r.content, 'html5lib')

    for element in soup.find_all('span'):
        if "Atlantic hurricane database (HURDAT2)" in element.text:
            target = element
    
    source = ("https://www.nhc.noaa.gov") + (target.next_sibling.next_sibling.attrs['href'])

    download_dataset = pd.read_csv(source, header = None, names = list(range(0, 20)))

    download_dataset.to_csv(f"../data/01_raw/{dest_filename}.csv", 
                            header = False, index = False)

    print(f"Downloaded data to /data/01_raw/{dest_filename}.csv")
    
    # Show the newly downloaded dataset
    return download_dataset.head()

In [18]:
download_atlantic_hurdat_raw()

Downloaded data to /data/01_raw/Atlantic.csv


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,AL011851,UNNAMED,14.0,,,,,,,,,,,,,,,,,
1,18510625,0000,,HU,28.0N,94.8W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
2,18510625,0600,,HU,28.0N,95.4W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
3,18510625,1200,,HU,28.0N,96.0W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
4,18510625,1800,,HU,28.1N,96.5W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0


These minor changes should allow the project to be at a good state for a first push to GitHub!