# Project Introduction:

This is a practice data science project from which I hope to gain a better understanding of Python, Pandas, and explore packages for data visualization, particularly those involving mapping and handling geographic data. Many areas in which I have a personal and academic interest involve geographic components including climate and weather, politics, and demography.

These notebooks will catalog each stage of development of this project and will remain in the repository to serve as an explanation of my thought process and a record of learning. 

# July 31, 2020

## Downloading Data

We want to use `pandas.read_csv()` to grab the HURDAT2 dataset from [NOAA's NHC Data Archive](https://www.nhc.noaa.gov/data).

The most recent version of this data is hosted at the link on the above page, but historical datasets also remain available all using the same standardized URL format. This means that we can create a function that accepts input to allow us to access historical records.

We'll also use `pandas.to_csv()` to save the raw data to our `/data/01_raw` directory. This will ensure that we have a copy of the raw data that will not be altered during analysis.

In [1]:
import pandas as pd

# The URL format is slightly different for the Atlantic and Northeast Pacific datasets, so we'll write a function for each.

# Allows the user to provide the most recent season and update date which are needed to identify the file, defaulting to the
# most recent version at the time of writing, as well as a destination filename, defaulting to the name of the appropriate
# ocean.
def download_atlantic_hurdat_raw(recent_season = "2019", update_date = "052520", dest_filename = "Atlantic"):
    
    # Set up the URL as an f-string.
    url = f"https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-{recent_season}-{update_date}.txt"

    download_dataset = pd.read_csv(url, header = None, names = list(range(0, 20)))
    download_dataset.to_csv(f"../data/01_raw/{dest_filename}.csv", header = False, index = False)

    # Print a message confirming the download.
    print(f"Downloaded data to /data/01_raw/{dest_filename}.csv")
    
    # Print the first few entries to ensure that the data downloaded correctly.
    return download_dataset.head()

def download_pacific_hurdat_raw(recent_season = "2019", update_date = "042320", dest_filename = "Pacific"):
   
    url = f"https://www.nhc.noaa.gov/data/hurdat/hurdat2-nepac-1949-{recent_season}-{update_date}.txt"

    download_dataset = pd.read_csv(url, header = None, names = list(range(0, 20)))
    download_dataset.to_csv(f"../data/01_raw/{dest_filename}.csv", header = False, index = False)
    
    print(f"Downloaded data to /data/01_raw/{dest_filename}.csv")
    
    return download_dataset.head()

Let's test the functions out.

In [2]:
download_atlantic_hurdat_raw()

Downloaded data to /data/01_raw/Atlantic.csv


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,AL011851,UNNAMED,14.0,,,,,,,,,,,,,,,,,
1,18510625,0000,,HU,28.0N,94.8W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
2,18510625,0600,,HU,28.0N,95.4W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
3,18510625,1200,,HU,28.0N,96.0W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
4,18510625,1800,,HU,28.1N,96.5W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0


In [3]:
download_pacific_hurdat_raw()

Downloaded data to /data/01_raw/Pacific.csv


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,EP011949,UNNAMED,7.0,,,,,,,,,,,,,,,,,
1,19490611,0000,,TS,20.2N,106.3W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
2,19490611,0600,,TS,20.2N,106.4W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
3,19490611,1200,,TS,20.2N,106.7W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
4,19490611,1800,,TS,20.3N,107.7W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0


## Modularizing

Ultimately I'd like to be able to import these functions into other python files when analyzing the data. Also, the data download functions should ideally only need to be run once unless we're replacing our current data with updated data, so it would be nice not to have to have them in the body of the code. They could really just be run from a command line.

We'll make a module in the `/src/do1_data` directory, `data_download.py`.

We'll also add the root directory for the project to `sys.path` so that we'll be able to import our modules.

In [4]:
import os, sys
import pandas as pd

root_dir = os.path.join(os.getcwd(), '..')
sys.path.append(root_dir)

Now we can test the functions!

In [5]:
from src.d01_data import data_download as raw

raw.download_atlantic_hurdat_raw()

Downloaded data to /data/01_raw/Atlantic.csv


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,AL011851,UNNAMED,14.0,,,,,,,,,,,,,,,,,
1,18510625,0000,,HU,28.0N,94.8W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
2,18510625,0600,,HU,28.0N,95.4W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
3,18510625,1200,,HU,28.0N,96.0W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
4,18510625,1800,,HU,28.1N,96.5W,80.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0


In [6]:
raw.download_pacific_hurdat_raw()

Downloaded data to /data/01_raw/Pacific.csv


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,EP011949,UNNAMED,7.0,,,,,,,,,,,,,,,,,
1,19490611,0000,,TS,20.2N,106.3W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
2,19490611,0600,,TS,20.2N,106.4W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
3,19490611,1200,,TS,20.2N,106.7W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
4,19490611,1800,,TS,20.3N,107.7W,45.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0


Everything seems to work as expected! Now we have our data and can move on to cleaning.