In this problem we look at the global historical climatology network

**You are free to add implmentation or markdown cells to make your notebook clearer!!**

## Data:

The following two datasets are our focus

* Weather data [NOAA-GHCN](https://registry.opendata.aws/noaa-ghcn/)

## Part 1: Download The Weather Data




Download a year of weather data.

The Raw GHCN files don't have column headers, so we manually add them in. It's safer to at this point read in everything as an object & then parse to the correct type once you extract the variables you're interested in. 
This information can be found in https://docs.opendata.aws/noaa-ghcn-pds/readme.html

In [1]:
import urllib 

import pandas as pd

import dask.dataframe as dd
import dask.bag as db
import dask.diagnostics as dg

We're using Dask for the lazy evaluation properties (it will only try to run the computations at the end, hopefully after the data has been filtered down) because the dataset is very large. We set the storage options to `anon=True` because this data is public. Otherwise this kwarg is where we'd pass in the AWS authorization keys. 

In [2]:
# Let's load in the data for 1992
YEAR = 1992

names = ['ID', 'DATE', 'ELEMENT', 'DATA_VALUE', 'M-FLAG', 'Q-FLAG', 'S-FLAG', 'OBS-TIME']
ds = dd.read_csv(f's3://noaa-ghcn-pds/csv/{YEAR}.csv', storage_options={'anon':True},  names=names, memory_map=False, 
                  dtype={'DATA_VALUE':'object'}, parse_dates=['DATE', 'OBS-TIME'])

In [3]:
mos = dd.read_csv('Examples/mos/modelrun/mav2019*.csv')

In [4]:
ls -lh Examples/mos/

total 16K
drwxrwsr-x 2 jovyan users 4.0K Jul 28 02:59 [0m[01;34mlog[0m/
drwxrwsr-x 2 jovyan users  12K Jul 29 00:20 [01;34mmodelrun[0m/


In [5]:
# You can check the data
print(ds.columns)
print(ds.dtypes)

Index(['ID', 'DATE', 'ELEMENT', 'DATA_VALUE', 'M-FLAG', 'Q-FLAG', 'S-FLAG',
       'OBS-TIME'],
      dtype='object')
ID                    object
DATE          datetime64[ns]
ELEMENT               object
DATA_VALUE            object
M-FLAG                object
Q-FLAG                object
S-FLAG                object
OBS-TIME              object
dtype: object


In [6]:
# Print out the first few rows
ds.head()

Unnamed: 0,ID,DATE,ELEMENT,DATA_VALUE,M-FLAG,Q-FLAG,S-FLAG,OBS-TIME
0,AE000041196,1992-01-01,TMAX,269,,,I,
1,AE000041196,1992-01-01,TMIN,97,,,I,
2,AE000041196,1992-01-01,TAVG,179,H,,S,
3,AEM00041194,1992-01-01,TMAX,273,,,S,
4,AEM00041194,1992-01-01,TMIN,130,,,S,


Now we want to parse out the station ID list. We are using [pandas.read_fwf](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_fwf.html#pandas.read_fwf) because this file is a fixed format width table rather than a csv file. 
We explicitly pass in the extents of the fixed width field because Pandas has trouble inferring what belongs in the `STATE` column versus in the `NAME` column. We obtained these extents from the readme https://docs.opendata.aws/noaa-ghcn-pds/readme.html

In [7]:
# {column name:extents of the fixed-width fields}
columns = {"ID": (0,11), "LATITUDE": (12, 20), "LONGITUDE": (21, 30), "ELEVATION": (31, 37),"STATE": (38, 40),
           "NAME": (41, 71), "GSN FLAG": (72, 75), "HCN/CRN FLAG": (76, 79),"WMO ID": (80, 85)}

In [8]:
df = pd.read_fwf("http://noaa-ghcn-pds.s3.amazonaws.com/ghcnd-stations.txt", 
                    colspecs=list(columns.values()), names=list(columns.keys()))

In [9]:
df.head()

Unnamed: 0,ID,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME,GSN FLAG,HCN/CRN FLAG,WMO ID
0,ACW00011604,17.1167,-61.7833,10.1,,ST JOHNS COOLIDGE FLD,,,
1,ACW00011647,17.1333,-61.7833,19.2,,ST JOHNS,,,
2,AE000041196,25.333,55.517,34.0,,SHARJAH INTER. AIRP,GSN,,41196.0
3,AEM00041194,25.255,55.364,10.4,,DUBAI INTL,,,41194.0
4,AEM00041217,24.433,54.651,26.8,,ABU DHABI INTL,,,41217.0


In [10]:
# You should be looking for those in the New York area like Central Park, JFK, LGA and Newark airport.
NYNJ = df[df['STATE'].isin(['NY', 'NJ'])]
NYNJ.head()

Unnamed: 0,ID,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME,GSN FLAG,HCN/CRN FLAG,WMO ID
73674,US1NJAT0001,39.5483,-74.8671,31.4,NJ,BUENA VISTA TWP 2.6 NNE,,,
73675,US1NJAT0002,39.5565,-74.8048,14.0,NJ,FOLSOM 3.2 SE,,,
73676,US1NJAT0003,39.4747,-74.7107,5.5,NJ,HAMILTON TWP 2.1 SE,,,
73677,US1NJAT0005,39.6404,-74.8261,29.9,NJ,HAMMONTON 3.3 WSW,,,
73678,US1NJAT0009,39.3346,-74.5759,5.8,NJ,LINWOOD 0.7 SSW,,,


Central Park is coded in shorthand, so we used the NOAA web portal to look up the correct ID
https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USW00094728/detail

In [11]:
NYNJ[NYNJ['ID'].str.contains('USW00094728')]

Unnamed: 0,ID,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME,GSN FLAG,HCN/CRN FLAG,WMO ID
114226,USW00094728,40.7789,-73.9692,39.6,NY,NEW YORK CNTRL PK TWR,,HCN,72506.0


In [12]:
# Airports + Central Park
apcp = NYNJ[NYNJ['NAME'].str.endswith('AP') | NYNJ['ID'].str.contains('USW00094728')]
apcp.head()

Unnamed: 0,ID,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME,GSN FLAG,HCN/CRN FLAG,WMO ID
100219,USC00305840,43.1139,-78.9353,179.2,NY,NIAGARA FALLS INTL AP,,,
112764,USW00004724,43.1072,-78.9453,178.3,NY,NIAGARA FALLS INTL AP,,,
112769,USW00004742,44.65,-73.4667,71.9,NY,PLATTSBURGH INTL AP,,,
112775,USW00004781,40.7939,-73.1017,25.6,NY,ISLIP LI MACARTHUR AP,,,72505.0
112779,USW00004789,41.5092,-74.265,111.3,NY,MONTGOMERY ORANGE AP,,,


What we're interested in is the IDs, which we will use for our dataset to obtain only the stations of interest. We are going to join our two dataframes on the ID column so that we have all the information in every row.  We are removing the flags since they have neither computational nor necessary identification information. 

we do not use `.compute()` to resolve the computation because it's better to hold off until the completetion of feature selection and engineering described below. If you'd like a fully computed dataframe, the code is 
```python


In [13]:
nyds = ds.merge(apcp[['ID', 'LATITUDE', 'LONGITUDE', 'ELEVATION', 'STATE', 'NAME']], on='ID')

In [14]:
nyds.head()

Unnamed: 0,ID,DATE,ELEMENT,DATA_VALUE,M-FLAG,Q-FLAG,S-FLAG,OBS-TIME,LATITUDE,LONGITUDE,ELEVATION,STATE,NAME
0,USC00305840,1992-01-01,TMAX,22,,,0,2400.0,43.1139,-78.9353,179.2,NY,NIAGARA FALLS INTL AP
1,USC00305840,1992-01-01,TMIN,-89,,,0,2400.0,43.1139,-78.9353,179.2,NY,NIAGARA FALLS INTL AP
2,USC00305840,1992-01-01,TOBS,-89,,I,0,2400.0,43.1139,-78.9353,179.2,NY,NIAGARA FALLS INTL AP
3,USC00305840,1992-01-01,PRCP,0,T,,0,2400.0,43.1139,-78.9353,179.2,NY,NIAGARA FALLS INTL AP
4,USC00305840,1992-01-01,SNOW,0,T,,0,,43.1139,-78.9353,179.2,NY,NIAGARA FALLS INTL AP


## Part 2: Creating and Selecting Variables

Pull out and encode the various variables listed below and set up these varaibles at least initially in a pandas data frame.

### Weather variables

* raining:
    - 0 - wasn't raining
    - 1 - was raining
* rain intensity:
    - 0 -low
    - 1 - medium
    - 2 - high
* rain duration in hours
* snowing:
    - 0 - wasn't snowing
    - 1 - was snowing
* snow intensity:
    - 0 - low
    1 - medium
    2 - high
* snow duration in hours
* windy:
    - 0 - low
    - 1 - medium
    - 2 - high
    
Make sure you have aligned the data by date in a pandas data frame. Show the counts and the summary stats.