![Whiteboard[2]-01.png](Whiteboard[2]-01.png)

In [None]:
%matplotlib inline

import dask.dataframe as dd
import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

Read in GHCN

In [None]:
YEAR = 2018
names = ['ID', 'DATE', 'ELEMENT', 'DATA_VALUE', 'M-FLAG', 'Q-FLAG', 'S-FLAG', 'OBS-TIME']
ds = dd.read_csv(f's3://noaa-ghcn-pds/csv/{YEAR}.csv', storage_options={'anon':True},  
                 names=names, parse_dates=['DATE'], dtype={'DATA_VALUE':'object'})

In [None]:
slookup = pd.read_csv('ghcn_mos_lookup.csv')

Filter down to stations in US, just the columns we need for visualization, and just the TAVG variable

In [None]:
ghcn = ds[['ID', 'DATE', 'ELEMENT', 'DATA_VALUE']][ds['ID'].isin(slookup['ID']) & ds['ELEMENT'].str.match('TAVG')].compute()ghcn = ds[['ID', 'DATE', 'ELEMENT', 'DATA_VALUE']][ds['ID'].isin(slookup['ID']) & ds['ELEMENT'].str.match('TAVG')].compute()

In [None]:
ghcn.head()

Turn this into a station by date table

In [None]:
ghcnt = ghcn.pivot(index='DATE', columns='ID', values='DATA_VALUE')
ghcnt.head()

In [None]:
# Compute our station order, which is how we're gonna compare one against the other

In [None]:
mos_0 = pd.read_csv('mos2018/MOS_2018_0_days_06:00:00.csv')[['station', 'runtime','TMP']]
mos_0.head()

In [None]:
mos_0.head()

In [None]:
in_both = (slookup['ID'].isin(ghcnt.columns) & slookup['Station'].isin(mos_0['station'].unique()))
station_order = slookup[['ID','Station']][in_both]

In [None]:
station_order.head()

# Lets get the GHCN observations in the same order & Plot

In [None]:
ghcnt[station_order['ID']].head()

In [None]:
fig, ax = plt.subplots()
ax.set_title("GHCN Observations")
im = ax.pcolormesh(ghcnt[station_order['ID']].values.astype(float).T, cmap='RdBu')
ax.set(ylabel="stations", xlabel="day of year")
ax.set_yticklabels([]) # removed individual station ids cause not super helpful here
ax.tick_params(axis='y', length=0)
fig.colorbar(im, ax=ax)
fig.savefig("GHCN_observations.png")

# Plot GFS MAV 6 hour ahead

Open our first prediction (0 days), pivot and line it up w/ GHCN

In [None]:
df = pd.read_csv('mos2018/MOS_2018_0_days_06:00:00.csv',
                 parse_dates=['runtime', 'ftime']).drop_duplicates()

In [None]:
#somehow there's a row where the header names got repeated
df.drop(df[df['station'].str.match('station')].index, inplace=True)

In [None]:
#filter & convert
dfc = df[['station', 'TMP']].astype({'TMP':float})
dfc['runtime'] = pd.to_datetime(df['runtime'])

In [None]:
# we have 4xDAY we predicted out, so let's average to one prediction per day 
dfc['runtime'].unique()

We need to for each station, take the average of the 4 predictions per day. We're going to end up w/ a table that is structured like our GHCN one above

In [None]:
mos_0_table = dfc.groupby(['station', pd.Grouper(freq='D', key='runtime')]).mean().unstack()['TMP'].T

In [None]:
mos_0_table.head()

Same plotting code, now just with most table in place of ghcn

In [None]:
fig, ax = plt.subplots()
ax.set_title("GHS MAV 6HR ahead prediction")
im = ax.pcolormesh(mos_0_table[station_order['Station']].values.astype(float).T, cmap='RdBu')
ax.set(ylabel="stations", xlabel="day of year")
ax.set_yticklabels([]) # removed individual station ids cause not super helpful here
ax.tick_params(axis='y', length=0)
fig.colorbar(im, ax=ax)
fig.savefig("MOS_6hr.png")

# To do:
repeat  plot GFS MAV 6 hour ahead for 
* MOS_2018_0_days_06:00:00.csv  
* MOS_2018_0_days_09:00:00.csv  
* MOS_2018_0_days_12:00:00.csv
* MOS_2018_0_days_15:00:00.csv  
* MOS_2018_0_days_18:00:00.csv  
* MOS_2018_0_days_21:00:00.csv  
* MOS_2018_1_days_00:00:00.csv  
* MOS_2018_1_days_03:00:00.csv  
* MOS_2018_1_days_06:00:00.csv  
* MOS_2018_1_days_09:00:00.csv  
* MOS_2018_1_days_12:00:00.csv 
* MOS_2018_1_days_15:00:00.csv
* MOS_2018_1_days_18:00:00.csv
* MOS_2018_1_days_21:00:00.csv
* MOS_2018_2_days_00:00:00.csv
* MOS_2018_2_days_03:00:00.csv
* MOS_2018_2_days_06:00:00.csv
* MOS_2018_2_days_09:00:00.csv
* MOS_2018_2_days_12:00:00.csv
* MOS_2018_2_days_18:00:00.csv
* MOS_2018_368_days_00:00:00.csv
* MOS_2018_3_days_00:00:00.csv

You can ignore those last two for now