## Obtain Data

In this notebook we perform the following steps:
* Establish the first hour of the dataset
* For the first month,
  * Obtain a list of available stations by state
  * Obtain temperature observations from weather stations in the MISO footprint
    * Stations are organized into MISO regions by state boundaries
    * Stations are predominantly clustered in population centers, making many observation redundant
    * There are many lacunae in some stations
  * Obtain the actual hourly MISO Load data and historical Medium-Term Load Forecasts (MTLF)
  * Join Load and MTLF data with weather observations to complete the raw data
  * Persist the data and demonstrate use of dataset wrapper class
* Update the dataset to the current day
* Identify and mitigate lacunae
* Publish the data

### Define Date Ranges

In [436]:
%load_ext autoreload
%autoreload 2

from datetime import datetime, timezone, timedelta
from weather_data import prevailing_time as est

# beginning of MISO's historical records that include the southern region (zones 8-10)
first_hour = est(2015, 2, 1, 0)
# five hours weather data was lost for 2015-06-17
# first_hour = est(2015, 7, 1, 0)

# latest date with actual load data available is
# l = date.today() - timedelta(days=2)
# last_hour = est(l.year, l.month, l.day, 23)
# instead, fix the date for repeatbility
last_hour = est(2022, 3, 31, 23)

test_split = last_hour - timedelta(days=364, hours=23)
validation_split = test_split - timedelta(days=365)


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Obtain Weather Data for Zones

[Iowa State ASOS Network Downloads](https://mesonet.agron.iastate.edu/request/download.phtml)

In [450]:
from weather_data import ASOS
from pathlib import Path
from os.path import isfile
import pandas as pd

zones = { 1 : {'MN': ['MSP',  # Minneapolis / St. Paul (STP)
                      'RST',  # Rochester
                      'DYT'], # Duluth
               'ND': ['FAR',  # Fargo
                      'BIS',  # Bismarck
                      'GFK'], # Grand Forks
               'SD': ['ABR'], # Aberdeen
               'WI': ['LSE'], # La Crosse
               'IL': ['SFY']  # Savanna 
              },
          2 : {'WI': ['MSN', 'MKE', 'EAU', 'GRB'],
               'MI': ['ANJ', 'SAW', 'IWD']},
          3 : {'IA': ['DSM', 'CID', 'DVN', 'SUX', 'ALO', 'MCW']}
        }

def download_file_path(zone, state, station):
    zone_data = f"./data/zone_{zone}"
    Path(zone_data).mkdir(exist_ok=True)
    return f'{zone_data}/{state}_{station}.parquet'

asos = ASOS()
def download_station(zone, state, station):
    path = download_file_path(zone, state, station) 
    if isfile(path):
        return pd.read_parquet(path)

    station = asos.get_hourly_observations(station, first_hour, last_hour)
    if station is None:
        print(f'Retrieve {station} failed')
        return None
    station.to_parquet(path)
    return station

from multiprocessing import cpu_count
from joblib import Parallel, delayed
def do_parallel(func, zones):
    parallel = Parallel(n_jobs=cpu_count())
    result = {}
    for zone in zones:
        stations = [(state, station) for state in zones[zone].keys() for station in zones[zone][state]]
        result[zone] = pd.concat(parallel(delayed(func)(zone, state, station) for (state, station) in stations))
    return result

In [408]:
zone_results = do_parallel(download_station, zones)
all_zones = pd.concat(zone_results.values())

In [409]:
interpolated = all_zones[pd.isna(all_zones['observation_time'])].copy()
f = interpolated.groupby([interpolated.index.date, 'station']).temp.count()
f[f > 4].groupby(level='station').count()

station
ABR     15
ALO      9
ANJ     34
BIS     10
CID     17
DSM      1
DVN     27
DYT     59
EAU      9
FAR      9
GFK      9
GRB      7
IWD    102
LSE     30
MCW     43
MKE      4
MSN      9
MSP      4
RST      9
SAW      6
SFY     98
SUX     38
Name: temp, dtype: int64

In [422]:
f[f > 4].groupby(level='station').max()

station
ALO    23
BIS    11
DSM     5
EAU    24
FAR    14
GFK    10
GRB    10
MKE     8
MSN     5
MSP    10
RST    20
SAW     6
Name: temp, dtype: int64

In [451]:
zones2 = { 1 : {'MN': ['MSP'],
               'ND': ['BIS',  # Bismarck
                      'GFK']  # Grand Forks
              },
          2 : {'WI': ['MSN', 'MKE', 'GRB'],
               'MI': ['SAW']},
          3 : {'IA': ['DSM']}
        }

results2 = do_parallel(download_station, zones2)
all_zones2 = pd.concat(results2.values())

In [457]:
interpolated = all_zones2[pd.isna(all_zones2['observation_time'])].copy()
f = interpolated.groupby([interpolated.index.date, 'station']).temp.count()

f[f > 4].groupby(level='station').max()

station
BIS    11
DSM     6
GFK    11
GRB    11
MKE     9
MSN     8
MSP    11
SAW     8
Name: temp, dtype: int64

In [458]:
f[f > 5].groupby(level='station').count()

station
BIS    10
DSM     1
GFK    12
GRB     9
MKE     4
MSN     9
MSP     4
SAW     8
Name: temp, dtype: int64

### Best Station for Each Zone

In [None]:
zone1_station = 'MSP'
zone2_station = 'MKE'
zone3_station = 'DSM'

#### Consider weighted by population average temperature change across

In [441]:
df = results2[1]
df1 = df.pivot(columns='station', values='temp')
df2 = df1.shift(periods=1)
df2.iloc[0] = df2.iloc[1] #replace NaN
df2.head()

station,BIS,GFK,MSP
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-02-01 00:00:00-05:00,-2.02,-7.96,23.0
2015-02-01 01:00:00-05:00,-2.02,-7.96,23.0
2015-02-01 02:00:00-05:00,-2.92,-7.96,21.02
2015-02-01 03:00:00-05:00,-4.0,-9.04,19.04
2015-02-01 04:00:00-05:00,-5.08,-9.94,19.04


In [447]:
zone1 = (df1 - df2)
zone1.mean(axis=1).value_counts()

 0.000000    2955
-0.666667    1520
-0.300000    1470
-0.366667    1170
 0.666667     957
             ... 
 2.000000       1
 0.800000       1
-3.133333       1
 1.666667       1
-1.433333       1
Length: 3492, dtype: int64

### Obtain the Regional MTLF and Actual Load for each Observation Hour

In [None]:
from rf_al_data import 

Path("./data/mtlf").mkdir(exist_ok=True)
forecast_output_dir = './data/mtlf'
# the actuals aren't available until the next day
actuals = get_daily_rf_al_df(first_hour, last_hour + timedelta(days=2.0), forecast_output_dir)
actuals

### Harmonize Features with Actuals

There are a number of lacunae in the weather observations.

In [None]:
def mktime_idx(row): 
    return datetime.combine(row['Market Day'].date(), time(row['HourEnding'] - 1), timezone(timedelta(hours = -5)))

actuals['time_idx'] = actuals.apply(mktime_idx, axis = 1)

In [None]:
actuals.to_parquet('./data/actuals_mtlf.parquet')
actuals

In [None]:
p = df.pivot(index='valid', columns='station', values='tmpf').dropna()
data = p.join(actuals.set_index('time_idx'), how='inner')

(n, _) = data.shape
(num_weather_observations, _) = df.groupby('valid').count().shape
(num_load_observations, _) = actuals.shape
(num_weather_observations, num_load_observations, len(observation_hours), n)

## Feature Engineering: Business Hours

Can we improve the performance of the model by introducing business hours into the feature set?

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay, BusinessHour

federal_business_days = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bh = BusinessHour()
def is_biz_hour(d):
    return federal_business_days.is_on_offset(d) and bh.is_on_offset(d)
data['IsBusinessHour'] = data.index.to_series().apply(lambda d: 1 if is_biz_hour(d) else 0)