## Obtain Data

In this notebook we perform the following steps:
* Establish the first hour of the dataset
* For the first month,
  * Obtain a list of available stations by state
  * Obtain temperature observations from weather stations in the MISO footprint
    * Stations are organized into MISO regions by state boundaries
    * Stations are predominantly clustered in population centers, making many observation redundant
    * There are many lacunae in some stations
  * Obtain the actual hourly MISO Load data and historical Medium-Term Load Forecasts (MTLF)
  * Join Load and MTLF data with weather observations to complete the raw data
  * Persist the data and demonstrate use of dataset wrapper class
* Update the dataset to the current day
* Identify and mitigate lacunae
* Publish the data

### Define Date Ranges

In [1]:
from datetime import datetime, timezone, timedelta
def est(yyyy, mm, dd, hh):
    return datetime(yyyy, mm, dd, hh, tzinfo=timezone(timedelta(hours=-5)))

# beginning of MISO's historical records that include the southern region (zones 8-10)
first_hour = est(2015, 2, 1, 0)

# latest date with actual load data available is
# l = date.today() - timedelta(days=2)
# last_hour = est(l.year, l.month, l.day, 23)
# instead, fix the date for repeatbility
last_hour = est(2022, 4, 22, 23)

test_split = last_hour - timedelta(days=364, hours=23)
validation_split = test_split - timedelta(days=365)


NameError: name 'pd' is not defined

### Obtain Weather Data for Zones

[Iowa State ASOS Network Downloads](https://mesonet.agron.iastate.edu/request/download.phtml)

In [9]:
from weather_data import get_station_df
from pathlib import Path
from os.path import isfile
import pandas as pd

zones = { 1 : {'MN': ['MSP',  # Minneapolis / St. Paul (STP)
                      'RST',  # Rochester
                      'DLH'], # Duluth
               'ND': ['FAR',  # Fargo
                      'BIS',  # Bismarck
                      'GFK'], # Grand Forks
               'SD': ['ABR'], # Aberdeen
               'WI': ['LSE'], # La Crosse
               'IL': ['SFY']  # Savanna 
              },
          2 : {'WI': ['MSN', 'MKE', 'EAU', 'GRB'],
               'MI': ['ANJ', 'SAW', 'IWD']},
          3 : {'IA': ['DSM', 'CID', 'DVN', 'SUX', 'ALO', 'MCW']}
        }

def download_file_path(zone, state, station):
    zone_data = f"./data/zone_{zone}"
    Path(zone_data).mkdir(exist_ok=True)
    return f'{zone_data}/{state}_{station}.parquet'

def download_station(zone, state, station):
    path = download_file_path(zone, state, station) 
    if isfile(path):
        return pd.read_parquet(path)

    station = get_station_df(station, first_hour, last_hour)
    if station is None:
        print(f'Retrieve {station} failed')
        return None
    return station.to_parquet(path)

from multiprocessing import cpu_count
from joblib import Parallel, delayed
parallel = Parallel(n_jobs=cpu_count())
for zone in zones:
  stations = [(state, station) for state in zones[zone].keys() for station in zones[zone][state]]
  _ = parallel(delayed(download_station)(zone, state, station) for (state, station) in stations)

We have obtained raw weather observations at various intervals, usually 15 minutes, but there is a lot of missing data.

In [19]:
raw_df = pd.DataFrame()
for zone in zones:
  for state in zones[zone]:
      for station in zones[zone][state]:
        raw_df = pd.concat([raw_df, pd.read_parquet(download_file_path(zone, state, station))])
raw_df[raw_df['tmpf'] == 'M'].head()

Unnamed: 0,station,valid,tmpf,lat,lon,feel
13537,MSP,2016-04-30 17:00,M,44.8854,-93.2313,M
13538,MSP,2016-04-30 17:15,M,44.8854,-93.2313,M
13539,MSP,2016-04-30 17:20,M,44.8854,-93.2313,M
13540,MSP,2016-04-30 17:35,M,44.8854,-93.2313,M
13542,MSP,2016-04-30 18:00,M,44.8854,-93.2313,M


We need a temperature observation for each hour, since that is how the MISO MTLF is reported. We will deal with lacunae in the data first.

In [3]:
#observation_dates = pd.date_range(start = first_hour, end = last_hour)
#observation_hours = [d.replace(hour = h) for d in observation_dates for h in range(0, 24)]

def build_hourly_df(zone, state, station, observation_hours):
    w = pd.read_parquet(download_file_path(zone, state, station))
    df = w[w['tmpf'] != 'M'].copy()
    df['valid'] = pd.to_datetime(df['valid'], utc=True)
    numeric_cols = ['tmpf', 'lat', 'lon']
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, axis=1)
    df = df.drop(columns=['feel'])
    idx = df.drop_duplicates('valid').set_index('valid').index.get_indexer(observation_hours, method='nearest')
    df = df.iloc[idx]
    df.loc[: , 'valid'] = df['valid'].dt.round(freq='H')
    return df.drop_duplicates('valid') 

dfs = parallel(delayed(build_hourly_df)(state, station, observation_hours) for (state, station) in stations)
df = pd.concat(dfs, ignore_index=True)
df

Unnamed: 0,station,valid,tmpf,lat,lon
0,DSM,2015-02-01 05:00:00+00:00,33.98,41.5339,-93.6531
1,DSM,2015-02-01 06:00:00+00:00,33.08,41.5339,-93.6531
2,DSM,2015-02-01 07:00:00+00:00,33.08,41.5339,-93.6531
3,DSM,2015-02-01 08:00:00+00:00,33.08,41.5339,-93.6531
4,DSM,2015-02-01 09:00:00+00:00,33.08,41.5339,-93.6531
...,...,...,...,...,...
2275820,LFK,2021-12-31 19:00:00+00:00,78.10,31.2340,-94.7500
2275821,LFK,2021-12-31 20:00:00+00:00,81.00,31.2340,-94.7500
2275822,LFK,2021-12-31 21:00:00+00:00,81.00,31.2340,-94.7500
2275823,LFK,2021-12-31 22:00:00+00:00,81.00,31.2340,-94.7500


### Obtain the Regional MTLF and Actual Load for each Observation Hour

In [4]:
from rf_al_data import 

Path("./data/mtlf").mkdir(exist_ok=True)
forecast_output_dir = './data/mtlf'
# the actuals aren't available until the next day
actuals = get_daily_rf_al_df(first_hour, last_hour + timedelta(days=2.0), forecast_output_dir)
actuals

Unnamed: 0,Market Day,HourEnding,Central MTLF (MWh),Central ActualLoad (MWh),North MTLF (MWh),North ActualLoad (MWh),South MTLF (MWh),South ActualLoad (MWh),MISO MTLF (MWh),MISO ActualLoad (MWh)
1,2015-02-01,1,37796,36585.39,15885,16150.30,16067,16163.39,69748,68899.08
2,2015-02-01,2,36589,35509.04,15517,15713.40,15591,15571.34,67697,66793.78
3,2015-02-01,3,36067,34970.90,15187,15381.56,15272,15137.39,66526,65489.85
4,2015-02-01,4,35855,34708.83,15060,15239.59,15122,14935.61,66037,64884.03
5,2015-02-01,5,35835,34709.81,15027,15229.09,15080,14750.43,65942,64689.33
...,...,...,...,...,...,...,...,...,...,...
20,2021-12-31,20,37097,35527.32,20035,19940.40,20204,20083.71,77336,75551.43
21,2021-12-31,21,36329,34442.06,19699,19477.29,19658,19543.14,75686,73462.49
22,2021-12-31,22,35383,33472.41,19408,19040.22,19096,18934.14,73887,71446.77
23,2021-12-31,23,33996,32390.22,19038,18672.17,18403,18469.10,71437,69531.49


### Harmonize Features with Actuals

There are a number of lacunae in the weather observations.

In [5]:
def mktime_idx(row): 
    return datetime.combine(row['Market Day'].date(), time(row['HourEnding'] - 1), timezone(timedelta(hours = -5)))

actuals['time_idx'] = actuals.apply(mktime_idx, axis = 1)

In [6]:
actuals.to_parquet('./data/actuals_mtlf.parquet')
actuals

Unnamed: 0,Market Day,HourEnding,Central MTLF (MWh),Central ActualLoad (MWh),North MTLF (MWh),North ActualLoad (MWh),South MTLF (MWh),South ActualLoad (MWh),MISO MTLF (MWh),MISO ActualLoad (MWh),time_idx
1,2015-02-01,1,37796,36585.39,15885,16150.30,16067,16163.39,69748,68899.08,2015-02-01 00:00:00-05:00
2,2015-02-01,2,36589,35509.04,15517,15713.40,15591,15571.34,67697,66793.78,2015-02-01 01:00:00-05:00
3,2015-02-01,3,36067,34970.90,15187,15381.56,15272,15137.39,66526,65489.85,2015-02-01 02:00:00-05:00
4,2015-02-01,4,35855,34708.83,15060,15239.59,15122,14935.61,66037,64884.03,2015-02-01 03:00:00-05:00
5,2015-02-01,5,35835,34709.81,15027,15229.09,15080,14750.43,65942,64689.33,2015-02-01 04:00:00-05:00
...,...,...,...,...,...,...,...,...,...,...,...
20,2021-12-31,20,37097,35527.32,20035,19940.40,20204,20083.71,77336,75551.43,2021-12-31 19:00:00-05:00
21,2021-12-31,21,36329,34442.06,19699,19477.29,19658,19543.14,75686,73462.49,2021-12-31 20:00:00-05:00
22,2021-12-31,22,35383,33472.41,19408,19040.22,19096,18934.14,73887,71446.77,2021-12-31 21:00:00-05:00
23,2021-12-31,23,33996,32390.22,19038,18672.17,18403,18469.10,71437,69531.49,2021-12-31 22:00:00-05:00


In [None]:
p = df.pivot(index='valid', columns='station', values='tmpf').dropna()
data = p.join(actuals.set_index('time_idx'), how='inner')

(n, _) = data.shape
(num_weather_observations, _) = df.groupby('valid').count().shape
(num_load_observations, _) = actuals.shape
(num_weather_observations, num_load_observations, len(observation_hours), n)

(60610, 60624, 60624, 40639)

## Feature Engineering: Business Hours

Can we improve the performance of the model by introducing business hours into the feature set?

In [8]:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay, BusinessHour

federal_business_days = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bh = BusinessHour()
def is_biz_hour(d):
    return federal_business_days.is_on_offset(d) and bh.is_on_offset(d)
data['IsBusinessHour'] = data.index.to_series().apply(lambda d: 1 if is_biz_hour(d) else 0)