## Obtain Data

In this notebook we perform the following steps:
* Establish the first hour of the dataset
* For the first month,
  * Obtain a list of available stations by state
  * Obtain temperature observations from weather stations in the MISO footprint
    * Stations are organized into MISO regions by state boundaries
    * Stations are predominantly clustered in population centers, making many observation redundant
    * There are many lacunae in some stations
  * Obtain the actual hourly MISO Load data and historical Medium-Term Load Forecasts (MTLF)
  * Join Load and MTLF data with weather observations to complete the raw data
  * Persist the data and demonstrate use of dataset wrapper class
* Update the dataset to the current day
* Identify and mitigate lacunae
* Publish the data

### Define Date Ranges

In [2]:
%load_ext autoreload
%autoreload 2

from datetime import datetime, timezone, timedelta
from weather_data import prevailing_time as est

# beginning of MISO's historical records that include the southern region (zones 8-10)
# five hours weather data was lost for 2015-06-17
first_hour = est(2015, 6, 18, 4)

# latest date with actual load data available is
# l = date.today() - timedelta(days=2)
# last_hour = est(l.year, l.month, l.day, 23)
# instead, fix the date for repeatbility
last_hour = est(2022, 4, 22, 23)

test_split = last_hour - timedelta(days=364, hours=23)
validation_split = test_split - timedelta(days=365)


### Obtain Weather Data for Zones

[Iowa State ASOS Network Downloads](https://mesonet.agron.iastate.edu/request/download.phtml)

In [3]:
from weather_data import ASOS
from pathlib import Path
from os.path import isfile
import pandas as pd

zones = { 1 : {'MN': ['MSP',  # Minneapolis / St. Paul (STP)
                      'RST',  # Rochester
                      'DYT'], # Duluth
               'ND': ['FAR',  # Fargo
                      'BIS',  # Bismarck
                      'GFK'], # Grand Forks
               'SD': ['ABR'], # Aberdeen
               'WI': ['LSE'], # La Crosse
               'IL': ['SFY']  # Savanna 
              },
          2 : {'WI': ['MSN', 'MKE', 'EAU', 'GRB'],
               'MI': ['ANJ', 'SAW', 'IWD']},
          3 : {'IA': ['DSM', 'CID', 'DVN', 'SUX', 'ALO', 'MCW']}
        }

def download_file_path(zone, state, station):
    zone_data = f"./data/zone_{zone}"
    Path(zone_data).mkdir(exist_ok=True)
    return f'{zone_data}/{state}_{station}.parquet'

def download_station(zone, state, station):
    path = download_file_path(zone, state, station) 
    if isfile(path):
        return pd.read_parquet(path)

    asos = ASOS()

    station = asos.get_station_df(station, first_hour, last_hour)
    if station is None:
        print(f'Retrieve {station} failed')
        return None
    return station.to_parquet(path)

from multiprocessing import cpu_count
from joblib import Parallel, delayed
def do_parallel(func):
    parallel = Parallel(n_jobs=cpu_count())
    result = {}
    for zone in zones:
        stations = [(state, station) for state in zones[zone].keys() for station in zones[zone][state]]
        result[zone] = parallel(delayed(func)(zone, state, station) for (state, station) in stations)
    return result

In [218]:
_ = do_parallel(download_station)

We have obtained raw weather observations at various intervals, usually 15 minutes, but there is a lot of missing data.

In [5]:
raw_df = pd.DataFrame()
for zone in zones:
  for state in zones[zone]:
      for station in zones[zone][state]:
        raw_df = pd.concat([raw_df, pd.read_parquet(download_file_path(zone, state, station))])
raw_df[raw_df['tmpf'] == 'M'].head()

Unnamed: 0,station,valid,tmpf,lat,lon,feel
9559,MSP,2016-04-30 17:00,M,44.8854,-93.2313,M
9560,MSP,2016-04-30 17:15,M,44.8854,-93.2313,M
9561,MSP,2016-04-30 17:20,M,44.8854,-93.2313,M
9562,MSP,2016-04-30 17:35,M,44.8854,-93.2313,M
9564,MSP,2016-04-30 18:00,M,44.8854,-93.2313,M


In [6]:
raw_df[raw_df['tmpf'] == 'M'].shape

(10880084, 6)

In [9]:
raw_df['temp'] = pd.to_numeric(raw_df['tmpf'], errors='coerce')
raw_df.head()

Unnamed: 0,station,valid,tmpf,lat,lon,feel,temp
0,MSP,2015-06-18 00:53,68.0,44.8854,-93.2313,68.0,68.0
1,MSP,2015-06-18 01:53,66.92,44.8854,-93.2313,66.92,66.92
2,MSP,2015-06-18 02:27,66.92,44.8854,-93.2313,66.92,66.92
3,MSP,2015-06-18 02:48,66.2,44.8854,-93.2313,66.2,66.2
4,MSP,2015-06-18 02:53,66.92,44.8854,-93.2313,66.92,66.92


In [10]:
missing_temps = raw_df[pd.isna(raw_df['temp'])]
missing_temps.shape

(10880084, 7)

In [233]:
dlh_df.tail(10)

Unnamed: 0_level_0,station,observation_time,temp
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-06-18 14:00:00-05:00,DLH,2016-06-18 19:00:00+00:00,80.970473
2016-06-18 15:00:00-05:00,DLH,2016-06-18 19:55:00+00:00,82.94
2016-06-18 16:00:00-05:00,DLH,2016-06-18 21:00:00+00:00,80.063491
2016-06-18 17:00:00-05:00,DLH,2016-06-18 21:55:00+00:00,77.0
2016-06-18 18:00:00-05:00,DLH,2016-06-18 22:55:00+00:00,78.08
2016-06-18 19:00:00-05:00,DLH,2016-06-18 23:55:00+00:00,78.08
2016-06-18 20:00:00-05:00,,NaT,78.468646
2016-06-18 21:00:00-05:00,,NaT,81.10177
2016-06-18 22:00:00-05:00,,NaT,85.987049
2016-06-18 23:00:00-05:00,,NaT,93.644189


In [239]:
from weather_data import ASOS
asos = ASOS()
dlh_df = asos.get_station_df('DLH', est(2016, 6, 16, 0), est(2016, 6, 19, 3))
dlh_df.tail(10)

Fetching https://mesonet.agron.iastate.edu/cgi-bin/request/asos.py?data=tmpf&data=feel&tz=Etc/UTC&format=comma&latlon=yes&year1=2016&month1=6&day1=15&year2=2016&month2=6&day2=21&station=DLH


Unnamed: 0_level_0,station,observation_time,temp
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-06-18 18:00:00-05:00,DLH,2016-06-18 22:55:00+00:00,78.08
2016-06-18 19:00:00-05:00,DLH,2016-06-19 00:00:00+00:00,76.343357
2016-06-18 20:00:00-05:00,DLH,2016-06-19 01:00:00+00:00,74.98466
2016-06-18 21:00:00-05:00,DLH,2016-06-19 02:00:00+00:00,73.637452
2016-06-18 22:00:00-05:00,DLH,2016-06-19 03:05:00+00:00,72.345319
2016-06-18 23:00:00-05:00,DLH,2016-06-19 04:00:00+00:00,71.151849
2016-06-19 00:00:00-05:00,DLH,2016-06-19 04:55:00+00:00,69.98
2016-06-19 01:00:00-05:00,DLH,2016-06-19 05:55:00+00:00,69.98
2016-06-19 02:00:00-05:00,DLH,2016-06-19 06:55:00+00:00,69.08
2016-06-19 03:00:00-05:00,DLH,2016-06-19 08:05:00+00:00,68.2099


In [231]:
est(2021, 1, 1, 0).astimezone(timezone.utc)

datetime.datetime(2021, 1, 1, 5, 0, tzinfo=datetime.timezone.utc)

We need a temperature observation for each hour, since that is how the MISO MTLF is reported. Our initial approach will be to simply drop all missing observations, then choose the observation that is closest in time to the top of the hour.

In [35]:
observation_dates = pd.date_range(start = first_hour, end = last_hour)
observation_hours = [d.replace(hour = h) for d in observation_dates for h in range(0, 24)]

def build_hourly_df(zone, state, station):
    w = pd.read_parquet(download_file_path(zone, state, station))
    df = w[w['tmpf'] != 'M'].copy()
    df['valid'] = pd.to_datetime(df['valid'], utc=True)
    numeric_cols = ['tmpf', 'lat', 'lon']
    df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric, axis=1)
    df = df.drop(columns=['feel'])
    idx = df.drop_duplicates('valid').set_index('valid').index.get_indexer(observation_hours, method='nearest')
    df = df.iloc[idx]
    df.loc[: , 'valid'] = df['valid'].dt.round(freq='H')
    return df.drop_duplicates('valid')


In [36]:
dfs = do_parallel(build_hourly_df)
zonal_weather_data = {}
for zone in zones:
    zonal_weather_data[zone] = pd.concat(dfs[zone], ignore_index=True)

In [37]:
df = zonal_weather_data[1]
for station in pd.unique(zonal_weather_data[1]['station']):
    count = df[df['station'] == station].shape[0]
    print(f'{station} observation count {count}')
print(f'Required observations {len(observation_hours)}')

MSP observation count 59898
RST observation count 59820
DLH observation count 49334
FAR observation count 59806
BIS observation count 59792
GFK observation count 59780
ABR observation count 59714
LSE observation count 59456
SFY observation count 57519
Required observations 60024


In [44]:
merged = pd.merge(df[df['station'] == 'DLH'], pd.DataFrame(observation_hours, columns=['Hours']), how='right', left_on='valid', right_on='Hours')
merged[merged['station'].isna()]['Hours'].apply(lambda h: h.date).value_counts()[0:50]

2021-08-22    24
2020-10-30    23
2020-09-23    22
2018-09-12    21
2021-05-02    21
2018-08-22    20
2020-10-25    19
2021-05-03    18
2018-04-25    17
2019-07-07    17
2018-06-20    17
2018-05-14    17
2018-05-07    17
2018-07-30    17
2018-05-10    17
2018-07-14    17
2020-07-20    16
2018-07-06    16
2019-03-25    16
2018-03-22    16
2018-07-11    16
2018-03-21    16
2018-07-13    16
2020-02-06    16
2019-04-27    16
2018-07-17    16
2019-06-13    16
2018-03-17    16
2018-03-16    16
2018-07-05    16
2018-03-13    16
2018-07-23    16
2018-07-24    16
2018-07-27    16
2018-03-09    16
2018-03-14    16
2018-06-19    16
2019-06-20    16
2019-07-16    16
2018-05-11    16
2018-05-12    16
2018-05-13    16
2019-07-31    16
2018-05-15    16
2018-05-06    16
2018-05-18    16
2018-05-05    16
2018-05-20    16
2018-05-22    16
2019-07-13    16
Name: Hours, dtype: int64

There are entire days missing in the observations. We need to try to obtain observations from nearby stations to fill in the lacunae.

In [39]:
hour = merged[merged['station'].isna()].iloc[0]['Hours']
from weather_data import get_nearest_observation, get_session
tmpf = get_nearest_observation('DLH', 'MN', hour, get_session())

Found 105 valid sites for MN
Pandas(Index=0, elevation=367.0, sname='AITKIN NDB', time_domain='(1991-Now)', state='MN', country='US', climate_site='MN0059', wfo='DLH', tzname='America/Chicago', ncdc81='USC00210059', ncei91='USC00210059', ugc_county='MNC001', ugc_zone='MNZ036', county='Aitkin', sid='AIT', latlon=(46.5484, -93.6768), distance=73.35199477929798)


In [1]:
df.head()

NameError: name 'df' is not defined

In [None]:
import pandas as pd
from weather_data import get_stations, get_session
duluth = pd.read_parquet(download_file_path(1, 'MN', 'DLH'))
stations_raw = get_stations('MN', 2015, get_session('https://'))

In [None]:
# sort stations MN by closest to another station
from geopy import distance
dlh = stations_MN_df[stations_MN_df['sid'] == 'DLH'].iloc[0]['latlon']
stations_MN_df['distance_from_DLH'] = stations_MN_df['latlon'].apply(lambda c: distance.distance(dlh, c).mi)

In [None]:
stations_MN_df.sort_values(['distance_from_DLH'])

In [None]:
distance.distance(dlh_latlon, dlh_latlon)

### Obtain the Regional MTLF and Actual Load for each Observation Hour

In [None]:
from rf_al_data import 

Path("./data/mtlf").mkdir(exist_ok=True)
forecast_output_dir = './data/mtlf'
# the actuals aren't available until the next day
actuals = get_daily_rf_al_df(first_hour, last_hour + timedelta(days=2.0), forecast_output_dir)
actuals

### Harmonize Features with Actuals

There are a number of lacunae in the weather observations.

In [None]:
def mktime_idx(row): 
    return datetime.combine(row['Market Day'].date(), time(row['HourEnding'] - 1), timezone(timedelta(hours = -5)))

actuals['time_idx'] = actuals.apply(mktime_idx, axis = 1)

In [None]:
actuals.to_parquet('./data/actuals_mtlf.parquet')
actuals

In [None]:
p = df.pivot(index='valid', columns='station', values='tmpf').dropna()
data = p.join(actuals.set_index('time_idx'), how='inner')

(n, _) = data.shape
(num_weather_observations, _) = df.groupby('valid').count().shape
(num_load_observations, _) = actuals.shape
(num_weather_observations, num_load_observations, len(observation_hours), n)

## Feature Engineering: Business Hours

Can we improve the performance of the model by introducing business hours into the feature set?

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay, BusinessHour

federal_business_days = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bh = BusinessHour()
def is_biz_hour(d):
    return federal_business_days.is_on_offset(d) and bh.is_on_offset(d)
data['IsBusinessHour'] = data.index.to_series().apply(lambda d: 1 if is_biz_hour(d) else 0)