# Goal

In this notebook, I explore the hourly 
[EIA Electric Power Operations data](https://www.eia.gov/opendata/browser/electricity/rto/region-data).

It is available in two main formats:

1. Spreadsheets (many types at different resolutions and facet hierachies)
1. [API](https://www.eia.gov/opendata/documentation.php)

**Main Question**: Are these 2 data sources consistent? (i.e. Could I train on bulk historical data downloaded via spreadsheet, then retrain periodically on recent data pulled from API?)

**Answer**: Yes, but be careful. I found that the spreadsheet version of the data has imputed/adjusted values (missing and outlier data

Other questions:

**Q**: Can I use the API to pull all historical data? 
**A**: Yes. But you need to paginate when pulling more than 5k rows.


In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# Spreadsheet Data Source

This is hourly data with a lot of columns (demand, demand forecast, many generation sources, etc) and a lot of rows (every hour back to july 2015). The original spreadsheet had this data for all balancing authorities, I filtered it down to only PJM, and removed a number of columns to produce the CSV file in the `data/` directory of this project.

For each metric, it includes raw values, imputed values, and adjusted values:

- Raw: The raw metrics reported by the balancing authority.
- Imputed: Where there are missing data or outliers in the raw timeseries, EIA imputes more realistic values.
- Adjusted: A merge of the raw and imputed values. Where imputed values were required they replace the raw values.

In [3]:
# Original source: https://www.eia.gov/electricity/gridmonitor/knownissues/xls/PJM.xlsx
# Downloaded 6/3/24

df = pd.read_csv('../data/pjm_hourly_published_data_20240603.csv') 


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78840 entries, 0 to 78839
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   BA                78840 non-null  object 
 1   UTC time          78840 non-null  object 
 2   Local date        78840 non-null  object 
 3   Hour              78840 non-null  int64  
 4   Local time        78840 non-null  object 
 5   Time zone         78840 non-null  object 
 6   Generation only?  78840 non-null  object 
 7   DF                78554 non-null  object 
 8   D                 78646 non-null  object 
 9   NG                78621 non-null  object 
 10  TI                78622 non-null  object 
 11  Imputed D         175 non-null    object 
 12  Imputed NG        196 non-null    object 
 13  Imputed TI        4 non-null      float64
 14  Adjusted D        78813 non-null  object 
 15  Adjusted NG       78813 non-null  object 
 16  Adjusted TI       78622 non-null  object

In [5]:
# Data column descriptions from EIA
column_descr = {
    'BA': '2-4 letter code that identifies the balancing authority',
    'UTC time': 'The end of the hour in Coordinated Universal Time (UTC)',
    'Local date': 'The date (using local time zone) for which data has been reported',
    'Hour': 'The hour number for the day.  Hour 1 corresponds to the time period 12:00 AM - 1:00 AM',
    'Local time': 'The end of the hour in local time',
    'Time zone': 'The local time zone',
    'Generation only?': ' Y indicates the balancing authority is a generation-only BA. Generation-only BAs consist of a power plant or group of power plants and do not directly serve retail customers. Therefore, they only report net generation and interchange and do not report demand or demand forecasts.',
    'DF': 'Demand forecast (DF): Each BA produces a day-ahead electricity demand forecast for every hour of the next day. These forecasts help BAs plan for and coordinate the reliable operation of their electric system on the following day. This column displays the actual data reported to EIA in MWh.',
    'D': 'Demand (D): A calculated value representing the amount of electricity load within the balancing authority’s electric system. A BA derives its demand value by taking the total metered net electricity generation within its electric system and subtracting the total metered net electricity interchange occurring between the BA and its neighboring BAs. This column displays the actual data reported to EIA in MWh.',
    'NG': 'Net generation (NG): the metered output of electric generating units in the balancing authority’s electric system. This generation only includes generating units that are managed by the balancing authority or whose operations are visible to the balancing authority.  This column displays the actual data reported to EIA in MWh.',
    'TI': 'Total Interchange (TI): the net metered tie line flow from one BA to another directly interconnected BA. Total net interchange is the net sum of all interchange occurring between a BA and it\'s directly interconnected neighboring BAs.  Negative interchange values indicate net inflows, and positive interchange values indicate net outflows.  This column displays the actual data reported to EIA in MWh.',
    'Imputed D': 'EIA imputes for anomalous values for total demand (D) if the value is missing or reported as negative, zero, or at least 1.5 times greater than the maximum of past total demand values reported by that BA. This column displays imputed values in MWh when they are made.',
    'Imputed NG': 'EIA imputes for anomalous values for total net generation (NG) if the value is missing or reported as negative, zero, or at least 1.5 times greater than the maximum of past total net generation values reported by that BA. This column displays imputed values in MWh when they are made.',
    'Imputed TI': 'EIA imputes for anomalous values for total interchange (TI) if the value is as at least 1.5 times greater than the maximum of past positive total interchange values reported by that BA or at least 1.5 times less than the minimum of past negative total interchange values reported by that BA. This column displays imputed values in MWh when they are made.',
    'Adjusted D': 'This column displays the demand (D) reported by the balancing authority in MWh unless imputation was required. When imputation was required, this column displays the imputed demand.',
    'Adjusted NG': 'This column displays the net generation (NG) reported by the balancing authority in MWh unless imputation was required. When imputation was required, this column displays the imputed net generation.',
    'Adjusted TI': 'This column displays the total interchange (TI) reported by the balancing authority in MWh unless imputation was required. When imputation was required, this column displays the imputed total interchange.',
}

Drop columns we're ignoring for now

In [6]:
df = df[['UTC time', 'Time zone', 'DF', 'D', 'Adjusted D']]

Convert columns to appropriate types

In [7]:
df['UTC time'] = pd.to_datetime(df['UTC time'], format='%d%b%Y %H:%M:%S', utc=True)
#df['Local time'] = pd.to_datetime(df['Local time'], format='%d%b%Y %H:%M:%S').dt.tz_localize('EST')
for col in ['DF', 'D', 'Adjusted D']:
    # Handle commas in string-encoded integers
    # df.loc[:, col] = pd.to_numeric(df[col].str.replace(',', ''))
    df[col] = pd.to_numeric(df[col].str.replace(',', ''))

In [8]:
df.head()

Unnamed: 0,UTC time,Time zone,DF,D,Adjusted D
0,2015-07-01 05:00:00+00:00,Eastern,29415.0,84024.0,84024.0
1,2015-07-01 06:00:00+00:00,Eastern,27687.0,79791.0,79791.0
2,2015-07-01 07:00:00+00:00,Eastern,26574.0,76760.0,76760.0
3,2015-07-01 08:00:00+00:00,Eastern,26029.0,74931.0,74931.0
4,2015-07-01 09:00:00+00:00,Eastern,26220.0,74368.0,74368.0


In [9]:
start = df['UTC time'].min()
end = df['UTC time'].max()

In [47]:
print(end)
pd.Timestamp('2024-06-28 04:00:00+00:00')

2024-06-28 04:00:00+00:00


Timestamp('2024-06-28 04:00:00+0000', tz='UTC')

# API Data Source

Below I query the API for all the hours covered in the spreadsheet to:

- Test out bulk querying with pagination (TLDR: works well)
- Check for consistency with the spreadsheet data (TLDR: The API gives the raw values found in the spreadsheet - not the adjusted values. So I'll need to impute missing and anamolous values myself.)

In [10]:
# Calculate the number of rows to fetch from the API between start and end
time_span = end - start
hours = int(time_span.total_seconds() / 3600)

# Calculate how many paginated API requests will be required to fetch all the 
# timeseries data
REQUEST_ROWS = 5000
num_full_requests = hours // REQUEST_ROWS
final_request_length = hours % REQUEST_ROWS
print(f'Fetching {hours} hours of data. Start: {start}. End: {end}')
print(f'Will make {num_full_requests} {REQUEST_ROWS}-length requests and one {final_request_length}-length request.')

Fetching 78839 hours of data. Start: 2015-07-01 05:00:00+00:00. End: 2024-06-28 04:00:00+00:00
Will make 15 5000-length requests and one 3839-length request.


Let's confirm that the 'Adjusted D' values in this dataset match the demand values available through the API.

In [11]:
import requests

url = "https://api.eia.gov/v2/electricity/rto/region-data/data/?frequency=hourly&data[0]=value&facets[respondent][]=PJM&facets[type][]=D&facets[type][]=DF&sort[0][column]=period&sort[0][direction]=asc"

# Build a list of dataframes storing each API request (page)'s response
response_dfs = []

def append_EIA_page_response_df(start, end, offset, length, result_list):
    print(f'Fetching API page. offset:{offset}. length:{length}')

    params = {
        'offset': offset,
        'length': length,
        'api_key': os.environ['EIA_API_KEY'],
        'start': start.strftime('%Y-%m-%dT%H'),
        'end': end.strftime('%Y-%m-%dT%H'),
    }

    r = requests.get(url, params=params)
    r.raise_for_status() 
    result_df = pd.DataFrame(r.json()['response']['data'])
    assert len(result_df) == length
    result_list.append(result_df)

# Make the full-length requests
for i in range(num_full_requests):
    offset = i * REQUEST_ROWS
    append_EIA_page_response_df(start, end, offset, REQUEST_ROWS, response_dfs)
# Make the final remainder request
append_EIA_page_response_df(start, end, num_full_requests * REQUEST_ROWS, final_request_length, response_dfs)

api_df = pd.concat(response_dfs)

Fetching API page. offset:0. length:5000
Fetching API page. offset:5000. length:5000
Fetching API page. offset:10000. length:5000
Fetching API page. offset:15000. length:5000
Fetching API page. offset:20000. length:5000
Fetching API page. offset:25000. length:5000
Fetching API page. offset:30000. length:5000
Fetching API page. offset:35000. length:5000
Fetching API page. offset:40000. length:5000
Fetching API page. offset:45000. length:5000
Fetching API page. offset:50000. length:5000
Fetching API page. offset:55000. length:5000
Fetching API page. offset:60000. length:5000
Fetching API page. offset:65000. length:5000
Fetching API page. offset:70000. length:5000
Fetching API page. offset:75000. length:3839


Cast columns from string to appropriate types

In [12]:
api_df['UTC period'] = pd.to_datetime(api_df['period'], utc=True)
api_df['value'] = pd.to_numeric(api_df['value'])

In [13]:
len(api_df)

78839

In [14]:
api_df.head()

Unnamed: 0,period,respondent,respondent-name,type,type-name,value,value-units,UTC period
0,2015-07-01T05,PJM,"PJM Interconnection, LLC",D,Demand,84024.0,megawatthours,2015-07-01 05:00:00+00:00
1,2015-07-01T05,PJM,"PJM Interconnection, LLC",DF,Day-ahead demand forecast,29415.0,megawatthours,2015-07-01 05:00:00+00:00
2,2015-07-01T06,PJM,"PJM Interconnection, LLC",DF,Day-ahead demand forecast,27687.0,megawatthours,2015-07-01 06:00:00+00:00
3,2015-07-01T06,PJM,"PJM Interconnection, LLC",D,Demand,79791.0,megawatthours,2015-07-01 06:00:00+00:00
4,2015-07-01T07,PJM,"PJM Interconnection, LLC",DF,Day-ahead demand forecast,26574.0,megawatthours,2015-07-01 07:00:00+00:00


In [15]:
api_demand_df = api_df[api_df['type'] == 'D']

In [16]:
merged_df = pd.merge(df[['UTC time', 'D', 'Adjusted D']], api_demand_df[['UTC period', 'value']], 
                     left_on='UTC time', right_on='UTC period')

merged_df.head()

Unnamed: 0,UTC time,D,Adjusted D,UTC period,value
0,2015-07-01 05:00:00+00:00,84024.0,84024.0,2015-07-01 05:00:00+00:00,84024.0
1,2015-07-01 06:00:00+00:00,79791.0,79791.0,2015-07-01 06:00:00+00:00,79791.0
2,2015-07-01 07:00:00+00:00,76760.0,76760.0,2015-07-01 07:00:00+00:00,76760.0
3,2015-07-01 08:00:00+00:00,74931.0,74931.0,2015-07-01 08:00:00+00:00,74931.0
4,2015-07-01 09:00:00+00:00,74368.0,74368.0,2015-07-01 09:00:00+00:00,74368.0


Null values are imputed in the spreadsheet's `Adjusted D` column

In [17]:
assert merged_df['Adjusted D'].isna().sum() == 0

But they are not removed from the raw values of the `D` column, which matches the demand values returned by the API.

In [18]:
assert merged_df['D'].isna().sum() > 0
assert merged_df['D'].isna().sum() == merged_df['value'].isna().sum()

The API reported values match the spreadsheet-reported raw values.

In [19]:
equal_mask = merged_df['D'] == merged_df['value']
both_nan_mask = pd.isna(merged_df['D']) & pd.isna(merged_df['value'])
equal_or_nan_mask = equal_mask | both_nan_mask
assert len(merged_df[~equal_or_nan_mask]) == 0

# Create separate columns for D and DF in API results.

In [30]:
demand_df = api_df[api_df.type == 'D']
d_forecast_df = api_df[api_df.type == 'DF']
new_api_df = pd.merge(demand_df[['UTC period', 'respondent', 'value']].rename(columns={'value': 'D'}), 
                      d_forecast_df[['UTC period', 'value']].rename(columns={'value': 'DF'}), 
                      on='UTC period')
new_api_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39333 entries, 0 to 39332
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   UTC period  39333 non-null  datetime64[ns, UTC]
 1   respondent  39333 non-null  object             
 2   D           39333 non-null  float64            
 3   DF          39333 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(2), object(1)
memory usage: 1.2+ MB


In [43]:
import io
buffer = io.StringIO()
new_api_df.info(buf=buffer)
print(buffer.getvalue())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39333 entries, 0 to 39332
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype              
---  ------      --------------  -----              
 0   UTC period  39333 non-null  datetime64[ns, UTC]
 1   respondent  39333 non-null  object             
 2   D           39333 non-null  float64            
 3   DF          39333 non-null  float64            
dtypes: datetime64[ns, UTC](1), float64(2), object(1)
memory usage: 1.2+ MB

