Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create notebook exploring bias in load growth projections #3910

Open
zaneselvans opened this issue Oct 16, 2024 · 10 comments
Open

Create notebook exploring bias in load growth projections #3910

zaneselvans opened this issue Oct 16, 2024 · 10 comments
Assignees
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. community ferc714 Anything having to do with FERC Form 714 good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. kaggle Sharing our data and analysis with the Kaggle community

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Oct 16, 2024

Overview

Regulated utilities have a habit of overestimating load growth, in order to justify expanding their rate base. @arengel at RMI did a little exploration of this in 2017 in The Billion Dollar Costs of Forecasting Electricity Demand and it has become relevant once again with the rush to build gas plants and delay coal plant retirements in order to serve "hyperscale" data centers and AI training. To what extent are utilities simply taking advantage of the hype around this narrative to justify "emergency" build out of new fossil infrastructure? Data reported by planning areas in the FERC-714 can provide some context, and would also provide a nice example analysis notebook for our PUDL Examples repo.

Image

Outline

  • Create a notebook using the PUDL Dataset on Kaggle
  • Use the projection data in the core_ferc714__yearly_planning_area_demand_forecast table to analyze the biases in demand forecasts.
  • Try different ways of visualizing the data to make it clear what's going on. The RMI plot above is one possibility.
  • Because different planning areas have wildly different levels of demand, it will probably make sense to normalize the projections, looking at them relative to actual demand, rather than in absolute MW or MWh units.
  • One complication that will come up is the territory served by a given respondent can change from year to year. You can get a sense of how this might impact the results by looking at out_ferc714__summarized_demand and if necessary, make maps of the service territories and see how they evolved over time with the out_eia861__yearly_utility_service_territory and out_eia861__yearly_balancing_authority_service_territory tables and the geometries in the Census DP1 database.
  • The names of the planning areas (which are often coincident with utilities or balancing authorities) can be merged in from the core_ferc714__respondent_id for readability.
  • Being able to look at the prediction record of a single respondent as well as ensembles of respondents would be helpful.
  • The actual peak demand numbers can be found in the historical hourly data: out_ferc714__hourly_planning_area_demand (only available as Parquet) -- you'll need to look at what the definition of winter and summer peak demand are.

Questions

  • Are some respondents consistently better than others at predicting their actual future demand?
  • When respondents are bad at predicting future demand, is the error random? Or is it systematic?
  • Are whole regions better or worse at predicting future demand? Does that correlate with whether the utilities are competitive or regulated monopolies?
  • Has the quality of predictions changed over time?
  • Has the level of load growth being predicted changed over time?

Background Reading

@zaneselvans zaneselvans converted this from a draft issue Oct 16, 2024
@zaneselvans zaneselvans added analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. ferc714 Anything having to do with FERC Form 714 kaggle Sharing our data and analysis with the Kaggle community community good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. labels Oct 16, 2024
@zaneselvans
Copy link
Member Author

zaneselvans commented Oct 31, 2024

Hey @nilaykumar just comment in here if you run into anything strange or have questions about the data or electricity system background (it looks like I can't assign the issue to you until you've engaged with it though)

@zaneselvans zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Oct 31, 2024
@nilaykumar
Copy link

nilaykumar commented Nov 13, 2024

What are the formal definitions (according to form 714) of summer and winter? I looked through the form documentation but couldn't find an answer (maybe I missed it!).

The EIA's glossary defines summer as May through October and winter as November through April. I'll stick with this for the moment, but let me know if you're familiar with the precise definitions.

Edit: and would April 2025, for instance, still count as the winter of 2024?

Edit#2: Aha, I should have checked the data dictionary:

  • summer: June through September
  • winter: January through March
    I imagine these are what forms 714 expects? This renders my question about the year associated to winters moot, but does make me wonder about peaks in December, say.

@zaneselvans
Copy link
Member Author

zaneselvans commented Nov 13, 2024

I suspect that the column descriptions in the data dictionary ultimately came from the EIA column definitions even though they're in a FERC table. Unfortunately the FERC-714 instructions are totally vague on the summer/winter definition, which could mean that every respondent is applying their own criteria and it's not standardized.

This EIA post from 2020 give a little insight into how peaks vary by region, month, and hour. The "summer peaking" pattern is a single daily peak late in the afternoon for AC, while the "winter peaking" pattern is two (smaller) daily peaks, in the morning and evening for heating. And when the load curve shifts between these two patterns is different in different regions. E.g. the US Southwest has a "summer" style pattern in all of April, July, and October, and only has the winter pattern in January, but the Northeast has a winter pattern in all of October, January, and April, and only looks like "summer" in July. So maybe it's not unreasonable that different respondents in different climatic regions can choose different cutoffs?

image

For the purposes of this visualization / analysis probably it doesn't matter too much -- if we can just make the cutoff dates for summer/winter a parameter that goes into it we can tweak it later if need be. And looking at all those regional curves, the "winter" peaking demand pattern is always highest in January while the "summer" peaking pattern is always highest in July, so windows that exclude the shoulder seasons are probably fine. It's probably simpler initially to just do a global peak rather than calling out the summer and winter peaks separately. It looks like the RMI analysis didn't differentiate.

Are you actually seeing winter peaks that happen in December for some respondents?

@nilaykumar
Copy link

Thanks for the detailed explanation! This question about varying summer/winter designations by geo is an interesting one, but agreed -- it makes sense to stick with a simple global peak for now.

I am actually seeing peaks throughout the year, but I might be wrangling the data incorrectly. My notebook is here and I believe it should be visible (I'm new to Kaggle, so let me know if there's something missing). I've got a simple histogram of the 10-year-forecast-vs-realized over-forecast percentage at the bottom. Hopefully that looks reasonable.

@zaneselvans
Copy link
Member Author

Yes, I can access the notebook!

I'm suspicious of the relatively flat distribution of months in which peak demand occurred. I would expect it to be primarily centered around a summer peak in July or August, with a smaller set of planning areas (if any) peaking in ~January.

peak_month
8     18722
7     18182
6     17323
1     16075
5     15973
12    15964
3     15864
10    15634
4     15503
9     15495
11    15176
2     14254
Name: count, dtype: int64

One thing that might be happening is that the planning areas reporting FERC-714 vary wildly in size, and the smaller ones probably have a much more variable pattern of demand. You might try just looking at planning areas above a certain total demand threshold? The out_ferc714__summarized_demand table has some annual summary statistics for the FERC-714 respondents in it that you can use to identify a set of respondent IDs associated with larger demand that you can focus on.

Also the number of peak values being reported in peak_month seems very large. There are only ~200 respondents, and 18 years of data, so there should be a total of ~3600 instances of peak annual demand -- if the peak is unique, which I guess it won't be, but should it really be as non-unique as it appears to be here? It looks like there's ~200,000 instances of actual demand matching peak demand.

It might be good to spot check a couple of big regional respondents and make sure they look reasonable. E.g. the California ISO and ERCOT should both have a clear summer peak. Maybe aggregate to the max value per day and plot those curves to see what the seasonal patterns look like for various planning areas.

@zaneselvans
Copy link
Member Author

I think the histogram looks generally like what I would expect. More or less centered around 0, but with a right-skew.

Given the wide range of total demand in the different planning areas, it might be more informative to do a histogram that's weighted either by peak demand or total demand.

@nilaykumar
Copy link

Nice catch, I had a lot of duplicates there from demand numbers that were either identically zero or hitting the yearly peak quite often (e.g. during the summer). Dropping duplicates appropriately seems to give more reasonable results (though there are still a decent number of peaks in December, for example):

peak_month
7     601
8     574
1     444
6     207
12    124
2      90
9      80
11     54
3      19
4      13
5      10
10     10
Name: count, dtype: int64

I've started to sketch out some plots similar to the RMI plot. I'm not familiar enough yet with the data to have much confidence in them yet (the peak-weighted curve is all over the place), but getting there!

@zaneselvans
Copy link
Member Author

Okay, that distribution looks much more like I would expect -- almost all the peaks are in clear summer or winter months and not the shoulder seasons.

@zaneselvans
Copy link
Member Author

Annnnd now Georgia Power has an even more bonkers load projection that would triple its overall generating capacity by 2030, almost entirely driven by datacenter loads. Just deranged fantasy. The IRP will be 🍿🔥

@zaneselvans
Copy link
Member Author

Hey @nilaykumar I'm going to be offline for December, but @cmgosnell and @arengel would both have context on this analysis if you need to check in with someone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. community ferc714 Anything having to do with FERC Form 714 good-first-issue Good issues for first-time contributors. Self-contained, low context, no credentials required. kaggle Sharing our data and analysis with the Kaggle community
Projects
Status: In progress
Development

No branches or pull requests

2 participants