Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estimate monthly plant-level fuel prices w/o using the EIA API #1343

Closed
14 of 17 tasks
zaneselvans opened this issue Nov 10, 2021 · 49 comments
Closed
14 of 17 tasks

Estimate monthly plant-level fuel prices w/o using the EIA API #1343

zaneselvans opened this issue Nov 10, 2021 · 49 comments
Assignees
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. data-repair Interpolating or extrapolating data that we don't actually have. eia923 Anything having to do with EIA Form 923

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Nov 10, 2021

Instead of using the EIA API to pull monthly average fuel costs by state and fuel when individual fuel deliveries have their costs redacted in the fuel_receipts_costs_eia923 table, calculate it for ourselves.

Motivation

This change will address several issues:

  • The EIA API is missing a fair amount of the data anyway. Sometimes whole state-months are missing. It also only contains data for coarse fuel categories (coal, petroleum, natural gas) rather than the specific fuel types.
  • Relying on the API means asking users to register for an API key and manage environment variables. This is a barrier for many of our less technical users.
  • Whenever something goes wrong with the API, our CI tests fail, and we can't work with this data locally. Over time this has been happening more frequently. HTML gets returned instead of JSON, or the network is down.
  • EIA is discontinuing the v1 API in November, 2022, so our current setup will stop working anyway.
  • There's a lot of information in the fuel_receipts_costs_eia923 table, and related to the plants and mines and suppliers involved. It should be possible to do a fairly good estimation of the fuel prices from scratch given all that context.

Approach

  • Estimate fuel prices using a variety of aggregations and use them to fill missing values.
  • Start with the most granular / accurate and progressively apply less specific estimates until everything is filled in.
  • Tag each record indicating which estimation was used to fill it in.
  • Pre-calculate all of the aggregations so that we can look at how they compare with actual values first.
  • Add each of these aggregations to the original FRC dataframe for plotting.
  • We should also include the EIA API values for comparison / constraint based on the redacted values.
  • Looking at the EIA API, only PEL, PC, COW, and NG really have values for $/MMBTU at census region and state level.
  • Seems like the very granular fuel types only have prices for US Total, at least at the monthly level.
  • Use median values of the fuel prices in $/MMBTU
  • Maybe calculate a weighted median? Want typical MMBTU, not typical delivery.
  • @gschivley and Neha suggested using both spatial and temporal interpolation -- averaging prices from the adjacent states, and filling in gaps in the monthly time series when possible.
  • We could also use a low-effort, but powerful estimator like XGBoost or a random forest to try and incorporate much more information, without designing something bespoke from scratch.
  • We should be able to benchmark these calculations against the data from the API or the specific information reported in the FRC table by doing some random knockouts to see how well we can recreate the reported values.

Choosing Aggregations

  • How do we decide how to prioritize aggregations?
  • Coal prices don't vary much month to month, aggregating annually would have little impact.
  • Gas & Petroluem prices can vary dramatically month to month, so aggregating across time is bad.
  • Petroleum fuel prices are highly correlated nationwide, so aggregating geographically has little impact.

Intuitive Aggregation Priorities

  1. Most precise: ["state", "energy_source_code", "report_date"]
  2. Annual aggregation (coal): ["state", "energy_source_code", "report_year"]
  3. Regional aggregation (petroleum): ["census_region", "energy_source_code", "report_date"]
  4. Fuel type aggregation: ["state", "fuel_group_code", "report_date"]
  5. Both regional and fuel type aggregation: ["census_region", "fuel_group_code", "report_date"]
  6. Annual, regional, and fuel type aggregations: ["census_region", "fuel_group_code", "report_year"]

Questions:

  • Should we use a MMBTU weighted median rather than delivery weighted median?
  • How should we identify outlier values in the fuel prices which should be replaced? Some are totally whacked.

Other Potential Refinements

  • Automatically fill using aggregations in order of increasing dispersion of the error distribution (e.g. IQR) rather than hard-coding the order based on intuition and eyeballing it.
  • Calculate the dispersion of the error distribution on an annual basis, rather than across the entire timeline, in case the temporal, fuel type & spatial correlations change over time.

Remaining tasks:

  • Always plant_state into the fuel_receipts_costs_eia923 output table all the time.
  • Add the census regions to state mappings into the metadata enums / constants.
  • Replace the existing roll & fill method in the fuel_receipts_costs_eia923 output routine.
  • Update tests to work with the new version of frc_eia923
  • Remove API_KEY_EIA infrastructure from everywhere in the code, so we aren't unknowingly relying on it.
  • Make filling in missing fuel prices the default behavior
  • Fix the filled_by labeling, which is now showing all filled values having national_fgc_year which is the last aggregation.
  • Remove fuel_group_code from the fuel_receipts_costs_eia923 table and add it to the energy_sources_eia coding table, and add it back into the output function.
  • Understand why these changes are apparently affecting ouput row counts
  • Pull the fuel price filling out into its own separate function
  • Understand why merge_date() is removing ~10k frc_eia923 records.
  • Implement weighted median function to use in filling & identifying outliers
  • Add weighted median unit tests
  • Identify outlying fuel prices using modified z-score with MMBTU weighted median
  • Have @cmgosnell look for weirdness in the results of a new MCOE calculation in an RMI context.
  • Update release notes
  • After merging into main remove API_KEY_EIA from the GitHub secrets.
@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. data-repair Interpolating or extrapolating data that we don't actually have. labels Nov 10, 2021
@zaneselvans
Copy link
Member Author

Hey @katherinelamb I'd like to talk to you about how we might use some ML / regressions to fill in these missing fuel costs if that's an issue you might like to take on.

@katie-lamb
Copy link
Member

katie-lamb commented Mar 17, 2022

This definitely sounds interesting and doable.

Potential data for features

  • who is selling the fuel, who is buying it, what kind of fuel, how much, what kind of contract, prices in neighboring states, historical timeseries

It's probably best to start by trying to reduce the problem to just be either spatial or temporal and then try some regressions and common modeling frameworks. Seems like it's harder but still doable to model both spatially and temporally.
A novel framework for spatio-temporal prediction

@zaneselvans
Copy link
Member Author

While we do have the lat/lon location for all the plants, for the "spatial" dimension I was imagining something like looking at the prices reported in adjacent states if they're available, since I imagine that there are generally regional price trends. I think this is what Aranya / Greg did. E.g. if a given state-month-fuel combination is missing a price, fill it in with the average of the reported price for adjacent states for that fuel and month, and then that filled-in state-month-fuel table could be used to estimate the cost per unit of fuel for any records in the fuel_receipts_costs_eia923 table that have redacted values. But maybe that's more vague and complex than it needs to be?

I guess in an ideal world we'd be able to make predictions / estimations for every record with a missing value, given all of the information available in a very generic way, but that seems like asking a lot. There would be a lot of categorical variables (state, buyer, seller, fuel type, spot market vs. contract purchase) numerical values (quantity of fuel purchased, heat content per unit of fuel, cost of fuel for that month in all adjacent states, or the region, or the country as a whole). And really you could bring in any number of utility, plant, and generator attributes if they were helpful.

@zaneselvans
Copy link
Member Author

I looked into this today and found some things.

  • The API information we're using to fill in fuel prices right now is coarse in terms of fuel type. It only uses 3 high level categories, one each for all coal (COW), all petroleum derived fuels (PEL), and all natural gas (NG).
  • However, the FRC table has the much more detailed energy_source_code column that we could use for missing price estimation if we wanted to.
  • We map these three high level fuel categories to our own fuel_type_code_pudl (coal, oil, gas)... but we lump together different energy_source_code values in our categories than EIA does, which has to be introducing some misalignment between the filled in prices we're using, and what we'll get if we calculate monthly per-state, per-fuel costs.
  • Even so, looking just at these imperfectly mapped high level categories, and dropping all the NA before calculating the average fuel prices, the correlation is very high between the average coal/oil/gas price per unit as calculated from the FRC, and the value we get from the API.
  • I suspect we chose to use these high level categories because it was more likely to give us a value to fill in when it was redacted from the FRC table.
  • However we didn't understand the hierarchies of categories that the EIA was using at the time. Our categorization of coal, oil, and gas is more like solid, liquid, gas, while theirs is more indicative of the original resource the fuel was derived from (e.g. petcoke is solid so we called it coal, but it's derived from oil, so they call it petroleum).
  • Now that we have these different fuel attributes explicitly broken out in the energy_sources_eia coding table, we could choose to organize things differently if we wanted to, or potentially just use their categorizations instead of rolling our own.

API vs. frc_eia923

Here's how the fuel API prices compare to an aggregation from the reported data using the (slightly different) high level categories. Each point corresponds to a particular (report_date, state, fuel_type) combo.

Coal

image

Oil

image

Gas

(note the different X and Y scales because of some outliers)
image

What to do?

  • Right now we could fill in missing values using the coal/oil/gas aggregated state prices. This would yield slightly different results than we're currently getting because those categories correspond to slightly different sets of records in the EIA and PUDL universes right now. But I don't think the different results would obviously be less correct than the current ones.
  • We could reform the fuel_type_code_pudl so that it does correspond to the EIA categorizations, and that ought to bring the two calculations closer together.
  • I think the "right" thing to do is probably to use a regression (random forest?) to try and estimate prices for each combination of (report_date, state, energy_source_code) rather than fuel_type_code_pudl, using whatever information is available in the FRC table. Or I guess we don't necessarily have to restrict ourselves to just looking at state-level estimates. We could try and estimate individual plant level prices and see how it does.

@zaneselvans
Copy link
Member Author

I'm curious what @cmgosnell thinks about the simpler options above as an interim change.

@cmgosnell
Copy link
Member

For context on the original choice to use the larger categories, iirc we went this the fuel_type_code_pudl so that we'd get higher coverage. iirc the original way we were doing it (i.e. not what is currently in dev) there was some (unnecessary) limiting factor, so we went with the broad fuel types to get broader coverage.

I agree generally that we can and probably should use the more granular energy_source_code. I assume that whatever we do won't align perfectly with EIA (in part because I could never find their methodology for how they generated these state-level values). I would personally not work with option 2 above (reform fuel_type_code_pudl). The desire imo is something defensible, not something that perfectly matches EIA.

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 14, 2022

Do you think it makes sense to swap in our own average fuel prices based on the fuel_type_code_pudl aggregations for the time being, if it gives us coverage that's similar to what we're getting from the EIA API (which IIRC, wasn't great)?

Another thought: does it make more sense to try and estimate the price per unit or the price per MMBTU? (where unit is ton, barrel, or mcf). I would think that within these big categories there would be less dispersion on a $/MMBTU basis, since what you're really buying is heat content, not mass (i.e. BIT costs much more than LIG per ton, but they're closer to each other per MMBTU). I think we have the option of doing either one.

@cmgosnell
Copy link
Member

I definitely think using fuel_type_code_pudl for the time being makes sense, but i assume its similar math to try it with energy_source_code, so I'd personally check the coverage of the latter before defaulting to fuel_type_code_pudl.

I am less familiar with the unit vs btu distinction, but I think you are on to something! If all the info is there (which I believe it is in the frc table), then yeah go for it... but again i'd check the energy_source_code first. iirc the original coverage of the API fuel source was artificially limited because of something I did that was silly and was later fixed. My gut tells me that we'd get pretty good overall coverage using just energy_source_code with our own data.

I would almost use state-level energy_source_code prices and then try nearby states or just a national average.

@zaneselvans
Copy link
Member Author

Okay, I'll try the simplistic (report_date, state, energy_source_code) and (report_date, state, fuel_type_code_pudl) options and see how the coverage compares to what we're currently doing.

Beyond that, rather than immediately jumping to our own bespoke estimator based on adjacent states, I'd like to try a few generic regression options from scikit-learn that can make use of a lot more features in the FRC table, and that can be trained and validated on the records in the table that do have fuel prices (which is most of them).

I guess we can also easily check whether the price per physical unit or the price per MMBTU is more consistent just by plotting the distributions of observed prices in the dataset per date / state / fuel. It seems like whichever variable has less dispersion would be the easier one to estimate accurately. But curious what @katie-lamb and @TrentonBush think with their stats background.

@zaneselvans
Copy link
Member Author

Maybe @priyald17 or @joshdr83 would have suggestions on how best to approach this too.

@zaneselvans
Copy link
Member Author

For context here's the fuel_receipts_costs_eia923 table on Datasette and the field we're trying to estimate is fuel_cost_per_mmbtu. There are a bunch of additional attributes associated with both the plant_id_eia and the mine_id_pudl including the lat/lon of the plants and the county FIPS code associated with many of the coal mines, that could also be used to construct features.

@zaneselvans zaneselvans changed the title Estimate monthly per-state fuel costs instead of using the EIA API Estimate monthly plant-level fuel prices w/o using the EIA API Jun 14, 2022
@TrentonBush
Copy link
Member

TrentonBush commented Jun 14, 2022

To clarify the problem:

  • we want to impute redacted fuel costs
  • we have previously filled them with one of the values from the EIA API, but that is a pain for various operational reasons
  • those EIA values are mysterious (to us) calculations that we think are something like a weighted average price that includes:
    • all the data we have
    • AND the redacted data we don't have

Is that right?

I see two separate parts to this:

  1. How well can we model known prices given all the other info we have in PUDL?
  2. Given the EIA API has information we don't have, how can we use that to validate our imputations of unknown prices?

Relatedly, the first few scoping questions I'd ask are:

  • how much variation is there in fuel price at a given point in time and/or region? (what are the stakes)
  • how much data is redacted? Are those particularly important for any reason? (what are the stakes)
  • do we have reason to think redacted costs are systematically different from known costs? (is this method even valid)

Ideally we test the validity, but remember we do have the option to just "assume and caveat" if the stakes are low. Plus, if EIA has done their job, there won't be a good way to test it.

With respect to modelling, I'd let the "what are the stakes" question guide the degree of complexity (simple averages vs fancier models). I'm not familiar with how redaction works (are the same plants always redacted? Does it come and go? etc), but in principle I think we have everything we need to construct a well formulated ML solution, if that is warranted.

@katie-lamb
Copy link
Member

katie-lamb commented Jun 14, 2022

I wrote down these questions that are pretty similar to what Trenton asks above:

  • Are there particular states that are missing more data than other states or is it a somewhat random dispersion of missing data?
    • If Nevada was missing a lot of data then there might not be enough recorded Nevada data to provide an accurate estimate based on state average. It also might not necessarily be a good idea to use California as a proxy for fuel costs just because they are geographically close.
  • Similarly, are there particular time series that are missing data? Are we missing more data from some fuel types than others? Is there a reason for this?

If the redacted data is kind of "one off" and there is still enough representative data, then doing a regression to estimate prices for each date, state, and fuel type could definitely work. But I also agree that we have all the tools and data needed to do something more comprehensive if it seems that the missing data can't be represented that simplistically.

Although I think I got a little lost at some point about the construction of fuel_type_code_pudl vs.energy_source_code and price per MMBTU vs. price per unit, I think it makes sense to do some feature selection with the attributes of the fuel receipts cost table + energy_source_code and attributes from related tables for each of these potential price attributes that we may try and estimate, instead of assuming that nearest states or something is the best estimator. The scores from this feature selection could also reveal something about what price attribute best to estimate. Maybe the k best features are similar for each estimated variable or maybe there's one variable that stands out as more easily modeled.

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 14, 2022

  • The fuel_receipts_costs_eia923 table has about 560,000 records in it, and about 67% of them have reported fuel prices.
  • I'm not sure the monthly statewide average fuel costs do account for the redacted values, though that would be helpful.
  • There are cases in which no state-level average is available, and I'm not sure why.
  • Using the statewide averages and the coarse fuel types from the API, we get 91% coverage. using rolling averages only gets us up to 93% so the time gaps are pretty big.

The distribution of missing values is skewed towards the early years (2008-2011) before filling with the API:

image

After filling with the API it's pretty uniform across all the years:

image

@zaneselvans
Copy link
Member Author

What do you think would be a good way to evaluate the dispersion of fuel prices? It seems like there are at least 3 dimensions to consider in sampling and estimating fuel prices:

  • Spatial: multi-state regions, single states, or individual plants.
  • Temporal: the entire price history, a single year, a single month, or some seasonally correlated periods.
  • Fuel Type: coarse fuel groups (coal, oil, gas) or individual energy_source_code values.

Using the more granular options (plants, months, energy_source_code) seems like it should give us lower dispersion results, but it'll also reduce the sample sizes pretty dramatically, reducing confidence in the number we get out. For example, here are some state-level distributions for a single year, and a single energy source code. I chose a mix of big states (CA, FL, NY, TX) medium states (CO, WY) and small states (ID, ME). NG and SUB are the most common fuel types.

Price distribution by state for NG (2020)

image

Price distribution by state for SUB (2020)

image

@zaneselvans
Copy link
Member Author

Rather than just counting individual fuel deliveries up, would it make more sense to weight them by the total heat content of the fuel delivered?

@TrentonBush
Copy link
Member

TrentonBush commented Jun 15, 2022

The following is ONLY for gas prices (filtering not shown)

  • monthly median gets you within about 15% of the true value
  • a state-month median gets you to within about 10% (+- $0.30ish) of the true value

I'm sure we could tighten that much further with a better model. I'm just not sure what our goals should be here without an application in mind. Maybe a reasonable next step would be to throw a low-effort, high-power model like XGBoost at it and see what kind of accuracy is possible.

# I didn't bother with any train-test split here so these error values should be considered best-case indicators
frc['global_median_fuel_price'] = frc['fuel_cost_per_mmbtu'].median()
frc['month_median_fuel_price'] = frc.groupby('report_date')['fuel_cost_per_mmbtu'].transform(np.median)
frc['state_month_median_fuel_price'] = frc.groupby(['state', 'report_date'])['fuel_cost_per_mmbtu'].transform(np.median)

error = frc.loc[:, [
                           'global_median_fuel_price',
                           'month_median_fuel_price',
                           'state_month_median_fuel_price',
                        ]
               ].sub(frc['fuel_cost_per_mmbtu'], axis=0)

pct_error = error.div(frc['fuel_cost_per_mmbtu'], axis=0)

image
image
image

@katie-lamb
Copy link
Member

To first address the goal of removing the EIA API calls from the CI asap:

Do you have an accuracy metric on how well the most granular option (plants, months, energy_source_code) performs against the API results? Or how swapping in state level instead of plant affects accuracy? I think for now you could just sub in the median of whatever combo of the three dimensions gets the highest API accuracy. It seems like it might be state, month, energy code. Although @zaneselvans mentioned there are states with no data (I think you actually said a state-level average is not available), so does this necessitate multi-state aggregations?

After this quick fix, I think looking at what other features we could bring into a more powerful model would be useful.

@TrentonBush
Copy link
Member

Are the API results our target or are the redacted values our target? I thought the API was basically just an outsourced imputation model. If it includes redacted data in its aggregates then it has value for validation. Is there anything more to the EIA API?

@katie-lamb
Copy link
Member

I believe you're right and the API is just modeling these values. After talking to Zane, I think to get a quick fix for today/tomorrow the API results are our target since people currently using the table don't want the values to change significantly. But to develop a more accurate and comprehensive solution in the less short term, the redacted values are indeed the target.

@TrentonBush
Copy link
Member

Oh I didn't know there was a timeline on this at all, much less tomorrow! Do what you gotta do

@katie-lamb
Copy link
Member

LOL ya I don't think this discussion is really framed around a timeline and is in fact more long term ideating of modeling tactics. My comment was really just spurred by annoyance that the CI is unreliable and wanting something to stand in there while a better solution is developed.

@zaneselvans
Copy link
Member Author

@TrentonBush The earlier scatter plots are comparing all the reported data -- so the ones where there actually was data in the FRC table, and they're only being aggregated by [state, month, fuel_group]

The more scatter recent plots are only looking at data points that were not present in the FRC table, and comparing the values which were filled in by our new method (breaking it out into all the different kinds of aggregation used) vs. the API values. So it's not surprising that the correlation is worse in general.

@TrentonBush
Copy link
Member

TrentonBush commented Jun 17, 2022

Oh ok, now if I squint I can see how they are related -- basically remove the line and just look at the dispersion, plus add some x-coordinate noise to account for mean to median. Ya this model makes sense to me. I think using the median is justified given the handful of big outliers I ran into.
We definitely want to use the mean for the purpose of disaggregating the API results in the future though.

one of our main applications of this data is to identify unusually expensive coal plants for advocates

That's an awesome use case I'd love to learn more about! There is certainly more accuracy to wring out of this data, and that kind of anomaly detection would probably require it. Maybe we find a day or two next sprint to try other models

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 18, 2022

Yeah, there are some wild outliers. I want to figure out how to remove them and fill them in with reasonable values. Some of them are out of bounds by factors of like 1000x -- natural gas is particularly messy. The mean values were quite skewed.

Given that the values aren't really normally distributed, would it still make sense to replace anything more than 3-4$\sigma$ away from the mean? I could drop the top and bottom 0.1% of the distribution too, but there aren't always big outliers, and sometimes more than 0.1% of the points are bad. Not sure what the right thing to do is.

@zaneselvans
Copy link
Member Author

I'm doing some additional normalization to the table and moving the fuel_group_code to the energy_sources_eia table, since the value is entirely determined by energy_source_code with this mapping:

assert (frc.groupby("energy_source_code").fuel_group_code.nunique() == 1).all()
dict(frc.groupby("energy_source_code").fuel_group_code.first())
{'BFG': 'other_gas',
 'BIT': 'coal',
 'DFO': 'petroleum',
 'JF': 'petroleum',
 'KER': 'petroleum',
 'LIG': 'coal',
 'NG': 'natural_gas',
 'OG': 'other_gas',
 'PC': 'petroleum_coke',
 'PG': 'other_gas',
 'RFO': 'petroleum',
 'SC': 'coal',
 'SGP': 'other_gas',
 'SUB': 'coal',
 'WC': 'coal',
 'WO': 'petroleum'}

I'm also renaming the column to energy_group_eiaepm since it's the fuel grouping that's used for reporting in the EIA Electric Power Monthly.

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 18, 2022

That's an awesome use case I'd love to learn more about!

This is basically the RMI ReFi project! The fuel is by far the biggest component of the OpEx for coal plants, and this data has been used to identify which plants are vulnerable to getting refinanced out of existence.

@zaneselvans
Copy link
Member Author

Okay, definitely some room for improvement. E.g. look at how how coal prices in NY drop dramatically (in the filled in data) starting in 2012 here:

def facet_fuel(
    frc: pd.DataFrame,
    states: list[str],
    fuel_facet: Literal["energy_source_code", "fuel_group_eiaepm", "fuel_type_code_pudl"],
    by: Literal["fuel", "state"] = "state",
    fuels: list[str] | None = None,
    max_price: float = 100.0,
    facet_kws: dict | None = None,
) -> None:
    mask = (
        (frc.state.isin(states))
        & (frc.fuel_cost_per_mmbtu <= max_price)
        & (frc[fuel_facet].isin(fuels) if fuels else True)
    )
    facet = fuel_facet if by == "state" else "state"
    sns.relplot(
        data=frc[mask],
        x="report_date",
        y="fuel_cost_per_mmbtu",
        hue=facet,
        style=facet,
        row=fuel_facet if by == "fuel" else "state",
        kind="line",
        height=4,
        aspect=2.5,
        facet_kws=facet_kws,
    );

facet_fuel(
    frc,
    fuel_facet="fuel_group_eiaepm",
    states=["CA", "NY", "FL", "TX"],
    fuels=["coal", "natural_gas", "petroleum"],
    by="fuel",
    facet_kws={"sharey": False},
)

image

@zaneselvans
Copy link
Member Author

I ran tox -e nuke to generate a new database with the slight schema changes, and validate its contents, and ended up getting the capacity_mw and row count errors that @katie-lamb was encountering after the date_merge updates, so I merged the changes from her current PR (#1695) and re-ran the validations, but I'm still getting a bunch of row count errors, so maybe they really are due to something that changed here?

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 214669 rows, expected 214613. Off by 0.026%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 213382 rows, expected 213329. Off by 0.025%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 1964 rows, expected 1961. Off by 0.153%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gens_eia860-490824-490824-490824] - ValueError: gens_eia860: found 491469 rows, expected 490824. Off by 0.131%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-2507040-2507040-214613] - ValueError: gf_eia923: found 2507830 rows, expected 2507040. Off by 0.032%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nonuclear_eia923-2492441-2491704-213329] - ValueError: gf_nonuclear_eia923: found 2492441 rows, expected 2491704. Off by 0.030%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_nuclear_eia923-23498-23445-1961] - ValueError: gf_nuclear_eia923: found 23498 rows, expected 23445. Off by 0.226%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-plants_eia860-171148-171148-171148] - ValueError: plants_eia860: found 171570 rows, expected 171148. Off by 0.247%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-170370-170370-170370] - ValueError: pu_eia860: found 170792 rows, expected 170370. Off by 0.248%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-113357-113357-113357] - ValueError: utils_eia860: found 113464 rows, expected 113357. Off by 0.094%, allowed margin of 0.000%

It's not clear off the top of my head why the changes in this PR would result in small changes to the number of rows in these other tables though. Does that make sense to anyone else?

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 19, 2022

Comparing old expectations, new expectations, and the rows we've got here...

table observed rows old rows date_merge rows matches
gens_eia860_raw 491,469 491,469 490,824 old
gens_eia860_annual 491,469 491,469 490,824 old
gens_eia860_monthly 491,469 491,469 490,824 old
gf_eia923_raw 2,507,830 2,507,870 2,507,040 neither
gf_eia923_annual 214,669 214,709 214_613 neither
gf_eia923_monthly 2,507,830 2,507,870 2,507,040 neither
gf_nonuclear_eia923_annual 213,382 213,422 213,329 neither
gf_nonuclear_eia923_monthly 2,492,441 2,492,481 2,491,704 neither
gf_nuclear_eia923_annual 1,964 1,964 1,961 old
gf_nuclear_eia923_monthly 23,498 23,498 23,445 old
plants_eia860_raw 171,570 171,570 171,148 old
plants_eia860_annual 171,570 171,570 171,148 old
plants_eia860_monthly 171,570 171,570 171,148 old
pu_eia860_raw 170,792 170,792 170,370 old
pu_eia860_annual 170,792 170,792 170,370 old
pu_eia860_monthly 170,792 170,792 170,370 old
utils_eia860_raw 113,464 113,464 113,357 old
utils_eia860_annual 113,464 113,464 113,357 old
utils_eia860_monthly 113,464 113,464 113,357 old

@zaneselvans
Copy link
Member Author

More weirdness.

The number of FRC records filled with each of the different aggregations seems a little different now than it was when I was originally developing the process in a notebook.

  • After inserting it into the midst of the pudl.output.fuel_receipts_costs_eia923() function only one of the national aggregations has any records associated with it.
  • We lose 11,039 records when date_merge() is run on the FRC output dataframe near the end of the output function. Need to figure out what those records are and why.
  • It also turns out there are about 10,500 records in the FRC table whose plant_id_eia values don't have states associated with them in the plants_entity_eia table, which seems pretty weird. They all start with 88, and their plant records are almost entirely null.
  • The stateless FRC records are relatively flatly distributed across years. They're 7236 coal, 1605 natural gas, 1504 petroleum, and 166 petcoke.

The lack of states is surprising, but seems like that's just how it is. The loss of a bunch of records in the date_merge() process is confusing. @katie-lamb and @cmgosnell could we talk about what might be going on there? I'm wondering if it might be due to the fact that the FRC table isn't a "time series" in that the report_date field isn't part of a primary key -- it's just a bunch of transactions, and there are some valid duplicate records. We use an autogenerated pseudo-key in the database. I'll compile a dataframe of the records that are getting lost.

@katie-lamb
Copy link
Member

After running the ETL, I updated the row counts and merged that PR into dev so that should fix some of your issues @zaneselvans

@zaneselvans
Copy link
Member Author

Ah okay so some of the row counts in your branch weren't actually the expected ones?

@zaneselvans
Copy link
Member Author

@katie-lamb any intuition as to why merge_date() would be causing ~11,000 rows to get dropped from the FRC table?

@zaneselvans
Copy link
Member Author

Comparing the records that exist before and after date_merge() it looks like all of the records for plants that are missing a state field (10,511) are getting dropped for some reason, plus 528 additional records that do have states associated with them. Note that the plant IDs that don't have states are also overwhelmingly NULL in the entity tables, so if there's any additional merging that's happening on other columns in there maybe that would mess things up?

Are these valid duplicate records that are getting treated strangely? If I list all of the duplicate records in the FRC table that I've pulled directly from the DB, it gives me 3510 of them, and all but one of them has a state, so it doesn't seem to be that.

In the same method chain where date_merge() is happening there's also a dropna(subset=["utility_id_eia"]) -- I bet that's where the problem is. The plants that don't have states are also missing almost everything else, so probably they're missing utility IDs. And then a smattering of the other plants just happen to be missing Utility IDs for some other reason.

Counting the rows in the dataframe before / after that dropna() specifically, that's indeed where the 11,000 records drop out. Whoops.

For the purposes of the FRC table is it absolutely necessary that we have the utility ID? The data is really about plants, fuels, and dates, so it seems like this isn't strictly required. If I remove it what happens...

In that case, we keep all the FRC records, but all of the fields which are brought in by merging with the results of pudl.output.eia860.plants_utils_eia860() are NA (plant name, PUDL ID, etc.). which may cause other unexpected problems downstream, so this probably isn't ideal, and the real issue is with how plants_utils_eia860() is working in cases where there are some missing ID fields, so I think I should make an issue to deal with this... #1700

@zaneselvans
Copy link
Member Author

Looking for ways to identify outliers in not-so-normal distributions I came across the modified z-score, which is analogous to the z-score in normal distributions, but instead of looking at how many standard deviations away from the mean a data point is, it looks at how many Median Absolute Deviations (MADs) away from the median it is:

mod_zscore = abs(0.6745 * (data - data.median()) / data.mad())

This seems like the kind of measure we want.

In messing around with this I also ended up looking at the median (aka delivery weighted median) vs. (mmbtu) weighted median values of fuel prices in the FRC table, and they're pretty different. I think this is because crazy expensive fuel deliveries tend to be very small -- so they make up a disproportionate number of deliveries (records) compared to how much heat content they represent. The delivery weighted median is $3.25 / MMBTU but the MMBTU weighted median is $2.09 / MMBTU.

A weighted median function that can be used with df.groupby().apply():

def weighted_median(df: pd.DataFrame, data: str, weights: str) -> float:
    df = df.dropna(subset=[data, weights])
    s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
    midpoint = 0.5 * np.sum(s_weights)
    if any(df[weights] > midpoint):
        w_median = df[data][weights == np.max(df[weights])].array[0]
    else:
        cs_weights = np.cumsum(s_weights)
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights[idx] == midpoint:
            w_median = np.mean(s_data[idx:idx+2])
        else:
            w_median = s_data[idx+1]
    return w_median

Grouping by report_year and fuel_group_eiaepm and calculating the modified z-score within each group, we get a pretty clean normal-ish distribution, with some long straggling tails that I think we can probably cut off.

DATA_COLS = [
    "plant_id_eia",
    "report_date",
    "energy_source_code",
    "fuel_cost_per_mmbtu",
    "fuel_received_units",
    "fuel_mmbtu_per_unit",
]

GB_COLS = ["report_year", "fuel_group_eiaepm"]

MAX_MOD_ZSCORE = 5.0

logger.info("Query PUDL DB")
frc_db = pd.read_sql("fuel_receipts_costs_eia923", pudl_engine, columns=DATA_COLS)
plant_states = pd.read_sql("SELECT plant_id_eia, state FROM plants_entity_eia", pudl_engine)
fuel_group_eiaepm = pd.read_sql("SELECT code AS energy_source_code, fuel_group_eiaepm FROM energy_sources_eia", pudl_engine)

logger.info("Join tables and calculate derived values")
frc_db = (
    frc_db
    .merge(plant_states, on="plant_id_eia", how="left", validate="many_to_one")
    .merge(fuel_group_eiaepm, on="energy_source_code", how="left", validate="many_to_one")
    .assign(
        report_year=lambda x: x.report_date.dt.year,
        fuel_mmbtu_total=lambda x: x.fuel_received_units * x.fuel_mmbtu_per_unit,
    )
    .pipe(apply_pudl_dtypes, group="eia")
)

logger.info("Calculate weighted median fuel price")
weighted_median_fuel_price = (
    frc_db
    .groupby(GB_COLS)
    .apply(weighted_median, data="fuel_cost_per_mmbtu", weights="fuel_mmbtu_total")
)
weighted_median_fuel_price.name = "weighted_median_fuel_price"
weighted_median_fuel_price = weighted_median_fuel_price.to_frame().reset_index()
frc_db = frc_db.merge(weighted_median_fuel_price, how="left", on=GB_COLS, validate="many_to_one")

logger.info("Calculate weighted fuel price MAD")
frc_db["delta"] = abs(frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price)
frc_db["fuel_price_mad"] = frc_db.groupby(GB_COLS)["delta"].transform("median")

logger.info("Calculate modified z-score")
frc_db["fuel_price_mod_zscore"] = (0.6745 * (frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price) / frc_db.fuel_price_mad).abs()

fraction_outliers = sum(frc_db.fuel_price_mod_zscore > MAX_MOD_ZSCORE) / len(frc_db)
logger.info(f"Modified z-score {MAX_MOD_ZSCORE} labels {fraction_outliers:0.2%} of all prices as outliers")

All fuels:

plt.hist(frc_db.fuel_price_mod_zscore, bins=80, range=(0,8), density=True)
plt.title("Modified z-score when grouped by fuel and year");

image

Fuel groups

sns.displot(
    data=frc_db[frc_db.fuel_group_eiaepm.isin(["coal", "natural_gas", "petroleum"])],
    kind="hist",
    x="fuel_price_mod_zscore",
    row="fuel_group_eiaepm",
    bins=80,
    binrange=(0,8),
    height=2.5,
    aspect=3,
    stat="density",
    linewidth=0,
);

image

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 21, 2022

I think have weighted, modified z-score outlier detection working, but am wondering how to apply it appropriately.

  • Calculate the wmod_zscore for high-level aggregations (like year, fuel group) at the beginning and then let each of the subsequent aggregations fill in the NAs.
  • Calculate the wmod_zscore for each individual aggregation as we go, so that "outliers" are being identified relative to the distribution whose median value would be getting used to fill in missing / outlying values.

The latter sounds more correct, but some of the groupings won't have a large number of records in them.

Also: what's the right wmod_zscore value to use as the cutoff for identifying outliers? At wmod_zscore == 5.0 it identifies about 1.5% of all records as outliers, mostly natural gas. Those outlier records account for only 0.2% of all reported MMBTU.

@zaneselvans
Copy link
Member Author

Outliers identified using modified z-score of aggregations, with max_mod_zscore == 5.0:

  • agg_mod_zscore=["report_year", "fuel_group_eiaepm"]: 1.5% of records (0.2% of mmbtu). No empty groups.
  • agg_mod_zscore=["report_year", "fuel_group_eiaepm", "census_region"]: 1.73% of records (0.37% of mmbtu). A few empty groups.
  • agg_mod_zscore=["report_year", "fuel_group_eiaepm", "state"]: 1.89% of records (0.74% of mmbtu). Lots of empty groups.
  • agg_mod_zscore=["report_date", "fuel_group_eiaepm", "census_region"]: 2.18% of records (0.89% of mmbtu). Lots of empty groups.

@TrentonBush
Copy link
Member

TrentonBush commented Jun 22, 2022

"Robust" outlier detection metrics are great! Their downsides are that 1) they can be too aggressive, especially on skewed data and 2) as you've noticed, they are slow because of all the O(n*logn) sorting.

I'm of two minds here. On one hand, it'd be fun and beneficial to tune this model for both accuracy and performance. On the other, we're considering replacing all this with a better model anyway, so maybe we should wait on any tuning until we make that determination. This outlier detection problem probably fits in better the with "improve the imputation model" tasks because both imputation and outlier detection might use the same model.
Also, outlier detection is a scope increase from the original "remove the EIA API" goals. The very linear plots of state-month averages vs EIA API imply that EIA isn't doing much outlier removal, which means our users have been using data with outliers averaged in the whole time. I think it makes sense to change things once and well rather than possibly twice.

Accuracy improvement ideas

I'd start by seeing if we can un-skew the price distributions before (and maybe instead of) applying the robust metrics. A variance-stabilizing transform like log or Box-Cox might make the data "normal" enough to use regular z-scores.

Second, have we checked what the minimum sample sizes are for some of these bins? As far as I can tell, the existing method calculates a bunch of types of aggregates and iteratively fills in NaN values in order of model precision. Sometimes a key combination like ["Rhode Island", "2008-01-01", "Coal"] might be empty and produce np.nan, which is later filled by a coarser aggregate, but other times it might only have a handful of plants and produce weird metrics as a result. Now that we're using this model for both prediction and outlier detection, it could cause problems in both.

Finally, the model we've built here is basically a decision tree, just manually constructed instead of automated. Problems like choosing the right grain of aggregation or enforcing a minimum sample size per bin are taken care of automatically by those algorithms!

Performance improvement ideas

Medians are always going to be slower, but the line
s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
jumped out at me because of the use of python iterators on pandas/numpy objects. I think we could probably get the same result faster with something involving df.loc[:, [data, weights]].sort_values(data), df.loc[:, weights].cumsum(), and np.searchsorted

@zaneselvans
Copy link
Member Author

I will admit that

 s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))

was just something I adapted from an example which was operating on lists or numpy arrays, and I didn't completely parse what it is doing to try and re-write it intelligently... like what is going on with the zip inside a zip and unpacking the sorted list. Is it obvious to you what it's doing? I'm not even sure what it means to sort a list of tuples necessarily.

@zaneselvans
Copy link
Member Author

The minimum sample size in some of these bins is definitely small, but it'll use the small sample if it can (even if that's a bad idea). Requiring a bigger minimum sample size would result in more frequently resorting to the less specific estimations, but with many more samples.

Should we have just gone straight to using one of the Swiss-army-knife regression models instead, and just started with a modest set of input columns as features, to be expanded / engineered more later? I haven't really worked with them so I don't know how they're finicky or what the pitfalls are.

@TrentonBush
Copy link
Member

I will admit that was just something I adapted from an example

Let he who has not copied from StackOverflow cast the first stone 😂

I had no clue what sorting tuples did, but I just tried it and it sorts them by their first value. I think that pattern is designed for python objects: when given iterables representing columns, the inner zip links the values by row so that sorted doesn't break the connection between them, then the outer zip puts them back into columnar layout. Mapping np.array then turns each column into a separate array.
But pandas can accomplish the same thing with df.sort_values(<my_column>) and some indexing.

Also I just noticed the python any() function, which I think will also resort to python iteration. Using df.any() or np.any() will be much faster (relatively, probably not even 100ms difference at n=500k).

Should we have just gone straight to using one of the Swiss-army-knife regression models instead

No, starting with groupby is much simpler, plus it matched the algorithm we were trying to replace. But once we start adding complexity like conditionally switching between groupbys and checking sample sizes, I think that effort is better spent on a model with a higher ceiling.

@zaneselvans
Copy link
Member Author

zaneselvans commented Jun 22, 2022

Okay here's a pandas-ified version

def weighted_median(df: pd.DataFrame, data: str, weights: str, dropna=True) -> float:
    if dropna:
        df = df.dropna(subset=[data, weights])
    if df.empty | df[[data, weights]].isna().any(axis=None):
        return np.nan
    df = df.loc[:, [data, weights]].sort_values(data)
    midpoint = 0.5 * df[weights].sum()
    if (df[weights] > midpoint).any():
        w_median = df.loc[df[weights].idxmax(), data]
    else:
        cs_weights = df[weights].cumsum()
        idx = np.where(cs_weights <= midpoint)[0][-1]
        if cs_weights.iloc[idx] == midpoint:
            w_median = df[data].iloc[idx : idx + 2].mean()
        else:
            w_median = df[data].iloc[idx + 1]
    return w_median

Old version: 4min 26s
New version: 5min 33s 🤣

But they do produce exactly the same output!

@TrentonBush
Copy link
Member

Moving some bits from Slack to here for posterity.

I tried (notebook) a quick implementation of an XGBoost equivalent (lightGBM). Using the same state, month, fuel features as the groupby model, it gave very similar accuracy. After adding a few new features, it cut relative error in about half. Training/inference time were around 10 seconds total
Cross validation shows that the groupby model overfits and performs a bit worse on unseen data. Also, note the difference in counts below. I checked whether the GBDT accuracy was different on the records that the groupby model could predict, but the only significant difference was that other_gas results (now only 9 of them) were much better.

Test set relative error for groupby(['state', 'report_date', 'fuel_group_code']) medians
image

Test set results for GBDT. Features: ['state', 'report_date', 'fuel_group_code', 'contract_type_code', 'primary_transportation_mode_code', 'energy_source_code', 'fuel_received_units', 'latitude', 'longitude',]
image

@zaneselvans
Copy link
Member Author

I'm closing this because it's effectively metastasized into #1708

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. data-repair Interpolating or extrapolating data that we don't actually have. eia923 Anything having to do with EIA Form 923
Projects
None yet
Development

No branches or pull requests

5 participants