-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimate monthly plant-level fuel prices w/o using the EIA API #1343
Comments
Hey @katherinelamb I'd like to talk to you about how we might use some ML / regressions to fill in these missing fuel costs if that's an issue you might like to take on. |
This definitely sounds interesting and doable. Potential data for features
It's probably best to start by trying to reduce the problem to just be either spatial or temporal and then try some regressions and common modeling frameworks. Seems like it's harder but still doable to model both spatially and temporally. |
While we do have the lat/lon location for all the plants, for the "spatial" dimension I was imagining something like looking at the prices reported in adjacent states if they're available, since I imagine that there are generally regional price trends. I think this is what Aranya / Greg did. E.g. if a given state-month-fuel combination is missing a price, fill it in with the average of the reported price for adjacent states for that fuel and month, and then that filled-in state-month-fuel table could be used to estimate the cost per unit of fuel for any records in the I guess in an ideal world we'd be able to make predictions / estimations for every record with a missing value, given all of the information available in a very generic way, but that seems like asking a lot. There would be a lot of categorical variables (state, buyer, seller, fuel type, spot market vs. contract purchase) numerical values (quantity of fuel purchased, heat content per unit of fuel, cost of fuel for that month in all adjacent states, or the region, or the country as a whole). And really you could bring in any number of utility, plant, and generator attributes if they were helpful. |
I'm curious what @cmgosnell thinks about the simpler options above as an interim change. |
For context on the original choice to use the larger categories, iirc we went this the I agree generally that we can and probably should use the more granular |
Do you think it makes sense to swap in our own average fuel prices based on the Another thought: does it make more sense to try and estimate the price per unit or the price per MMBTU? (where unit is ton, barrel, or mcf). I would think that within these big categories there would be less dispersion on a $/MMBTU basis, since what you're really buying is heat content, not mass (i.e. BIT costs much more than LIG per ton, but they're closer to each other per MMBTU). I think we have the option of doing either one. |
I definitely think using I am less familiar with the unit vs btu distinction, but I think you are on to something! If all the info is there (which I believe it is in the frc table), then yeah go for it... but again i'd check the I would almost use state-level |
Okay, I'll try the simplistic ( Beyond that, rather than immediately jumping to our own bespoke estimator based on adjacent states, I'd like to try a few generic regression options from I guess we can also easily check whether the price per physical unit or the price per MMBTU is more consistent just by plotting the distributions of observed prices in the dataset per date / state / fuel. It seems like whichever variable has less dispersion would be the easier one to estimate accurately. But curious what @katie-lamb and @TrentonBush think with their stats background. |
Maybe @priyald17 or @joshdr83 would have suggestions on how best to approach this too. |
For context here's the fuel_receipts_costs_eia923 table on Datasette and the field we're trying to estimate is |
To clarify the problem:
Is that right? I see two separate parts to this:
Relatedly, the first few scoping questions I'd ask are:
Ideally we test the validity, but remember we do have the option to just "assume and caveat" if the stakes are low. Plus, if EIA has done their job, there won't be a good way to test it. With respect to modelling, I'd let the "what are the stakes" question guide the degree of complexity (simple averages vs fancier models). I'm not familiar with how redaction works (are the same plants always redacted? Does it come and go? etc), but in principle I think we have everything we need to construct a well formulated ML solution, if that is warranted. |
I wrote down these questions that are pretty similar to what Trenton asks above:
If the redacted data is kind of "one off" and there is still enough representative data, then doing a regression to estimate prices for each date, state, and fuel type could definitely work. But I also agree that we have all the tools and data needed to do something more comprehensive if it seems that the missing data can't be represented that simplistically. Although I think I got a little lost at some point about the construction of |
The distribution of missing values is skewed towards the early years (2008-2011) before filling with the API: After filling with the API it's pretty uniform across all the years: |
What do you think would be a good way to evaluate the dispersion of fuel prices? It seems like there are at least 3 dimensions to consider in sampling and estimating fuel prices:
Using the more granular options (plants, months, Price distribution by state for
|
Rather than just counting individual fuel deliveries up, would it make more sense to weight them by the total heat content of the fuel delivered? |
The following is ONLY for gas prices (filtering not shown)
I'm sure we could tighten that much further with a better model. I'm just not sure what our goals should be here without an application in mind. Maybe a reasonable next step would be to throw a low-effort, high-power model like XGBoost at it and see what kind of accuracy is possible. # I didn't bother with any train-test split here so these error values should be considered best-case indicators
frc['global_median_fuel_price'] = frc['fuel_cost_per_mmbtu'].median()
frc['month_median_fuel_price'] = frc.groupby('report_date')['fuel_cost_per_mmbtu'].transform(np.median)
frc['state_month_median_fuel_price'] = frc.groupby(['state', 'report_date'])['fuel_cost_per_mmbtu'].transform(np.median)
error = frc.loc[:, [
'global_median_fuel_price',
'month_median_fuel_price',
'state_month_median_fuel_price',
]
].sub(frc['fuel_cost_per_mmbtu'], axis=0)
pct_error = error.div(frc['fuel_cost_per_mmbtu'], axis=0) |
To first address the goal of removing the EIA API calls from the CI asap: Do you have an accuracy metric on how well the most granular option (plants, months, energy_source_code) performs against the API results? Or how swapping in state level instead of plant affects accuracy? I think for now you could just sub in the median of whatever combo of the three dimensions gets the highest API accuracy. It seems like it might be state, month, energy code. Although @zaneselvans mentioned there are states with no data (I think you actually said a state-level average is not available), so does this necessitate multi-state aggregations? After this quick fix, I think looking at what other features we could bring into a more powerful model would be useful. |
Are the API results our target or are the redacted values our target? I thought the API was basically just an outsourced imputation model. If it includes redacted data in its aggregates then it has value for validation. Is there anything more to the EIA API? |
I believe you're right and the API is just modeling these values. After talking to Zane, I think to get a quick fix for today/tomorrow the API results are our target since people currently using the table don't want the values to change significantly. But to develop a more accurate and comprehensive solution in the less short term, the redacted values are indeed the target. |
Oh I didn't know there was a timeline on this at all, much less tomorrow! Do what you gotta do |
LOL ya I don't think this discussion is really framed around a timeline and is in fact more long term ideating of modeling tactics. My comment was really just spurred by annoyance that the CI is unreliable and wanting something to stand in there while a better solution is developed. |
@TrentonBush The earlier scatter plots are comparing all the reported data -- so the ones where there actually was data in the FRC table, and they're only being aggregated by The more scatter recent plots are only looking at data points that were not present in the FRC table, and comparing the values which were filled in by our new method (breaking it out into all the different kinds of aggregation used) vs. the API values. So it's not surprising that the correlation is worse in general. |
Oh ok, now if I squint I can see how they are related -- basically remove the line and just look at the dispersion, plus add some x-coordinate noise to account for mean to median. Ya this model makes sense to me. I think using the median is justified given the handful of big outliers I ran into.
That's an awesome use case I'd love to learn more about! There is certainly more accuracy to wring out of this data, and that kind of anomaly detection would probably require it. Maybe we find a day or two next sprint to try other models |
Yeah, there are some wild outliers. I want to figure out how to remove them and fill them in with reasonable values. Some of them are out of bounds by factors of like 1000x -- natural gas is particularly messy. The mean values were quite skewed. Given that the values aren't really normally distributed, would it still make sense to replace anything more than 3-4$\sigma$ away from the mean? I could drop the top and bottom 0.1% of the distribution too, but there aren't always big outliers, and sometimes more than 0.1% of the points are bad. Not sure what the right thing to do is. |
I'm doing some additional normalization to the table and moving the assert (frc.groupby("energy_source_code").fuel_group_code.nunique() == 1).all()
dict(frc.groupby("energy_source_code").fuel_group_code.first()) {'BFG': 'other_gas',
'BIT': 'coal',
'DFO': 'petroleum',
'JF': 'petroleum',
'KER': 'petroleum',
'LIG': 'coal',
'NG': 'natural_gas',
'OG': 'other_gas',
'PC': 'petroleum_coke',
'PG': 'other_gas',
'RFO': 'petroleum',
'SC': 'coal',
'SGP': 'other_gas',
'SUB': 'coal',
'WC': 'coal',
'WO': 'petroleum'} I'm also renaming the column to |
This is basically the RMI ReFi project! The fuel is by far the biggest component of the OpEx for coal plants, and this data has been used to identify which plants are vulnerable to getting refinanced out of existence. |
Okay, definitely some room for improvement. E.g. look at how how coal prices in NY drop dramatically (in the filled in data) starting in 2012 here: def facet_fuel(
frc: pd.DataFrame,
states: list[str],
fuel_facet: Literal["energy_source_code", "fuel_group_eiaepm", "fuel_type_code_pudl"],
by: Literal["fuel", "state"] = "state",
fuels: list[str] | None = None,
max_price: float = 100.0,
facet_kws: dict | None = None,
) -> None:
mask = (
(frc.state.isin(states))
& (frc.fuel_cost_per_mmbtu <= max_price)
& (frc[fuel_facet].isin(fuels) if fuels else True)
)
facet = fuel_facet if by == "state" else "state"
sns.relplot(
data=frc[mask],
x="report_date",
y="fuel_cost_per_mmbtu",
hue=facet,
style=facet,
row=fuel_facet if by == "fuel" else "state",
kind="line",
height=4,
aspect=2.5,
facet_kws=facet_kws,
);
facet_fuel(
frc,
fuel_facet="fuel_group_eiaepm",
states=["CA", "NY", "FL", "TX"],
fuels=["coal", "natural_gas", "petroleum"],
by="fuel",
facet_kws={"sharey": False},
) |
I ran
It's not clear off the top of my head why the changes in this PR would result in small changes to the number of rows in these other tables though. Does that make sense to anyone else? |
Comparing old expectations, new expectations, and the rows we've got here...
|
More weirdness. The number of FRC records filled with each of the different aggregations seems a little different now than it was when I was originally developing the process in a notebook.
The lack of states is surprising, but seems like that's just how it is. The loss of a bunch of records in the |
After running the ETL, I updated the row counts and merged that PR into dev so that should fix some of your issues @zaneselvans |
Ah okay so some of the row counts in your branch weren't actually the expected ones? |
@katie-lamb any intuition as to why |
Comparing the records that exist before and after Are these valid duplicate records that are getting treated strangely? If I list all of the duplicate records in the FRC table that I've pulled directly from the DB, it gives me 3510 of them, and all but one of them has a state, so it doesn't seem to be that. In the same method chain where Counting the rows in the dataframe before / after that For the purposes of the FRC table is it absolutely necessary that we have the utility ID? The data is really about plants, fuels, and dates, so it seems like this isn't strictly required. If I remove it what happens... In that case, we keep all the FRC records, but all of the fields which are brought in by merging with the results of |
Looking for ways to identify outliers in not-so-normal distributions I came across the modified z-score, which is analogous to the z-score in normal distributions, but instead of looking at how many standard deviations away from the mean a data point is, it looks at how many Median Absolute Deviations (MADs) away from the median it is: mod_zscore = abs(0.6745 * (data - data.median()) / data.mad()) This seems like the kind of measure we want. In messing around with this I also ended up looking at the median (aka delivery weighted median) vs. (mmbtu) weighted median values of fuel prices in the FRC table, and they're pretty different. I think this is because crazy expensive fuel deliveries tend to be very small -- so they make up a disproportionate number of deliveries (records) compared to how much heat content they represent. The delivery weighted median is $3.25 / MMBTU but the MMBTU weighted median is $2.09 / MMBTU. A weighted median function that can be used with def weighted_median(df: pd.DataFrame, data: str, weights: str) -> float:
df = df.dropna(subset=[data, weights])
s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights]))))
midpoint = 0.5 * np.sum(s_weights)
if any(df[weights] > midpoint):
w_median = df[data][weights == np.max(df[weights])].array[0]
else:
cs_weights = np.cumsum(s_weights)
idx = np.where(cs_weights <= midpoint)[0][-1]
if cs_weights[idx] == midpoint:
w_median = np.mean(s_data[idx:idx+2])
else:
w_median = s_data[idx+1]
return w_median Grouping by DATA_COLS = [
"plant_id_eia",
"report_date",
"energy_source_code",
"fuel_cost_per_mmbtu",
"fuel_received_units",
"fuel_mmbtu_per_unit",
]
GB_COLS = ["report_year", "fuel_group_eiaepm"]
MAX_MOD_ZSCORE = 5.0
logger.info("Query PUDL DB")
frc_db = pd.read_sql("fuel_receipts_costs_eia923", pudl_engine, columns=DATA_COLS)
plant_states = pd.read_sql("SELECT plant_id_eia, state FROM plants_entity_eia", pudl_engine)
fuel_group_eiaepm = pd.read_sql("SELECT code AS energy_source_code, fuel_group_eiaepm FROM energy_sources_eia", pudl_engine)
logger.info("Join tables and calculate derived values")
frc_db = (
frc_db
.merge(plant_states, on="plant_id_eia", how="left", validate="many_to_one")
.merge(fuel_group_eiaepm, on="energy_source_code", how="left", validate="many_to_one")
.assign(
report_year=lambda x: x.report_date.dt.year,
fuel_mmbtu_total=lambda x: x.fuel_received_units * x.fuel_mmbtu_per_unit,
)
.pipe(apply_pudl_dtypes, group="eia")
)
logger.info("Calculate weighted median fuel price")
weighted_median_fuel_price = (
frc_db
.groupby(GB_COLS)
.apply(weighted_median, data="fuel_cost_per_mmbtu", weights="fuel_mmbtu_total")
)
weighted_median_fuel_price.name = "weighted_median_fuel_price"
weighted_median_fuel_price = weighted_median_fuel_price.to_frame().reset_index()
frc_db = frc_db.merge(weighted_median_fuel_price, how="left", on=GB_COLS, validate="many_to_one")
logger.info("Calculate weighted fuel price MAD")
frc_db["delta"] = abs(frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price)
frc_db["fuel_price_mad"] = frc_db.groupby(GB_COLS)["delta"].transform("median")
logger.info("Calculate modified z-score")
frc_db["fuel_price_mod_zscore"] = (0.6745 * (frc_db.fuel_cost_per_mmbtu - frc_db.weighted_median_fuel_price) / frc_db.fuel_price_mad).abs()
fraction_outliers = sum(frc_db.fuel_price_mod_zscore > MAX_MOD_ZSCORE) / len(frc_db)
logger.info(f"Modified z-score {MAX_MOD_ZSCORE} labels {fraction_outliers:0.2%} of all prices as outliers") All fuels:plt.hist(frc_db.fuel_price_mod_zscore, bins=80, range=(0,8), density=True)
plt.title("Modified z-score when grouped by fuel and year"); Fuel groupssns.displot(
data=frc_db[frc_db.fuel_group_eiaepm.isin(["coal", "natural_gas", "petroleum"])],
kind="hist",
x="fuel_price_mod_zscore",
row="fuel_group_eiaepm",
bins=80,
binrange=(0,8),
height=2.5,
aspect=3,
stat="density",
linewidth=0,
); |
I think have weighted, modified z-score outlier detection working, but am wondering how to apply it appropriately.
The latter sounds more correct, but some of the groupings won't have a large number of records in them. Also: what's the right |
Outliers identified using modified z-score of aggregations, with
|
"Robust" outlier detection metrics are great! Their downsides are that 1) they can be too aggressive, especially on skewed data and 2) as you've noticed, they are slow because of all the O(n*logn) sorting. I'm of two minds here. On one hand, it'd be fun and beneficial to tune this model for both accuracy and performance. On the other, we're considering replacing all this with a better model anyway, so maybe we should wait on any tuning until we make that determination. This outlier detection problem probably fits in better the with "improve the imputation model" tasks because both imputation and outlier detection might use the same model. Accuracy improvement ideasI'd start by seeing if we can un-skew the price distributions before (and maybe instead of) applying the robust metrics. A variance-stabilizing transform like log or Box-Cox might make the data "normal" enough to use regular z-scores. Second, have we checked what the minimum sample sizes are for some of these bins? As far as I can tell, the existing method calculates a bunch of types of aggregates and iteratively fills in NaN values in order of model precision. Sometimes a key combination like Finally, the model we've built here is basically a decision tree, just manually constructed instead of automated. Problems like choosing the right grain of aggregation or enforcing a minimum sample size per bin are taken care of automatically by those algorithms! Performance improvement ideasMedians are always going to be slower, but the line |
I will admit that s_data, s_weights = map(np.array, zip(*sorted(zip(df[data], df[weights])))) was just something I adapted from an example which was operating on lists or numpy arrays, and I didn't completely parse what it is doing to try and re-write it intelligently... like what is going on with the zip inside a zip and unpacking the sorted list. Is it obvious to you what it's doing? I'm not even sure what it means to sort a list of tuples necessarily. |
The minimum sample size in some of these bins is definitely small, but it'll use the small sample if it can (even if that's a bad idea). Requiring a bigger minimum sample size would result in more frequently resorting to the less specific estimations, but with many more samples. Should we have just gone straight to using one of the Swiss-army-knife regression models instead, and just started with a modest set of input columns as features, to be expanded / engineered more later? I haven't really worked with them so I don't know how they're finicky or what the pitfalls are. |
Let he who has not copied from StackOverflow cast the first stone 😂 I had no clue what sorting tuples did, but I just tried it and it sorts them by their first value. I think that pattern is designed for python objects: when given iterables representing columns, the inner Also I just noticed the python
No, starting with groupby is much simpler, plus it matched the algorithm we were trying to replace. But once we start adding complexity like conditionally switching between groupbys and checking sample sizes, I think that effort is better spent on a model with a higher ceiling. |
Okay here's a pandas-ified version def weighted_median(df: pd.DataFrame, data: str, weights: str, dropna=True) -> float:
if dropna:
df = df.dropna(subset=[data, weights])
if df.empty | df[[data, weights]].isna().any(axis=None):
return np.nan
df = df.loc[:, [data, weights]].sort_values(data)
midpoint = 0.5 * df[weights].sum()
if (df[weights] > midpoint).any():
w_median = df.loc[df[weights].idxmax(), data]
else:
cs_weights = df[weights].cumsum()
idx = np.where(cs_weights <= midpoint)[0][-1]
if cs_weights.iloc[idx] == midpoint:
w_median = df[data].iloc[idx : idx + 2].mean()
else:
w_median = df[data].iloc[idx + 1]
return w_median Old version: 4min 26s But they do produce exactly the same output! |
Moving some bits from Slack to here for posterity. I tried (notebook) a quick implementation of an XGBoost equivalent (lightGBM). Using the same state, month, fuel features as the groupby model, it gave very similar accuracy. After adding a few new features, it cut relative error in about half. Training/inference time were around 10 seconds total Test set relative error for Test set results for GBDT. Features: |
I'm closing this because it's effectively metastasized into #1708 |
Instead of using the EIA API to pull monthly average fuel costs by state and fuel when individual fuel deliveries have their costs redacted in the
fuel_receipts_costs_eia923
table, calculate it for ourselves.Motivation
This change will address several issues:
fuel_receipts_costs_eia923
table, and related to the plants and mines and suppliers involved. It should be possible to do a fairly good estimation of the fuel prices from scratch given all that context.Approach
Choosing Aggregations
Intuitive Aggregation Priorities
["state", "energy_source_code", "report_date"]
["state", "energy_source_code", "report_year"]
["census_region", "energy_source_code", "report_date"]
["state", "fuel_group_code", "report_date"]
["census_region", "fuel_group_code", "report_date"]
["census_region", "fuel_group_code", "report_year"]
Questions:
Other Potential Refinements
Remaining tasks:
plant_state
into thefuel_receipts_costs_eia923
output table all the time.fuel_receipts_costs_eia923
output routine.frc_eia923
API_KEY_EIA
infrastructure from everywhere in the code, so we aren't unknowingly relying on it.filled_by
labeling, which is now showing all filled values havingnational_fgc_year
which is the last aggregation.fuel_group_code
from thefuel_receipts_costs_eia923
table and add it to theenergy_sources_eia
coding table, and add it back into the output function.merge_date()
is removing ~10kfrc_eia923
records.main
removeAPI_KEY_EIA
from the GitHub secrets.The text was updated successfully, but these errors were encountered: