-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanity check EIA fuel price predictions visually #1720
Comments
I looked at all of the per-state But then I looked at the coal prices, and of course there the opposite was true -- all the predictions are biased high. How can we make the model differentiate more strongly between the fuel types? Would it make sense to train separate models for each fuel type? They long tail of high price excursions that I think comes entirely from gas and petroleum also shows up in coal price distributions. |
One thing that seems to be going on is that it predicts coal + natural gas like price distributions for petroleum, and seems to bring petroleum-like high price outliers into to coal+gas distributions. There are way fewer records / MMBTU representing petroleum, so it'll be hard for the model to integrate petroleum-specific pricing information unless it know it only applies to the other petroleum records. |
It seems to have this same issue with or without the weighting by received MMBTU, and it persists even when I narrow the features down to just fuel type, state, report month, and elapsed days. I feel like I must be doing something wrong. Isn't the idea with this kind of regression that it can identify different regimes in one variable (like fuel type) that indicate different ranges of desirable predictions? Would it help to have an anchoring value that can be modulated by other variables? E.g. a state/regional/national average price for the type of fuel each record contains? |
The numbers are larger than the number of samples because they're weighted by MMBTU. It seems crazy to me that a unit train (~10,000 tons of coal) would get the same prominence as 1 mcf of natural gas or 1 bbl of diesel fuel oil, which is what the unweighted distribution would show, right? Your scatter plot looks way way better! Let me re-run with a single model in GridSearchCV and see what the score looks like now. I think the last one I saw was -0.48. What kind of scores are you getting? |
The one I have from last week uses these features: param_grid = {
"hist_gbr__max_depth": [7,],
"hist_gbr__max_leaf_nodes": [2**7],
"hist_gbr__learning_rate": [0.1],
"hist_gbr__min_samples_leaf": [25],
} to get a |
I was curious how the quality of predictions varies by different fuels, and it looks like we do good on all the major ones: all types of coal and natural gas, plus DFO/RFO. Not so great on petcoke. But overall this seems great! from sklearn.metrics import mean_absolute_percentage_error
def fuel_price_wmape(df, predict_col):
return mean_absolute_percentage_error(
df["fuel_cost_per_mmbtu"],
df[predict_col],
sample_weight=df["fuel_received_mmbtu"]
)
def err_by_fuel(frc):
fuel_cols = ["energy_source_code", "fuel_group_eiaepm"]
out_df = pd.DataFrame()
for col in fuel_cols:
valid_rows = (
(frc["fuel_cost_per_mmbtu"].notna())
& (frc["fuel_received_mmbtu"].notna())
& (frc[col].notna())
)
gb = frc.loc[valid_rows].groupby(col, observed=True)
wm_err = gb.apply(fuel_price_wmape, predict_col="fuel_cost_per_mmbtu_wm").to_frame(name="wm_wmape")
hgbr_err = gb.apply(fuel_price_wmape, predict_col="fuel_cost_per_mmbtu_predicted").to_frame(name="hgbr_wmape")
total_mmbtu = gb["fuel_received_mmbtu"].sum().to_frame(name="total_mmbtu")
tmp = pd.concat([wm_err, hgbr_err, total_mmbtu], axis="columns")
out_df = pd.concat([out_df, tmp])
return out_df.sort_values("total_mmbtu", ascending=False)
err_by_fuel(frc_predicted)
Looking at just the fuels with more than 1e8 MMBTU reported (SUB through DFO): big_fuels = ["SUB", "BIT", "NG", "LIG", "PC", "RFO", "DFO"]
valid_rows = (
(frc_predicted["fuel_cost_per_mmbtu"].notna())
& (frc_predicted["fuel_received_mmbtu"].notna())
& (frc_predicted["fuel_cost_per_mmbtu_predicted"].notna())
)
fuel_price_wmape(
frc_predicted[frc_predicted.energy_source_code.isin(big_fuels)].loc[valid_rows],
predict_col="fuel_cost_per_mmbtu_predicted")
# Result: 0.06948276509771906 Within the coal category, we seem to do equally well on all the coal types, and better than the weighted medians. With the petroleum and some other fuels, the HGBR has higher relative error. In the weighted median approach, there are a number of fuel deliveries with prices that are exactly identical to the weighted median (any time there's only a single delivery of a given fuel in a given state and month). E.g. the perfect 1:1 lines in the middle of these plots... Weighted Medians vs. Hist Gradient Boosted Regression (
|
See continued work in #1767 |
The model error metric gives an extremely condensed (zero-dimensional) view of whether or not we're predicting fuel prices correctly. Develop some additional visualizations that let us see whether things look right, or at least good enough.
Compare by state & fuel type:
Visualizations
The text was updated successfully, but these errors were encountered: