-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove external EIA & Zenodo API calls from CI #1696
Conversation
Implement a method of estimating fuel prices which are missing from the fuel_receipts_costs_eia923 table, using aggregations across space (state, census region, or nationally), across time (annual), and across fuel categories (energy_source_code vs. fuel_type_code). Incrementally apply these methods in order of decreasing precision, until all missing values in the original table have been filled. In this commit the code for using the EIA API and the new method are both still present, enabling direct comparison between them in the `fuel_receipts_costs_eia923` table. See GitHub issue #1343 for more extensive discussion of the issues and potential refinements.
Removed EIA API infrastructure: * Several function in the `pudl.output.eia923` module. * Environment variables passed into test environment via tox-pytest github workflow. * Environment variables passed through via tox.ini. * Output tests pudl_out fixtures. * Switch to looking at `filled_by` rather than `fuel_cost_from_eiaapi` in mcoe calcs. Also: * Remove the rolling average option from the frc_eia923 outputs. * Switch to filling in the fuel prices by default in `PudlTabl.frc_eia923()` and `pudl.output.eia923.fuel_receipts_costs_eia923` * Add a `PudlTabl.debug` boolean attribute that can be used to preserve intermediate columns in the output dataframes (like the aggregation and error columns for the fuel price estimates).
The fuel_receipts_costs_eia923 table reports an energy group, which must be one of coal, petroleum, natural_gas, petroleum_coke, or other_gas. This is the category that the fuels show up under in the EIA's Electric Power Monthly (EPM). It's entirely determined by the value of `energy_source_code`, and so can be stored as one of the columns in the `energy_sources_eia` coding table. I've renamed it `fuel_group_eiaepm` to give a better indication of where it comes from, as there are a number of other very generic energy categorizations in that table already.
Codecov Report
@@ Coverage Diff @@
## dev #1696 +/- ##
=======================================
- Coverage 83.5% 83.5% -0.1%
=======================================
Files 65 66 +1
Lines 7310 7334 +24
=======================================
+ Hits 6106 6125 +19
- Misses 1204 1209 +5
Continue to review full report at Codecov.
|
* Use a weighted median value within each aggregation when filling fuel prices in the FRC table. * Also use the weighted median when identifying outlying fuel price values that should be replaced, in the calculation of their modified z-score. * Add some unit tests for the weighted_median function.
@@ -293,7 +293,7 @@ def fuel_cost(pudl_out): | |||
{ | |||
"total_fuel_cost": pudl.helpers.sum_na, | |||
"fuel_consumed_mmbtu": pudl.helpers.sum_na, | |||
"fuel_cost_from_eiaapi": "any", | |||
"filled_by": "any", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this can only ever be True
since we put something in this column for every record (original
if it's not filled). Probably not what we want.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@@ -0,0 +1,1080 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #41. frc.loc[:, string_cols] = frc[string_cols].fillna("NULL")
I think filling with "NULL" probably works since I've seen strings like "missing" be used. You could also add this step to a categorical column transformer pipeline by using SimpleImputer
(with the constant
strategy) as the "imputer" step.
Not that familiar with pipelines but it seems like you might be able to move some of these preprocessing steps into a pipeline for categorical cols and a pipeline for numerical cols.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm frustrated here because it seems like this isn't needed at all -- supposedly you can get the OrdinalEncoder
to pass the NA values through, and the HistGBR model is supposedly fine with NA values on its own, but when I leave the NA values in there it complains about getting a mix of NAType
and str
values in the column.
@@ -0,0 +1,1080 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. frc_train_test = frc[frc.fuel_cost_per_mmbtu.notna()]
Mentioned this in Slack, but you probably want to use something like sklearn train_test_split()
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might not understand what train_test_split()
does. I thought it split up a bunch of records that you know the target value for into different chunks for training vs. testing on, which is also something that's also built into the GridSearchCV?
Here I was trying to separate the records we do know the target value for from those where the target value is missing (since in this imputation problem they all show up in the same table to begin with). So I thought frc_train_test
was the set of all records to use in training and testing, and the remaining records are the ones where we don't know the correct answer and we want to eventually generate one using the trained model.
@@ -0,0 +1,1080 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #10. num_cols = [
You might want to scale these numerical columns, like StandardScaler
for example.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had read that with the tree-based models and this one in particular, the numerical features didn't need to be scaled or normalized, since it's just picking some value to partition by such that the child categories are maximally distinct (which seems to be a selling point of this model, since you can avoid those finicky choices).
@@ -0,0 +1,1080 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #16. cat_pre = OrdinalEncoder(
I'm not that familiar with sklearn pipelines but if you added something like SimpleImputer
to the categorical columns, then cat_pre
could be a pipeline that you pass into ColumnTransformer
Reply via ReviewNB
@@ -0,0 +1,1080 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this warning is coming from the values in your categorical columns. Not sure what quoted_name
is? Maybe look at the initial merges and SQL query you do?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has something to do with some of the data having labels (pandas df/series) and some of it having gotten reduced to numpy arrays somewhere, and it sees that these two things look different and thinks maybe you're not using the same kind of data for training / prediction?
@@ -0,0 +1,1080 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting that you can only use sample weights with cross_val_score
. You could use search.fit(X_train, y_train)
where search is your cross validator and hist_gbr__sample_weights
is part of your parameter grid
Reply via ReviewNB
@@ -0,0 +1,1133 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #31. "plant_id_eia",
My first instinct here was "oh no we definitely don't want that", because this picks out individual plants, whereas most of our actual imputations will be of 800 plants the model has never seen before (plus 100 it has seen pieces of). I think the model literally knowing plant_id_eia
will only be useful for the second, much smaller group.
But after more thought, this feature, because it is assigned sequentially over time (right?), probably is a direct proxy for construction date and likely correlates with lots of other temporally dependent things like maybe plant type or location, etc. It might be better to just bring those in directly? I'm undecided here. I guess we can just leave it up to cross validation to decide.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you mean plant_id_eia
is ordinal in time? You mean as new plants get built the number increments? Is this still an issue if the ID is treated as a categorical feature rather than a numerical one?
I was imagining that this would be helpful in filling in gaps within a plant's fuel price history, but based on what you've seen that seems like a secondary consideration, since almost all the plants either have all their fuel prices, or none of them.
I think a useful geographic feature would be distance to the coalmine associated with the record, which we could estimate as the geographic distance to the centroid of the geometry associated with mine_county_id_fips
I'd be surprised if plant age had a lot to do with the fuel price.
@@ -0,0 +1,1133 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #14. ("hist_gbr", HistGradientBoostingRegressor(loss="absolute_error", categorical_features=cat_cols_idx)),
The default value of max_iter
is 100, which is probably too low (also why the best learning rate is pretty high - the model is trying to make the most of a small number of iterations). We can probably crank that up to like 1000 and rely on early stopping.
BUT this will lead to much longer training times, and as you've seen the output is not that sensitive to this stuff. So for the purposes of experimentation, I think it's good to leave as is, but at the very end of this whole process let's do one big round of grid searching to dial it in.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've switched over to using a single set of parameters for basic experimentation, and it seems to train in a reasonable amount of time, even with all these columns. But yeah I agree a big parameters space search at the end seems reasonable!
I'll mess around with some of the other model params in a single-shot and see how they behave. Do you have an intuitive sense of what the minimum number of samples per leaf does? Does it prevent overfitting on smaller datasets because the model can only get so granular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, min_samples_leaf
helps prevent overfitting aka fitting to noise in the training data. A groupby slices your data up into equally spaced bins and then calculates an aggregate of the points in each bin. A decision tree also slices your data up into bins (called "leaves") and aggregates the points, but instead of equal spacing the sizes/shapes of the bins are calculated dynamically. min_samples_leaf
sets the minimum allowable sample size in the bins so they don't get overly fit on a small, potentially noisy sample. If you had only 3 points in a bin, you can imagine the aggregate could get thrown way off by a rogue data point, whereas a 30 point aggregate would be less bothered.
@@ -0,0 +1,1133 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #3. gkf = GroupKFold(n_splits=5)
Two things here:
- For the sake of reproducibility, I think we should set
random_state=<your favorite number>
for every piece that takes that arg. That includes this CV splitter, HistGBR, and maybe GridSearch too. - KFold splits data in the order it comes in (GroupKFold by the order the groups appear). Sometimes that is a useful thing, I'm not sure if it is here. I think frc comes roughly sorted by
report_date
, so this might be holding out plants roughly based on their age group. That sounds good for forecasting, but for imputation it might just add noise? I'm not totally sure. Why did you move from GroupShuffleSplit to GroupKFold?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really understand the difference between the various splitters, beyond ensuring that groups are partitioned, and that unevenly sized categories get proportionally represented in the different splits. It seems like there's no reason not to shuffle all the records within the group partitions that we want for test/train splits, and it seems like that's what the GroupShuffleSplit
does?
@@ -0,0 +1,1133 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #4. gkf_split = gkf.split(frc_data, groups=frc_data["plant_id_eia"])
In light of the geographic clustering of missing data, I'm wondering if we might try doing a grouped split by state
to test the spatial dependency issue. Which do you think is more representative of our actual missing data problem?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this would be grouping for a different reason than avoiding intra-group information leakage right? In this case we'd be checking whether or not there's enough regional correlation that we can hope to infer prices in states where we have little to no information, based on information from other states?
So if we were to hold out whole states, and saw model performance go down substantially, then we'd be worried, since it would mean that maybe we can't effectively get at what's going on the NE and other IPP dominated areas.
Looking at the aggregate EIA API data for the states that aren't well represented, and seeing what kinds of correlations it has with our predicted (and the observed) prices in adjacent states might also be helpful?
@@ -0,0 +1,1133 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #5. param_grid = {
Another hyperparameter note (again best left for the very end of this process due to model insensitivity), but another param worth exploring is min_samples_leaf
, which directly controls the minimum number of samples going into each aggregate.
Reply via ReviewNB
@@ -0,0 +1,1133 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to print the best score somewhere for reference! I didn't find the actual number anywhere in this notebook
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the scoring metric has been changing a lot too so maybe not so comparable. But yeah once we've settled on something it'd be good to be able to compare across time and efforts. We could even compete. :grimmace:
@@ -0,0 +1,1133 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this sanity check. Another useful output for model evaluation is to compute the residuals and relative % error and do some error analysis: plot histograms of error, do a .describe()
, print the rows with the .nlargest()
and .nsmallest()
absolute and relative errors and make sure the inputs correspond with the outputs in a way that makes sense.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'll look at doing some of this.
…l into fuel-price-calc
For more information, see https://pre-commit.ci
Brought some notebook functions into the module to share / discuss, especially the function that builds the dataframe of features to pass off to sklearn. Includes estimating distance between plant and mine when possible based on county FIPS codes. Also some more visualizations and error metrics in the notebook
…l into fuel-price-calc
This PR has gotten so out of date that it's probably not worth trying to merge it in anymore. I will leave the branch around though, so we can poach any code we might need from the HistGBDT work directly. See #1708 for ongoing work on this. |
EIA API
Implement a method of estimating fuel prices which are missing from the
fuel_receipts_costs_eia923
table by aggregating across:The aggregate median values are used to
fillna()
in order of decreasing precision, until all missing fuel price values in the ourput table have been filled. See #1343Zenodo API
Add a GCS Cache layer to the Datastore that's used in CI so that whenever the GitHub Action runner datastore cache gets evicted, the data is downloaded from GCS instead of Zenodo if possible. See #1679
Review Questions
google.storage.Client().bucket
a project to bill to for requester pays, via an environment variable, but it seems kinda hacky, I don't know if this is really the right way to do it. I'm also not sure how this will play with the changes that you've made in the GCE deploy PR. Any suggestions as to how this is really supposed to work would be great.Closes #1491
Closes #1343
Closes #1679