Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove external EIA & Zenodo API calls from CI #1696

Closed
wants to merge 38 commits into from
Closed

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Jun 17, 2022

EIA API

Implement a method of estimating fuel prices which are missing from the fuel_receipts_costs_eia923 table by aggregating across:

  • space (state, census region, or nationally)
  • time (annual), and
  • fuel categories (energy_source_code vs. fuel_group_eiaepm)

The aggregate median values are used to fillna() in order of decreasing precision, until all missing fuel price values in the ourput table have been filled. See #1343

Zenodo API

Add a GCS Cache layer to the Datastore that's used in CI so that whenever the GitHub Action runner datastore cache gets evicted, the data is downloaded from GCS instead of Zenodo if possible. See #1679

Review Questions

  • @bendnorman I got authentication working for GCP, and gave the google.storage.Client().bucket a project to bill to for requester pays, via an environment variable, but it seems kinda hacky, I don't know if this is really the right way to do it. I'm also not sure how this will play with the changes that you've made in the GCE deploy PR. Any suggestions as to how this is really supposed to work would be great.
  • @TrentonBush @katie-lamb What do you think of the outlier identification process, weighted median, and modified z-scores? Also... is there any obvious way it can be made faster? With the weighted medians, the filling process now takes ~5 minutes.
  • @cmgosnell It would be great if you could see whether the new fuel price estimates do anything weird to the MCOE / RMI outputs.

Closes #1491
Closes #1343
Closes #1679

Implement a method of estimating fuel prices which are missing from the
fuel_receipts_costs_eia923 table, using aggregations across space
(state, census region, or nationally), across time (annual), and across
fuel categories (energy_source_code vs. fuel_type_code).

Incrementally apply these methods in order of decreasing precision,
until all missing values in the original table have been filled.

In this commit the code for using the EIA API and the new method are
both still present, enabling direct comparison between them in the
`fuel_receipts_costs_eia923` table. See GitHub issue #1343 for more extensive
discussion of the issues and potential refinements.
@zaneselvans zaneselvans linked an issue Jun 17, 2022 that may be closed by this pull request
17 tasks
Removed EIA API infrastructure:
* Several function in the `pudl.output.eia923` module.
* Environment variables passed into test environment via tox-pytest github workflow.
* Environment variables passed through via tox.ini.
* Output tests pudl_out fixtures.
* Switch to looking at `filled_by` rather than `fuel_cost_from_eiaapi` in mcoe calcs.

Also:
* Remove the rolling average option from the frc_eia923 outputs.
* Switch to filling in the fuel prices by default in `PudlTabl.frc_eia923()` and
  `pudl.output.eia923.fuel_receipts_costs_eia923`
* Add a `PudlTabl.debug` boolean attribute that can be used to preserve intermediate
  columns in the output dataframes (like the aggregation and error columns for the fuel
  price estimates).
The fuel_receipts_costs_eia923 table reports an energy group, which must be one of coal,
petroleum, natural_gas, petroleum_coke, or other_gas. This is the category that the
fuels show up under in the EIA's Electric Power Monthly (EPM). It's entirely determined
by the value of `energy_source_code`, and so can be stored as one of the columns in the
`energy_sources_eia` coding table.

I've renamed it `fuel_group_eiaepm` to give a better indication of where it comes from,
as there are a number of other very generic energy categorizations in that table
already.
@zaneselvans zaneselvans linked an issue Jun 19, 2022 that may be closed by this pull request
6 tasks
@zaneselvans zaneselvans changed the title Estimate fuel prices without using the EIA API Remove external EIA & Zenodo APIs from CI Jun 19, 2022
@zaneselvans zaneselvans linked an issue Jun 19, 2022 that may be closed by this pull request
2 tasks
@codecov
Copy link

codecov bot commented Jun 19, 2022

Codecov Report

Merging #1696 (6677dc0) into dev (2384719) will decrease coverage by 0.0%.
The diff coverage is 98.7%.

❗ Current head 6677dc0 differs from pull request most recent head 2933ed8. Consider uploading reports for the commit 2933ed8 to get more accurate results

@@           Coverage Diff           @@
##             dev   #1696     +/-   ##
=======================================
- Coverage   83.5%   83.5%   -0.1%     
=======================================
  Files         65      66      +1     
  Lines       7310    7334     +24     
=======================================
+ Hits        6106    6125     +19     
- Misses      1204    1209      +5     
Impacted Files Coverage Δ
src/pudl/analysis/mcoe.py 90.3% <ø> (ø)
src/pudl/metadata/codes.py 100.0% <ø> (ø)
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/helpers.py 97.8% <ø> (ø)
src/pudl/metadata/resources/eia.py 100.0% <ø> (ø)
src/pudl/metadata/resources/eia923.py 100.0% <ø> (ø)
src/pudl/settings.py 94.7% <ø> (ø)
src/pudl/transform/eia923.py 93.9% <ø> (ø)
src/pudl/output/eia923.py 98.5% <80.0%> (+1.7%) ⬆️
src/pudl/analysis/fuel_price.py 100.0% <100.0%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24bfa05...2933ed8. Read the comment docs.

@zaneselvans zaneselvans marked this pull request as ready for review June 19, 2022 05:45
* Use a weighted median value within each aggregation when filling fuel
  prices in the FRC table.
* Also use the weighted median when identifying outlying fuel price
  values that should be replaced, in the calculation of their modified
  z-score.
* Add some unit tests for the weighted_median function.
@@ -293,7 +293,7 @@ def fuel_cost(pudl_out):
{
"total_fuel_cost": pudl.helpers.sum_na,
"fuel_consumed_mmbtu": pudl.helpers.sum_na,
"fuel_cost_from_eiaapi": "any",
"filled_by": "any",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this can only ever be True since we put something in this column for every record (original if it's not filled). Probably not what we want.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans zaneselvans removed the request for review from cmgosnell June 23, 2022 16:37
@@ -0,0 +1,1080 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #41.    frc.loc[:, string_cols] = frc[string_cols].fillna("NULL")

I think filling with "NULL" probably works since I've seen strings like "missing" be used. You could also add this step to a categorical column transformer pipeline by using SimpleImputer (with the constant strategy) as the "imputer" step.

Not that familiar with pipelines but it seems like you might be able to move some of these preprocessing steps into a pipeline for categorical cols and a pipeline for numerical cols.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm frustrated here because it seems like this isn't needed at all -- supposedly you can get the OrdinalEncoder to pass the NA values through, and the HistGBR model is supposedly fine with NA values on its own, but when I leave the NA values in there it complains about getting a mix of NAType and str values in the column.

@@ -0,0 +1,1080 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5.    frc_train_test = frc[frc.fuel_cost_per_mmbtu.notna()]

Mentioned this in Slack, but you probably want to use something like sklearn train_test_split()


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not understand what train_test_split() does. I thought it split up a bunch of records that you know the target value for into different chunks for training vs. testing on, which is also something that's also built into the GridSearchCV?

Here I was trying to separate the records we do know the target value for from those where the target value is missing (since in this imputation problem they all show up in the same table to begin with). So I thought frc_train_test was the set of all records to use in training and testing, and the remaining records are the ones where we don't know the correct answer and we want to eventually generate one using the trained model.

@@ -0,0 +1,1080 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #10.    num_cols = [

You might want to scale these numerical columns, like StandardScaler for example.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had read that with the tree-based models and this one in particular, the numerical features didn't need to be scaled or normalized, since it's just picking some value to partition by such that the child categories are maximally distinct (which seems to be a selling point of this model, since you can avoid those finicky choices).

@@ -0,0 +1,1080 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #16.    cat_pre = OrdinalEncoder(

I'm not that familiar with sklearn pipelines but if you added something like SimpleImputer to the categorical columns, then cat_pre could be a pipeline that you pass into ColumnTransformer


Reply via ReviewNB

@@ -0,0 +1,1080 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this warning is coming from the values in your categorical columns. Not sure what quoted_name is? Maybe look at the initial merges and SQL query you do?


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has something to do with some of the data having labels (pandas df/series) and some of it having gotten reduced to numpy arrays somewhere, and it sees that these two things look different and thinks maybe you're not using the same kind of data for training / prediction?

@@ -0,0 +1,1080 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting that you can only use sample weights with cross_val_score. You could use search.fit(X_train, y_train) where search is your cross validator and hist_gbr__sample_weights is part of your parameter grid


Reply via ReviewNB

@@ -0,0 +1,1133 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #31.        "plant_id_eia", 

My first instinct here was "oh no we definitely don't want that", because this picks out individual plants, whereas most of our actual imputations will be of 800 plants the model has never seen before (plus 100 it has seen pieces of). I think the model literally knowing plant_id_eia will only be useful for the second, much smaller group.

But after more thought, this feature, because it is assigned sequentially over time (right?), probably is a direct proxy for construction date and likely correlates with lots of other temporally dependent things like maybe plant type or location, etc. It might be better to just bring those in directly? I'm undecided here. I guess we can just leave it up to cross validation to decide.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you mean plant_id_eia is ordinal in time? You mean as new plants get built the number increments? Is this still an issue if the ID is treated as a categorical feature rather than a numerical one?

I was imagining that this would be helpful in filling in gaps within a plant's fuel price history, but based on what you've seen that seems like a secondary consideration, since almost all the plants either have all their fuel prices, or none of them.

I think a useful geographic feature would be distance to the coalmine associated with the record, which we could estimate as the geographic distance to the centroid of the geometry associated with mine_county_id_fips

I'd be surprised if plant age had a lot to do with the fuel price.

@@ -0,0 +1,1133 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #14.        ("hist_gbr", HistGradientBoostingRegressor(loss="absolute_error", categorical_features=cat_cols_idx)),

The default value of max_iter is 100, which is probably too low (also why the best learning rate is pretty high - the model is trying to make the most of a small number of iterations). We can probably crank that up to like 1000 and rely on early stopping.

BUT this will lead to much longer training times, and as you've seen the output is not that sensitive to this stuff. So for the purposes of experimentation, I think it's good to leave as is, but at the very end of this whole process let's do one big round of grid searching to dial it in.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched over to using a single set of parameters for basic experimentation, and it seems to train in a reasonable amount of time, even with all these columns. But yeah I agree a big parameters space search at the end seems reasonable!

I'll mess around with some of the other model params in a single-shot and see how they behave. Do you have an intuitive sense of what the minimum number of samples per leaf does? Does it prevent overfitting on smaller datasets because the model can only get so granular?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, min_samples_leaf helps prevent overfitting aka fitting to noise in the training data. A groupby slices your data up into equally spaced bins and then calculates an aggregate of the points in each bin. A decision tree also slices your data up into bins (called "leaves") and aggregates the points, but instead of equal spacing the sizes/shapes of the bins are calculated dynamically. min_samples_leaf sets the minimum allowable sample size in the bins so they don't get overly fit on a small, potentially noisy sample. If you had only 3 points in a bin, you can imagine the aggregate could get thrown way off by a rogue data point, whereas a 30 point aggregate would be less bothered.

@@ -0,0 +1,1133 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.    gkf = GroupKFold(n_splits=5)

Two things here:

  1. For the sake of reproducibility, I think we should set random_state=<your favorite number> for every piece that takes that arg. That includes this CV splitter, HistGBR, and maybe GridSearch too.
  2. KFold splits data in the order it comes in (GroupKFold by the order the groups appear). Sometimes that is a useful thing, I'm not sure if it is here. I think frc comes roughly sorted by report_date, so this might be holding out plants roughly based on their age group. That sounds good for forecasting, but for imputation it might just add noise? I'm not totally sure. Why did you move from GroupShuffleSplit to GroupKFold?


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really understand the difference between the various splitters, beyond ensuring that groups are partitioned, and that unevenly sized categories get proportionally represented in the different splits. It seems like there's no reason not to shuffle all the records within the group partitions that we want for test/train splits, and it seems like that's what the GroupShuffleSplit does?

@@ -0,0 +1,1133 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #4.    gkf_split = gkf.split(frc_data, groups=frc_data["plant_id_eia"])

In light of the geographic clustering of missing data, I'm wondering if we might try doing a grouped split by state to test the spatial dependency issue. Which do you think is more representative of our actual missing data problem?


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would be grouping for a different reason than avoiding intra-group information leakage right? In this case we'd be checking whether or not there's enough regional correlation that we can hope to infer prices in states where we have little to no information, based on information from other states?

So if we were to hold out whole states, and saw model performance go down substantially, then we'd be worried, since it would mean that maybe we can't effectively get at what's going on the NE and other IPP dominated areas.

Looking at the aggregate EIA API data for the states that aren't well represented, and seeing what kinds of correlations it has with our predicted (and the observed) prices in adjacent states might also be helpful?

@@ -0,0 +1,1133 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #5.    param_grid = {

Another hyperparameter note (again best left for the very end of this process due to model insensitivity), but another param worth exploring is min_samples_leaf, which directly controls the minimum number of samples going into each aggregate.


Reply via ReviewNB

@@ -0,0 +1,1133 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to print the best score somewhere for reference! I didn't find the actual number anywhere in this notebook


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the scoring metric has been changing a lot too so maybe not so comparable. But yeah once we've settled on something it'd be good to be able to compare across time and efforts. We could even compete. :grimmace:

@@ -0,0 +1,1133 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this sanity check. Another useful output for model evaluation is to compute the residuals and relative % error and do some error analysis: plot histograms of error, do a .describe(), print the rows with the .nlargest() and .nsmallest() absolute and relative errors and make sure the inputs correspond with the outputs in a way that makes sense.


Reply via ReviewNB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll look at doing some of this.

zaneselvans and others added 9 commits June 27, 2022 17:13
Brought some notebook functions into the module to share / discuss, especially the
function that builds the dataframe of features to pass off to sklearn. Includes
estimating distance between plant and mine when possible based on county FIPS codes.

Also some more visualizations and error metrics in the notebook
@zaneselvans zaneselvans linked an issue Jul 19, 2022 that may be closed by this pull request
4 tasks
@zaneselvans zaneselvans self-assigned this Sep 12, 2022
@zaneselvans zaneselvans changed the title Remove external EIA & Zenodo APIs from CI Remove external EIA & Zenodo API calls from CI Oct 18, 2022
@zaneselvans
Copy link
Member Author

This PR has gotten so out of date that it's probably not worth trying to merge it in anymore. I will leave the branch around though, so we can poach any code we might need from the HistGBDT work directly. See #1708 for ongoing work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sanity check EIA fuel price predictions visually Remove API calls from our CI tests
4 participants