Remove external EIA & Zenodo API calls from CI #1696

zaneselvans · 2022-06-17T18:48:10Z

EIA API

Implement a method of estimating fuel prices which are missing from the fuel_receipts_costs_eia923 table by aggregating across:

space (state, census region, or nationally)
time (annual), and
fuel categories (energy_source_code vs. fuel_group_eiaepm)

The aggregate median values are used to fillna() in order of decreasing precision, until all missing fuel price values in the ourput table have been filled. See #1343

Zenodo API

Add a GCS Cache layer to the Datastore that's used in CI so that whenever the GitHub Action runner datastore cache gets evicted, the data is downloaded from GCS instead of Zenodo if possible. See #1679

Review Questions

@bendnorman I got authentication working for GCP, and gave the google.storage.Client().bucket a project to bill to for requester pays, via an environment variable, but it seems kinda hacky, I don't know if this is really the right way to do it. I'm also not sure how this will play with the changes that you've made in the GCE deploy PR. Any suggestions as to how this is really supposed to work would be great.
@TrentonBush @katie-lamb What do you think of the outlier identification process, weighted median, and modified z-scores? Also... is there any obvious way it can be made faster? With the weighted medians, the filling process now takes ~5 minutes.
@cmgosnell It would be great if you could see whether the new fuel price estimates do anything weird to the MCOE / RMI outputs.

Closes #1491
Closes #1343
Closes #1679

Implement a method of estimating fuel prices which are missing from the fuel_receipts_costs_eia923 table, using aggregations across space (state, census region, or nationally), across time (annual), and across fuel categories (energy_source_code vs. fuel_type_code). Incrementally apply these methods in order of decreasing precision, until all missing values in the original table have been filled. In this commit the code for using the EIA API and the new method are both still present, enabling direct comparison between them in the `fuel_receipts_costs_eia923` table. See GitHub issue #1343 for more extensive discussion of the issues and potential refinements.

Removed EIA API infrastructure: * Several function in the `pudl.output.eia923` module. * Environment variables passed into test environment via tox-pytest github workflow. * Environment variables passed through via tox.ini. * Output tests pudl_out fixtures. * Switch to looking at `filled_by` rather than `fuel_cost_from_eiaapi` in mcoe calcs. Also: * Remove the rolling average option from the frc_eia923 outputs. * Switch to filling in the fuel prices by default in `PudlTabl.frc_eia923()` and `pudl.output.eia923.fuel_receipts_costs_eia923` * Add a `PudlTabl.debug` boolean attribute that can be used to preserve intermediate columns in the output dataframes (like the aggregation and error columns for the fuel price estimates).

The fuel_receipts_costs_eia923 table reports an energy group, which must be one of coal, petroleum, natural_gas, petroleum_coke, or other_gas. This is the category that the fuels show up under in the EIA's Electric Power Monthly (EPM). It's entirely determined by the value of `energy_source_code`, and so can be stored as one of the columns in the `energy_sources_eia` coding table. I've renamed it `fuel_group_eiaepm` to give a better indication of where it comes from, as there are a number of other very generic energy categorizations in that table already.

codecov · 2022-06-19T05:42:14Z

Codecov Report

Merging #1696 (6677dc0) into dev (2384719) will decrease coverage by 0.0%.
The diff coverage is 98.7%.

❗ Current head 6677dc0 differs from pull request most recent head 2933ed8. Consider uploading reports for the commit 2933ed8 to get more accurate results

@@           Coverage Diff           @@
##             dev   #1696     +/-   ##
=======================================
- Coverage   83.5%   83.5%   -0.1%     
=======================================
  Files         65      66      +1     
  Lines       7310    7334     +24     
=======================================
+ Hits        6106    6125     +19     
- Misses      1204    1209      +5

Impacted Files	Coverage Δ
src/pudl/analysis/mcoe.py	`90.3% <ø> (ø)`
src/pudl/metadata/codes.py	`100.0% <ø> (ø)`
src/pudl/metadata/fields.py	`100.0% <ø> (ø)`
src/pudl/metadata/helpers.py	`97.8% <ø> (ø)`
src/pudl/metadata/resources/eia.py	`100.0% <ø> (ø)`
src/pudl/metadata/resources/eia923.py	`100.0% <ø> (ø)`
src/pudl/settings.py	`94.7% <ø> (ø)`
src/pudl/transform/eia923.py	`93.9% <ø> (ø)`
src/pudl/output/eia923.py	`98.5% <80.0%> (+1.7%)`	⬆️
src/pudl/analysis/fuel_price.py	`100.0% <100.0%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24bfa05...2933ed8. Read the comment docs.

* Use a weighted median value within each aggregation when filling fuel prices in the FRC table. * Also use the weighted median when identifying outlying fuel price values that should be replaced, in the calculation of their modified z-score. * Add some unit tests for the weighted_median function.

zaneselvans · 2022-06-21T19:54:44Z

src/pudl/analysis/mcoe.py

@@ -293,7 +293,7 @@ def fuel_cost(pudl_out):
        {
            "total_fuel_cost": pudl.helpers.sum_na,
            "fuel_consumed_mmbtu": pudl.helpers.sum_na,
-            "fuel_cost_from_eiaapi": "any",
+            "filled_by": "any",


Hmm, this can only ever be True since we put something in this column for every record (original if it's not filled). Probably not what we want.

review-notebook-app · 2022-06-23T13:42:35Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

katie-lamb · 2022-06-23T19:23:36Z