Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize extraction of Excel spreadsheets #2943

Merged
merged 19 commits into from Oct 18, 2023
Merged

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Oct 16, 2023

PR Overview

Taking over #2673. This builds on top of #2644 to add parallel data loading to EIA 861 and EIA 923 datasets. All of the logic introduced in #2644 has been factored out to excel.py so it can be re-used across each of the datasets.

This is more work towards #2385, but doesn't quite fix it because the EIA 860m data isn't being parallel loaded yet. That's done a bit differently to other datasets in this PR at the moment, so I think best to tackle that in a follow up PR.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@e-belfer e-belfer self-assigned this Oct 16, 2023
@e-belfer e-belfer mentioned this pull request Oct 16, 2023
8 tasks
@e-belfer e-belfer linked an issue Oct 16, 2023 that may be closed by this pull request
@e-belfer
Copy link
Member Author

Current validation failures when comparing to 10-12 full dev SQL database:

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-bga_eia860-141638-141638-141638] - ValueError: bga_eia860: found 138256 rows, expected 141638. Off by -2.388%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-bga_eia860-141638-141638-141638] - ValueError: bga_eia860: found 138256 rows, expected 141638. Off by -2.388%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5178892-433286] - ValueError: gen_eia923: found 433278 rows, expected 433286. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-387694-32416] - ValueError: hr_by_unit: found 31714 rows, expected 32416. Off by -2.166%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-599496-50070] - ValueError: hr_by_gen: found 48867 rows, expected 50070. Off by -2.403%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-599496-50070] - ValueError: fuel_cost: found 48867 rows, expected 50070. Off by -2.403%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5178892-433286] - ValueError: capacity_factor: found 433278 rows, expected 433286. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5179300-433320] - ValueError: mcoe: found 433311 rows, expected 433320. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-bga_eia860-141638-141638-141638] - ValueError: bga_eia860: found 138256 rows, expected 141638. Off by -2.388%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5178892-433286] - ValueError: gen_eia923: found 5178682 rows, expected 5178892. Off by -0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-387694-32416] - ValueError: hr_by_unit: found 379321 rows, expected 387694. Off by -2.160%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-599496-50070] - ValueError: hr_by_gen: found 585094 rows, expected 599496. Off by -2.402%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-599496-50070] - ValueError: fuel_cost: found 585094 rows, expected 599496. Off by -2.402%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5178892-433286] - ValueError: capacity_factor: found 5178682 rows, expected 5178892. Off by -0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5179300-433320] - ValueError: mcoe: found 5179082 rows, expected 5179300. Off by -0.004%, allowed margin of 0.000%

@e-belfer
Copy link
Member Author

e-belfer commented Oct 16, 2023

Losing rows in _core_eia923__generation because of issues related to the index being duplicated in the raw assets (each first row has an index of 0, e.g.). I've reset the index and hopefully that should fix the issue.
Update: fixed it!

@e-belfer e-belfer marked this pull request as ready for review October 17, 2023 03:37
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excel.py seems very clever and concise but I'm having a hard time understanding how it works with the multiple nested layers of factories. Can we expand the docstrings and/or add some comments and argument descriptions to help clarify how they all fit together?

src/pudl/etl/__init__.py Outdated Show resolved Hide resolved
src/pudl/extract/excel.py Outdated Show resolved Hide resolved
src/pudl/extract/excel.py Outdated Show resolved Hide resolved
src/pudl/extract/excel.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (b939d28) 88.5% compared to head (8d3fbcc) 88.5%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2943   +/-   ##
=====================================
  Coverage   88.5%   88.5%           
=====================================
  Files         91      91           
  Lines      10805   10808    +3     
=====================================
+ Hits        9569    9572    +3     
  Misses      1236    1236           
Files Coverage Δ
src/pudl/extract/eia860.py 100.0% <100.0%> (ø)
src/pudl/extract/eia861.py 94.4% <100.0%> (-0.3%) ⬇️
src/pudl/extract/eia923.py 100.0% <100.0%> (ø)
src/pudl/extract/excel.py 96.8% <100.0%> (+0.7%) ⬆️
src/pudl/settings.py 98.8% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zaneselvans zaneselvans added dagster Issues related to our use of the Dagster orchestrator excel Issues involving data in Microsoft Excel spreadsheets labels Oct 17, 2023
Copy link
Member Author

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to some comments and added some more docstrings in here.

e-belfer and others added 4 commits October 17, 2023 10:49
- Expanded on some of the docstrings and comments to try and clarify the
  highly abstracted nested factories that construct the Excel extraction
  graph asset.
- Changed some of the function/object names to use extract_ rather
  than load_ to match the stage-names we're using elsewhere.
- It turned out that not including the empty dataframes created errors,
  since those dataframes are being constructed such that they always
  contain all mapped columns, even if the particular year being
  extracted doesn't have every column... and if we don't end up with all
  of the expected columns (even if they're null) that causes problems
  downstream.
Comment on lines +388 to +391
all_data = defaultdict(list)
for dfs in paged_dfs:
for page in dfs:
all_data[page].append(dfs[page])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this back to not checking for the empty dataframes because it turns out we need them to ensure all of the mapped columns are present downstream. Otherwise the fast tests and other subsets of the data fail because they don't contain columns that are only reported in the earlier years.

return all_data


def year_extractor_factory(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On this and a couple of other names, I switched from load to extract so they match the sense in which we're using those words to describe the different ETL stages.

@zaneselvans zaneselvans merged commit b69c009 into dev Oct 18, 2023
11 checks passed
@zaneselvans zaneselvans deleted the parallel_extraction branch October 18, 2023 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator excel Issues involving data in Microsoft Excel spreadsheets
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Parallelize extraction of Excel spreadsheets
4 participants