Parallelize extraction of Excel spreadsheets #2943

e-belfer · 2023-10-16T18:19:45Z

PR Overview

Taking over #2673. This builds on top of #2644 to add parallel data loading to EIA 861 and EIA 923 datasets. All of the logic introduced in #2644 has been factored out to excel.py so it can be re-used across each of the datasets.

This is more work towards #2385, but doesn't quite fix it because the EIA 860m data isn't being parallel loaded yet. That's done a bit differently to other datasets in this PR at the moment, so I think best to tackle that in a follow up PR.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

e-belfer · 2023-10-16T18:21:34Z

Current validation failures when comparing to 10-12 full dev SQL database:

FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-bga_eia860-141638-141638-141638] - ValueError: bga_eia860: found 138256 rows, expected 141638. Off by -2.388%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-bga_eia860-141638-141638-141638] - ValueError: bga_eia860: found 138256 rows, expected 141638. Off by -2.388%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5178892-433286] - ValueError: gen_eia923: found 433278 rows, expected 433286. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-387694-32416] - ValueError: hr_by_unit: found 31714 rows, expected 32416. Off by -2.166%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-599496-50070] - ValueError: hr_by_gen: found 48867 rows, expected 50070. Off by -2.403%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-599496-50070] - ValueError: fuel_cost: found 48867 rows, expected 50070. Off by -2.403%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5178892-433286] - ValueError: capacity_factor: found 433278 rows, expected 433286. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5179300-433320] - ValueError: mcoe: found 433311 rows, expected 433320. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-bga_eia860-141638-141638-141638] - ValueError: bga_eia860: found 138256 rows, expected 141638. Off by -2.388%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5178892-433286] - ValueError: gen_eia923: found 5178682 rows, expected 5178892. Off by -0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-387694-32416] - ValueError: hr_by_unit: found 379321 rows, expected 387694. Off by -2.160%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-599496-50070] - ValueError: hr_by_gen: found 585094 rows, expected 599496. Off by -2.402%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-599496-50070] - ValueError: fuel_cost: found 585094 rows, expected 599496. Off by -2.402%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5178892-433286] - ValueError: capacity_factor: found 5178682 rows, expected 5178892. Off by -0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5179300-433320] - ValueError: mcoe: found 5179082 rows, expected 5179300. Off by -0.004%, allowed margin of 0.000%

e-belfer · 2023-10-16T20:21:56Z

Losing rows in _core_eia923__generation because of issues related to the index being duplicated in the raw assets (each first row has an index of 0, e.g.). I've reset the index and hopefully that should fix the issue.
Update: fixed it!

zaneselvans

excel.py seems very clever and concise but I'm having a hard time understanding how it works with the multiple nested layers of factories. Can we expand the docstrings and/or add some comments and argument descriptions to help clarify how they all fit together?

src/pudl/etl/__init__.py

src/pudl/extract/excel.py

codecov · 2023-10-17T04:48:27Z

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (b939d28) 88.5% compared to head (8d3fbcc) 88.5%.

Additional details and impacted files

@@          Coverage Diff          @@
##             dev   #2943   +/-   ##
=====================================
  Coverage   88.5%   88.5%           
=====================================
  Files         91      91           
  Lines      10805   10808    +3     
=====================================
+ Hits        9569    9572    +3     
  Misses      1236    1236

Files	Coverage Δ
src/pudl/extract/eia860.py	`100.0% <100.0%> (ø)`
src/pudl/extract/eia861.py	`94.4% <100.0%> (-0.3%)`	⬇️
src/pudl/extract/eia923.py	`100.0% <100.0%> (ø)`
src/pudl/extract/excel.py	`96.8% <100.0%> (+0.7%)`	⬆️
src/pudl/settings.py	`98.8% <ø> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ssets from module

e-belfer

Responding to some comments and added some more docstrings in here.

- Expanded on some of the docstrings and comments to try and clarify the highly abstracted nested factories that construct the Excel extraction graph asset. - Changed some of the function/object names to use extract_ rather than load_ to match the stage-names we're using elsewhere. - It turned out that not including the empty dataframes created errors, since those dataframes are being constructed such that they always contain all mapped columns, even if the particular year being extracted doesn't have every column... and if we don't end up with all of the expected columns (even if they're null) that causes problems downstream.

zaneselvans · 2023-10-18T00:14:02Z

src/pudl/extract/excel.py

+    all_data = defaultdict(list)
+    for dfs in paged_dfs:
+        for page in dfs:
+            all_data[page].append(dfs[page])


I switched this back to not checking for the empty dataframes because it turns out we need them to ensure all of the mapped columns are present downstream. Otherwise the fast tests and other subsets of the data fail because they don't contain columns that are only reported in the earlier years.

zaneselvans · 2023-10-18T00:14:51Z

src/pudl/extract/excel.py

+    return all_data
+
+
+def year_extractor_factory(


On this and a couple of other names, I switched from load to extract so they match the sense in which we're using those words to describe the different ETL stages.

dstansby and others added 9 commits June 25, 2023 17:02

Parallel load EIA861

3840373

De-dupe years from settings

82cef3b

Add a raw_df factory

5bced34

Clean up new docs

b048e83

Apply parallel loading to eia923

f060d93

Add a unit test for years_from_settings

4002f15

Add test for yearly df merge

c04e98f

Add docstring to merge_dfs_by_page

e367529

Merge branch 'dev' into eia861-parallel-load

888307b

e-belfer self-assigned this Oct 16, 2023

e-belfer mentioned this pull request Oct 16, 2023

Parallel load EIA{861, 923} #2673

Closed

8 tasks

e-belfer linked an issue Oct 16, 2023 that may be closed by this pull request

Parallelize extraction of Excel spreadsheets #2385

Closed

Merge branch 'dev' into parallel_extraction

9cfddd8

Fix indexing issue

5b75a8d

e-belfer marked this pull request as ready for review October 17, 2023 03:37

Merge branch 'dev' into parallel_extraction

dea4a6b

zaneselvans requested changes Oct 17, 2023

View reviewed changes

src/pudl/etl/__init__.py Outdated Show resolved Hide resolved

src/pudl/extract/excel.py Outdated Show resolved Hide resolved

src/pudl/extract/excel.py Outdated Show resolved Hide resolved

src/pudl/extract/excel.py Outdated Show resolved Hide resolved

zaneselvans added dagster Issues related to our use of the Dagster orchestrator excel Issues involving data in Microsoft Excel spreadsheets labels Oct 17, 2023

e-belfer added 2 commits October 17, 2023 10:37

Remove empty dataframes from concatenation, add notes, remove excel a…

ecd1abd

…ssets from module

Add more docstrings

77e695f

e-belfer commented Oct 17, 2023

View reviewed changes

e-belfer requested a review from zaneselvans October 17, 2023 14:42

e-belfer and others added 4 commits October 17, 2023 10:49

Update release notes

4c62103

Fix missing data tables

b35fbe3

Fix typo in release notes

81a27d4

Replace one more 'load' with 'extract'

8d3fbcc

zaneselvans approved these changes Oct 18, 2023

View reviewed changes

zaneselvans merged commit b69c009 into dev Oct 18, 2023
11 checks passed

zaneselvans deleted the parallel_extraction branch October 18, 2023 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize extraction of Excel spreadsheets #2943

Parallelize extraction of Excel spreadsheets #2943

e-belfer commented Oct 16, 2023 •

edited by zaneselvans

e-belfer commented Oct 16, 2023

e-belfer commented Oct 16, 2023 •

edited

zaneselvans left a comment

codecov bot commented Oct 17, 2023 •

edited

e-belfer left a comment

zaneselvans Oct 18, 2023

zaneselvans Oct 18, 2023

Parallelize extraction of Excel spreadsheets #2943

Parallelize extraction of Excel spreadsheets #2943

Conversation

e-belfer commented Oct 16, 2023 • edited by zaneselvans

PR Overview

PR Checklist

e-belfer commented Oct 16, 2023

e-belfer commented Oct 16, 2023 • edited

zaneselvans left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 17, 2023 • edited

Codecov Report

e-belfer left a comment

Choose a reason for hiding this comment

zaneselvans Oct 18, 2023

Choose a reason for hiding this comment

zaneselvans Oct 18, 2023

Choose a reason for hiding this comment

e-belfer commented Oct 16, 2023 •

edited by zaneselvans

e-belfer commented Oct 16, 2023 •

edited

codecov bot commented Oct 17, 2023 •

edited