New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize extraction of Excel spreadsheets #2943
Conversation
Current validation failures when comparing to 10-12 full dev SQL database:
|
Losing rows in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excel.py
seems very clever and concise but I'm having a hard time understanding how it works with the multiple nested layers of factories. Can we expand the docstrings and/or add some comments and argument descriptions to help clarify how they all fit together?
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## dev #2943 +/- ##
=====================================
Coverage 88.5% 88.5%
=====================================
Files 91 91
Lines 10805 10808 +3
=====================================
+ Hits 9569 9572 +3
Misses 1236 1236
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responding to some comments and added some more docstrings in here.
- Expanded on some of the docstrings and comments to try and clarify the highly abstracted nested factories that construct the Excel extraction graph asset. - Changed some of the function/object names to use extract_ rather than load_ to match the stage-names we're using elsewhere. - It turned out that not including the empty dataframes created errors, since those dataframes are being constructed such that they always contain all mapped columns, even if the particular year being extracted doesn't have every column... and if we don't end up with all of the expected columns (even if they're null) that causes problems downstream.
all_data = defaultdict(list) | ||
for dfs in paged_dfs: | ||
for page in dfs: | ||
all_data[page].append(dfs[page]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched this back to not checking for the empty dataframes because it turns out we need them to ensure all of the mapped columns are present downstream. Otherwise the fast tests and other subsets of the data fail because they don't contain columns that are only reported in the earlier years.
return all_data | ||
|
||
|
||
def year_extractor_factory( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this and a couple of other names, I switched from load
to extract
so they match the sense in which we're using those words to describe the different ETL stages.
PR Overview
Taking over #2673. This builds on top of #2644 to add parallel data loading to EIA 861 and EIA 923 datasets. All of the logic introduced in #2644 has been factored out to
excel.py
so it can be re-used across each of the datasets.This is more work towards #2385, but doesn't quite fix it because the EIA 860m data isn't being parallel loaded yet. That's done a bit differently to other datasets in this PR at the moment, so I think best to tackle that in a follow up PR.
PR Checklist
dev
).