-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read EIA860 data in parallel #2644
Read EIA860 data in parallel #2644
Conversation
Hey, thanks for taking this on! I think it'll only really be exercised fully in a multi-year run. Would you mind trying to generate the raw EIA-860 dataframes in the full ETL and see how it goes? |
Also we're super curious what you're up to with the PUDL data! |
👍 will do
(perhaps disappointingly for you folks 😄) I'm not doing anything with the data. I've been doing Python OSS development for a while now, partly day job and partly for fun. Also worried about the climate, and looking for some fresh projects to contribute too, and this seemed like a great option! |
That looks right. There's monthly data from the most recent year that's integrated into the EIA-860 extraction on top of the most recent year of annual data in the fast ETL. It seems like there's an unexpected change to the contents of the data as a result of the parallel loading though, which is resulting in the generator |
Not at all! @rousik has also been helping us out on the side with his software and DevOps experience. If you'd ever like to hop on a call and get some more context on how the data feeds into the energy policy machine in the US or want help finding an appropriate issue to tackle given your skills, please let us know. |
8f7cfce
to
9441450
Compare
The full ETL seems to work fine through the new EIA-860 loading (at least up to the epacems steps where I get a bunch of timeouts from zenodo). I think I've fixed the problem with the data contents too, will wait and see what the integration tests say to be sure. |
9441450
to
bd6ef93
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2644 +/- ##
=====================================
Coverage 87.1% 87.1%
=====================================
Files 86 86
Lines 10004 10027 +23
=====================================
+ Hits 8719 8740 +21
- Misses 1285 1287 +2
☔ View full report in Codecov by Sentry. |
What did you have to change to fix whatever was going wrong before? It all seems to be in one big commit. Was it just needing to add a Tagging @jdangerx for a review since this super slow Excel reading has been a pet peeve of his. If this works well should we try and use it for the |
By habit I squash stuff into a single commit until a proper review on pull requests - it was previously broken because in I was accidentally overriting previous years when merging the dataframes into a single dictionary. Now fixed by the new
Sounds good - happy to do that in this PR or a subsequent PR if this workflow is good, let me know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good, thanks so much for the patch! But it'd be nice to catch that error you found in some sort of data validation test somewhere - hence "request changes."
@zaneselvans , you have a lot more context on our data validations than I do - do you have any thoughts on what to do here? If you feel comfortable pushing this through without data validation tests, too, I defer to your judgement and will happily approve the PR 😅
future/out-of-scope: if this pattern works well in our nightly build, it'd be nice to apply this everywhere we use that excel.GenericExtractor
- then we could probably even delete a bunch of that GenericExtractor
code...
|
||
|
||
@graph_asset | ||
def eia860_raw_dfs() -> dict[str, pd.DataFrame]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is pretty clean and I bet it would apply well to our other "read a lot of excel files in series" operations, as @zaneselvans pointed out! I don't think that needs to be part of this PR, though - happy to merge this chunk first and see how it performs before moving more into this pattern 😄
@jdangerx If you mean the accidental clobbering of other years bug, it broke a bunch of things downstream with more than one year of data, so it's probably okay as is? It wouldn't be a silent error. |
Sounds good! @dstansby any last thoughts before merge? |
Nope, all good from my end - I'll look into rolling this out to other data sources in a follow up |
PR Overview
This fixes (partially) #2385, by parallelising the reading in of EIA860 Excel spreadsheets. As suggested in that issue I've largely followed previous work in #2472 for the implementation.
I've tested this locally with the ETL_FAST job and it seems to work fine.
PR Checklist
dev
).