New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ferc2 full etl integration test issues #2652
Conversation
When run over full ETL, tests get confused because multiple archives could be present. This test checks that the filtering works as expected.
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Ahh, okay it's not a problem with the actual extraction in the ETL which I ran locally and worked fine -- just this test of the datastore. I think the only downside to testing all of the years in all of the tests is that all of the data will end up getting downloaded from Zenodo, and at some point we'll hit the data cache / disk space limits on GitHub runners. Speaking of download size, does it seem at all odd that the input data for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad we solved the mystery of the missing early years too. If it would be easy to raise an exception in the case of invalid paths that seems like it would be a good idea. If it'll need more debugging / exploration then let's make an issue and track it down later.
Does it seem odd to you that the ratio between the input and output file sizes for FERC 1 and FERC 2 are so different? They're both ~2GB inputs, but the FERC 1 outputs are ~3x larger (760MB vs. 250MB).
1996,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
1997,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
1998,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
1999,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
2000,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
2001,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
2002,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
2003,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC | ||
1996,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC | ||
1997,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC | ||
1998,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC | ||
1999,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC | ||
2000,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC | ||
2001,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC | ||
2002,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC | ||
2003,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be easy to raise an exception in the case of invalid / non-existend DBC paths rather than silently failing?
That's (output size differences) for a separate investigation I'm planning in the coming days. |
Codecov ReportPatch and project coverage have no change.
Additional details and impacted files@@ Coverage Diff @@
## dev #2652 +/- ##
=====================================
Coverage 87.1% 87.1%
=====================================
Files 86 86
Lines 10001 10001
=====================================
Hits 8716 8716
Misses 1285 1285 ☔ View full report in Codecov by Sentry. |
An attempted fix for the
ferc2
integration tests issues.The non-uniqueness of the early (1996) archives is a known issue that was mentioned as a note in the code that crashed, but was not addressed. The problem was not caught earlier because the years validated depends on the actual etl configuration and by default (during PR tests and locally), these tests use
etl_fast.yml
which only runs this on 2020 which doesn't have the problem.I have refactored the test to also check that:
is_valid_partition
is used for filtering, exactly one partition remainsThis should be a good invariant for ferc2 at the moment.
Due to somewhat hungry dependencies in the integration tests, I'm still running the full etl test locally to validate whether this addresses the issue.
It might be good idea to test all supported years always, assuming that we can avoid actually processing the full set of data here. This should be generally possible, but pytest fixtures may make this a little tricky to select just the relevant deps.