Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ferc2 full etl integration test issues #2652

Merged
merged 4 commits into from Jun 10, 2023
Merged

Fix ferc2 full etl integration test issues #2652

merged 4 commits into from Jun 10, 2023

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Jun 9, 2023

An attempted fix for the ferc2 integration tests issues.

The non-uniqueness of the early (1996) archives is a known issue that was mentioned as a note in the code that crashed, but was not addressed. The problem was not caught earlier because the years validated depends on the actual etl configuration and by default (during PR tests and locally), these tests use etl_fast.yml which only runs this on 2020 which doesn't have the problem.

I have refactored the test to also check that:

  1. for every year in the config, there's at least one partition present, and
  2. when is_valid_partition is used for filtering, exactly one partition remains

This should be a good invariant for ferc2 at the moment.

Due to somewhat hungry dependencies in the integration tests, I'm still running the full etl test locally to validate whether this addresses the issue.

It might be good idea to test all supported years always, assuming that we can avoid actually processing the full set of data here. This should be generally possible, but pytest fixtures may make this a little tricky to select just the relevant deps.

When run over full ETL, tests get confused because multiple archives
could be present. This test checks that the filtering works as expected.
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans zaneselvans changed the base branch from main to dev June 9, 2023 22:31
@zaneselvans
Copy link
Member

Ahh, okay it's not a problem with the actual extraction in the ETL which I ran locally and worked fine -- just this test of the datastore.

I think the only downside to testing all of the years in all of the tests is that all of the data will end up getting downloaded from Zenodo, and at some point we'll hit the data cache / disk space limits on GitHub runners.

Speaking of download size, does it seem at all odd that the input data for the ferc2 is almost 2GB, but the SQLite DB that results is only 250MB? While the input data for ferc1 is about the same size but results in a 760MB DB?

@zaneselvans zaneselvans linked an issue Jun 9, 2023 that may be closed by this pull request
@zaneselvans zaneselvans added testing Writing tests, creating test data, automating testing, etc. ferc2 Issues related to the FERC Form 2 dataset labels Jun 9, 2023
@zaneselvans zaneselvans added this to the 2023 Spring milestone Jun 9, 2023
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad we solved the mystery of the missing early years too. If it would be easy to raise an exception in the case of invalid paths that seems like it would be a good idea. If it'll need more debugging / exploration then let's make an issue and track it down later.

Does it seem odd to you that the ratio between the input and output file sizes for FERC 1 and FERC 2 are so different? They're both ~2GB inputs, but the FERC 1 outputs are ~3x larger (760MB vs. 250MB).

Comment on lines -2 to +9
1996,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
1997,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
1998,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
1999,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
2000,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
2001,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
2002,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
2003,UPLOADERS/FORM2/tmpwork/F2_PUB.DBC
1996,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
1997,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
1998,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
1999,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
2000,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
2001,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
2002,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
2003,FORMSADMIN/FORM2/tmpwork/F2_PUB.DBC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be easy to raise an exception in the case of invalid / non-existend DBC paths rather than silently failing?

@rousik
Copy link
Collaborator Author

rousik commented Jun 9, 2023

That's (output size differences) for a separate investigation I'm planning in the coming days.

@codecov
Copy link

codecov bot commented Jun 10, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (7f08ddf) 87.1% compared to head (4246451) 87.1%.

❗ Current head 4246451 differs from pull request most recent head 5183494. Consider uploading reports for the commit 5183494 to get more accurate results

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2652   +/-   ##
=====================================
  Coverage   87.1%   87.1%           
=====================================
  Files         86      86           
  Lines      10001   10001           
=====================================
  Hits        8716    8716           
  Misses      1285    1285           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@zaneselvans zaneselvans merged commit fa40d03 into dev Jun 10, 2023
6 checks passed
@zaneselvans zaneselvans deleted the ferc2-fix-tests branch June 10, 2023 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ferc2 Issues related to the FERC Form 2 dataset testing Writing tests, creating test data, automating testing, etc.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

FERC From 2 DBF seems to have gap in the early years
2 participants