Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix XBRL extraction clobber #3026

Merged
merged 5 commits into from Nov 10, 2023
Merged

Fix XBRL extraction clobber #3026

merged 5 commits into from Nov 10, 2023

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Nov 7, 2023

On dev, if you set clobber: True in the ferc_to_sqlite job config, it will clobber the existing database when extracting 2021 data, and then clobber it again when extracting 2022 data.

  • clobbering happens on PUDL side now, instead of on ferc-xbrl-extractor side - so it gets called once per op instead of once per form
  • you can still clobber when running ferc-xbrl-extractor on its own

This was probably the cause of the missing 2021 XBRL data that @cmgosnell ran into yesterday.

@jdangerx jdangerx changed the base branch from main to dev November 7, 2023 22:52
src/pudl/io_managers.py Outdated Show resolved Hide resolved
Comment on lines 137 to 138
sql_path=sql_path,
clobber=False, # if we set clobber=True, clobbers on *every* call to run_main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this makes the clobber option non-functional, at least in this location. The idea with clobber is to functionally rm -f the_database.sqlite before starting a new ETL run. Maybe it should only exist at the highest level in the ferc_to_sqlite script and actually unlink the paths if set -- and then everything in here can just be like "You gave me a DB URL and I'm going to write to it. Not my problem if there was already something incompatible there."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the unlink() on line 92 does the rm -f you're looking for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously when I failed to clobber, the XBRL extractor would write duplicate data - when I ran xbrl_to_sqlite multiple times I still only had one copy of the data afterwards. I guess it could have not written any data at all, the later times. I can run again and see if the database is gone between the unlink() and the database write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the database gets deleted at the unlink() step if clobber is set to True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DELETED. Strongbad would be proud. I guess the other half of the logic we've previously associated with --clobber is that if the DB already exists and you attempt to run the ETL, it bails out early, to avoid unintentionally modifying an existing DB. It's not trying to be anything fancy -- just a binary switch "Do you want me to make something new and blow everything that exists away if it's there? Or do you expect that there's a clean slate, and there should be no possible conflicts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the current behavior though. If you don't say clobber and there is an existing DB what happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past if you don't say clobber and there's an existing DB, we actually just append to the existing DB. Which is pretty bad. I'll add a quick check to just bail out if there's an existing DB instead.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh i don't understand all of the whys for the changes in here but the broad banner of make pudl do a big clobber, always clobber=False our run of the year-based extractor makes sense to me and looks good!

Comment on lines 89 to 100
sql_path = Path(urlparse(PudlPaths().sqlite_db(f"ferc{form.value}_xbrl")).path)

if clobber:
sql_path.unlink(missing_ok=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay so this is now the guy that removes the db and we always run clobber=True in pudl when we call the extractor's run_main bc each run is per year. cool cool!

src/pudl/extract/xbrl.py Show resolved Hide resolved
Copy link

codecov bot commented Nov 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1bb33dd) 88.7% compared to head (200a3ca) 88.7%.
Report is 3 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #3026   +/-   ##
=====================================
  Coverage   88.7%   88.7%           
=====================================
  Files         90      90           
  Lines      10988   10994    +6     
=====================================
+ Hits        9752    9758    +6     
  Misses      1236    1236           
Files Coverage Δ
src/pudl/extract/xbrl.py 96.0% <100.0%> (+0.4%) ⬆️
src/pudl/workspace/setup.py 91.8% <100.0%> (+0.1%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@jdangerx jdangerx merged commit fa096f2 into dev Nov 10, 2023
11 checks passed
@jdangerx jdangerx deleted the fix-xbrl-clobber branch November 10, 2023 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants