Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PUDL_INPUT not hard-coded data dir in datastore CLI #2651

Merged
merged 1 commit into from
Jun 9, 2023

Conversation

zaneselvans
Copy link
Member

PR Overview

While debugging FERC 2 DBF extraction build failure I found a lingering hard-coded PUDL input directory in the datastore CLI and fixed it.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans zaneselvans added cli Scripts and other command line interfaces to PUDL. ferc2 Issues related to the FERC Form 2 dataset labels Jun 9, 2023
@zaneselvans zaneselvans requested a review from rousik June 9, 2023 15:50
@zaneselvans zaneselvans self-assigned this Jun 9, 2023
@codecov
Copy link

codecov bot commented Jun 9, 2023

Codecov Report

Patch coverage: 93.8% and project coverage change: +0.2 🎉

Comparison is base (fe395c9) 86.9% compared to head (7f08ddf) 87.1%.

Additional details and impacted files
@@           Coverage Diff           @@
##             dev   #2651     +/-   ##
=======================================
+ Coverage   86.9%   87.1%   +0.2%     
=======================================
  Files         84      86      +2     
  Lines       9720   10001    +281     
=======================================
+ Hits        8447    8716    +269     
- Misses      1273    1285     +12     
Impacted Files Coverage Δ
src/pudl/metadata/fields.py 100.0% <ø> (ø)
src/pudl/metadata/resources/ferc1.py 100.0% <ø> (ø)
src/pudl/metadata/sources.py 100.0% <ø> (ø)
src/pudl/transform/params/ferc1.py 100.0% <ø> (ø)
src/pudl/extract/xbrl.py 94.2% <33.3%> (-2.8%) ⬇️
src/pudl/workspace/datastore.py 73.8% <55.5%> (+0.6%) ⬆️
src/pudl/ferc_to_sqlite/cli.py 70.7% <71.4%> (-0.7%) ⬇️
src/pudl/extract/dbf.py 88.8% <87.8%> (-0.2%) ⬇️
src/pudl/extract/ferc.py 91.6% <91.6%> (ø)
src/pudl/transform/ferc1.py 96.6% <98.6%> (+0.4%) ⬆️
... and 4 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -534,15 +534,15 @@ def _get_pudl_in(args: dict) -> Path:
if args.pudl_in:
return Path(args.pudl_in)
else:
return Path(pudl.workspace.setup.get_defaults()["pudl_in"])
return Path(pudl.workspace.setup.get_defaults()["PUDL_INPUT"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that pudl_in and PUDL_INPUT both exist in the settings dict and have different semantics seems like a dangerous thing. Additionally, it seems that PUDL_INPUT is set into data_dir. Should we then use data_dir here instead of PUDL_INPUIT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely not ideal! I went with PUDL_INPUT rather than data_dir because I think we're eventually just going to depend on a real settings class & have it take cues from the environment variables.

There's a ton of vestigial mess in this settings system and I would ❤️❤️❤️ to blow it all away and replace it with a rational Pydantic settings class, but that'll mean chasing down and removing all the pudl_settings dictionaries that are floating around right now which will be a bit of work (but totally worth it in the medium term...). I'm inclined toward that option rather than trying to rationalize the current mess in-place. But we haven't been able to prioritize it. 😭

@@ -534,15 +534,15 @@ def _get_pudl_in(args: dict) -> Path:
if args.pudl_in:
return Path(args.pudl_in)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we could drop support for --pudl_in, we could purely rely on get_defaults() and drop this function altogether. I think that could make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the use-case for setting a different input directory is/was being able to test the downloads of archives from Zenodo independent of the user's working cache directory.

As you've already noticed... there's a ton of room for simplification in the settings system.



def _create_datastore(args: argparse.Namespace) -> Datastore:
"""Constructs datastore instance."""
# Configure how we want to obtain raw input data:
ds_kwargs = dict(gcs_cache_path=args.gcs_cache_path, sandbox=args.sandbox)
if not args.bypass_local_cache:
ds_kwargs["local_cache_path"] = _get_pudl_in(args) / "data"
ds_kwargs["local_cache_path"] = _get_pudl_in(args)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we could simply replace this with get_defaults()["data_dir"]

Copy link
Collaborator

@rousik rousik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for tactical fix, but I think there's a deeper issue that might be worth cleaning up (e.g. pudl_in, PUDL_INPUT and data_dir all being valid options, causing confusion)

@zaneselvans zaneselvans merged commit 1c734f6 into dev Jun 9, 2023
10 checks passed
@zaneselvans zaneselvans deleted the datastore-input-dir branch June 9, 2023 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Scripts and other command line interfaces to PUDL. ferc2 Issues related to the FERC Form 2 dataset
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants