-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PUDL_INPUT not hard-coded data dir in datastore CLI #2651
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev #2651 +/- ##
=======================================
+ Coverage 86.9% 87.1% +0.2%
=======================================
Files 84 86 +2
Lines 9720 10001 +281
=======================================
+ Hits 8447 8716 +269
- Misses 1273 1285 +12
☔ View full report in Codecov by Sentry. |
@@ -534,15 +534,15 @@ def _get_pudl_in(args: dict) -> Path: | |||
if args.pudl_in: | |||
return Path(args.pudl_in) | |||
else: | |||
return Path(pudl.workspace.setup.get_defaults()["pudl_in"]) | |||
return Path(pudl.workspace.setup.get_defaults()["PUDL_INPUT"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that pudl_in
and PUDL_INPUT
both exist in the settings dict and have different semantics seems like a dangerous thing. Additionally, it seems that PUDL_INPUT is set into data_dir
. Should we then use data_dir
here instead of PUDL_INPUIT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely not ideal! I went with PUDL_INPUT
rather than data_dir
because I think we're eventually just going to depend on a real settings class & have it take cues from the environment variables.
There's a ton of vestigial mess in this settings system and I would ❤️❤️❤️ to blow it all away and replace it with a rational Pydantic settings class, but that'll mean chasing down and removing all the pudl_settings
dictionaries that are floating around right now which will be a bit of work (but totally worth it in the medium term...). I'm inclined toward that option rather than trying to rationalize the current mess in-place. But we haven't been able to prioritize it. 😭
@@ -534,15 +534,15 @@ def _get_pudl_in(args: dict) -> Path: | |||
if args.pudl_in: | |||
return Path(args.pudl_in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could drop support for --pudl_in
, we could purely rely on get_defaults()
and drop this function altogether. I think that could make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the use-case for setting a different input directory is/was being able to test the downloads of archives from Zenodo independent of the user's working cache directory.
As you've already noticed... there's a ton of room for simplification in the settings system.
|
||
|
||
def _create_datastore(args: argparse.Namespace) -> Datastore: | ||
"""Constructs datastore instance.""" | ||
# Configure how we want to obtain raw input data: | ||
ds_kwargs = dict(gcs_cache_path=args.gcs_cache_path, sandbox=args.sandbox) | ||
if not args.bypass_local_cache: | ||
ds_kwargs["local_cache_path"] = _get_pudl_in(args) / "data" | ||
ds_kwargs["local_cache_path"] = _get_pudl_in(args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we could simply replace this with get_defaults()["data_dir"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving for tactical fix, but I think there's a deeper issue that might be worth cleaning up (e.g. pudl_in
, PUDL_INPUT
and data_dir
all being valid options, causing confusion)
PR Overview
While debugging FERC 2 DBF extraction build failure I found a lingering hard-coded PUDL input directory in the datastore CLI and fixed it.
PR Checklist
dev
).