Skip to content
This repository has been archived by the owner on Jan 12, 2024. It is now read-only.

Allow local file caching to be disabled when appropriate #6

Closed
Tracked by #1564
zaneselvans opened this issue Apr 7, 2022 · 1 comment
Closed
Tracked by #1564

Allow local file caching to be disabled when appropriate #6

zaneselvans opened this issue Apr 7, 2022 · 1 comment
Assignees
Labels
intake Intake data catalogs performance Make data go faster by using less memory, disk, network, compute, etc.

Comments

@zaneselvans
Copy link
Member

Local file caching is via simplecache:: is hugely valuable when you have a lot of cheap disk and a slower net connection (WFH),but it's not necessarily appropriate in a cloud computing context (e.g. our JupyterHub or CI/CD) where the network is extremely fast, there are no data egress fees, and fast disk is more likely to be constrained.

If we are going to use our Intake data catalog as a primary means of accessing versioned, processed data, the user should be able to turn off caching when appropriate. Is this as easy as not setting PUDL_INTAKE_CACHE so there's no designated location for the cache? Or can it / should it be set explicitly in the arguments to the data source?

@zaneselvans zaneselvans added intake Intake data catalogs performance Make data go faster by using less memory, disk, network, compute, etc. Epic and removed Epic labels Apr 7, 2022
@zaneselvans zaneselvans changed the title Ensure users can disable local file caching when appropriate Allow local file caching to be disabled when appropriate Apr 11, 2022
zaneselvans added a commit that referenced this issue Apr 19, 2022
With some pointers from @martindurant in
[this issue](intake/intake-parquet#26) I got
anonymous public access working, and caching can now be turned off when
appropriate.

Accessing the partitioned data is still very slow in a variety of
contexts for reasons I don't understand. I also hit a snag attempting to
create a consolidated external `_metadata` file to hopefully speed up
access to the partitioned data so... not sure what to do there.

The current Tox/pytest setup expects to find data locally, which won't
work right now on GitHub. Need to set the tests up better for real world
use, and less for exploring different catalog configurations.

Closes #5, #6
@zaneselvans
Copy link
Member Author

Fixed in 7fb38ff

@zaneselvans zaneselvans self-assigned this Apr 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
intake Intake data catalogs performance Make data go faster by using less memory, disk, network, compute, etc.
Projects
None yet
Development

No branches or pull requests

1 participant