Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EQR archiver concurrency/disk limits #44

Closed
6 tasks
jdangerx opened this issue Jan 27, 2023 · 0 comments
Closed
6 tasks

EQR archiver concurrency/disk limits #44

jdangerx opened this issue Jan 27, 2023 · 0 comments

Comments

@jdangerx
Copy link
Member

jdangerx commented Jan 27, 2023

In #31 and #43 @e-belfer ran into an issue where if we download all the EQR data at once it uses a ton of disk space - the full EQR dataset is 15.5 GB.

The problem, apart from this just being slow, is that GH actions runners only have 14G of disk space, so we'd have to manually partition across multiple runners or try to reduce the disk usage by only keeping a small subset of the data on disk at any one time.

I think we can try to basically lazily load the files from the downloader:

  • initialize new deposition version
  • for each resource we have, individually download/checksum/upload to deposition, with some concurrency set at the dataset level
  • delete files that we didn't see in the above step from the pending deposition
  • regenerate datapackage
  • update settings etc.

Also, only 3 users globally can download EQR data at a time.

Scope:

  • we only run one concurrent EQR download at once
  • we only store one EQR dataset on disk at once

Next steps:

  • allow datasets to force aiohttp client pool size
  • add a method to AbstractDatasetArchiver that spits out a generator of (name, resource) tuples instead of a dict from name to resources
    • hope this is effectively limited by the session concurrency limits above
  • in DepositorOrchestrator, diff files by md5 one-by-one instead of all at once
@zaneselvans zaneselvans changed the title Reduce peak disk usage Reduce EQR Archiver peak disk usage Feb 13, 2023
@jdangerx jdangerx changed the title Reduce EQR Archiver peak disk usage EQR archiver concurrency/disk limits Feb 24, 2023
@e-belfer e-belfer linked a pull request Aug 1, 2023 that will close this issue
@zschira zschira closed this as completed Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants