Skip to content

Conversation

@georgeboot
Copy link
Contributor

Currently, Azure is implemented without streams. This means that when uploading a file, the full file is first read into memory, before it is actually uploaded. The same goes for downloads.

This PR changes that by using streams.

@pjbull
Copy link
Member

pjbull commented May 25, 2023

@georgeboot Thanks, I'm interested in including something like this. A few questions/needs:

  • (1) Can we make similar changes for other backends? We generally like to implement things at parity so we don't have to track what does/doesn't work in different backends or have divergent behaviors.
  • (2) What happens on network failures? Is this better/worse than the current behavior? E.g., can we end up with partial files in the cache that then confuse future checks because they exist and the modified time is right, but the content is wrong?
  • (3) Passing tests

@georgeboot
Copy link
Contributor Author

Sure, will have a look!

S3 uses streams internally already. GCP I'll have to check.

When a network error happens, there indeed is a partial file on disk. If you want, I can clean that up? What should happen if we are overwriting an existing file and we fail half way through?

Will fix tests as well. Wanted to see if this had a chance to get merged first.

@pjbull
Copy link
Member

pjbull commented May 25, 2023

What should happen if we are overwriting an existing file and we fail half way through?

Maybe this should append .partial to the end of the local file until download completes successfully and then rename? This is maybe annoying in that you need potentially 2x the file size for the cache, but it seems like a good balance of simple to implement and reliable.

@pjbull
Copy link
Member

pjbull commented Jul 22, 2023

Hi @georgeboot, this change still seems worth having. Would you have a chance to wrap what this PR needs? I think that should be:

  • Rebase on to lastest so unrelated tests don't fail
  • Management of partially downloaded files as noted above
  • Confirm GCS also streams
  • Tests as needed (particularly, the partially downloaded files logic)

@pjbull pjbull changed the base branch from master to 334-local February 17, 2024 19:59
@pjbull
Copy link
Member

pjbull commented Feb 17, 2024

Merging in to local branch to finish implementation.

@pjbull pjbull merged commit c878782 into drivendataorg:334-local Feb 17, 2024
@pjbull pjbull mentioned this pull request Feb 17, 2024
pjbull added a commit that referenced this pull request Feb 17, 2024
* Stream to and from Azure (#334)

* Stream uploads for azure

* Download using a stream

* Tests and partials

* changelog

* use pathlib open

* use replace instead of rename

* another try at flaky tests

---------

Co-authored-by: George Boot <884482+georgeboot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants