Improve resumable upload: track completion at the batch/session level #5744

xuang7 · 2026-06-17T02:43:56Z

xuang7
Jun 17, 2026
Collaborator

Context

During recent large-dataset upload testing, we ran into an issue that may be worth discussing as a group, since the solution may affect both the upload and dataset commit behavior.

A user uploaded ~1300 files (~1.3 TB total) into a dataset. After the upload "finished", only ~1200 files were uploaded; ~100 failed mid-process.

This exposed two usability issues:

No visibility into what failed: There's no clear report of which files are missing, so the user can't tell what still needs uploading.
Re-dragging the entire folder wastes huge amounts of time and bandwidth: Resumability today is file-level only. When the user re-drops the same directory:
- the ~100 failed files resume correctly, but
- the ~1200 already-uploaded files are treated as new uploads and re-sent in full.

Currently, resumability has no batch/session-level state. We do not track which files in the batch have already completed, so re-dropping the same directory cannot distinguish "already uploaded" files from "new" files.

There are two categories of files that get uploaded again today:

Committed files: These files already exist in a dataset version, but the upload path does not check for them before uploading. They are re-uploaded into LakeFS staging and deduplicated later during createDatasetVersion, which diffs the branch. Because LakeFS is content-addressed, the re-sent bytes match the committed checksum and do not appear in the final diff, so they are not committed again.
Uncommitted files: These files finished uploading but have not yet been committed. The upload session row is deleted after finishMultipartUpload, so a completed file becomes invisible.

How other systems handle this

In some other platforms, when uploading a folder that contains files already uploaded before, the system detects the existing files and prompts the user with options such as:

Replace existing files (upload becomes a new version)
Keep existing files (for example by creating data(1).txt)
Skip already uploaded files

However, some platforms still seem to re-upload the existing files instead of truly skipping them, which is effectively what we do today.

Proposed Solution

Move existence detection to the beginning of the upload flow instead of relying on commit-time deduplication.

Detect files that already exist before uploading against both states:
- Committed in the target dataset (LakeFS object listing), and uncommitted on the branch (LakeFS staging/uncommitted listing).
- Use a lightweight existence check, such as path and file size.
When existing files are detected, ask users to confirm whether they want to skip them before continuing the upload. (Future work could explore support for replace/overwrite)
Avoid re-uploading files that are already successfully uploaded when the user chooses to skip them.
Resume only the missing or failed files/parts after an interrupted upload.

This would be especially useful for large datasets, where re-uploading already completed files can waste a lot of time and bandwidth.

Please feel free to share any suggestions or concerns. Thanks!

xuang7 · 2026-06-17T02:44:28Z

xuang7
Jun 17, 2026
Collaborator Author

CC @chenlica @aicam @carloea2

0 replies

aicam · 2026-06-17T16:58:25Z

aicam
Jun 17, 2026
Collaborator

I agree with the proposed solution to simply allow user to upload remaining files, this would be an immediate fix. However, I believe our file-service lacks proper logging, we should add verbosity levels to application and each service so based on the need, we enable detail logging for services needed.

1 reply

xuang7 Jun 24, 2026
Collaborator Author

Agreed! Improving logging would also be important. Currently, it can be a bit hard to trace the root cause of upload failures. Adding more detailed and configurable logging in file-service would help us debug these cases more effectively.

carloea2 · 2026-06-18T01:17:07Z

carloea2
Jun 18, 2026

I agree.

0 replies

Yicong-Huang · 2026-06-18T03:45:46Z

Yicong-Huang
Jun 18, 2026
Collaborator

I like the idea, besides path and size, we could (or should?) also use checksum?

1 reply

xuang7 Jun 24, 2026
Collaborator Author

Thanks for the suggestion. I think we should lean toward avoiding checksums as the default existence check on the client side. At the scale we are targeting, potentially TB-scale folders, the browser would need to read every byte of every file to compute the checksum before deciding whether to upload. That could be very slow and resource-heavy, and would partly reintroduce the cost that batch-level resume is trying to avoid. We could consider sampling, but that would still have accuracy limitations.

I think a lightweight path + file size check is a better starting point. It is not a perfect guarantee, so instead of silently deciding for the user, we can surface the uploaded file records and let the user choose whether to skip those files or re-upload/restart them. Basically, we make the user aware of the existing files and let them make the final decision. Checksums could still be revisited later if needed.

chenlica · 2026-06-18T05:00:59Z

chenlica
Jun 18, 2026
Collaborator

Per our offline discussion, I agree that it's important for the server to detect uploaded files and give the user an option to skip them. A solution is for the server to keep track of the uploaded files.

A technical challenge is whether we want to let the client site efficiently compute a signature of a file (e.g., checksum), using which the server can check if a server-side file with the same file name and checksum is likely the same file. It will be good to know if other systems do this type of checking. My guess is possibly not due to efficiency concerns.

0 replies

zuozhiw · 2026-06-18T05:03:41Z

zuozhiw
Jun 18, 2026
Collaborator

this sounds like a very good proposal and both the technical solution and the user experience is good

0 replies

xuang7 · 2026-06-24T18:54:29Z

xuang7
Jun 24, 2026
Collaborator Author

Thanks everyone for the input. We will move forward with improving resumable upload by tracking completion at the batch/session level. There is an ongoing PR for this issue, so please feel free to share any additional suggestions there: #5929

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve resumable upload: track completion at the batch/session level #5744

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Improve resumable upload: track completion at the batch/session level #5744

Uh oh!

xuang7 Jun 17, 2026 Collaborator

Context

How other systems handle this

Proposed Solution

Replies: 7 comments · 2 replies

Uh oh!

xuang7 Jun 17, 2026 Collaborator Author

Uh oh!

aicam Jun 17, 2026 Collaborator

Uh oh!

xuang7 Jun 24, 2026 Collaborator Author

Uh oh!

carloea2 Jun 18, 2026

Uh oh!

Yicong-Huang Jun 18, 2026 Collaborator

Uh oh!

xuang7 Jun 24, 2026 Collaborator Author

Uh oh!

chenlica Jun 18, 2026 Collaborator

Uh oh!

zuozhiw Jun 18, 2026 Collaborator

Uh oh!

xuang7 Jun 24, 2026 Collaborator Author

xuang7
Jun 17, 2026
Collaborator

Replies: 7 comments 2 replies

xuang7
Jun 17, 2026
Collaborator Author

aicam
Jun 17, 2026
Collaborator

xuang7 Jun 24, 2026
Collaborator Author

carloea2
Jun 18, 2026

Yicong-Huang
Jun 18, 2026
Collaborator

xuang7 Jun 24, 2026
Collaborator Author

chenlica
Jun 18, 2026
Collaborator

zuozhiw
Jun 18, 2026
Collaborator

xuang7
Jun 24, 2026
Collaborator Author