fix(dataset): enforce backend max single file size limit#4059
Closed
carloea2 wants to merge 20 commits intoapache:mainfrom
Closed
fix(dataset): enforce backend max single file size limit#4059carloea2 wants to merge 20 commits intoapache:mainfrom
carloea2 wants to merge 20 commits intoapache:mainfrom
Conversation
Contributor
Author
Contributor
|
@carloea2 Please explain the reason of closing this PR and our plan next. |
Contributor
Author
|
It was closed to simplify the review process and split the work into two steps:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Backend (DatasetResource)
Introduce
singleFileUploadMaxSizeMibandmaxSingleFileUploadBytesinDatasetResource, reading thesingle_file_upload_max_size_mibsite setting with aDefaultsConfigfallback.Single-part upload (
/{did}/upload)totalBytesRead.totalBytesReadexceedsmaxSingleFileUploadBytes, abort the LakeFS multipart upload, and return413 REQUEST_ENTITY_TOO_LARGEwith a descriptive error message.Multipart upload (new server-proxied flow)
Add a
SessionStatecase class and an in-memoryTrieMap[String, SessionState](uploadSessions) to track ongoing multipart uploads on the backend (upload token, repo name, uploadId, presigned URLs, total bytes, collected part ETags, etc.).POST /dataset/multipart-upload?type=inituploadSessionskeyed by a generateduploadToken.{ uploadToken }to the client.POST /dataset/multipart-upload/part?token=<uploadToken>&partNumber=<n>totalBytescounter (AtomicLong) to accumulate the total uploaded size for this file.totalBytesexceedsmaxSingleFileUploadBytes, marks the session as aborted, aborts the LakeFS multipart upload, removes the session fromuploadSessions, and returns413 REQUEST_ENTITY_TOO_LARGE(“File exceeds maximum allowed size … MiB.”).(partNumber, eTag)in the session.POST /dataset/multipart-upload?type=finishuploadToken, completes the LakeFS multipart upload using the collected parts, and then readssizeBytesfrom the returned object stats.sizeBytes > maxSingleFileUploadBytes, callsresetObjectUploadOrDeletionin LakeFS to roll back the object and returns413 REQUEST_ENTITY_TOO_LARGE.POST /dataset/multipart-upload?type=abortuploadToken, aborts the underlying LakeFS multipart upload, removes the session fromuploadSessions, and returns a success message.All multipart endpoints enforce that only the user who started the session (same
uid) can upload parts / finish / abort that session.Together, this means the server now enforces the single-file size limit for both single-part and multipart uploads. Modifying or removing the size check in
main.jsno longer allows oversized files to be stored in LakeFS/S3.Frontend (Angular)
Update
DatasetService.multipartUpload(...)to use the new server-proxied multipart API:type=init→ getuploadToken.XMLHttpRequesttoPOST /dataset/multipart-upload/part?token=...&partNumber=..., honoring a configurable concurrency limit.type=finishwith{ uploadToken }.type=abortwith{ uploadToken }to ensure the LakeFS multipart upload is cleaned up.Update
finalizeMultipartUpload(...)and its caller inDatasetDetailComponentto align with the new backend signature (uploadToken+type=finish/abort, without passing parts/physicalAddress from the client).Keep the existing frontend size check for UX (fast feedback), but it is now only a convenience guard; the authoritative limit is enforced on the backend.
Any related issues, documentation, discussions?
How was this PR tested?
Set
single_file_upload_max_size_mibto a known value (e.g., 20 MiB / 10 GiB).Multipart upload (new flow):
With the unmodified frontend, upload a file below the limit → all parts succeed,
type=finishcompletes, and the object is committed in LakeFS.Modify
main.jsin the browser to relax or remove the frontend size check and attempt to upload a file larger than the configured limit:/multipart-upload/partcall returns413, the server aborts the LakeFS multipart upload, the frontend cannot see 413 since the request was aborted during streaming of the body.Was this PR authored or co-authored using generative AI tooling?
Co-authored with ChatGPT.