Skip to content

fix(dataset): enforce backend max single file size limit#4059

Closed
carloea2 wants to merge 20 commits intoapache:mainfrom
carloea2:single_max_size_bk_enf
Closed

fix(dataset): enforce backend max single file size limit#4059
carloea2 wants to merge 20 commits intoapache:mainfrom
carloea2:single_max_size_bk_enf

Conversation

@carloea2
Copy link
Contributor

@carloea2 carloea2 commented Nov 16, 2025

What changes were proposed in this PR?

Backend (DatasetResource)

  • Introduce singleFileUploadMaxSizeMib and maxSingleFileUploadBytes in DatasetResource, reading the single_file_upload_max_size_mib site setting with a DefaultsConfig fallback.

  • Single-part upload (/{did}/upload)

    • While streaming the request body, track totalBytesRead.
    • If totalBytesRead exceeds maxSingleFileUploadBytes, abort the LakeFS multipart upload, and return 413 REQUEST_ENTITY_TOO_LARGE with a descriptive error message.
  • Multipart upload (new server-proxied flow)

    • Add a SessionState case class and an in-memory TrieMap[String, SessionState] (uploadSessions) to track ongoing multipart uploads on the backend (upload token, repo name, uploadId, presigned URLs, total bytes, collected part ETags, etc.).

    • POST /dataset/multipart-upload?type=init

      • Validates dataset access and initializes a LakeFS multipart upload.
      • Stores the session in uploadSessions keyed by a generated uploadToken.
      • Returns { uploadToken } to the client.
    • POST /dataset/multipart-upload/part?token=<uploadToken>&partNumber=<n>

      • Reads the incoming part stream into an 8 KiB buffer and forwards it to the corresponding LakeFS presigned URL.
      • Uses the session’s totalBytes counter (AtomicLong) to accumulate the total uploaded size for this file.
      • As soon as totalBytes exceeds maxSingleFileUploadBytes, marks the session as aborted, aborts the LakeFS multipart upload, removes the session from uploadSessions, and returns 413 REQUEST_ENTITY_TOO_LARGE (“File exceeds maximum allowed size … MiB.”).
      • On success, stores (partNumber, eTag) in the session.
    • POST /dataset/multipart-upload?type=finish

      • Looks up the session by uploadToken, completes the LakeFS multipart upload using the collected parts, and then reads sizeBytes from the returned object stats.
      • If sizeBytes > maxSingleFileUploadBytes, calls resetObjectUploadOrDeletion in LakeFS to roll back the object and returns 413 REQUEST_ENTITY_TOO_LARGE.
      • Otherwise, returns a success payload with the final path.
    • POST /dataset/multipart-upload?type=abort

      • Looks up the session by uploadToken, aborts the underlying LakeFS multipart upload, removes the session from uploadSessions, and returns a success message.
    • All multipart endpoints enforce that only the user who started the session (same uid) can upload parts / finish / abort that session.

Together, this means the server now enforces the single-file size limit for both single-part and multipart uploads. Modifying or removing the size check in main.js no longer allows oversized files to be stored in LakeFS/S3.

Frontend (Angular)

  • Update DatasetService.multipartUpload(...) to use the new server-proxied multipart API:

    • type=init → get uploadToken.
    • Upload each chunk with XMLHttpRequest to POST /dataset/multipart-upload/part?token=...&partNumber=..., honoring a configurable concurrency limit.
    • Track per-part and total progress, and compute smoothed upload speed, ETA, and total time for UI display.
    • After all parts succeed, call type=finish with { uploadToken }.
    • On any error, call type=abort with { uploadToken } to ensure the LakeFS multipart upload is cleaned up.
  • Update finalizeMultipartUpload(...) and its caller in DatasetDetailComponent to align with the new backend signature (uploadToken + type=finish/abort, without passing parts/physicalAddress from the client).

  • Keep the existing frontend size check for UX (fast feedback), but it is now only a convenience guard; the authoritative limit is enforced on the backend.

Any related issues, documentation, discussions?

How was this PR tested?

  • Set single_file_upload_max_size_mib to a known value (e.g., 20 MiB / 10 GiB).

  • Multipart upload (new flow):

    • With the unmodified frontend, upload a file below the limit → all parts succeed, type=finish completes, and the object is committed in LakeFS.

    • Modify main.js in the browser to relax or remove the frontend size check and attempt to upload a file larger than the configured limit:

      • Part uploads start, but once the cumulative size exceeds the limit, the /multipart-upload/part call returns 413, the server aborts the LakeFS multipart upload, the frontend cannot see 413 since the request was aborted during streaming of the body.
      • No oversized object exists in LakeFS after the failed upload.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with ChatGPT.

@github-actions github-actions bot removed the common label Nov 16, 2025
@carloea2 carloea2 changed the title Single max size bk enf fix(upload): enforce max file size on backend Nov 16, 2025
@carloea2 carloea2 changed the title fix(upload): enforce max file size on backend fix(dataset): enforce single max size file upload limit Nov 16, 2025
@carloea2 carloea2 changed the title fix(dataset): enforce single max size file upload limit fix(dataset): enforce backend max single file size limit Nov 16, 2025
@github-actions github-actions bot added the frontend Changes related to the frontend GUI label Dec 5, 2025
@carloea2 carloea2 marked this pull request as ready for review December 5, 2025 06:53
@carloea2
Copy link
Contributor Author

carloea2 commented Dec 5, 2025

@aicam Hello good evening, the PR is ready for review, please let me know your comments. Thank you.

@chenlica

@carloea2 carloea2 closed this Dec 5, 2025
@chenlica
Copy link
Contributor

chenlica commented Dec 6, 2025

@carloea2 Please explain the reason of closing this PR and our plan next.

@carloea2
Copy link
Contributor Author

carloea2 commented Dec 6, 2025

It was closed to simplify the review process and split the work into two steps:

  1. Handle this issue first: task(dataset): Redirect multipart upload through File Service #4110

  2. Once that is completed, I’ll create a new issue to enforce the actual max file size in the new architecture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend Changes related to the frontend GUI service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset max file size limit can be bypassed (frontend-only check)

2 participants