Replies: 7 comments 2 replies
-
Beta Was this translation helpful? Give feedback.
-
|
I agree with the proposed solution to simply allow user to upload remaining files, this would be an immediate fix. However, I believe our file-service lacks proper logging, we should add verbosity levels to application and each service so based on the need, we enable detail logging for services needed. |
Beta Was this translation helpful? Give feedback.
-
|
I agree. |
Beta Was this translation helpful? Give feedback.
-
|
I like the idea, besides path and size, we could (or should?) also use checksum? |
Beta Was this translation helpful? Give feedback.
-
|
Per our offline discussion, I agree that it's important for the server to detect uploaded files and give the user an option to skip them. A solution is for the server to keep track of the uploaded files. A technical challenge is whether we want to let the client site efficiently compute a signature of a file (e.g., checksum), using which the server can check if a server-side file with the same file name and checksum is likely the same file. It will be good to know if other systems do this type of checking. My guess is possibly not due to efficiency concerns. |
Beta Was this translation helpful? Give feedback.
-
|
this sounds like a very good proposal and both the technical solution and the user experience is good |
Beta Was this translation helpful? Give feedback.
-
|
Thanks everyone for the input. We will move forward with improving resumable upload by tracking completion at the batch/session level. There is an ongoing PR for this issue, so please feel free to share any additional suggestions there: #5929 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
During recent large-dataset upload testing, we ran into an issue that may be worth discussing as a group, since the solution may affect both the upload and dataset commit behavior.
A user uploaded ~1300 files (~1.3 TB total) into a dataset. After the upload "finished", only ~1200 files were uploaded; ~100 failed mid-process.
This exposed two usability issues:
Currently, resumability has no batch/session-level state. We do not track which files in the batch have already completed, so re-dropping the same directory cannot distinguish "already uploaded" files from "new" files.
There are two categories of files that get uploaded again today:
How other systems handle this
In some other platforms, when uploading a folder that contains files already uploaded before, the system detects the existing files and prompts the user with options such as:
However, some platforms still seem to re-upload the existing files instead of truly skipping them, which is effectively what we do today.
Proposed Solution
Move existence detection to the beginning of the upload flow instead of relying on commit-time deduplication.
This would be especially useful for large datasets, where re-uploading already completed files can waste a lot of time and bandwidth.
Please feel free to share any suggestions or concerns. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions