Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic UrlOperations.upload for ssh- and file-URLs #649

Merged
merged 3 commits into from
Jul 1, 2024

Conversation

christian-monch
Copy link
Contributor

@christian-monch christian-monch commented Mar 27, 2024

This PR addresses issue datalad-ria #102 and touches on issue #454 where atomic upload operations are discussed as well.

The PR also adds atomic upload to files. This can prevent data loss in the case where multiple threads write to the same file concurrently.

All ssh- and file-uploads should be atomic, which means an atomic-parameter that controls atomicity on a per-call basis is not necessary.

@christian-monch christian-monch changed the title Atomic UrlOperations.upload for ssh- and file-URLs Atomic UrlOperations.upload for ssh-URLs Mar 27, 2024
@christian-monch christian-monch changed the title Atomic UrlOperations.upload for ssh-URLs Atomic UrlOperations.upload for ssh- and file-URLs Mar 27, 2024
@christian-monch christian-monch force-pushed the atomic-upload branch 4 times, most recently from 0c248eb to 8c7fc58 Compare March 27, 2024 17:03
Copy link

codecov bot commented Mar 27, 2024

Codecov Report

Attention: Patch coverage is 97.36842% with 3 lines in your changes missing coverage. Please review.

Project coverage is 92.47%. Comparing base (556bb81) to head (f07a613).

Files Patch % Lines
datalad_next/url_operations/ssh.py 96.20% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #649      +/-   ##
==========================================
+ Coverage   89.23%   92.47%   +3.24%     
==========================================
  Files         195      195              
  Lines       14272    14301      +29     
  Branches     2164     2162       -2     
==========================================
+ Hits        12735    13225     +490     
+ Misses       1258      812     -446     
+ Partials      279      264      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this PR together. I hope we can get this done quickly now. Given the age of this PR, I'd like to ask for the first and the last commit of this series to be posted as two individual PRs. They are largely unrelated to the actual topic, valid and complete fixes that should not be hanging here. Thanks!

The main diff of the PR changes the URL operations upload from a straight copy to a copy-to-tmp-and-final-mv implementation. This is good IMHO.

However, the diff introduces two random number reinitializations. We cannot do this, and I do not understand why it would be needed either. Please add rationale on the necessity, and add the protective scaffolding (getstate(), setstate()) to limit the impact of the reinitialization to only this very function and without impairing the ability of any Python script importing DataLad to run reproducibly.

Thanks!

This commit adds atomic upload to `url_operations.ssh`.
It uses `datalad_next.shell` to implement all
ssh-operations.

The commit also adds an explicit test for
`url_operations.ssh.stat`.
This commit adds an atomic upload for files. This
can prevent errors where two threads are
writing to the same file concurrently.

The commit also fixes type errors in
`url_operations.file`.
This commit adds a description of the atomicity
properties of upload to the doc-string of the method
`url_operations.base.UrlOperations.upload`.
@christian-monch
Copy link
Contributor Author

Thanks for putting this PR together. I hope we can get this done quickly now. Given the age of this PR, I'd like to ask for the first and the last commit of this series to be posted as two individual PRs. They are largely unrelated to the actual topic, valid and complete fixes that should not be hanging here. Thanks!

Done in PR #736 and #737

[...]
However, the diff introduces two random number reinitializations. We cannot do this, and I do not understand why it would be needed either. Please add rationale on the necessity, and add the protective scaffolding (getstate(), setstate()) to limit the impact of the reinitialization to only this very function and without impairing the ability of any Python script importing DataLad to run reproducibly.
[...]

I removed the reinitialization of random numbers and added a time-stamp modifier to the nonce-values. That should prevent name collisions, even if the random state is identically on calls to upload (which is already unlikely).

@christian-monch christian-monch requested a review from mih July 1, 2024 14:44
Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation adds a longer suffix (38chars) (like .transfer-8267388836_17682337760925293 that include a random number and a timestamp to make collisions unlikely. This is not cheap on filesystems with path length limits, but also an implementation detail that can be addressed, if and when there is a real need.

@mih mih merged commit f00cfdb into datalad:main Jul 1, 2024
9 checks passed
@mih
Copy link
Member

mih commented Jul 1, 2024

Thanks @christian-monch !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants