Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config: make writing to disk as atomic as possible #4607

Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Dec 2, 2020

Potentially addresses #4604

The original implementation of the Config.store method, that writes
the configuration from memory to disk, was using shutil.copy which is
not an atomic operation. It is therefore possible that if the copy
operation is interrupted the configuration is not written entirely and
is left either empty or in a corrupted state.

Making this an atomic operation in 100% of the cases is not trivial and
even on the application level there are no guarantees as even on the
file system level operations that should be atomic can be interrupted
under very rare circumstances.

However, we can make a significant improvement by replacing the use of
shutil.copy with os.rename, which because it is a file move
operation should be atomic and therefore less prone to being
interrupted. The new algorithm for storing the config is as follows:

  1. Create temporary file in same directory as target file
  2. Write content of in memory config to the temporary file and flush
  3. Use os.rename to move the temporary file to target destination

This should prevent most common cases that could corrupt the existing
configuration file.

Finally, in Config.from_method now only FileNotFoundError is caught
when trying to read the config file from disk. Originally both IOError
and OSError were being caught, which can also be raised if an existing
file could not be read. Simply overwriting in this case with a default
config is not desirable and so in this case we simply let the exception
bubble up.

@sphuber sphuber force-pushed the fix/4604/config-file-deleted branch 3 times, most recently from 91714e0 to 6d08b19 Compare December 2, 2020 11:12
@codecov
Copy link

codecov bot commented Dec 2, 2020

Codecov Report

Merging #4607 (4f2ab82) into release/1.5.1 (d604916) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                Coverage Diff                @@
##           release/1.5.1    #4607      +/-   ##
=================================================
+ Coverage          79.50%   79.50%   +0.01%     
=================================================
  Files                482      482              
  Lines              35315    35320       +5     
=================================================
+ Hits               28073    28078       +5     
  Misses              7242     7242              
Flag Coverage Δ
django 73.67% <100.00%> (+0.01%) ⬆️
sqlalchemy 72.84% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/manage/configuration/config.py 92.44% <100.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d604916...4f2ab82. Read the comment docs.

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sphuber , just some small comments

aiida/manage/configuration/config.py Show resolved Hide resolved
timestamp = os.path.getmtime(self.config_filepath)
Config.from_file(self.config_filepath)

# Sleep a second, because for some operating systems the time resolution is of the order of a second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, which OS has 1s time resolution?
It's interesting that this can work with processors that have clock cycles below ns ;-)

As far as I'm aware, tests should run "in the blink of an eye", i.e. 1s is already too long.. 0.1s would already be better..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

l'histoire se répete
TLDR: blame MacOS :)

Copy link
Member

@ltalirz ltalirz Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what type of time resolution @giovannipizzi was referring to, but on my MacOS it is quite substantially better

In [1]: from os.path import getmtime

In [5]: from time import sleep

In [7]: from pathlib import Path

In [8]: a = Path('a')

In [9]: for i in range(10):
   ...:     sleep(0.1)
   ...:     a.touch()
   ...:     print(getmtime('a'))
   ...:
1606909606.250035
1606909606.350413
1606909606.452756
1606909606.556789
1606909606.658411
1606909606.758832
1606909606.864015
1606909606.968789
1606909607.069461
1606909607.171968

In [10]: for i in range(10):
    ...:     sleep(0.01)
    ...:     a.touch()
    ...:     print(getmtime('a'))
    ...:
1606909614.754383
1606909614.765453
1606909614.775664
1606909614.785863
1606909614.796015
1606909614.806161
1606909614.817056
1606909614.827201
1606909614.837322
1606909614.848603

In [11]: for i in range(10):
    ...:     sleep(0.001)
    ...:     a.touch()
    ...:     print(getmtime('a'))
    ...:
1606909621.740552
1606909621.741794
1606909621.74325
1606909621.744598
1606909621.745944
1606909621.747289
1606909621.748634
1606909621.749978
1606909621.751066
1606909621.752403

In [12]: for i in range(10):
    ...:     sleep(0.0001)
    ...:     a.touch()
    ...:     print(getmtime('a'))
    ...:
1606909728.212414
1606909728.212609
1606909728.212802
1606909728.213136
1606909728.213311
1606909728.213494
1606909728.21371
1606909728.213929
1606909728.21422
1606909728.214399

In [13]: for i in range(10):
    ...:     a.touch()
    ...:     print(getmtime('a'))
    ...:
1606909932.093822
1606909932.093957
1606909932.094021
1606909932.094065
1606909932.094104
1606909932.094144
1606909932.094179
1606909932.094213
1606909932.094248
1606909932.094292

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this problem has been addressed in at least some of the MacOS versions, maybe it would be best to remove these sleeps. Note that this is not the only place where I am adding it. It is used for a lot of the verdi tests that need to use the interactive writer. I would leave that for a separate PR though. I would be fine to remove it since the only problem it tackles is that certain unit tests will fail on certain MacOS systems. Probably not reason enough to slow down many of these tests all the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

  • I have fixed the (testing) issue already upstream in click some time ago - this will appear in v8.0 (it's now in v8.0.0a1 according to GitHub)
  • According to my old comments in that commit, the precision is 2 seconds (!!) on FAT32!
  • According to my commit, at the time I was using Mac OS X 10.12. In 10.13 Apple switched to APFS that I guess fixed this and that must be why Leo does not see it anymore (BTW, it introduced a much more serious bug IMO with atomicity of file replacements, that I unveiled when hard-testing disk-objectstore! Still no reply from the developers, though). Since this is only for testing so it matters only to us developers, and I think nowadays probably everybody is developing on at least 10.13, probably we could do a PR and clean up these sleeps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to bring references: in the official Microsoft docs they say

[...] the resolution of create time on FAT is 10 milliseconds, while write time has a resolution of 2 seconds and access time has a resolution of 1 day, so it is really the access date. The NTFS file system delays updates to the last access time for a file by up to 1 hour after the last access.

And then you guys complain about the 1s resolution on older Macs? :-D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giovannipizzi for the info, that's good to know!

Does this mean that since we are now always creating a temporary file and renaming it, we actually have a time resolution of 10ms on FAT32 as well?

In general, one could also think of an alternative way of testing this: as long as we know where in our code the config file can be written (e.g. the atomic_write function), we could use pytest-mock to mock this function and just check whether it is called. Then we don't need to rely on timing updates at the FS level.

The original implementation of the `Config.store` method, that writes
the configuration from memory to disk, was using `shutil.copy` which is
not an atomic operation. It is therefore possible that if the copy
operation is interrupted the configuration is not written entirely and
is left either empty or in a corrupted state.

Making this an atomic operation in 100% of the cases is not trivial and
even on the application level there are no guarantees as even on the
file system level operations that should be atomic can be interrupted
under very rare circumstances.

However, we can make a significant improvement by replacing the use of
`shutil.copy` with `os.rename`, which because it is a file move
operation should be atomic and therefore less prone to being
interrupted. The new algorithm for storing the config is as follows:

 1. Create temporary file in same directory as target file
 2. Write content of in memory config to the temporary file and flush
 3. Use `os.rename` to move the temporary file to target destination

This should prevent most common cases that could corrupt the existing
configuration file.

Finally, in `Config.from_method` now only `FileNotFoundError` is caught
when trying to read the config file from disk. Originally both `IOError`
and `OSError` were being caught, which can also be raised if an existing
file could not be read. Simply overwriting in this case with a default
config is not desirable and so in this case we simply let the exception
bubble up.
@sphuber sphuber force-pushed the fix/4604/config-file-deleted branch from 6d08b19 to 4f2ab82 Compare December 2, 2020 11:50
@ltalirz ltalirz self-requested a review December 2, 2020 11:59
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, in that case good to merge from my side.

let's follow up on the timing issue though

@sphuber
Copy link
Contributor Author

sphuber commented Dec 2, 2020

ok, in that case good to merge from my side.

Do you think this fixes issue #4604 and so I close it? Or do we leave it open?

@ltalirz
Copy link
Member

ltalirz commented Dec 2, 2020

Let's close it and reopen in case it reoccurs

@sphuber sphuber merged commit b9fcbd8 into aiidateam:release/1.5.1 Dec 2, 2020
@sphuber sphuber deleted the fix/4604/config-file-deleted branch December 2, 2020 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants