Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

Replace custom atomicWrite by more robust library call. #334

Merged
merged 2 commits into from
Jan 4, 2019
Merged

Replace custom atomicWrite by more robust library call. #334

merged 2 commits into from
Jan 4, 2019

Conversation

benjaminfuchs
Copy link
Contributor

Issue

We are sometimes seeing following exception when we run our build with many cores:

Failed to execute script clcache_main
         Traceback (most recent call last):
           File "clcache_main.py", line 2, in <module>
           File "clcache\__main__.py", line 1596, in main
           File "clcache\__main__.py", line 1623, in processCompileRequest
           File "clcache\__main__.py", line 1675, in scheduleJobs
           File "concurrent\futures\_base.py", line 398, in result
           File "concurrent\futures\_base.py", line 357, in __get_result
           File "concurrent\futures\thread.py", line 55, in run
           File "clcache\__main__.py", line 1696, in processSingleSource
           File "clcache\__main__.py", line 1760, in processDirect
           File "clcache\__main__.py", line 1785, in ensureArtifactsExist
           File "clcache\__main__.py", line 756, in __exit__
           File "clcache\__main__.py", line 666, in save
           File "contextlib.py", line 66, in __exit__
           File "clcache\__main__.py", line 125, in atomicWrite
         PermissionError: [WinError 5] Access is denied: 'XXXX\\stats.txt.new' -> 'XXXX\\stats.txt'

Searching a bit makes it look like there is a race condition:

Anyway, the os.replace docs could state in general that: "[o]n Windows, a PermissionError will be raised if dst is a read-only file, or if either src or dst is currently open, or if src is a directory with an open file". https://bugs.python.org/issue27886

Looking with that in mind at the atomicWrite function:

@contextlib.contextmanager
def atomicWrite(fileName):
    tempFileName = fileName + '.new'
    with open(tempFileName, 'w') as f:
# One thread can have a open file handler on the tempFile thats always "stats.txt" ...
        yield f
    os.replace(tempFileName, fileName)
# ... while another thread is calling os.replace() causing PermssionError.

Solution

Replacing the atomicWrite function with atomic_write from the atomicwrites package seems to solve the issue (https://pypi.org/project/atomicwrites/). The package takes care of such race conditions so we no longer have to think about it.

@frerich frerich merged commit ea46042 into frerich:master Jan 4, 2019
@frerich
Copy link
Owner

frerich commented Jan 4, 2019

The atomicwrites package is a great find, I didn't even know that exists!

The patch seems plausible to me, merging this. Thanks!

@izmmisha
Copy link

izmmisha commented Jan 4, 2019

Hi, just checked this patch with this simple test:

import concurrent.futures
import tempfile
import os

#os.environ['TRACKER_ENABLED']="1"

from clcache.__main__ import (
    Statistics,
)

STATS_FILE = 'C:\\Users\\im\\clcache\\stats.txt' #problem
#STATS_FILE = tempfile.NamedTemporaryFile(dir='C:\\Users\\im\\clcache', suffix='.txt').name #problem
#STATS_FILE = tempfile.NamedTemporaryFile(suffix='.txt').name #no problem
#STATS_FILE = tempfile.NamedTemporaryFile(dir='C:\\Users\\im\\clcache').name #problem without txt extension too
#STATS_FILE = 'C:\\clcache\\stats.txt' #no problem

def change_stats():
    stats = Statistics(STATS_FILE)
    with stats.lock, stats as s:
        s.registerCallWithInvalidArgument()


if __name__ == '__main__':
    print(STATS_FILE)
    while True:
        with concurrent.futures.ThreadPoolExecutor(4) as executor:
            futures = [executor.submit(change_stats) for _ in range(1000)]
            for future in concurrent.futures.as_completed(futures):
                future.result()

Without this fix:

C:\Users\im\clcache\stats.txt
Traceback (most recent call last):
  File "..\awtest.py", line 29, in <module>
    future.result()
  File "C:\Python34\lib\concurrent\futures\_base.py", line 395, in result
    return self.__get_result()
  File "C:\Python34\lib\concurrent\futures\_base.py", line 354, in __get_result
    raise self._exception
  File "C:\Python34\lib\concurrent\futures\thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "..\awtest.py", line 20, in change_stats
    s.registerCallWithInvalidArgument()
  File "C:\Python34\lib\site-packages\clcache\__main__.py", line 795, in __exit__
    self._stats.save()
  File "C:\Python34\lib\site-packages\clcache\__main__.py", line 703, in save
    json.dump(self._dict, f, sort_keys=True, indent=4)
  File "C:\Python34\lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "C:\Python34\lib\site-packages\clcache\__main__.py", line 162, in atomicWrite
    os.replace(tempFileName, fileName)
PermissionError: [WinError 5] Отказано в доступе: 'C:\\Users\\im\\clcache\\stats.txt.new' -> 'C:\\Users\\im\\clcache\\stats.txt'

With this fix:

C:\repository\git\clcache>\Python34\python.exe ..\awtest.py
C:\Users\im\clcache\stats.txt
Traceback (most recent call last):
  File "..\awtest.py", line 29, in <module>
    future.result()
  File "C:\Python34\lib\concurrent\futures\_base.py", line 395, in result
    return self.__get_result()
  File "C:\Python34\lib\concurrent\futures\_base.py", line 354, in __get_result
    raise self._exception
  File "C:\Python34\lib\concurrent\futures\thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "..\awtest.py", line 20, in change_stats
    s.registerCallWithInvalidArgument()
  File "C:\Python34\lib\site-packages\clcache\__main__.py", line 749, in __exit__
    self._stats.save()
  File "C:\Python34\lib\site-packages\clcache\__main__.py", line 659, in save
    json.dump(self._dict, f, sort_keys=True, indent=4)
  File "C:\Python34\lib\contextlib.py", line 66, in __exit__
    next(self.gen)
  File "C:\Python34\lib\site-packages\atomicwrites\__init__.py", line 152, in _open
    self.commit(f)
  File "C:\Python34\lib\site-packages\atomicwrites\__init__.py", line 177, in commit
    replace_atomic(f.name, self._path)
  File "C:\Python34\lib\site-packages\atomicwrites\__init__.py", line 89, in replace_atomic
    return _replace_atomic(src, dst)
  File "C:\Python34\lib\site-packages\atomicwrites\__init__.py", line 71, in _replace_atomic
    _windows_default_flags | _MOVEFILE_REPLACE_EXISTING
  File "C:\Python34\lib\site-packages\atomicwrites\__init__.py", line 66, in _handle_errors
    raise WinError()
PermissionError: [WinError 5] Отказано в доступе.

@benjaminfuchs
Copy link
Contributor Author

@izmmisha Thanks for the test. I did some test runs with it. Both implementations with 200.000 calls to change_stats():

old implementation: 10 fails
new implementation: 2 fails

Did you also see such kind of frequency of fails?

I think the race condition that tempFileName was always stats.txt.new for every thread is solved.
But there is still a race condition if 2 threads try to write the new content to stats.txt.
I think this can only be solved by some kind of mutex guarding (what is aleardy done in some other parts of clcache).

@izmmisha
Copy link

izmmisha commented Jan 4, 2019

It need more investigation, the problem exists only in user home directory (except temp directory), the simplest solution right now is move clcache folder out of user home directory.

@benjaminfuchs
Copy link
Contributor Author

@izmmisha Looking at this post nodejs/node-gyp#1124 maybe the default CLCACHE_DIR location should be in %LOCALAPPDATA% on windows? I ran the test with this folder and didn't get any errors. No idea what magic happens in %USERPROFILE%.

shijunjing added a commit to shijunjing/clcache that referenced this pull request Mar 15, 2019
It is not unusual that we observed the atomic_write() has data race
in multipul threads and can raise exception to stop build as below:
PermissionError: [WinError 5] Access is denied.

Even the %LOCALAPPDATA% suggestion in below links still cannot
totally fix the exceptions in our CI build system
frerich#334 (comment)
frerich#342

So, simply retry the atomic_write() several times first and then just
pass through if still fails. The save() work around looks only impact
the statistics data accurate. Since The build expection hang is always
show stopper issue for us with top priority, it's OK for us get harden
stable compiler cache with cost of inaccurate statistics info.
@punitchandra
Copy link

When can this be released?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants