Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slave: use system rmdir on windows #540

Closed
wants to merge 2 commits into from

Conversation

benallard
Copy link
Contributor

And try twice if the first time is not working.

This seems to be the only way of getting rid of annoying trouble when some files are not deleted, making the traversal cleanup fail.

@tomprince
Copy link
Member

I certainly want to support windows better.

How did you determine that calling rmdir (at most twice) is more reliable than what we have now. The fact that you have to retry once seems to me like it might be more reliable with your usage patterns, but not generally.

Also, rmdirRecursive is used in a few places on the slave, so it would be better to factor the logic into a single place.

@benallard
Copy link
Contributor Author

Hey Tom,

I tried a couple of things, the main trouble there is that windows (on the fs level) when deleting a file, actually only flip a switch, that only delete the file when all the file descriptors are closed. So, if a service (name any antivirus for instance) is browsing your file system at this moment, the rmdirRecursive will fail whatever you try as removing the parent directory will not succeed, this one not being empty. Trying another time will often succeed as the service (or whatever was browsing at that time) will surely have changed file in the meantime.

We often had trouble with the clobber functionality of the mercurial source step (on 64-bits), this, running for some months now seems to be pretty reliable ... Calling 'rmdir' in itself, instead if the previously complicated function maybe does not really change things, but as we call 'rm -rf' on Linux, I thought we could better unify everything, as the current one did not worked anyway.

I did not looked at trying to factor things, can you point me to some places where that could be done ?

Regards,
Benoît.

On 1 okt. 2012, at 21:46, Tom Prince notifications@github.com wrote:

I certainly want to support windows better?

How did you determine that calling rmdir (at most twice) is more reliable than what we have now. The fact that you have to retry once seems to me like it might be more reliable with your usage patterns, but not generally.

Also, rmdirRecursive is used in a few places on the slave, so it would be better to factor the logic into a single place.


Reply to this email directly or view it on GitHub.

@tomprince
Copy link
Member

Benoît Allard notifications@github.com writes:

I tried a couple of things, ...

Certainly retrying can cause things to succeed, but it isn't clear that
retrying once is better than retying multiple times. (I'm not suggesting
you rewrite this to retry multiple times.)

We often had trouble with the clobber functionality of the mercurial source step (on 64-bits), this, running for some months now seems to be pretty reliable ... Calling 'rmdir' in itself, instead if the previously complicated function maybe does not really change things, but as we call 'rm -rf' on Linux, I thought we could better unify everything, as the current one did not worked anyway.

How much of the improvement is from switching to rmdir, and how much is
from retrying? I worry that this might make things more reliable for
you, and less reliable for someone else. Without some analysis of why/if
this is better, I am relucant to change things around based on anecdotal
evidence.

I did not looked at trying to factor things, can you point me to some places where that could be done ?

git grep rmdirRecursive and git grep 'rm.*-rf'.

Tom

@benallard
Copy link
Contributor Author

Ok, fine, I don't know either how much of the improvements come from each modification.

One of the trouble I had is that the old rmdirRecursive was simply crashing with an exception:

exceptions.WindowsError: [Error 145] The directory is not empty: u'c:\...'

That should'nt happen, so I thought that it was old and buggy ans that I could simply replace it with rmdir as it was anyway only used on windows. Not trying to justify myself, just explaining my chain of thought.

What's the way forward now ?

@tomprince
Copy link
Member

It wouldn't suprise me if rmdir is more reliable, but I don't want to
just randomly change things (maybe @Callek can chime in here).

Retrying just feels like it is papering over the issue, but maybe that
is the best that can be done one windows. (@djmitche: Thoughts?)

@djmitche
Copy link
Member

djmitche commented Oct 5, 2012

There are lots of inconclusive discussions of this sort of thing around the web. In fact, some point to our implementation of rmdirRecursive. One of the things rmdirRecursive gets right is setting permissions while removing. I don't think rmdir does so.

This code has been around for a while, and works well on a variety of Windows systems against a variety of common issues. I'd rather improve it to cover the failure case you're seeing, than replace it and start that years-long refinement process over.

The most obvious solution is to turn off virus scanning, or disable it for the slave's basedir. It might be worth adding this suggestion to the docs. Assuming that's impossible, I suspect that the solution here is delayed retries, retrying infinitely as long as progress is being made. So, given the list of entries in a directory, keep a list of the entries for which you encounter an error, and when you encountered that error, then re-try those, say, 0.1s and 1s later, while working on other entries in the interim.

@benallard
Copy link
Contributor Author

Just got the error on another slave (also Win7 x64), so I can show the full traceback:

exception from rmdirRecursive
Traceback (most recent call last):
  File "C:\Python27\lib\threading.py", line 524, in __bootstrap
    self.__bootstrap_inner()
  File "C:\Python27\lib\threading.py", line 551, in __bootstrap_inner
    self.run()
  File "C:\Python27\lib\threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\twisted\python\threadpool.py", line 167, in _worker
    result = context.call(ctx, function, *args, **kwargs)
  File "C:\Python27\lib\site-packages\twisted\python\context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "C:\Python27\lib\site-packages\twisted\python\context.py", line 81, in callWithContext
    return func(*args,**kw)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 91, in rmdirRecursive
    rmdirRecursive(full_name)
  File "C:\Python27\lib\site-packages\buildslave\commands\utils.py", line 96, in rmdirRecursive
    os.rmdir(dir)
exceptions.WindowsError: [Error 145] The directory is not empty: u'C:\\Users\\buildbot\\buildslave\\slave_win_inst_installer_pack\\build\\.hg\\store\\data\\long\\long\\long\\long\\long\\long\\directory'
program finished with exit code -1

This error is something we are working on for a long time already, It was one of Harry's first assignment to get used to buildbot before starting with real development ... And he tried a couple of retry stuff before we decide to switch to the system rmdir. Too bad we did not documented the progresses, but I think there is an old email where he asks for advices over there. Not saying he tried smart stuff though, he did not knew how to write a line of python at that time.

I just mean to say that this solution evolved a lot in the past.

Enough about the past, let's look forward ; @djmitche: You proposed to re-try with delay. Do you mean in a deffered way ? Are you thinking of something like callLater or such ? Can you elaborate ?

@djmitche
Copy link
Member

In this context, I don't think it's necessary to delay asynchronously, particularly if, as I suggested, other directory entries are deleted while waiting for the retry delay to complete. If the algorithm runs out of work, then a sleep() is OK.

In a bit more detail, I'd design the algorithm as follows: first, convert the recursive algorithm to an iterative algorithm in the customary way, using a stack to hold incomplete work (in this case, directory entries that have yet to be deleted).

Aside from this, keep a queue (perhaps a heap?) of delayed tasks, each tagged with the time of the next try and the number of retries. Before popping a work item off the stack (or maybe every 100th pop?), check the queue for timed-out delayed tasks, and perform any that are due. If the work stack is empty, but the queue is not, sleep until the next item in the queue is due. Note that when that item executes successfully, it may place more items on the stack.

Each directory entry should have a small, finite number of retries (I suggested two, one at 0.1s and one at 1s, above). That way, the algorithm can perform lots of retries, without retrying a single obstinate file too many times.

@djmitche
Copy link
Member

Wow, step back for a minute. That's the best we can come up with for removing directories recursively on Windows? YUCK!

@tomprince
Copy link
Member

Using @inlineCallbacks and deferLater(reactor, delay, lambda: None) instead of sleep will prevent this stalling other builds.

@djmitche
Copy link
Member

@tardyp - do you want to pick this back up? If not (and that's fine), let's close the pull req.

@djmitche
Copy link
Member

djmitche commented Apr 7, 2014

This can always be picked back up in a new pull request..

@djmitche djmitche closed this Apr 7, 2014
@djmitche
Copy link
Member

djmitche commented Sep 3, 2014

see also http://trac.buildbot.net/ticket/2885

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants