Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: run Process.open_files() in a separate thread #597

Merged
merged 7 commits into from Jul 9, 2015

Conversation

mrjefftang
Copy link
Collaborator

This PR fixes issue #340 and supersedes PR #596.

This will spawn a new thread on every call to get_open_files() and time it out after 1 second.

Based off the skeleton reverse engineered code from Sysinternal's Handles here:
http://forum.sysinternals.com/discussion-howto-enumerate-handles_topic19403_post99587.html#99587

Tested on Python 2.7.2 and Python 3.4.2 on Windows 7 x64.

All tests pass and the only memory leak test failures are related to net_if_addrs as seen in #596 as well.

@mrjefftang
Copy link
Collaborator Author

I'm working on an updated version that won't rely on a global secondary thread.

@giampaolo
Copy link
Owner

I can't say I like the thread-based solution but considering that this is a very difficult (apparently unfixable) and ancient issue I suppose it may be OK to stick with this solution, also because the alternative I had in mind was to get rid of open_files() support on Windows once and for all.
Question: what happens in case of timeout? If you haven't decided yet I think we should raise AccessDenied. Also, 1 second seems too much to me. I would stick with 0.3 or 0.5 secs.

@giampaolo
Copy link
Owner

PS: in order to raise AccesDenied form the C extension you'll have to include "../_psutil_common.h", then:

 AccessDenied();
 return NULL;

...and decorate open_files() method with wrap_exceptions() decorator in _pswindows.py.

@mrjefftang
Copy link
Collaborator Author

On case of timeout, we just skip the file handle. I've been debug printing the PID/handle on timeouts and this behavior mimics ProcessExplorer. Raising an AD exception doesn't sound right because there may be actual files, we just can't get the name of the synchronous named pipe for that corresponding handle.

I agree 1 second is a long timeout, this can be changed. Right now I'm focused on trying to remove the use of global variables and a global thread so calls to get_open_files will be thread safe but I'm running into some heap corruption issues I haven't been able to track down.

@giampaolo
Copy link
Owner

Thanks for looking into this. If this finally gets fixed it's gonna be a big step forward.

I've been debug printing the PID/handle on timeouts and this behavior mimics ProcessExplorer

What do you mean? That ProcessExplorer uses this same technique (using a thread)?

@mrjefftang
Copy link
Collaborator Author

As far as I can tell there are 2 ways for Process Explorer to work.

If it can inject a driver it will use the driver and this issue does not exist.

If it cannot inject a driver, it will create a thread and terminate it if it hangs. The handle that causes NtQueryObject to hang is just ignored and not present in the output for Process Explorer.

I've taken a slightly different approach which is just resetting the current execution point for the hanging thread back to the start so it's waiting for a signal. In theory should avoid all memory leaks since I'm essentially forcing the thread to goto the beginning of the function. I did some testing to verify that the stack contents aren't reallocated over and over.

However, running make.bat test-memleaks shows that it's still leaking memory and I'm not sure where.

I've also discovered get_open_files doesn't like file handles to network resources because the call to resolve \Device\HardDisk1\Partition0 to a drive letter doesn't understand the prefix for a file name that looks like: \Device\mup\sharename\path\to\file.txt

@giampaolo
Copy link
Owner

I've also discovered get_open_files doesn't like file handles to network resources because the call to
resolve \Device\HardDisk1\Partition0 to a drive letter doesn't understand the prefix for a file name that
looks like: \Device\mup\sharename\path\to\file.txt

This looks like a separate issue.
What happens in that case? Do you get an error? Can you attach the traceback? Any idea how to solve that? If we find a robust stratagem to preemptively recognize network files I think they should be represented like this:

\\path\to\file.ext

@mrjefftang
Copy link
Collaborator Author

I've opened issue #600 to track the network handles issue.

@giampaolo
Copy link
Owner

thanks

@giampaolo
Copy link
Owner

Any news about this? I think I'm gonna release 3.0.0 this week and I'd like to get this in. I don't want to rush it though, just wanted to let you know my plans.

@mrjefftang
Copy link
Collaborator Author

I don't think this will be ready in time. I would recommend merging in #596 to solve files not being shown due to the access flags filter and the possibility of memory/handle leaks in error edge cases.

Right now, I have an issue on Python 3 x64 where I'm getting a PUNICODE_STRING with a random null byte in the middle like: \Device\Harddisk1\Program Files\x00Microsoft

Passing this to PyUnicode_FromWideChar causes it to segfault.

I can't figure out why NtQueryObject with ObjectNameInformation is causing this to return an invalid string to me.

@mrjefftang
Copy link
Collaborator Author

Just an update: I should have this done by the end of the week.

@giampaolo
Copy link
Owner

Cool, thanks.

@mrjefftang
Copy link
Collaborator Author

So I'm content with the code at this point. I'd like to move a lot of the defines/structs into the header file but it causes a bunch of compiler errors for struct redefinition. It looks like it's due to ntextapi.h and the requirement to support mingw.

@giampaolo
Copy link
Owner

I already want to drop mingw support (which is already broken, see #591) so if that is blocking you feel free to either fix #591 first or remove only the mingw-related parts which are giving you troubles.

@mrjefftang
Copy link
Collaborator Author

I thought I saw an issue referring to dropping MingW but I couldn't find it. I'll see what I can do.

@mrjefftang
Copy link
Collaborator Author

Done.

I removed the conflicts that were affecting this PR so this should get you part way through #591.

@giampaolo
Copy link
Owner

It seems there are still conflicts as the merge button is grey.
I see stuff which should not be there, like this one:

-            if self._ppid is None:
-                ppid = self._proc.ppid()
-            self._ppid = ppid
-        self._proc._ppid = ppid
+            # Cache the PPID if it is not set
+            self._ppid = self._ppid or self._proc.ppid()
+            ppid = self._ppid

get_open_files spawns a worker thread to process calls to
NtQueryObject().

If the call timesout (100ms), the thread is terminated and the file
handle is skipped.  The ConvertThreadToFiber call prevents the initial
stack from leaking when the thread is terminated on Windows XP/2003.

The code has been cleaned up and struct redefinitions in ntextapi have
been removed and with conflicts using #define's.
@mrjefftang
Copy link
Collaborator Author

I rebased it to remove that commit.

@giampaolo
Copy link
Owner

OK, I'll try/test this someday during this week and hopefully merge it.

@mrjefftang
Copy link
Collaborator Author

@giampaolo Any chance of this getting merged in soon?

@giampaolo
Copy link
Owner

Sorry, I'm going to NYC on Sunday so this is gonna take a little while.

* master:
  use 'with open' to make sure file is closed
  test_process_create_time: always test against the rounded time too
  don't test physical cpu count on systems that don't include it
  don't test num_ctx_switches on unsupported kernels
  fix compilation warning about possible misuse of XDECREF
  fix race condition in wait_for_file
  fix flake8 error
  fix permission errors when running from /root
  fix giampaolo#607: DUPLEX_UNKNOWN is not defined on old RedHat versions
  fix giampaolo#606: Process.parent() may swallow NoSuchProcess exceptions (#race-condition)
  attempt to fix xargs on OSX
  giampaolo#602: move pre-commit hook into a separate file
  fix giampaolo#602: add GIT pre-commit hook
  fix failing test on Windows
  C styling: if unification
  forgot to close the handle
  fix giampaolo#599 (Windows): process name() can now be determined for all PIDs
  add test
  windows: refactoring of the alternative process info implementation
  fix typo in documentation
…read

* upstream/master:
  get rid of unreliable if_addrs() test on Linux
  better skip msg on windows
@mrjefftang
Copy link
Collaborator Author

I have reason to believe this breaks compatibility on Windows XP 32-bit. I will continue to investigate further.

@mrjefftang
Copy link
Collaborator Author

Looks like the Windows XP failure is unrelated.

It's a missing reference to inet_ntop [1] in net_if_addrs which isn't implemented in Windows until Vista [2]. This was introduced in PR #588

[1]

intRet = inet_ntop(AF_INET, &(sa_in->sin_addr), buff,

[2] https://msdn.microsoft.com/en-us/library/windows/desktop/cc805843%28v=vs.85%29.aspx

There's an edge case on Windows XP where an infinite loop is possible.

A call to NtQueryObject with a NULL buffer and a buffer size of 0 will
return STATUS_INFO_LENGTH_MISMATCH but dwLength will be 0.

This change will pre-allocate a buffer of size (MAX_PATH + 1) *
sizeof(WCHAR) and skips the handle if the required dwLenght is 0.
@mrjefftang
Copy link
Collaborator Author

Ran into some other issues, don't think this should be merged yet (Windows XP 32-bit running as SYSTEM).

@mrjefftang
Copy link
Collaborator Author

Turns out the threads won't die on WinXP under certain conditions.

For Vista+, it will spawn a new thread.
For WinXP it will use GetMappedFileName as suggested in the MSDN documentation. This implementation is safe from hangs but it will miss some files, specifically it fails the test case.

giampaolo added a commit that referenced this pull request Jul 9, 2015
@giampaolo giampaolo merged commit 1c1af13 into giampaolo:master Jul 9, 2015
@giampaolo
Copy link
Owner

I've just merged this. Tested on Win 7 and it seems everything is fine. Thanks for working on this. It was a big one.

@giampaolo
Copy link
Owner

@mrjefftang: also, I added this to doc:

Warning: On Windows this is not fully reliable as the underlying implementation may hang when
retrieving certain file handles. In order to work around that psutil spawns a thread and kills it if it’s not
responding. Bottom line is this method on Windows may be slow and may not return all file handles
opened by process.

Is it correct? Do you think we should add anything else (e.g. differences between XP and Vista)?

@mrjefftang
Copy link
Collaborator Author

I think that's correct. I would add a note that psutil spawns a thread on Vista+ only.

@giampaolo
Copy link
Owner

What happens on XP? The thread isn't spawn so we still have the hanging issue?
Also, what's the thread timeout?

@mrjefftang
Copy link
Collaborator Author

Timeout is 100ms.

Windows XP uses an alternative method that was documented on MSDN which misses file handles to valid files. It's a trade off of hanging vs being accurate.

Also despite all my testing, I could not get the thread cleanup to work on Windows XP which is why I went that route. We could spawn helper threads and call TerminateThread but the thread never dies off until python.exe quits. This lead to hundreds of threads being left around and in generally behaving badly.

@giampaolo
Copy link
Owner

...so as far as I understand:

  • it will never hang on XP but may skip valid file handles
  • it will never hang on Vista+ but as far as you know it will never skip valid file handles

Is it correct?

@mrjefftang
Copy link
Collaborator Author

That is correct.

@giampaolo
Copy link
Owner

Mmm... since we're killing the thread if it's hanging I suppose we may as well skip valid file handles.
Maybe it's because I'm pessimistic but I prefer to assume the worst case scenario and inform the user we're not sure this is 100% reliable.
I've change the doc to:

   Warning: on Windows this is not fully reliable as the underlying implementation
   may hang when retrieving certain file handles. In order to work around
   that psutil on Windows Vista (and higher) spawns a thread and kills it
   if it's not responding after 100ms.
   That implies valid file handles may be skipped.

@mrjefftang
Copy link
Collaborator Author

There's a chance that given the 100ms timeout, we end up killing the thread when it didn't hang but the CPU was just slow.

If a thread hangs, we kill it, spawn a new thread, and continue with the next file handle. The theory is that the thread only hangs on asynchronous pipes.

@giampaolo
Copy link
Owner

There's a chance that given the 100ms timeout, we end up killing the thread when it didn't hang but the CPU was just slow.

That's exactly because I prefer to declare this is not fully reliable.

If a thread hangs, we kill it, spawn a new thread, and continue with the next file handle. The theory is that the thread only hangs on asynchronous pipes.

I suppose there's no way to pre-emptively check whether the handle is an async pipe and skip it, is that right?

@mrjefftang
Copy link
Collaborator Author

There is no way to check if a handle is async from userspace (hence the need for a kernel driver).

We could probably just leave a disclaim that applies to all versions of Windows.

Warning: This method is not guaranteed to enumerate all file handles due to limitations of the Windows API.

@giampaolo giampaolo changed the title get_open_files in a separate thread Windows: run Process.open_files() in a separate thread Jul 9, 2015
@mrjefftang mrjefftang deleted the get_open_files_thread branch August 3, 2015 14:40
@jrfl
Copy link

jrfl commented Sep 5, 2015

I know this is a somewhat old conversation, but...
I've been working with psutils a little using the open_files function, which lead me to this conversation. I just wanted to mention. At some point in the open_files code, you're creating a thread with CreateThread. If you use any compiler runtime library calls in that thread, you're almost guaranteed to leak memory because the runtime lib relies on the setup done by (_)beginthread to clean up properly.

@giampaolo
Copy link
Owner

Where exactly does it leak? Can you provide a PR?

@mrjefftang
Copy link
Collaborator Author

I believe @jrfl is talking about this https://support.microsoft.com/en-us/kb/104641

@jrfl
Copy link

jrfl commented Sep 11, 2015

@mrjefftang is correct. I should have been more clear. Sorry for that.
At some point in the conversation, Jeff was talking about a memory leak he could not find. It's likely that clib calls were being used in that code that need the initialization and tear down that is done by beginthread and maybe endthread. I'm referring to the comment made by Jeff on Mar 1.

@mrjefftang
Copy link
Collaborator Author

There were a lot of changes since March 1. I might have been using CRT calls back then as well.

Since the merge, I have not seen any evidence of a memory leak, have you seen otherwise @jrfl ?

@jrfl
Copy link

jrfl commented Sep 11, 2015

There were some leaks when mucking about looking through all processes for all open handles every 15 seconds or so. The python process grew to over a gig in about an 2 hours. That's one of the things that lead me to this discussion. It turned out not to be a problem for me as I had to abandon using this code and start using handle from sysinternals and parse its output. Handle was finding open things that psutils did not. (Sorry if this sounds like bitching. My windows internals skills have degraded as I haven't had to use them for years otherwise I'd attempt to help)

@giampaolo
Copy link
Owner

Looking back at this it appears the leaking issue is a pretty severe one:

There were some leaks when mucking about looking through all processes
for all open handles every 15 seconds or so. The python process grew to over
a gig in about an 2 hours.

@mrjefftang any chance you could look into this again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants