Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean path prefix from filename in profile #4765

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

ncclementi
Copy link
Member

We remove the sys.exec_prefix from the filename in the profile. I think we are still missing couple of cases where the filename won't have the sys.exec_prefix but some other information that we want to remove.
For example in the current test, if instead of doing x = await c.submit(sleep, 1) we use a function like slowinc that lives in distributed.utils_test we will have a path of the form /Users/username/Documents/distributed/distributed/utils_test.py. we should see for cases like this one, what do we want to remove.

@mrocklin
Copy link
Member

I think we are still missing couple of cases where the filename won't have the sys.exec_prefix but some other information that we want to remove.

It would be ideal to remoe these, but not essential I don't think. A 90% solution sounds pretty good to me :)

Quick question, have you spot checked that things look good in the /profile tab or the performance report?

@ncclementi
Copy link
Member Author

Good question Matt, no I didn't check how the report looks like. I'm trying to assign a filename to the profile when generating it so I can take a look at it, but this is breaking the test. So I'll work on that and see what we get.

@mrocklin
Copy link
Member

mrocklin commented Apr 29, 2021 via email

@jrbourbeau
Copy link
Member

For example, you can run the following code:

import dask.array as da
from distributed import Client, performance_report

if __name__ == "__main__":

    with Client() as client:
        with performance_report("my-performance-report.html"):
            x = da.random.random((1_000, 1_000), chunks=(50, 50))
            result = (x + x.T).mean(axis=0).sum()
            result.compute()

which will generate a performance report and save it to a local file named my-performance-report.html which you can open in your browser. When viewing the performance report, you'll see three profile tabs (two for the workers and one for the scheduler) where you'll be able to see the impact of the changes here on the filename that's displayed.

@ncclementi
Copy link
Member Author

Thank you both, so in the performance report things look good except for the cases we are missing. Still need to find a good way to clean the path when we don't have the sys.exec_prefix.
See a clean path example:
clean

... and a not clean one here:
not_clean

distributed/profile.py Outdated Show resolved Hide resolved
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Matt's point about a 90% solution, I agree it totally fine for us to cover some obvious cases now and address other edge cases as they pop up. To that end, removing sys.exec_prefix as you already have seems like a great start.

We could also remove the user's home directory (os.path.expanduser("~")) if it's present in the filename. As an attempt to cover other cases and to prevent the filename from taking up too much space in profile plots we could also only display the last, for example, 80 characters.

A _format_filename utility function something like what's below might do the trick (note I've not tested the code below at all)

import functools

@functools.lru_cache(maxsize=256)
def _format_filename(filename):
    if filename.startswith(sys.exec_prefix):
        return filename.replace(sys.exec_prefix, "...")
    elif filename.startswith(os.path.expanduser("~")):
        return filename.replace(os.path.expanduser("~"), "...")
    else:
        return "..." + filename[-80:]

@ncclementi
Copy link
Member Author

Thanks James, I was going in a similar direction, except I was just thinking of when there is no sys.exec_prefix, arbitrarily keep let's say the last 3 components of the path to keep it some how short.

if filename.startswith(sys.exec_prefix):
        return filename.replace(sys.exec_prefix, "...")
else:
       return '.../' + '/'.join(filename.split('/')[-3:])

But that can end up including '~', I like the idea of using os.path.expanduser("~"). Will include that and modify the test accordingly.

@mrocklin
Copy link
Member

I might suggest that we drop the /lib/python3.X/site-packages/ bit as well. That seems not super informative.

+1 on using ~ as well. Past that I don't know that there is much to do.

@ncclementi
Copy link
Member Author

Added a function _clean_filename function to utils.py that it seems covers all the cases. I can show couple of screenshots of the scheduler's profile (I wasn't sure if it was worth it to create a gist to show the whole profile).
Note I checked the worker profiles and they are all clean too.

Screen Shot 2021-04-29 at 6 40 43 PM
Screen Shot 2021-04-29 at 6 40 59 PM
Screen Shot 2021-04-29 at 6 41 18 PM

@ncclementi
Copy link
Member Author

I still have some comments in the tests, if we think this looks good I'll delete them (they were helping in the debugging process)

@mrocklin
Copy link
Member

The images here look great to me

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the /lib/python3.X/site-packages/ part provides some value as, while not directly useful, gives indirect context on where modules live. For example, in the last screenshot the filename is .../threading.py. I don't know if that's threading.py somewhere in the standard library or a threading.py module I have in my home directory.

I'll propose that maybe we should keep this PR simple and just remove sys.exec_prefix and os.path.expanduser("~"). That accomplishes the goal of removing potentially sensitive information for obvious cases, while still keeping other useful information around.

filename = filename.replace(os.path.expanduser("~"), "")
filename = ".../" + "/".join(filename.split("/")[-3:])
else:
filename = ".../" + "/".join(filename.split("/")[-3:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and in a few other places, we're using / as the path separator. This is true for POSIX systems, but not for Windows which uses \\. We should use os.sep instead of /

filename = filename.replace(sys.exec_prefix, "...")
elif filename.startswith(os.path.expanduser("~")):
filename = filename.replace(os.path.expanduser("~"), "")
filename = ".../" + "/".join(filename.split("/")[-3:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using only the last three parts of the path is too limiting. Some modules, for example dask.dataframe, will have a deeply nested hierarchy where only using the last three parts will remove valuable information:

In [1]: import dask.dataframe as dd

In [2]: dd.read_parquet.__code__.co_filename
Out[2]: '/Users/james/projects/dask/dask/dask/dataframe/io/parquet/core.py'

In [3]: "/".join(dd.read_parquet.__code__.co_filename.split("/")[-3:])
Out[3]: 'io/parquet/core.py'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks James, that makes sense. Now do we want to put a cap on the amount of characters (before you suggested 80) which in this case will keep /projects/dask/dask/dask/dataframe/io/parquet/core.py (len =53) after removing the os.path.expanduser("~"). I'm thinking that 80 characters could end up bringing many directories that are not necessary useful and might be too long, but that depends on the length of the directories name. Not sure how do we want to go about it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll roll back my previous suggestion about limiting the path in these cases. I suspect in the common case we'll be dealing with either Python stdlib modules or third-party packages in site-packages. We can always truncate things in the future if needed, but let's keep things simple for now.

else:
filename = ".../" + "/".join(filename.split("/")[-3:])

path_remove = r"lib/python[.0-9]*/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some platforms use lib64 instead of lib. See https://docs.python.org/3/library/sys.html#sys.platlibdir for more information.

@mrocklin
Copy link
Member

I think the /lib/python3.X/site-packages/ part provides some value as, while not directly useful, gives indirect context on where modules live.

Maybe we keep site-packages then? Dropping /lib/python3.X would be nice, if only so that we get to ambiguate Python versions.

@jrbourbeau
Copy link
Member

Thinking about this more, I was mistaken. Python standard library packages will look like .../lib/pythonX.Y/threading.py while third-party libraries will have a site-packages part. So to disambiguate between standard library modules (like threading.py) and a threading.py module in one's home directory we need to keep the pythonX.Y part. Though if we're keeping that we might also want to keep the lib part too (it's not that much more) to avoid code complexity in our utility function since some platforms use lib64 instead of lib. If we assume there's no name clashes between third-party and stdlib packages (this seems okay to me), then we can drop lib/pythonX.Y/site-packages in those cases.

Here's a function that should handle the cases mentioned above:

@functools.lru_cache(maxsize=256)
def _clean_filename(filename):
    """Removes potentially sensitive information from a Python module file path"""
    if filename.startswith(sys.exec_prefix):
        filename = filename.replace(sys.exec_prefix, "...")
        if "site-packages" not in filename:
            # Python standard library module
            return filename
        else:
            # Third-party package
            return filename.replace(
                f"lib/python{sys.version_info[0]}.{sys.version_info[1]}/site-packages/",
                "",
            )
    elif filename.startswith(os.path.expanduser("~")):
        return filename.replace(os.path.expanduser("~"), "...")

Also, apologies for all the back-and-forth. My intention isn't to overly nitpick this PR.

@ncclementi
Copy link
Member Author

So what I understand is that we want to remove things of the form lib/pythonX.Y/site-packages/file.py but not things like lib/pythonX.Y/file.py. Now one question I have is, would we ever have the version of python be like pythonX.Y.Z? If so the fstring f"lib/python{sys.version_info[0]}.{sys.version_info[1]}/site-packages/" won't work. That's why I was using this r"lib/python[.0-9]*/. This will catch something like python3.8.5.

The question is, will we ever encounter something in the form pythonX.Y.Z in those paths?

@jrbourbeau
Copy link
Member

I believe it's always pythonX.Y https://docs.python.org/3/library/sys.html#sys.prefix

@ncclementi
Copy link
Member Author

Everything is clean as we want it now. I still want to add a test for the _clean_filename util function, and probably replace the regex checking for lib/lib64. After talking to James, it seems these are the only two cases that could appear so we might want to avoid the regex.

@ncclementi
Copy link
Member Author

After timing the clean_filename function using regex and a version without regex, the version without regex is faster by ~30% so I took out the regex.
Added a test for the clean_filename function.

@ncclementi
Copy link
Member Author

Well Windows tests are failing:
It seems the last test I added to the clean filename PR, to tests the utils function is failing in windows, looks like the path for path_no_sp = os.path.join.__code__.co_filename leads to a different file. ntpath.py instead of posixpath.py.
I can do a separate case like:

if os.sep == "\\":
 assert clean_path_no_sp =="...\lib\ntpath.py"  

Unless we have a cleaner way of doing this...suggestions?

@@ -635,3 +637,33 @@ def foo():

with pytest.warns(DeprecationWarning, match="removed in version 1.2.3"):
assert foo() == "bar"


def test__clean_filename():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of testing for an exact match for the output of _clean_filename, we could try making slightly looser assert statements (e.g. assert sys.exec_prefix not in result) which still provide solid test coverage, but are less platform-specific and probably more robust. Additionally, you may find os.path.join useful for this test.

For example, something along the lines of:

diff --git a/distributed/tests/test_utils.py b/distributed/tests/test_utils.py
index 8d1f1e9d..3c1fcc6b 100644
--- a/distributed/tests/test_utils.py
+++ b/distributed/tests/test_utils.py
@@ -642,28 +642,18 @@ def test_deprecated():
 def test__clean_filename():
     # python standard library case
     # path with sys.exec_prefix and no site-packages
-    path_no_sp = os.path.join.__code__.co_filename
-    clean_path_no_sp = _clean_filename(path_no_sp)
-
-    if f"lib{os.sep}" in path_no_sp:
-        assert (
-            clean_path_no_sp
-            == f"...{os.sep}lib{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}posixpath.py"
-        )
-    elif f"lib64{os.sep}" in path_no_sp:
-        assert (
-            clean_path_no_sp
-            == f"...{os.sep}lib64{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}posixpath.py"
-        )
+    filename = os.path.join.__code__.co_filename
+    result = _clean_filename(filename)
+    assert sys.exec_prefix not in result
+    assert result.startswith(os.path.join("...", "lib"))
+    assert f"python{sys.version_info[0]}.{sys.version_info[1]}" in result

     # Third-party package - site-packages
-    path_third_party = toolz.merge.__code__.co_filename
-    clean_path_third_party = _clean_filename(path_third_party)
-    assert clean_path_third_party == f"...{os.sep}toolz{os.sep}dicttoolz.py"
-
-    # Other case where esername in path
-    path_usr = (
-        os.path.expanduser("~") + f"{os.sep}repository{os.sep}folder{os.sep}filename.py"
-    )
-    clean_path_usr = _clean_filename(path_usr)
-    assert clean_path_usr == f"...{os.sep}repository{os.sep}folder{os.sep}filename.py"
+    filename = toolz.merge.__code__.co_filename
+    result = _clean_filename(filename)
+    assert sys.exec_prefix not in result
+    assert result.startswith(os.path.join("...", "toolz"))
+
+    # Other case where username in path
+    filename = os.path.join(os.path.expanduser("~"), "dir", "filename.py")
+    assert _clean_filename(filename) == filename.replace(os.path.expanduser("~"), "...")



@functools.lru_cache(maxsize=256)
def _clean_filename(filename):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the lib / lib64 parsing more compact. Also, I forgot to mention that if filename doesn't start with sys.exec_prefix or the user's home directory, we should probably return filename (right now we return None)

diff --git a/distributed/utils.py b/distributed/utils.py
index 1f9cc3a4..8d66f539 100644
--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -1555,16 +1555,20 @@ def _clean_filename(filename):
             # Python standard library module
             return filename
         else:
-            if f"lib{os.sep}" in filename:
-                return filename.replace(
-                    f"lib{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}site-packages{os.sep}",
-                    "",
-                )
-            elif f"lib64{os.sep}" in filename:
-                return filename.replace(
-                    f"lib64{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}site-packages{os.sep}",
-                    "",
-                )
-
+            # TODO: Use `sys.platlibdir` once Python 3.9 is the minimum supported Python version
+            if filename.startswith(os.path.join("...", "lib64")):
+                platlibdir = "lib64"
+            else:
+                platlibdir = "lib"
+            return filename.replace(
+                os.path.join(
+                    os.sep + platlibdir,
+                    f"python{sys.version_info[0]}.{sys.version_info[1]}",
+                    "site-packages",
+                ),
+                "",
+            )
     elif filename.startswith(os.path.expanduser("~")):
         return filename.replace(os.path.expanduser("~"), "...")
+    else:
+        return filename

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests kept failing in windows after the modifications, it seems that in windows os.path.join.__code__.co_filename doesn't have pythonX.Y in the path so it was failing when asserting for the version. I put an if statement to only check when the separator is not \\ not sure if it's the cleanest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll re-propose that we just drop everything before site-packages and be done with it.  Trying to reconstruct paths with version numbers and such seems like we're deep in a rabbit hole.

I'll let you two decide on what's best here, but this feels like we're spending too much time being careful about a thing that isn't at all important.

@ncclementi ncclementi changed the title Clean path prefix from filename in profile (WIP) Clean path prefix from filename in profile May 28, 2021
@fjetter
Copy link
Member

fjetter commented Jun 29, 2021

what's the status on this @ncclementi ?

@ncclementi
Copy link
Member Author

@fjetter It's been ready for a while, @jrbourbeau mentioned he wanted to take a final look. James, do you have any extra comments?

@ncclementi
Copy link
Member Author

Checking in here, is this PR something that we still want to get in? If so I can work on the fixes to get CI passing. @jrbourbeau you mentioned you had some comments on this, want to make sure we still want this in if not, we can close it.

cc: @mrocklin since you open the original issue #4716

@fjetter
Copy link
Member

fjetter commented Sep 28, 2021

Test failures seem to be genuine failures. There is also a linting issue. Can you have a look please? I think this still makes sense, especially if it is already done

Comment on lines +1444 to +1460
if os.sep == "\\":
return filename.replace(
os.path.join(
os.sep + platlibdir,
"site-packages",
),
"",
)
else:
return filename.replace(
os.path.join(
os.sep + platlibdir,
f"python{sys.version_info[0]}.{sys.version_info[1]}",
"site-packages",
),
"",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a comment about us not dealing with version numbers and path reconstruction, etc. I don't mind going ahead with this since it's been sitting for so long. However, a rather simple and possibly generic implementation would be

In [1]: import os

In [2]: def func(filename):
   ...:     filename_cut = filename.split("site-packages")[-1]
   ...:     filename_stripped = filename_cut.replace(os.sep, "", 1 if filename.startswith(os.sep) else 0)
   ...:     return filename_stripped
   ...:

In [3]: func("foo")
Out[3]: 'foo'

In [4]: func("/Users/fjetter/mambaforge/envs/dask-distributed/lib/python3.8/site-packages/pandas/core/computation/expr.py")
Out[4]: 'pandas/core/computation/expr.py'

In [5]: func("~/my_very_own_directory/foo/bar")
Out[5]: '~/my_very_own_directory/foo/bar'

In [6]: func("/root/is/stripped/if/no/site-package/")
Out[6]: 'root/is/stripped/if/no/site-package/'

the only assumption is that the paths are sitting in site-packages but that's it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if we don't care about keeping the python version we can go with striping everything to the left from site-packages included. I think at the moment @jrbourbeau mentioned that it would be useful to have the python version and that's why we went that route.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove path prefix from filename in profile
4 participants