-
-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean path prefix from filename in profile #4765
base: main
Are you sure you want to change the base?
Conversation
It would be ideal to remoe these, but not essential I don't think. A 90% solution sounds pretty good to me :) Quick question, have you spot checked that things look good in the /profile tab or the performance report? |
Good question Matt, no I didn't check how the report looks like. I'm trying to assign a filename to the profile when generating it so I can take a look at it, but this is breaking the test. So I'll work on that and see what we get. |
The two common ways to view a profile are in the /profile tab of the
dashboard and with the performance_report context manager. Taking the time
to familiarize yourself with both of them is time well spent I think :)
…On Thu, Apr 29, 2021 at 9:28 AM Naty Clementi ***@***.***> wrote:
Good question Matt, no I didn't check how the report looks like. I'm
trying to assign a filename to the profile when generating it so I can take
a look at it, but this is breaking the test. So I'll work on that and see
what we get.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4765 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGQFNZTY5HSPZSZSMLTLFUJRANCNFSM43X4PCCQ>
.
|
For example, you can run the following code: import dask.array as da
from distributed import Client, performance_report
if __name__ == "__main__":
with Client() as client:
with performance_report("my-performance-report.html"):
x = da.random.random((1_000, 1_000), chunks=(50, 50))
result = (x + x.T).mean(axis=0).sum()
result.compute() which will generate a performance report and save it to a local file named |
Co-authored-by: James Bourbeau <jrbourbeau@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Matt's point about a 90% solution, I agree it totally fine for us to cover some obvious cases now and address other edge cases as they pop up. To that end, removing sys.exec_prefix
as you already have seems like a great start.
We could also remove the user's home directory (os.path.expanduser("~")
) if it's present in the filename. As an attempt to cover other cases and to prevent the filename from taking up too much space in profile plots we could also only display the last, for example, 80 characters.
A _format_filename
utility function something like what's below might do the trick (note I've not tested the code below at all)
import functools
@functools.lru_cache(maxsize=256)
def _format_filename(filename):
if filename.startswith(sys.exec_prefix):
return filename.replace(sys.exec_prefix, "...")
elif filename.startswith(os.path.expanduser("~")):
return filename.replace(os.path.expanduser("~"), "...")
else:
return "..." + filename[-80:]
Thanks James, I was going in a similar direction, except I was just thinking of when there is no if filename.startswith(sys.exec_prefix):
return filename.replace(sys.exec_prefix, "...")
else:
return '.../' + '/'.join(filename.split('/')[-3:]) But that can end up including '~', I like the idea of using |
I might suggest that we drop the +1 on using |
…ed into clean_fname_profile
I still have some comments in the tests, if we think this looks good I'll delete them (they were helping in the debugging process) |
The images here look great to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the /lib/python3.X/site-packages/
part provides some value as, while not directly useful, gives indirect context on where modules live. For example, in the last screenshot the filename is .../threading.py
. I don't know if that's threading.py
somewhere in the standard library or a threading.py
module I have in my home directory.
I'll propose that maybe we should keep this PR simple and just remove sys.exec_prefix
and os.path.expanduser("~")
. That accomplishes the goal of removing potentially sensitive information for obvious cases, while still keeping other useful information around.
distributed/utils.py
Outdated
filename = filename.replace(os.path.expanduser("~"), "") | ||
filename = ".../" + "/".join(filename.split("/")[-3:]) | ||
else: | ||
filename = ".../" + "/".join(filename.split("/")[-3:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, and in a few other places, we're using /
as the path separator. This is true for POSIX systems, but not for Windows which uses \\
. We should use os.sep
instead of /
distributed/utils.py
Outdated
filename = filename.replace(sys.exec_prefix, "...") | ||
elif filename.startswith(os.path.expanduser("~")): | ||
filename = filename.replace(os.path.expanduser("~"), "") | ||
filename = ".../" + "/".join(filename.split("/")[-3:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using only the last three parts of the path is too limiting. Some modules, for example dask.dataframe
, will have a deeply nested hierarchy where only using the last three parts will remove valuable information:
In [1]: import dask.dataframe as dd
In [2]: dd.read_parquet.__code__.co_filename
Out[2]: '/Users/james/projects/dask/dask/dask/dataframe/io/parquet/core.py'
In [3]: "/".join(dd.read_parquet.__code__.co_filename.split("/")[-3:])
Out[3]: 'io/parquet/core.py'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks James, that makes sense. Now do we want to put a cap on the amount of characters (before you suggested 80) which in this case will keep /projects/dask/dask/dask/dataframe/io/parquet/core.py
(len =53) after removing the os.path.expanduser("~")
. I'm thinking that 80 characters could end up bringing many directories that are not necessary useful and might be too long, but that depends on the length of the directories name. Not sure how do we want to go about it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll roll back my previous suggestion about limiting the path in these cases. I suspect in the common case we'll be dealing with either Python stdlib modules or third-party packages in site-packages
. We can always truncate things in the future if needed, but let's keep things simple for now.
distributed/utils.py
Outdated
else: | ||
filename = ".../" + "/".join(filename.split("/")[-3:]) | ||
|
||
path_remove = r"lib/python[.0-9]*/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some platforms use lib64
instead of lib
. See https://docs.python.org/3/library/sys.html#sys.platlibdir for more information.
Maybe we keep |
Thinking about this more, I was mistaken. Python standard library packages will look like Here's a function that should handle the cases mentioned above: @functools.lru_cache(maxsize=256)
def _clean_filename(filename):
"""Removes potentially sensitive information from a Python module file path"""
if filename.startswith(sys.exec_prefix):
filename = filename.replace(sys.exec_prefix, "...")
if "site-packages" not in filename:
# Python standard library module
return filename
else:
# Third-party package
return filename.replace(
f"lib/python{sys.version_info[0]}.{sys.version_info[1]}/site-packages/",
"",
)
elif filename.startswith(os.path.expanduser("~")):
return filename.replace(os.path.expanduser("~"), "...") Also, apologies for all the back-and-forth. My intention isn't to overly nitpick this PR. |
So what I understand is that we want to remove things of the form The question is, will we ever encounter something in the form |
I believe it's always |
Everything is clean as we want it now. I still want to add a test for the |
After timing the |
Well Windows tests are failing:
Unless we have a cleaner way of doing this...suggestions? |
@@ -635,3 +637,33 @@ def foo(): | |||
|
|||
with pytest.warns(DeprecationWarning, match="removed in version 1.2.3"): | |||
assert foo() == "bar" | |||
|
|||
|
|||
def test__clean_filename(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of testing for an exact match for the output of _clean_filename
, we could try making slightly looser assert
statements (e.g. assert sys.exec_prefix not in result
) which still provide solid test coverage, but are less platform-specific and probably more robust. Additionally, you may find os.path.join
useful for this test.
For example, something along the lines of:
diff --git a/distributed/tests/test_utils.py b/distributed/tests/test_utils.py
index 8d1f1e9d..3c1fcc6b 100644
--- a/distributed/tests/test_utils.py
+++ b/distributed/tests/test_utils.py
@@ -642,28 +642,18 @@ def test_deprecated():
def test__clean_filename():
# python standard library case
# path with sys.exec_prefix and no site-packages
- path_no_sp = os.path.join.__code__.co_filename
- clean_path_no_sp = _clean_filename(path_no_sp)
-
- if f"lib{os.sep}" in path_no_sp:
- assert (
- clean_path_no_sp
- == f"...{os.sep}lib{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}posixpath.py"
- )
- elif f"lib64{os.sep}" in path_no_sp:
- assert (
- clean_path_no_sp
- == f"...{os.sep}lib64{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}posixpath.py"
- )
+ filename = os.path.join.__code__.co_filename
+ result = _clean_filename(filename)
+ assert sys.exec_prefix not in result
+ assert result.startswith(os.path.join("...", "lib"))
+ assert f"python{sys.version_info[0]}.{sys.version_info[1]}" in result
# Third-party package - site-packages
- path_third_party = toolz.merge.__code__.co_filename
- clean_path_third_party = _clean_filename(path_third_party)
- assert clean_path_third_party == f"...{os.sep}toolz{os.sep}dicttoolz.py"
-
- # Other case where esername in path
- path_usr = (
- os.path.expanduser("~") + f"{os.sep}repository{os.sep}folder{os.sep}filename.py"
- )
- clean_path_usr = _clean_filename(path_usr)
- assert clean_path_usr == f"...{os.sep}repository{os.sep}folder{os.sep}filename.py"
+ filename = toolz.merge.__code__.co_filename
+ result = _clean_filename(filename)
+ assert sys.exec_prefix not in result
+ assert result.startswith(os.path.join("...", "toolz"))
+
+ # Other case where username in path
+ filename = os.path.join(os.path.expanduser("~"), "dir", "filename.py")
+ assert _clean_filename(filename) == filename.replace(os.path.expanduser("~"), "...")
|
||
|
||
@functools.lru_cache(maxsize=256) | ||
def _clean_filename(filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the lib
/ lib64
parsing more compact. Also, I forgot to mention that if filename
doesn't start with sys.exec_prefix
or the user's home directory, we should probably return filename
(right now we return None
)
diff --git a/distributed/utils.py b/distributed/utils.py
index 1f9cc3a4..8d66f539 100644
--- a/distributed/utils.py
+++ b/distributed/utils.py
@@ -1555,16 +1555,20 @@ def _clean_filename(filename):
# Python standard library module
return filename
else:
- if f"lib{os.sep}" in filename:
- return filename.replace(
- f"lib{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}site-packages{os.sep}",
- "",
- )
- elif f"lib64{os.sep}" in filename:
- return filename.replace(
- f"lib64{os.sep}python{sys.version_info[0]}.{sys.version_info[1]}{os.sep}site-packages{os.sep}",
- "",
- )
-
+ # TODO: Use `sys.platlibdir` once Python 3.9 is the minimum supported Python version
+ if filename.startswith(os.path.join("...", "lib64")):
+ platlibdir = "lib64"
+ else:
+ platlibdir = "lib"
+ return filename.replace(
+ os.path.join(
+ os.sep + platlibdir,
+ f"python{sys.version_info[0]}.{sys.version_info[1]}",
+ "site-packages",
+ ),
+ "",
+ )
elif filename.startswith(os.path.expanduser("~")):
return filename.replace(os.path.expanduser("~"), "...")
+ else:
+ return filename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests kept failing in windows after the modifications, it seems that in windows os.path.join.__code__.co_filename
doesn't have pythonX.Y
in the path so it was failing when asserting for the version. I put an if statement to only check when the separator is not \\
not sure if it's the cleanest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll re-propose that we just drop everything before site-packages and be done with it. Trying to reconstruct paths with version numbers and such seems like we're deep in a rabbit hole.
I'll let you two decide on what's best here, but this feels like we're spending too much time being careful about a thing that isn't at all important.
what's the status on this @ncclementi ? |
@fjetter It's been ready for a while, @jrbourbeau mentioned he wanted to take a final look. James, do you have any extra comments? |
Checking in here, is this PR something that we still want to get in? If so I can work on the fixes to get CI passing. @jrbourbeau you mentioned you had some comments on this, want to make sure we still want this in if not, we can close it. |
Test failures seem to be genuine failures. There is also a linting issue. Can you have a look please? I think this still makes sense, especially if it is already done |
if os.sep == "\\": | ||
return filename.replace( | ||
os.path.join( | ||
os.sep + platlibdir, | ||
"site-packages", | ||
), | ||
"", | ||
) | ||
else: | ||
return filename.replace( | ||
os.path.join( | ||
os.sep + platlibdir, | ||
f"python{sys.version_info[0]}.{sys.version_info[1]}", | ||
"site-packages", | ||
), | ||
"", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a comment about us not dealing with version numbers and path reconstruction, etc. I don't mind going ahead with this since it's been sitting for so long. However, a rather simple and possibly generic implementation would be
In [1]: import os
In [2]: def func(filename):
...: filename_cut = filename.split("site-packages")[-1]
...: filename_stripped = filename_cut.replace(os.sep, "", 1 if filename.startswith(os.sep) else 0)
...: return filename_stripped
...:
In [3]: func("foo")
Out[3]: 'foo'
In [4]: func("/Users/fjetter/mambaforge/envs/dask-distributed/lib/python3.8/site-packages/pandas/core/computation/expr.py")
Out[4]: 'pandas/core/computation/expr.py'
In [5]: func("~/my_very_own_directory/foo/bar")
Out[5]: '~/my_very_own_directory/foo/bar'
In [6]: func("/root/is/stripped/if/no/site-package/")
Out[6]: 'root/is/stripped/if/no/site-package/'
the only assumption is that the paths are sitting in site-packages but that's it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well if we don't care about keeping the python version we can go with striping everything to the left from site-packages included. I think at the moment @jrbourbeau mentioned that it would be useful to have the python version and that's why we went that route.
We remove the
sys.exec_prefix
from the filename in the profile. I think we are still missing couple of cases where the filename won't have thesys.exec_prefix
but some other information that we want to remove.For example in the current test, if instead of doing
x = await c.submit(sleep, 1)
we use a function likeslowinc
that lives indistributed.utils_test
we will have a path of the form/Users/username/Documents/distributed/distributed/utils_test.py.
we should see for cases like this one, what do we want to remove.black distributed
/flake8 distributed
/isort distributed