Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to upload modules from zip files #886

Merged
merged 4 commits into from Feb 22, 2017
Merged

Support to upload modules from zip files #886

merged 4 commits into from Feb 22, 2017

Conversation

bmaisonn
Copy link
Contributor

Fix issue (#865).

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this looks pretty good to me. A few comments.

if os.path.exists('myfile.zip'):
os.remove('myfile.zip')

sleep(1) # TODO: why is this necessary?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this is not present?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one second resolution may depend on the filesystem. Try importlib.invalidate_caches() instead. (see https://docs.python.org/3/library/importlib.html#importlib.invalidate_caches ). It isn't needed on Python 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no case where we call invalidate_cache in the upload_file method.
Should I add it in all cases ?
In the same way the sleep call is in all the upload_file tests.
Should I remove it from all the tests ?

                if ext in ('.py', '.pyc'):
                    logger.info("Reload module %s from .py file", name)
                    name = name.split('-')[0]
                    reload(import_module(name))
                if ext == '.egg':
                    import pkg_resources
                    sys.path.append(out_filename)
                    pkgs = pkg_resources.find_distributions(out_filename)
                    for pkg in pkgs:
                        logger.info("Load module %s from egg", pkg.project_name)
                        reload(import_module(pkg.project_name))
                    if not pkgs:
                        logger.warning("Found no packages in egg file")
                if ext == '.zip':
                    logger.info("Reload module %s from .zip file", name)
                    if out_filename not in sys.path:
                        sys.path.insert(0, out_filename)
                    invalidate_caches()
                    reload(import_module(name))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should at least try it and see it if works, yes :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the invalid_caches and commonalized some code:

try:
                name, ext = os.path.splitext(filename)
                names_to_import = []
                if ext in ('.py', '.pyc'):
                    names_to_import.append(name)
                if ext in ('.egg', '.zip'):
                    if out_filename not in sys.path:
                        sys.path.insert(0, out_filename)
                    if ext == '.egg':
                        import pkg_resources
                        pkgs = pkg_resources.find_distributions(out_filename)
                        for pkg in pkgs:
                            names_to_import.append(pkg.project_name)
                    elif ext == '.zip':
                        names_to_import.append(name)
                
                if not names_to_import:
                    logger.warning("Found nothing to import from %s", filename)
                else:
                    invalidate_caches()
                    for name in names_to_import:
                        logger.info("Reload module %s from %s file", name, ext)
                        reload(import_module(name))

I removed the sleep in my test case and it works fine (But it also works without the call to invalidate_caches).
However the test still fails in the 'test_upload_file' case which only uploads .py or .pyc files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be the case indeed. You may want to try to remove the cached bytecode file. On Python 3, see https://docs.python.org/3/library/importlib.html#importlib.util.cache_from_source . On Python 2, you will need to reimplement that function yourself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok i'll try this. It starts to be much bigger than the original change :)
By the way, I'm not used to write code for Python2/Python3
Is there a good way to do that ? What I see on forums is something like that:

            if hasattr(importlib, "invalidate_caches"):
                importlib.invalidate_caches()

Is that right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though we have a distributed.compatibility module where you could add a invalidate_import_caches function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the changes to delete the pyc file
It works fine now without the sleep(1) call

os.remove('myfile.zip')

sleep(1) # TODO: why is this necessary?
x = c.submit(g, pure=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need pure=False here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same way pure=False is used in the other upload_file test but i agree that it could be removed.
Should I remove it from the other tests ?

return myfile.f()

with tmp_text('myfile.py', 'def f():\n return 123') as fn_my_file, \
tmp_text('init.py', '') as fn_init:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the file init.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you're right. I thought that the init.py was necessary to make a valid module but actually it's not. I tried to remove it and the test passed as well.
Thx for the reveiw

@@ -1116,6 +1117,27 @@ def g():
result = yield y._result()
assert result == 456

@gen_cluster(client=True)
def test_upload_file_zip(c, s, a, b):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test would be more robust if it scrubbed out myfile from sys.modules at the start and at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to get this one.
This module is uploaded locally on workers , right ? The workers aren't restarted before each test ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With gen_cluster, the workers should run in the current process, so they would inherit the current (process-wide) import state. For example, perhaps the myfile module seen in this test actually comes from another one of the tests here.

It seems all tests for upload_file are written in this style. I think it would deserve fixing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so i'm going to apply all those modifications to all tests :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the upload_file_* tests i mean

logger.info("Reload module %s from .zip file", name)
if out_filename not in sys.path:
sys.path.insert(0, out_filename)
name = name.split('-')[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question: why this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this line from ('.py', '.pyc') case

                if ext in ('.py', '.pyc'):
                    logger.info("Reload module %s from .py file", name)
                    name = name.split('-')[0]
                    reload(import_module(name))

Actually I don't know what it is for, so i can remove it in the zip case

if out_filename not in sys.path:
sys.path.insert(0, out_filename)
name = name.split('-')[0]
reload(import_module(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invalidate_caches should probably be called just before this line.

@bmaisonn
Copy link
Contributor Author

It starts to be confusing here...
I modified my test to integrate your reviews (Using invalid_cache, remove sleep and pure=False).
I also modified the test to make the upload twice with different code and check if the modifications are correctly taken into account (As it's done in test_upload_file):

@gen_cluster(client=True)
def test_upload_file_zip(c, s, a, b):
    def g():
        import myfile
        return myfile.f()

    try:
        for value in [123, 456]:
            with tmp_text('myfile.py', 'def f():\n    return {}'.format(value)) as fn_my_file:
                with zipfile.ZipFile('myfile.zip', 'w') as z:
                    z.write(fn_my_file, arcname=os.path.basename(fn_my_file))
                yield c._upload_file('myfile.zip')
                    
                x = c.submit(g)
                result = yield x._result()
                assert result == value
    finally:
        if os.path.exists('myfile.zip'):
            os.remove('myfile.zip')
        if 'myfile' in sys.modules:
            del sys.modules['myfile']
        for path in sys.path:
            if os.path.basename(path) == 'myfile.zip':
                sys.path.remove(path)
                break

With this code the second test assert result == 456 fails as result == 123. So the second file isn't updated.
The same happens in test_upload_file is I remove pure=False OR sleep(1).
I really don't understand why it fails here ...

@mrocklin
Copy link
Member

mrocklin commented Feb 21, 2017 via email

@bmaisonn
Copy link
Contributor Author

Ok, i misunderstood the meaning of pure.
It works now
Thx

…e now remove the pyc files corresponding to the .py file being uploaded

As importlib.invalidate_caches and importlib.util.cache_from_source are python3 specifc move them in compatibility (+1 squashed commits)

Squashed commits:

[92d94b4] Code reviews:
In worker.py::upload_file:
Commonalize code,
Remove split('-') as it seems useless,
Use invalid_caches before reloading modules

In test_client.py::test_upload_file_*:
commonalize code
Make two tries to check for upload updates
@bmaisonn
Copy link
Contributor Author

I commited a new change to integrate the reviews
Both tests pass in python 2 and python 3

else:
for name in names_to_import:
logger.info("Reload module %s from %s file", name, ext)
invalidate_caches()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit: you can pull invalidate_caches() out of the for loop.

@pitrou
Copy link
Member

pitrou commented Feb 22, 2017

Thanks a lot!
There's one thing missing: you should update the Client.upload_file docstring.

@bmaisonn
Copy link
Contributor Author

I moved the call to invalidate_caches and updated the documentation
Thx for you help!

@pitrou
Copy link
Member

pitrou commented Feb 22, 2017

I don't have anything to add here, perhaps @mrocklin wants to take a last look.

@mrocklin
Copy link
Member

Happy to trust you two on this. Thank you both for the effort and review.

@pitrou pitrou merged commit 050c2f9 into dask:master Feb 22, 2017
@pitrou pitrou mentioned this pull request Feb 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants