Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require additional dependencies: cloudpickle, partd, fsspec, toolz #7345

Merged
merged 30 commits into from Mar 12, 2021

Conversation

jsignell
Copy link
Member

@jsignell jsignell commented Mar 9, 2021

I am planning on adding commits to remove handling of cases where these packages aren't available.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks @jsignell -- let me know when you'd like someone to review

@jsignell
Copy link
Member Author

jsignell commented Mar 9, 2021

This failure has me kind of stumped.

=================================== FAILURES ===================================
_______________________ test_works_with_highlevel_graph ________________________

    def test_works_with_highlevel_graph():
        """Previously `dask.multiprocessing.get` would accidentally forward
        `HighLevelGraph` graphs through the dask optimization/scheduling routines,
        resulting in odd errors. One way to trigger this was to have a
        non-indexable object in a task. This is just a smoketest to ensure that
        things work properly even if `HighLevelGraph` objects get passed to
        `dask.multiprocessing.get`. See https://github.com/dask/dask/issues/7190.
        """
    
        class NoIndex:
            def __init__(self, x):
                self.x = x
    
            def __getitem__(self, key):
                raise Exception("Oh no!")
    
        x = delayed(lambda x: x)(NoIndex(1))
        (res,) = get(x.dask, x.__dask_keys__())
>       assert isinstance(res, NoIndex)
E       AssertionError: assert False
E        +  where False = isinstance(<dask.tests.test_multiprocessing.NoIndex object at 0x7fd42c727490>, <class 'dask.tests.test_multiprocessing.test_works_with_highlevel_graph.<locals>.NoIndex'>)

dask/tests/test_multiprocessing.py:174: AssertionError

@jrbourbeau
Copy link
Member

That looks to be an issue with cloudpickle which was fixed in the 1.1.1 release. We can fix things by either moving the NoIndex class definition outside of the body of test_works_with_highlevel_graph or updating our minimum cloudpickle to be >=1.1.1 (which is coming up on being two years old)

@jsignell
Copy link
Member Author

jsignell commented Mar 9, 2021

Oh nice find! I say let's bump up our cloudpickle.

@@ -74,6 +74,28 @@ jobs:
shell: bash -l {0}
run: pytest -v --doctest-modules --ignore-glob='*/test_*.py' dask

doctest-pip:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we care about this, but since it came up in #7358 I thought I'd try it out. I added a skip in conftest.py to support this.

@jsignell jsignell marked this pull request as ready for review March 10, 2021 16:30
dask/bytes/__init__.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@jsignell
Copy link
Member Author

Ok @crusaderky and @jrbourbeau I think this is ready for you to take a look. I put some questions inline in comments.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the changes here look good, thanks for working on this @jsignell. Some comments in addition to the ones below:

  • There are a couple of pytest.importorskip("tlz") occurances we can remove
  • Same with pytest.importorskip("dask.bag") and import_or_none("dask.bag")

.github/workflows/ci-additional.yml Outdated Show resolved Hide resolved
.github/workflows/ci-additional.yml Outdated Show resolved Hide resolved
conftest.py Show resolved Hide resolved
dask/bag/text.py Show resolved Hide resolved
dask/bytes/__init__.py Outdated Show resolved Hide resolved
dask/bytes/core.py Outdated Show resolved Hide resolved
dask/bytes/core.py Outdated Show resolved Hide resolved
docs/source/install.rst Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
dask/array/core.py Outdated Show resolved Hide resolved
dask/bag/avro.py Outdated Show resolved Hide resolved
dask/dataframe/io/json.py Outdated Show resolved Hide resolved
dask/diagnostics/profile_visualize.py Outdated Show resolved Hide resolved
dask/tests/test_delayed.py Outdated Show resolved Hide resolved
dask/tests/test_multiprocessing.py Outdated Show resolved Hide resolved
docs/source/install.rst Outdated Show resolved Hide resolved
docs/source/install.rst Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@crusaderky
Copy link
Collaborator

please fix docstring of dask.multiprocessing.get:

    func_dumps : function
        Function to use for function serialization
        (defaults to cloudpickle.dumps if available, otherwise pickle.dumps)
    func_loads : function
        Function to use for function deserialization
        (defaults to cloudpickle.loads if available, otherwise pickle.loads)

@crusaderky
Copy link
Collaborator

In docs/source/bag.rst, please remove:

Because the multiprocessing scheduler requires moving functions between multiple
processes, we encourage that Dask Bag users also install the cloudpickle_ library to
enable the transfer of more complex functions.

[...]
.. _cloudpickle: https://github.com/cloudpipe/cloudpickle

@jsignell
Copy link
Member Author

Ok! I have addressed all the comments. Thanks for the reviews @crusaderky and @jrbourbeau!

@jsignell
Copy link
Member Author

Note that this will require changes in the conda recipes. @martindurant I don't know if we need to let the anaconda conda team know about this. I'll open a PR in the conda-forge recipe now

@martindurant
Copy link
Member

Thanks @jsignell . I don't think so normally, but I passed it on anyway.

@jsignell
Copy link
Member Author

🎉 green! 🎉 @jrbourbeau @crusaderky are you happy with the state of this?

Comment on lines +11 to +13
"array": ["numpy >= 1.15.1"],
"bag": [], # keeping for backwards compatibility
"dataframe": ["numpy >= 1.15.1", "pandas >= 0.25.0"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do anything in this PR, but I'm curious when do we want to bump these and what versions should we bump them to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We bumped them a few months ago to this. There was some talk in the maintenance meeting about using released > n months ago as the floor. Apparently xarray does something like this? But we didn't choose a specific policy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At xarray and pint we use a rolling policy based on NEP-29:
https://xarray.pydata.org/en/stable/installing.html#minimum-dependency-versions
The number of months listed is arbitrary and subject to negotiation. The key benefit of such a policy is that it allows developer to bump up minimum dependencies as needed, without initiating a discussion every time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving alone is fine. I think we have some workarounds for NumPy pre-1.17 that could be cleaned out once that is a minimum

cc @pentschev (in case you have thoughts here 🙂)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also agree we could consider dropping older versions of NumPy. As @crusaderky mentioned NEP-29, it's worth noticing that NumPy is supporting Python releases for 18 months. Is there any reason for us to keep on supporting NumPy versions (or all other dependencies for that matter) that are much older than that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #7378 to continue this conversation

install_requires = ["pyyaml"]
install_requires = [
"pyyaml",
"cloudpickle >= 1.1.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what we think about bumping this to 1.5.0. That version added support for the pickle5 backport package. Admittedly that is only needed for Python pre-3.8, which is just Python 3.7 for Dask now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think dask/dask has any need to directly acknowledge pickle protocol 5? 1.5.0 is very recent...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributed pins to 1.5.0, but since that is less than a year old it seemed like overkill for dask/dask.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloudpickle will use it under-the-hood if installed

Yeah I don't have strong feelings here. Just wanted to raise for discussion. This also will become irrelevant once Python 3.7 is dropped (guessing that will coincide with the Python 3.10 release)

dask/diagnostics/profile_visualize.py Outdated Show resolved Hide resolved
dask/multiprocessing.py Show resolved Hide resolved
jsignell and others added 2 commits March 11, 2021 12:55
Co-authored-by: crusaderky <crusaderky@gmail.com>
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jsignell!

@jrbourbeau jrbourbeau merged commit a62fced into dask:main Mar 12, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Mar 12, 2021
dask/dask#7345 removed some imports that we were improperly using from a dask module. Fix the imports to properly target `fsspec`.

Authors:
  - Keith Kraus (@kkraus14)

Approvers:
  - @jakirkham
  - Ashwin Srinath (@shwina)

URL: #7580
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
dask/dask#7345 removed some imports that we were improperly using from a dask module. Fix the imports to properly target `fsspec`.

Authors:
  - Keith Kraus (@kkraus14)

Approvers:
  - @jakirkham
  - Ashwin Srinath (@shwina)

URL: rapidsai#7580
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make trivial dependencies mandatory
6 participants