Use dask.delayed instead of the distributed client #222

leouieda · 2020-01-16T19:00:04Z

Started work on using dask.delayed instead of the futures interface
(client.submit). It's easier and allows building the entire graph lazily
before executing. Still need tests and to port the SplineCV code to
this. Will deprecate the client argument and remove in 2.0.0.

Reminders

Run make format and make check to make sure the code follows the style guide.
Add tests for new features or tests that would have caught the bug that you're fixing.
Add new public functions/methods/classes to doc/api/index.rst and verde/__init__.py.
Write detailed docstrings for all functions/classes/methods.
If adding new functionality, add an example to the docstring, gallery, and/or tutorials.

Started work on using dask.delayed instead of the futures interface (client.submit). It's easier and allows building the entire graph lazily before executing. Still need tests and to port the SplineCV code to this. Will deprecate the client argument and remove in 2.0.0.

leouieda · 2020-01-17T11:37:14Z

Having Dask be optional is causing headaches:

Can't use it in the documentation AND have builds that don't install Dask to test that it's really optinal. Since we build the docs always, there is no easy way to have both.
Need to wrap all calls to dask in functions that raise exceptions when it's not installed. This increases the amount of tests for each of these functions and is a bit tedious to write.

Dask is a really lightweight dependency (pure Python with minimal requirements) and I would like to add more support for it in the future (to enable parallel fitting and larger than RAM Jacobians).

I'm thinking we should just adopt Dask as a dependency. Scikit-learn is a much heavier dependency and we already have it. This is nothing compared to GDAL and the geospatial stack.

@santisoler @jessepisel any thoughts/objections on this?

Separated so we can revert this easily if needed.

santisoler · 2020-01-20T12:46:27Z

I'm seeing Dask in the future of Verde, specially for solving problems that include larger-than-memory arrays. So I think we could add it as a dependency right away and save the headaches of writing annoying lines on test functions and struggling with building the docs.
It's lightweight, so I don't see why we shouldn't add it as a dependency.

leouieda · 2020-01-20T13:34:10Z

👍 alright, I'll try to get this one finished soon. Thanks, @santisoler

jessepisel · 2020-01-20T15:11:11Z

I agree with @santisoler that Dask should be a dependency. It will be easier to deprecate the futures interface and not have to write new tests for both options. This will be nice to have for big grids that don't fit in memory @leouieda !

leouieda · 2020-01-20T15:30:35Z

Thanks for the input @jessepisel and @santisoler 👍

This will be nice to have for big grids that don't fit in memory @leouieda !

I'm still struggling with ways of doing this. The dask-ml package has some optimization methods for linear problems that we could use to run the least-squares fit. But it's not stable enough to be a dependency. Right now, solving the system with the linear algebra is dask is bad because there are a lot of limitations (need square chunks, for one). That's where we should start looking at pylops. Matteo is working on a distributed version of that using dask. I'm eager to play around with it and see if we can make it work.

leouieda · 2020-01-20T15:33:15Z

Another option is to break up the fit into windows and run each window separately (using dask).

leouieda · 2020-01-20T17:24:52Z

OK, I added the delayed argument to the tutorials as well and made Dask a dependency. @santisoler and/or @jessepisel could you please review this PR?

jessepisel

Nice work on the dask conversion @leouieda. I went through and reviewed all the changes. The conf.py, install.rst, environment.yml, requirements-dev.txt, requirements.txt, and setup.py look good to go.

The tutorials look really good. They have a nice logical flow and explain the difference between delayed dask computation and serial computation. I added some comments here and there for clarity and grammar, but overall I think they work really well.

As for the bulk of the changes in the rest of the package, I looked through them and I think you got all the deprecation warnings covered.

The new tests for test_model_selection.py should work nicely for the model selection. In the test for test_spline.py the mindists changed from 1e-5 to 1e6, is there any reason why you changed the minimum distances to fit by such a large value? Is it ensuring that the 1e-7 is returned for the test? That's all I noticed for the tests.

I think this looks great. The builds are passing and I think there is enough documentation and warnings for users to make the transition to the dask implementation in verde.

tutorials/model_evaluation.py

jessepisel · 2020-01-20T21:00:57Z

tutorials/model_evaluation.py

+# :class:`~verde.Spline` aren't optimal for this dataset. We could try
+# different combinations manually until we get a good score. A better way is to
+# do this automatically. In :ref:`model_selection` we'll go over how to do just
+# that.


I think the note is appropriate for reminding users that it will be memory intensive. The last bit on improving the score is a great way to transition to model_selection.

tutorials/model_selection.py

verde/spline.py

verde/utils.py

Co-Authored-By: Jesse Pisel <jessepisel@users.noreply.github.com>

leouieda · 2020-01-21T09:46:59Z

Thanks for the review @jessepisel!

is there any reason why you changed the minimum distances to fit by such a large value? Is it ensuring that the 1e-7 is returned for the test? That's all I noticed for the tests.

That is exactly it. I just wanted to make it easier for the CV by giving it obviously bad values and only 1 good combination. This will avoid future headaches due to floating point round-off.

leouieda added 2 commits January 16, 2020 18:57

Simplify dispatch and add delayed to SplineCV

e2dae9b

leouieda added 4 commits January 17, 2020 11:41

Add Dask as a dependency

f8c0af9

Separated so we can revert this easily if needed.

Simplify the code for cross_val_score

0a3d9a8

Experiments in SplineCV

6c55ed3

Clean up code and unify splinecv tests

eb53acc

leouieda mentioned this pull request Jan 17, 2020

Explicitly set numba config for each jit function #221

Merged

5 tasks

leouieda added 2 commits January 20, 2020 15:03

Add tests for delayed and client

7b21b7e

Merge branch 'master' into delayed

17c5d6c

leouieda added 7 commits January 20, 2020 15:37

Merge branch 'master' into delayed

da3acdb

Add deprecation warning for cross_val_score client

78abb89

Add deprecation warnings and docstrings

6a3ca34

Clone the estimator for safer parallel

1a03265

Use delayed in model evaluation tutorial

859f3aa

Use delayed in tutorials

aa23f24

Fix formatting

ae27c7d

leouieda changed the title ~~WIP Setup dask.delayed instead of futures~~ Use dask.delayed instead of the distributed client Jan 20, 2020

leouieda requested review from jessepisel and santisoler January 20, 2020 17:24

leouieda added this to the v1.3.0 milestone Jan 20, 2020

Reduce the CV tutorial a bit

ad2050f

jessepisel suggested changes Jan 20, 2020

View reviewed changes

Update verde/spline.py

0e051b1

Co-Authored-By: Jesse Pisel <jessepisel@users.noreply.github.com>

leouieda and others added 3 commits January 21, 2020 09:43

Tunning -> tuning

9404a25

Update verde/utils.py

f154c7d

Co-Authored-By: Jesse Pisel <jessepisel@users.noreply.github.com>

Merge branch 'delayed' of github.com:fatiando/verde into delayed

0b4a6f0

Set numba threads to 1

2df8ba8

leouieda merged commit f680661 into master Jan 21, 2020

leouieda deleted the delayed branch January 21, 2020 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dask.delayed instead of the distributed client #222

Use dask.delayed instead of the distributed client #222

leouieda commented Jan 16, 2020

leouieda commented Jan 17, 2020

santisoler commented Jan 20, 2020

leouieda commented Jan 20, 2020

jessepisel commented Jan 20, 2020

leouieda commented Jan 20, 2020

leouieda commented Jan 20, 2020

leouieda commented Jan 20, 2020

jessepisel left a comment

jessepisel Jan 20, 2020

leouieda commented Jan 21, 2020

Use dask.delayed instead of the distributed client #222

Use dask.delayed instead of the distributed client #222

Conversation

leouieda commented Jan 16, 2020

leouieda commented Jan 17, 2020

santisoler commented Jan 20, 2020

leouieda commented Jan 20, 2020

jessepisel commented Jan 20, 2020

leouieda commented Jan 20, 2020

leouieda commented Jan 20, 2020

leouieda commented Jan 20, 2020

jessepisel left a comment

Choose a reason for hiding this comment

jessepisel Jan 20, 2020

Choose a reason for hiding this comment

leouieda commented Jan 21, 2020