New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CV / shuffle #172

Merged
merged 7 commits into from May 22, 2018

Conversation

Projects
None yet
2 participants
@TomAugspurger
Member

TomAugspurger commented May 21, 2018

Array only right now.

Blockwise-only right now, to keep things performant.

Will followup with a helper train_test_split later tonight.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented May 22, 2018

Added docs, if anyone is able to take a look.

The basic version is: we can now do blockwise shuffling of dask arrays. Doing

X_train, X_test, y_train, y_test = train_test_split(X, y)

will not have to shuffle any data between workers.

A version that does do a full shuffle, including between blocks, will hopefully be done tomorrow (separate PR).

Fixed chunk size determination
additional tests
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented May 22, 2018

I'm going to merge this in a bit. Will continue to iterate on things, including some benchmarks, later. For now, I think the graph looks good (this is the graph for train_test_split(X))

mydask

@mrocklin

This comment has been minimized.

Member

mrocklin commented May 22, 2018

No objection to merging. The graph does look good. However, it would look better if there were no A-like branches, each of which is an opportunity for communication in a distributed setting. This might be premature optimization though. I say go ahead and move onto other things.

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented May 22, 2018

no A-like branches, each of which is an opportunity for communication in a distributed setting.

Just to confirm my understanding: with A-like branches, we could end up with the data on one machine (like sub-#3), and the indexes on another (like concatenate-#1-, 0, 0), which would require moving one?

@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented May 22, 2018

I think I'll still go ahead with merging this, since ShuffleSplit().split(X), which generates the index arrays, may still be useful.

In a followup we can optimize train_test_split, which returns the split arrays.

@TomAugspurger TomAugspurger merged commit 859a18f into dask:master May 22, 2018

3 checks passed

ci/circleci: py27 Your tests passed on CircleCI!
Details
ci/circleci: py36 Your tests passed on CircleCI!
Details
ci/circleci: sklearn_dev Your tests passed on CircleCI!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment