Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Add CV / shuffle #172
Added docs, if anyone is able to take a look.
The basic version is: we can now do blockwise shuffling of dask arrays. Doing
X_train, X_test, y_train, y_test = train_test_split(X, y)
will not have to shuffle any data between workers.
A version that does do a full shuffle, including between blocks, will hopefully be done tomorrow (separate PR).
No objection to merging. The graph does look good. However, it would look better if there were no A-like branches, each of which is an opportunity for communication in a distributed setting. This might be premature optimization though. I say go ahead and move onto other things.
Just to confirm my understanding: with A-like branches, we could end up with the data on one machine (like