Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support for stratified split in dask_ml.model_selection.train_test_split #535

Open
chauhankaranraj opened this issue Aug 9, 2019 · 20 comments

Comments

@chauhankaranraj
Copy link
Contributor

scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.

@TomAugspurger
Copy link
Member

TomAugspurger commented Aug 9, 2019 via email

@chauhankaranraj
Copy link
Contributor Author

Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class).

So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time.

@TomAugspurger
Copy link
Member

TomAugspurger commented Aug 12, 2019 via email

@tiagofassoni
Copy link

Hey, Tom, I'm thinking of picking this up. My doubt is:

Say we have a big csv file with 2 categories and two partitions of the data.

So file_0 has only category 0 and 1, file_1 has only category 1.

My first thought was to just use the stratify parameter of scikit-learn, but in this case that wouldn't work. Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

I'd be glad to pick this up, as it would help in some research I'm doing.

@TomAugspurger
Copy link
Member

@tiagofassoni great! dask-ml's OneHotEncoder may be helpful here. It will use the Categorical dtype for pandas dataframes. Otherwise you can (or maybe need?) to pass the categories manually as a list / array. Does that make sense?

In other places that just work with arrays, like Incremental, we require that the classes (groups in this case) be specified ahead of time.

@jerrytim
Copy link

is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful

@TomAugspurger
Copy link
Member

TomAugspurger commented Feb 21, 2020 via email

@tiagofassoni
Copy link

Hello, @TomAugspurger, @jerrytim. Got to try my hand on this just last week and... gotta say, I have no idea on how to make it. I don't know why OneHotEncoder would be helpful, if at all.

I was thinking of using something like Pandas' value_counts for the series results and then trying to make a shuffle, but I don't know if such an approach is feasible.

@chauhankaranraj
Copy link
Contributor Author

@TomAugspurger I agree with @tiagofassoni - I'm not sure how OneHotEncoder can be used. But I also don't understand how value_counts can be used - @tiagofassoni could you please elaborate?

There's two things I wanted to bring into discussion that might help us better decide how to implement. IIUC splitting is handled differently for da.Array and dd.Series/dd.DataFrame, correct?

  1. For dd.Series/dd.DataFrame, heavy lifting is done by random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.
  2. For da.Array, heavy lifting is done by ShuffleSplit and _blockwise_slice. Could we get parts of the input array of that belong to a particular class, compute chunks of this subarray, apply the same ShuffleSplit+_blockwise_slice strategy for this subarray, repeat for all classes, and finally concatenate the results? This would be kind of along the same lines as @tiagofassoni's comment:

Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

@TomAugspurger
Copy link
Member

random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.

That's in dask.dataframe.DataFrame.random_split

compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

@chauhankaranraj
Copy link
Contributor Author

That's in dask.dataframe.DataFrame.random_split

Gotcha, thanks! I'll take a look :)

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

Yeah, I agree - having the classes up front would be ideal. We could still compute the classes (da.unique on stratify array) but I don't think that can be done lazily and thus wouldn't be ideal.

Maybe I'm missing something here, but do we really need the frequencies? This might be a little far from optimal, but could we do something along these lines:

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))

@trail-coffee
Copy link

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

Note: I'm a data scientist, not a developer...

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))

Sklearn uses np.bincount in class StratifiedShuffleSplit in sklearn.model_selection._split to get frequencies and split accordingly.

@chauhankaranraj
Copy link
Contributor Author

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

@ericbassett It should split it in whatever train/test ratio is provided as input. The splitter used here is the instance of ShuffleSplit that gets created here. IIUC it takes care of splitting by ratios provided.

I'll submit a WIP PR soon so this discuss becomes more concrete :)

@trail-coffee
Copy link

Very nice, makes sense.

@chauhankaranraj
Copy link
Contributor Author

Hey folks,

I made an attempt to implement the stratified split here. I could do it lazily for dask Series and DataFrames, but not completely lazily for dask Array (calling compute_chunk_sizes()).

Does anyone have ideas to get around this? Would it be possible to "enforce" chunk size instead of computing it? [e.g. if chunk size for the whole array is (x, 10) then chunk size for the part of the array that belongs to a class with weight 15% should be (0.15x, 10)]

Any feedback in general would be highly appreciated 🙏

Also, if you feel this discussion should be moved to a WIP PR, I can open that too.

@TomAugspurger
Copy link
Member

May be easiest to move to a PR. We might be able to do things lazily for dask array, we'll just probably end up with unknown chunk sizes.

@chauhankaranraj
Copy link
Contributor Author

@TomAugspurger Sure thing. Opened this WIP PR yesterday

@ashokrayal
Copy link

Any Progress on this task ? :)

@kennylids
Copy link

I need the stratify feature in tran_split_test as well for my imbalanced dataset. Any updates?

@chauhankaranraj
Copy link
Contributor Author

Hey folks, sorry but I haven't had the chance to continue working on this. I did open a WIP PR (#635) so if anyone would like to fork off of it or just start from scratch, feel free to do so! Let me know if you'd like anything from me in doing so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants