No support for stratified split in dask_ml.model_selection.train_test_split #535

chauhankaranraj · 2019-08-09T20:18:18Z

scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-08-09T20:23:37Z

Agreed. Are you interested in working on this?

…

On Fri, Aug 9, 2019 at 3:18 PM Karanraj Chauhan ***@***.***> wrote: scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#535?email_source=notifications&email_token=AAKAOITYSEXRW4TUOTW6OFDQDXGIXA5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HEPIOUQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIQTYGEUZ6B7ABDQK4TQDXGIXANCNFSM4IKXCDHA> .

chauhankaranraj · 2019-08-11T23:40:13Z

Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class).

So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time.

TomAugspurger · 2019-08-12T02:19:48Z

That's great if you're willing to try. Let us know if you get stuck.

…

On Sun, Aug 11, 2019 at 6:40 PM Karanraj Chauhan ***@***.***> wrote: Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class). So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#535?email_source=notifications&email_token=AAKAOISB25FYPBPYRR4KOFTQECPN5A5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BLGTI#issuecomment-520270669>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIQVJEDB5XD7VSQZO4TQECPN5ANCNFSM4IKXCDHA> .

tiagofassoni · 2019-12-13T17:49:09Z

Hey, Tom, I'm thinking of picking this up. My doubt is:

Say we have a big csv file with 2 categories and two partitions of the data.

So file_0 has only category 0 and 1, file_1 has only category 1.

My first thought was to just use the stratify parameter of scikit-learn, but in this case that wouldn't work. Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

I'd be glad to pick this up, as it would help in some research I'm doing.

TomAugspurger · 2019-12-13T19:19:36Z

@tiagofassoni great! dask-ml's OneHotEncoder may be helpful here. It will use the Categorical dtype for pandas dataframes. Otherwise you can (or maybe need?) to pass the categories manually as a list / array. Does that make sense?

In other places that just work with arrays, like Incremental, we require that the classes (groups in this case) be specified ahead of time.

jerrytim · 2020-02-21T14:37:36Z

is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful

TomAugspurger · 2020-02-21T14:50:53Z

I’m not aware of any progress. Perhaps Tiago can share a status update.

…

On Feb 21, 2020, at 7:37 AM, Tim Huang ***@***.***> wrote: is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

tiagofassoni · 2020-02-26T22:13:02Z

Hello, @TomAugspurger, @jerrytim. Got to try my hand on this just last week and... gotta say, I have no idea on how to make it. I don't know why OneHotEncoder would be helpful, if at all.

I was thinking of using something like Pandas' value_counts for the series results and then trying to make a shuffle, but I don't know if such an approach is feasible.

chauhankaranraj · 2020-03-08T20:42:07Z

@TomAugspurger I agree with @tiagofassoni - I'm not sure how OneHotEncoder can be used. But I also don't understand how value_counts can be used - @tiagofassoni could you please elaborate?

There's two things I wanted to bring into discussion that might help us better decide how to implement. IIUC splitting is handled differently for da.Array and dd.Series/dd.DataFrame, correct?

For dd.Series/dd.DataFrame, heavy lifting is done by random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.
For da.Array, heavy lifting is done by ShuffleSplit and _blockwise_slice. Could we get parts of the input array of that belong to a particular class, compute chunks of this subarray, apply the same ShuffleSplit+_blockwise_slice strategy for this subarray, repeat for all classes, and finally concatenate the results? This would be kind of along the same lines as @tiagofassoni's comment:

Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

TomAugspurger · 2020-03-09T21:43:54Z

random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.

That's in dask.dataframe.DataFrame.random_split

compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

chauhankaranraj · 2020-03-09T22:01:38Z

That's in dask.dataframe.DataFrame.random_split

Gotcha, thanks! I'll take a look :)

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

Yeah, I agree - having the classes up front would be ideal. We could still compute the classes (da.unique on stratify array) but I don't think that can be done lazily and thus wouldn't be ideal.

Maybe I'm missing something here, but do we really need the frequencies? This might be a little far from optimal, but could we do something along these lines:

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))

trail-coffee · 2020-03-16T15:09:46Z

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

Note: I'm a data scientist, not a developer...

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))

Sklearn uses np.bincount in class StratifiedShuffleSplit in sklearn.model_selection._split to get frequencies and split accordingly.

chauhankaranraj · 2020-03-16T15:23:51Z

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

@ericbassett It should split it in whatever train/test ratio is provided as input. The splitter used here is the instance of ShuffleSplit that gets created here. IIUC it takes care of splitting by ratios provided.

I'll submit a WIP PR soon so this discuss becomes more concrete :)

trail-coffee · 2020-03-16T15:30:44Z

Very nice, makes sense.

chauhankaranraj · 2020-03-29T18:21:23Z

Hey folks,

I made an attempt to implement the stratified split here. I could do it lazily for dask Series and DataFrames, but not completely lazily for dask Array (calling compute_chunk_sizes()).

Does anyone have ideas to get around this? Would it be possible to "enforce" chunk size instead of computing it? [e.g. if chunk size for the whole array is (x, 10) then chunk size for the part of the array that belongs to a class with weight 15% should be (0.15x, 10)]

Any feedback in general would be highly appreciated 🙏

Also, if you feel this discussion should be moved to a WIP PR, I can open that too.

TomAugspurger · 2020-03-30T13:03:13Z

May be easiest to move to a PR. We might be able to do things lazily for dask array, we'll just probably end up with unknown chunk sizes.

chauhankaranraj · 2020-04-04T21:23:36Z

@TomAugspurger Sure thing. Opened this WIP PR yesterday

ashokrayal · 2022-06-07T07:34:00Z

Any Progress on this task ? :)

kennylids · 2022-07-28T03:08:29Z

I need the stratify feature in tran_split_test as well for my imbalanced dataset. Any updates?

chauhankaranraj · 2022-07-31T19:58:03Z

Hey folks, sorry but I haven't had the chance to continue working on this. I did open a WIP PR (#635) so if anyone would like to fork off of it or just start from scratch, feel free to do so! Let me know if you'd like anything from me in doing so.

chauhankaranraj mentioned this issue Apr 3, 2020

WIP: Add stratified split feature to model_selection.train_test_split #635

Open

hristog mentioned this issue Mar 22, 2021

ImportError: Cannot import StratifiedKFold from dask_ml.model_selection #810

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for stratified split in dask_ml.model_selection.train_test_split #535

No support for stratified split in dask_ml.model_selection.train_test_split #535

chauhankaranraj commented Aug 9, 2019

TomAugspurger commented Aug 9, 2019 via email

chauhankaranraj commented Aug 11, 2019

TomAugspurger commented Aug 12, 2019 via email

tiagofassoni commented Dec 13, 2019

TomAugspurger commented Dec 13, 2019

jerrytim commented Feb 21, 2020

TomAugspurger commented Feb 21, 2020 via email

tiagofassoni commented Feb 26, 2020

chauhankaranraj commented Mar 8, 2020

TomAugspurger commented Mar 9, 2020

chauhankaranraj commented Mar 9, 2020

trail-coffee commented Mar 16, 2020

chauhankaranraj commented Mar 16, 2020

trail-coffee commented Mar 16, 2020

chauhankaranraj commented Mar 29, 2020

TomAugspurger commented Mar 30, 2020

chauhankaranraj commented Apr 4, 2020

ashokrayal commented Jun 7, 2022

kennylids commented Jul 28, 2022

chauhankaranraj commented Jul 31, 2022

No support for stratified split in dask_ml.model_selection.train_test_split #535

No support for stratified split in dask_ml.model_selection.train_test_split #535

Comments

chauhankaranraj commented Aug 9, 2019

TomAugspurger commented Aug 9, 2019 via email

chauhankaranraj commented Aug 11, 2019

TomAugspurger commented Aug 12, 2019 via email

tiagofassoni commented Dec 13, 2019

TomAugspurger commented Dec 13, 2019

jerrytim commented Feb 21, 2020

TomAugspurger commented Feb 21, 2020 via email

tiagofassoni commented Feb 26, 2020

chauhankaranraj commented Mar 8, 2020

TomAugspurger commented Mar 9, 2020

chauhankaranraj commented Mar 9, 2020

trail-coffee commented Mar 16, 2020

chauhankaranraj commented Mar 16, 2020

trail-coffee commented Mar 16, 2020

chauhankaranraj commented Mar 29, 2020

TomAugspurger commented Mar 30, 2020

chauhankaranraj commented Apr 4, 2020

ashokrayal commented Jun 7, 2022

kennylids commented Jul 28, 2022

chauhankaranraj commented Jul 31, 2022