-
-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No support for stratified split in dask_ml.model_selection.train_test_split #535
Comments
Agreed. Are you interested in working on this?
…On Fri, Aug 9, 2019 at 3:18 PM Karanraj Chauhan ***@***.***> wrote:
scikit-learn implementation of train test split
(sklearn.model_selection.train_test_split) supports splitting data
according to class labels (stratified split) by using the argument
stratify. This is especially useful when datasets have high class
imbalance. It would be really helpful to have this feature in dask_ml as
well.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#535?email_source=notifications&email_token=AAKAOITYSEXRW4TUOTW6OFDQDXGIXA5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HEPIOUQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIQTYGEUZ6B7ABDQK4TQDXGIXANCNFSM4IKXCDHA>
.
|
Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class). So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time. |
That's great if you're willing to try. Let us know if you get stuck.
…On Sun, Aug 11, 2019 at 6:40 PM Karanraj Chauhan ***@***.***> wrote:
Tempted to say yes, but I don't know the codebase/internals very well
(specifically, I'm not sure how we can get a stratified split with
blockwise=False not implemented for the ShuffleSplit class).
So it'd be faster if someone more knowledgeable could volunteer. If not
then I'd be happy to give it a shot, but it might take some time.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#535?email_source=notifications&email_token=AAKAOISB25FYPBPYRR4KOFTQECPN5A5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BLGTI#issuecomment-520270669>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIQVJEDB5XD7VSQZO4TQECPN5ANCNFSM4IKXCDHA>
.
|
Hey, Tom, I'm thinking of picking this up. My doubt is: Say we have a big csv file with 2 categories and two partitions of the data. So My first thought was to just use the stratify parameter of scikit-learn, but in this case that wouldn't work. Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases. I'd be glad to pick this up, as it would help in some research I'm doing. |
@tiagofassoni great! dask-ml's OneHotEncoder may be helpful here. It will use the Categorical dtype for pandas dataframes. Otherwise you can (or maybe need?) to pass the In other places that just work with arrays, like |
is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful |
I’m not aware of any progress. Perhaps Tiago can share a status update.
… On Feb 21, 2020, at 7:37 AM, Tim Huang ***@***.***> wrote:
is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hello, @TomAugspurger, @jerrytim. Got to try my hand on this just last week and... gotta say, I have no idea on how to make it. I don't know why OneHotEncoder would be helpful, if at all. I was thinking of using something like Pandas' value_counts for the series results and then trying to make a shuffle, but I don't know if such an approach is feasible. |
@TomAugspurger I agree with @tiagofassoni - I'm not sure how There's two things I wanted to bring into discussion that might help us better decide how to implement. IIUC splitting is handled differently for
|
That's in
In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily? |
Gotcha, thanks! I'll take a look :)
Yeah, I agree - having the classes up front would be ideal. We could still compute the classes ( Maybe I'm missing something here, but do we really need the frequencies? This might be a little far from optimal, but could we do something along these lines:
|
@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train? Note: I'm a data scientist, not a developer...
Sklearn uses |
@ericbassett It should split it in whatever train/test ratio is provided as input. The I'll submit a WIP PR soon so this discuss becomes more concrete :) |
Very nice, makes sense. |
Hey folks, I made an attempt to implement the stratified split here. I could do it lazily for dask Series and DataFrames, but not completely lazily for dask Array (calling Does anyone have ideas to get around this? Would it be possible to "enforce" chunk size instead of computing it? [e.g. if chunk size for the whole array is (x, 10) then chunk size for the part of the array that belongs to a class with weight 15% should be (0.15x, 10)] Any feedback in general would be highly appreciated 🙏 Also, if you feel this discussion should be moved to a WIP PR, I can open that too. |
May be easiest to move to a PR. We might be able to do things lazily for dask array, we'll just probably end up with unknown chunk sizes. |
@TomAugspurger Sure thing. Opened this WIP PR yesterday |
Any Progress on this task ? :) |
I need the stratify feature in tran_split_test as well for my imbalanced dataset. Any updates? |
Hey folks, sorry but I haven't had the chance to continue working on this. I did open a WIP PR (#635) so if anyone would like to fork off of it or just start from scratch, feel free to do so! Let me know if you'd like anything from me in doing so. |
scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument
stratify
. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.The text was updated successfully, but these errors were encountered: