New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resampling with imbalanced-learn samplers #15
Conversation
f86a465
to
ebdd07e
Compare
Pull Request Test Coverage Report for Build 225
💛 - Coveralls |
Thanks Matthias. I'll have to look this over later this week. Thanks again for the contribution. |
I rebased the commits on the current development branch. |
Hi David, I added the possibility to shuffle the resampled results. The reason for this feature is that e.g. the RandomUnderSampler seems to sort the X/y arrays by the class of y. This turns out to be problematic when using the fixed validation_split to fit a Keras classifier on resampled and segmented data. Cheers, |
Thanks Matthias - I have to spend some more time looking at this. I am working on Here is the thread for the discussion: scikit-learn/scikit-learn#3855 |
This is really good to know. |
I rebased the commits on the current development branch. |
I rebased the commits on the current development branch. |
''' | ||
Circumvent the check whether dim(Xt) == 2. | ||
''' | ||
Xt_2d = Xt.reshape(Xt.shape[0], -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a copy of Xt when Fortran-like (default) ordering is used when segmenting data. #24 solves this issue by choosing C-like ordering for the segmentation. Shall I give a warning if Fortran-like ordering is used, or shall I remove the "faked" check altogether?
Replace deprecated six functionality with Python 3 code and adapt to new version requirements.
Drop Python 2 support by using scikit-learn 0.21.3
9039c6e
to
c24a839
Compare
Hi David, I rebased the resampling patches to the master branch and squashed the commits such that it would be easier to revert them. What do you think about merging this patch set? It seems that scikit-learn needs some more time until they might provide this feature (c.f. scikit-learn/scikit-learn#13269). Should I change this pull request from the dev to the master branch? Cheers, |
Hi Matthias, I really appreciate your work on this. I am pretty busy over the next two weeks but promise to look over this again soon. Last time I wasn't too keen on adding the dependency of imblearn. Let me look it over again and let us then discuss. David |
…deviation add median absolute deviation
Hi David, any news? Cheers, |
Matthias - truly apologize for the delay as I am writing my thesis currently. This looks great. Can you please rebase to the current master and I will merge and deploy soon as that's done. I appreciate all your work on this really useful patch. David |
This functions dynamically patches an imbalanced-learn Sampler transformer to be usable inside a seglearn Pype. It ensures that the objects created from this metaclass are pickable. Additionally, shuffling is implemented for imbalanced-learn samplers. The reason is that imbalanced-learn sorts the output by classes. This is problematic when splitting the resampled data (e.g. using the validation split from the Keras fit function). Finally, calling repr() on a dynamically patched sampler will return the parameters of the imbalanced-learn base class along with the additional parameters introduced in the PickableSampler class (shuffle and random_state).
Additionally, add tests for a dynamically created PickableSampler object and imbalanced-learn sampler shuffling.
Hi David, no worries, all the best for your thesis :). Cheers, |
Thanks Matthias |
Hi David,
I added the patch_sampler(imblearn_sampler_class) function which can be used to derive a dynamically created (and pickable) sampler class compatible with Pype.
The derived class implements a transform method which returns the data unchanged. The fit_transform method calls the fit_resample method of the imbalanced-learn sampler which resamples the data.
These steps are important to ensure that resampling only applies to training data but not to test data (the example shows that Pype.fit calls the fit_transform method, whereas score calls the transform method).
Cheers,
Matthias