Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resampling with imbalanced-learn samplers #15

Merged
merged 15 commits into from Nov 8, 2019
Merged

Conversation

qtux
Copy link
Contributor

@qtux qtux commented Dec 8, 2018

Hi David,

I added the patch_sampler(imblearn_sampler_class) function which can be used to derive a dynamically created (and pickable) sampler class compatible with Pype.
The derived class implements a transform method which returns the data unchanged. The fit_transform method calls the fit_resample method of the imbalanced-learn sampler which resamples the data.
These steps are important to ensure that resampling only applies to training data but not to test data (the example shows that Pype.fit calls the fit_transform method, whereas score calls the transform method).

Cheers,
Matthias

@qtux qtux force-pushed the resampling branch 2 times, most recently from f86a465 to ebdd07e Compare December 8, 2018 09:50
@coveralls
Copy link

coveralls commented Dec 8, 2018

Pull Request Test Coverage Report for Build 225

  • 282 of 288 (97.92%) changed or added relevant lines in 11 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.5%) to 95.104%

Changes Missing Coverage Covered Lines Changed/Added Lines %
seglearn/pipe.py 1 2 50.0%
seglearn/transform.py 66 68 97.06%
seglearn/util.py 15 18 83.33%
Totals Coverage Status
Change from base Build 197: 0.5%
Covered Lines: 1923
Relevant Lines: 2022

💛 - Coveralls

@dmbee
Copy link
Owner

dmbee commented Dec 8, 2018

Thanks Matthias. I'll have to look this over later this week. Thanks again for the contribution.
David

@qtux
Copy link
Contributor Author

qtux commented Dec 12, 2018

I rebased the commits on the current development branch.

@qtux
Copy link
Contributor Author

qtux commented Dec 13, 2018

Hi David,

I added the possibility to shuffle the resampled results. The reason for this feature is that e.g. the RandomUnderSampler seems to sort the X/y arrays by the class of y. This turns out to be problematic when using the fixed validation_split to fit a Keras classifier on resampled and segmented data.
A solution to provide validation_data without using the validation_split seems to be more complex.

Cheers,
Matthias

@dmbee
Copy link
Owner

dmbee commented Dec 14, 2018

Thanks Matthias - I have to spend some more time looking at this. I am working on sklearn on how best to integrate resampling into their pipeline as well.

Here is the thread for the discussion: scikit-learn/scikit-learn#3855

@dmbee
Copy link
Owner

dmbee commented Dec 14, 2018

The reason for this feature is that e.g. the RandomUnderSampler seems to sort the X/y arrays by the class of y.

This is really good to know.

@qtux
Copy link
Contributor Author

qtux commented Dec 18, 2018

I rebased the commits on the current development branch.

@qtux
Copy link
Contributor Author

qtux commented Jan 5, 2019

I rebased the commits on the current development branch.

'''
Circumvent the check whether dim(Xt) == 2.
'''
Xt_2d = Xt.reshape(Xt.shape[0], -1)
Copy link
Contributor Author

@qtux qtux Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a copy of Xt when Fortran-like (default) ordering is used when segmenting data. #24 solves this issue by choosing C-like ordering for the segmentation. Shall I give a warning if Fortran-like ordering is used, or shall I remove the "faked" check altogether?

dmbee and others added 3 commits May 24, 2019 16:14
Replace deprecated six functionality with Python 3 code and adapt to new
version requirements.
Drop Python 2 support by using scikit-learn 0.21.3
@qtux qtux force-pushed the resampling branch 2 times, most recently from 9039c6e to c24a839 Compare August 27, 2019 16:46
@qtux
Copy link
Contributor Author

qtux commented Aug 27, 2019

Hi David,

I rebased the resampling patches to the master branch and squashed the commits such that it would be easier to revert them. What do you think about merging this patch set? It seems that scikit-learn needs some more time until they might provide this feature (c.f. scikit-learn/scikit-learn#13269).

Should I change this pull request from the dev to the master branch?

Cheers,
Matthias

@dmbee
Copy link
Owner

dmbee commented Aug 30, 2019

Hi Matthias,

I really appreciate your work on this. I am pretty busy over the next two weeks but promise to look over this again soon. Last time I wasn't too keen on adding the dependency of imblearn.

Let me look it over again and let us then discuss.

David

@qtux
Copy link
Contributor Author

qtux commented Nov 7, 2019

Hi David,

any news?

Cheers,
Matthias

@dmbee
Copy link
Owner

dmbee commented Nov 7, 2019

Matthias - truly apologize for the delay as I am writing my thesis currently. This looks great. Can you please rebase to the current master and I will merge and deploy soon as that's done.

I appreciate all your work on this really useful patch.

David

This functions dynamically patches an imbalanced-learn Sampler
transformer to be usable inside a seglearn Pype. It ensures that the
objects created from this metaclass are pickable.

Additionally, shuffling is implemented for imbalanced-learn samplers.
The reason is that imbalanced-learn sorts the output by classes. This is
problematic when splitting the resampled data (e.g. using the validation
split from the Keras fit function).

Finally, calling repr() on a dynamically patched sampler will return the
parameters of the imbalanced-learn base class along with the additional
parameters introduced in the PickableSampler class (shuffle and
random_state).
Additionally, add tests for a dynamically created PickableSampler object
and imbalanced-learn sampler shuffling.
@qtux
Copy link
Contributor Author

qtux commented Nov 7, 2019

Hi David,

no worries, all the best for your thesis :).

Cheers,
Matthias

@dmbee dmbee merged commit a61f90f into dmbee:dev Nov 8, 2019
@dmbee
Copy link
Owner

dmbee commented Nov 8, 2019

Thanks Matthias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants