Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dictionary support for Undersampler base class #2235

Merged
merged 9 commits into from
May 13, 2021
Merged

Conversation

bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented May 6, 2021

Part 1 of #2105

This PR adds support for accepting dictionary input in the undersampler base class. I will add support for adding the dictionary support for Undersampler in a separate PR, since I want to handle the dictionary processing/conversion from ratios to number of samples in the sampler base class, which requires this doc to be approved and implemented.

@bchen1116 bchen1116 self-assigned this May 6, 2021
@codecov
Copy link

codecov bot commented May 6, 2021

Codecov Report

Merging #2235 (c9e795d) into main (e051731) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2235     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        280     280             
  Lines      24294   24330     +36     
=======================================
+ Hits       24266   24302     +36     
  Misses        28      28             
Impacted Files Coverage Δ
.../data_splitters/balanced_classification_sampler.py 100.0% <100.0%> (ø)
...sing_tests/test_balanced_classification_sampler.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e051731...c9e795d. Read the comment docs.

y_dict = y.value_counts().to_dict()
new_dic = {}
for k, v in self.sampling_ratio_dict.items():
new_dic[k] = max(y_dict[k] - v, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 The doc said that we'd accept ratios as the values of the dictionary but it seems like we're still expecting counts. Is that coming in the part 2 pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I plan on adding support for converting the ratios to the number of samples in the base_sampler class so both the over and undersamplers have access to it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think I understand the plan now!

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @bchen1116 !

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left nit-picky comment about docstring and how sampling_ratio interacts with sampling_ratio_dict but otherwise 🚢

@bchen1116 bchen1116 merged commit 6499428 into main May 13, 2021
@freddyaboulton freddyaboulton deleted the bc_2105_dictionary branch May 13, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants