Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SMOTE Oversampler #2590

Merged
merged 18 commits into from
Aug 9, 2021
Merged

Fix SMOTE Oversampler #2590

merged 18 commits into from
Aug 9, 2021

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Aug 4, 2021

Fixes bug where non-categorical and non-numeric data types (boolean, DateTime, etc) were being counted as numeric columns when determining which SMOTE oversampler to add, but SMOTE counts them as categorical.

@eccabay eccabay changed the title 74 smotenc Fix SMOTE Oversampler Aug 4, 2021
@codecov
Copy link

codecov bot commented Aug 4, 2021

Codecov Report

Merging #2590 (0947608) into main (f46b94d) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2590     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        295     295             
  Lines      26811   26847     +36     
=======================================
+ Hits       26765   26801     +36     
  Misses        46      46             
Impacted Files Coverage Δ
...s/components/transformers/samplers/oversamplers.py 100.0% <ø> (ø)
evalml/automl/utils.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl_utils.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_oversamplers.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f46b94d...0947608. Read the comment docs.

@eccabay eccabay marked this pull request as ready for review August 5, 2021 15:28
@eccabay eccabay requested a review from a team August 5, 2021 15:29
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left one additional suggestion that could be nice to test but not blocking.

X = pd.DataFrame(X)
y = pd.Series([i % 5 == 0 for i in range(100)])
X[0] = [i % 2 for i in range(100)]
X_ww = infer_feature_types(X, feature_types={0: "boolean", 1: "Categorical"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could we see what happens with 'datetime' as well?

@@ -193,7 +193,7 @@ def get_best_sampler_for_data(X, y, sampler_method, sampler_balanced_ratio):
import_or_raise(
"imblearn.over_sampling", error_msg="imbalanced-learn is not installed"
)
cat_cols = X.ww.select("Categorical").columns
cat_cols = X.ww.select(exclude="numeric").columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause a problem if users pass in non-categorical and non-numeric features that get converted to numeric by components in the pipeline.

For example, this counts text as categoricals. But the text would get turned into doubles by the text featurizer. So they wouldn't be categoricals by the time it got to the sampler. That would cause a stacktrace because SmoteNC expects categoricals to be present, right?

from evalml.pipelines import BinaryClassificationPipeline
from evalml.automl.utils import get_best_sampler_for_data
import pandas as pd
import woodwork as ww
import pytest

X = pd.DataFrame({"a": ["Hello this is my text"] * 50 + ["Hello this is my bad text"] * 50,
                  "b": list(range(100)),
                  "c": pd.date_range("2020-01-01", periods=100)})
y = pd.Series([1] * 90 + [0] * 10)
X.ww.init(logical_types={"a": "NaturalLanguage", "c": "Datetime"})

sampler = get_best_sampler_for_data(X, y, sampler_method="Oversampler",
                                    sampler_balanced_ratio=0.5)

pipeline = BinaryClassificationPipeline({"Dt": ["DateTime Featurization Component", "X", "y"],
                                         "Text": ["Text Featurization Component", "Dt.x", "y"],
                                         "Oversampler": [sampler, "Text.x", "y"],
                                         "RF": ["Random Forest Classifier", "Oversampler.x", "Oversampler.y"]
                                        })

with pytest.raises(IndexError):
    pipeline.fit(X, y)

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @freddyaboulton, we agreed the base issue is that what's considered numeric/categorical/neither when we select the proper sampler is different than what's numeric/categorical when the data has made it to the sampler in the pipeline.

Right now, boolean columns are the only non-numeric, non-categorical columns that are considered categorical by SMOTE. Therefore for the purposes of this PR, the fix is just to select for both categorical and boolean columns when determining which SMOTE Oversampler to use. Freddy will file an issue to track future work to make oversampler selection more scalable, since that is out of scope for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue filed here! #2605

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eccabay !

name_output = get_best_sampler_for_data(X_ww, y, "Oversampler", 0.5)
assert name_output == "SMOTEN Oversampler"

X_ww = infer_feature_types(X, feature_types={0: "boolean", 1: "boolean"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


X = X.drop([i for i in range(2, 20)], axis=1) # drop all numeric columns
X_ww = infer_feature_types(X, feature_types={0: "boolean", 1: "categorical"})
snc = SMOTENCOversampler()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is covered elsewhere already but maybe also check that SMOTEN does work when it's only boolean and boolean/categorical?

@eccabay eccabay merged commit efcf5d1 into main Aug 9, 2021
@eccabay eccabay deleted the 74_smotenc branch August 9, 2021 13:14
@chukarsten chukarsten mentioned this pull request Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants