-
Notifications
You must be signed in to change notification settings - Fork 92
Fix SMOTE Oversampler #2590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SMOTE Oversampler #2590
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2590 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 295 295
Lines 26811 26847 +36
=======================================
+ Hits 26765 26801 +36
Misses 46 46
Continue to review full report at Codecov.
|
bchen1116
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left one additional suggestion that could be nice to test but not blocking.
| X = pd.DataFrame(X) | ||
| y = pd.Series([i % 5 == 0 for i in range(100)]) | ||
| X[0] = [i % 2 for i in range(100)] | ||
| X_ww = infer_feature_types(X, feature_types={0: "boolean", 1: "Categorical"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could we see what happens with 'datetime' as well?
evalml/automl/utils.py
Outdated
| "imblearn.over_sampling", error_msg="imbalanced-learn is not installed" | ||
| ) | ||
| cat_cols = X.ww.select("Categorical").columns | ||
| cat_cols = X.ww.select(exclude="numeric").columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will cause a problem if users pass in non-categorical and non-numeric features that get converted to numeric by components in the pipeline.
For example, this counts text as categoricals. But the text would get turned into doubles by the text featurizer. So they wouldn't be categoricals by the time it got to the sampler. That would cause a stacktrace because SmoteNC expects categoricals to be present, right?
from evalml.pipelines import BinaryClassificationPipeline
from evalml.automl.utils import get_best_sampler_for_data
import pandas as pd
import woodwork as ww
import pytest
X = pd.DataFrame({"a": ["Hello this is my text"] * 50 + ["Hello this is my bad text"] * 50,
"b": list(range(100)),
"c": pd.date_range("2020-01-01", periods=100)})
y = pd.Series([1] * 90 + [0] * 10)
X.ww.init(logical_types={"a": "NaturalLanguage", "c": "Datetime"})
sampler = get_best_sampler_for_data(X, y, sampler_method="Oversampler",
sampler_balanced_ratio=0.5)
pipeline = BinaryClassificationPipeline({"Dt": ["DateTime Featurization Component", "X", "y"],
"Text": ["Text Featurization Component", "Dt.x", "y"],
"Oversampler": [sampler, "Text.x", "y"],
"RF": ["Random Forest Classifier", "Oversampler.x", "Oversampler.y"]
})
with pytest.raises(IndexError):
pipeline.fit(X, y)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @freddyaboulton, we agreed the base issue is that what's considered numeric/categorical/neither when we select the proper sampler is different than what's numeric/categorical when the data has made it to the sampler in the pipeline.
Right now, boolean columns are the only non-numeric, non-categorical columns that are considered categorical by SMOTE. Therefore for the purposes of this PR, the fix is just to select for both categorical and boolean columns when determining which SMOTE Oversampler to use. Freddy will file an issue to track future work to make oversampler selection more scalable, since that is out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue filed here! #2605
freddyaboulton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eccabay !
| name_output = get_best_sampler_for_data(X_ww, y, "Oversampler", 0.5) | ||
| assert name_output == "SMOTEN Oversampler" | ||
|
|
||
| X_ww = infer_feature_types(X, feature_types={0: "boolean", 1: "boolean"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
|
||
| X = X.drop([i for i in range(2, 20)], axis=1) # drop all numeric columns | ||
| X_ww = infer_feature_types(X, feature_types={0: "boolean", 1: "categorical"}) | ||
| snc = SMOTENCOversampler() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is covered elsewhere already but maybe also check that SMOTEN does work when it's only boolean and boolean/categorical?

Fixes bug where non-categorical and non-numeric data types (boolean, DateTime, etc) were being counted as numeric columns when determining which SMOTE oversampler to add, but SMOTE counts them as categorical.