Skip to content

Add Undersampler Component to AutoMLSearch#2128

Merged
bchen1116 merged 115 commits intomainfrom
bc_2098_sampling
May 4, 2021
Merged

Add Undersampler Component to AutoMLSearch#2128
bchen1116 merged 115 commits intomainfrom
bc_2098_sampling

Conversation

@bchen1116
Copy link
Contributor

@bchen1116 bchen1116 commented Apr 12, 2021

fix #2053
Part 1 support of #2098 (Only add undersampler for now, adding oversampler components will be a separate PR)
Initial implementation and testing. 2053 was fixed in order to support adding samplers to pipelines.

Quip doc here

Perf tests here

if sampler_name is not None:
try:
import_or_raise("imblearn.over_sampling", error_msg="imbalanced-learn is not installed")
sampler_components = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this PR adds only the Undersampler to AutoMLSearch, I wanted to include the oversamplers in make_pipelines to simplify the implementation to add the oversamplers to AutoMLSearch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 excellent!!!

The perf test results look great to me.

I had a few thoughts on little simplifications to the way the pieces connect, and what test coverage is necessary. Otherwise, I think this is almost ready to ship.

" max_batches=1,\n",
" optimize_thresholds=True)\n",
" optimize_thresholds=True,\n",
" sampler_method=None)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why add this here?

Copy link
Contributor Author

@bchen1116 bchen1116 May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that if we do over/undersampling, the performance on this model becomes very poor, with the AUC at 0.5. I removed the sampling to improve the performance of the model here.

image

It might be good to file an issue to look deeper into lead_scoring, since it looks like we have similar issues in previous docs here and here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. Yeah agreed. Would you mind filing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed one here!

output = component_instance.transform(input_x, input_y)
if isinstance(output, tuple):
output_x, output_y = output[0], output[1]
most_recent_y = output_y
Copy link
Contributor

@dsherry dsherry Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so we will pass forward the target values from the last component to output a target, or the original target if no component has modified the target. And yes I see how this would fix #2053 . Makes sense. Good call!

I would like to point out that this change could be a great separate PR. Its needed for using the Undersampler component, but its an independent alteration to ComponentGraph behavior and requires separate test coverage. Doing it in a separate PR would make it easy for reviewers to follow the change and the tests. No need to change things here, it was great that you added this, just pointing that out as food for thought.

if len(categorical_cols.columns) > 0 and estimator_class not in {CatBoostClassifier, CatBoostRegressor}:
pp_components.append(OneHotEncoder)

if sampler_name is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this comment get put in the wrong place? I'm not sure I understand how its connected with this code change.

if sampler_name is not None:
try:
import_or_raise("imblearn.over_sampling", error_msg="imbalanced-learn is not installed")
sampler_components = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# check that we do add the sampler properly
component_to_check = pipeline.component_graph
if samplers is not None and problem_type != 'regression':
# we add the sampler before the scaler if it exists
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I missed this detail. Why do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the sampler to come before the scaler since the scaler turns boolean values to int values ([0, 1] -> [-1, 1]), and since literature suggests scaling data should come after sampling

@bchen1116 bchen1116 requested a review from dsherry May 3, 2021 17:16
@bchen1116 bchen1116 mentioned this pull request May 3, 2021
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 thanks, looking excellent 😁

if is_classification(self.problem_type):
self._sampler_name = self.sampler_method
if self.sampler_method == 'auto':
self._sampler_name = get_best_sampler_for_data(self.X_train, self.y_train, self.sampler_method, self.sampler_balanced_ratio)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks great

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 This looks great! I have a question about the behavior of sampler_balanced_ratio. I think that should get resolved before merging!

if is_classification(self.problem_type):
self._sampler_name = self.sampler_method
if self.sampler_method == 'auto':
self._sampler_name = get_best_sampler_for_data(self.X_train, self.y_train, self.sampler_method, self.sampler_balanced_ratio)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If _sampler_name is only used for making pipelines, maybe we should let make_pipeline handle the 'auto' to sampler name conversion? Or not store _sampler_name in the AutoMLSearch state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing, but I will leave this in order to set the balanced ratio for the component

X = pd.DataFrame({"a": [i for i in range(100)],
"b": [i % 3 for i in range(100)]})
y = pd.Series([0] * 90 + [1] * 10)
component_list = ['Imputer', 'SMOTE Oversampler', 'Random Forest Classifier']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we need both component_graph.fit and component_graph2.fit if we're only checking the output of the mocks after component_graph2.fit? I think we need a reset_mock in between if we want to keep both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two different mocks. One is mocking the RFC pipeline, and the other mocks the DTC pipeline

@pytest.mark.parametrize("sampling_ratio", [0.8, 0.5, 0.25, 0.2, 0.1, 0.05])
def test_automl_search_sampler_ratio(sampling_ratio, size, categorical_features, problem_type, mock_imbalanced_data_X_y, has_minimal_dependencies):
X, y = mock_imbalanced_data_X_y(problem_type, categorical_features, size)
automl = AutoMLSearch(X_train=X, y_train=y, problem_type=problem_type, sampler_method='auto', sampler_balanced_ratio=sampling_ratio)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For paranoia's sake, can you search with max_batches=2 and make sure each pipeline gets the right value during the search? Checking the rankings table should be sufficient.

I see we use the _frozen_pipelines_parameter argument in IterativeAlgorithm and I don't think we have coverage for that (at least in test_iterative_algorithm.py)

Feel free to merge after that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah not a bad idea. So you want to run two batches, and assert that all the pipelines were given the right undersampler parameters? That makes sense.

I wonder if this sort of test would fit better in test_iterative_algorithm.py, since its testing _frozen_pipelines_parameter functionality and isn't specific to the undersampler component, right?

@bchen1116 if you run search please don't forget to mock fit and score :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, adding the test to test_iterative_algorithm.py!

@bchen1116 bchen1116 merged commit fcacca2 into main May 4, 2021
This was referenced May 4, 2021
@freddyaboulton freddyaboulton deleted the bc_2098_sampling branch May 13, 2022 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix and clean component graph list support

3 participants