Add Undersampler Component to AutoMLSearch#2128
Conversation
…into bc_oversampler_component
evalml/pipelines/utils.py
Outdated
| if sampler_name is not None: | ||
| try: | ||
| import_or_raise("imblearn.over_sampling", error_msg="imbalanced-learn is not installed") | ||
| sampler_components = { |
There was a problem hiding this comment.
While this PR adds only the Undersampler to AutoMLSearch, I wanted to include the oversamplers in make_pipelines to simplify the implementation to add the oversamplers to AutoMLSearch
dsherry
left a comment
There was a problem hiding this comment.
@bchen1116 excellent!!!
The perf test results look great to me.
I had a few thoughts on little simplifications to the way the pieces connect, and what test coverage is necessary. Otherwise, I think this is almost ready to ship.
| " max_batches=1,\n", | ||
| " optimize_thresholds=True)\n", | ||
| " optimize_thresholds=True,\n", | ||
| " sampler_method=None)\n", |
There was a problem hiding this comment.
I'm curious, why add this here?
There was a problem hiding this comment.
I found that if we do over/undersampling, the performance on this model becomes very poor, with the AUC at 0.5. I removed the sampling to improve the performance of the model here.
It might be good to file an issue to look deeper into lead_scoring, since it looks like we have similar issues in previous docs here and here
There was a problem hiding this comment.
Understood. Yeah agreed. Would you mind filing that?
| output = component_instance.transform(input_x, input_y) | ||
| if isinstance(output, tuple): | ||
| output_x, output_y = output[0], output[1] | ||
| most_recent_y = output_y |
There was a problem hiding this comment.
I see, so we will pass forward the target values from the last component to output a target, or the original target if no component has modified the target. And yes I see how this would fix #2053 . Makes sense. Good call!
I would like to point out that this change could be a great separate PR. Its needed for using the Undersampler component, but its an independent alteration to ComponentGraph behavior and requires separate test coverage. Doing it in a separate PR would make it easy for reviewers to follow the change and the tests. No need to change things here, it was great that you added this, just pointing that out as food for thought.
evalml/pipelines/components/transformers/samplers/oversamplers.py
Outdated
Show resolved
Hide resolved
| if len(categorical_cols.columns) > 0 and estimator_class not in {CatBoostClassifier, CatBoostRegressor}: | ||
| pp_components.append(OneHotEncoder) | ||
|
|
||
| if sampler_name is not None: |
There was a problem hiding this comment.
Did this comment get put in the wrong place? I'm not sure I understand how its connected with this code change.
evalml/pipelines/utils.py
Outdated
| if sampler_name is not None: | ||
| try: | ||
| import_or_raise("imblearn.over_sampling", error_msg="imbalanced-learn is not installed") | ||
| sampler_components = { |
| # check that we do add the sampler properly | ||
| component_to_check = pipeline.component_graph | ||
| if samplers is not None and problem_type != 'regression': | ||
| # we add the sampler before the scaler if it exists |
There was a problem hiding this comment.
Oh, I think I missed this detail. Why do this?
There was a problem hiding this comment.
We want the sampler to come before the scaler since the scaler turns boolean values to int values ([0, 1] -> [-1, 1]), and since literature suggests scaling data should come after sampling
dsherry
left a comment
There was a problem hiding this comment.
@bchen1116 thanks, looking excellent 😁
| if is_classification(self.problem_type): | ||
| self._sampler_name = self.sampler_method | ||
| if self.sampler_method == 'auto': | ||
| self._sampler_name = get_best_sampler_for_data(self.X_train, self.y_train, self.sampler_method, self.sampler_balanced_ratio) |
freddyaboulton
left a comment
There was a problem hiding this comment.
@bchen1116 This looks great! I have a question about the behavior of sampler_balanced_ratio. I think that should get resolved before merging!
| if is_classification(self.problem_type): | ||
| self._sampler_name = self.sampler_method | ||
| if self.sampler_method == 'auto': | ||
| self._sampler_name = get_best_sampler_for_data(self.X_train, self.y_train, self.sampler_method, self.sampler_balanced_ratio) |
There was a problem hiding this comment.
If _sampler_name is only used for making pipelines, maybe we should let make_pipeline handle the 'auto' to sampler name conversion? Or not store _sampler_name in the AutoMLSearch state?
There was a problem hiding this comment.
Fixing, but I will leave this in order to set the balanced ratio for the component
| X = pd.DataFrame({"a": [i for i in range(100)], | ||
| "b": [i % 3 for i in range(100)]}) | ||
| y = pd.Series([0] * 90 + [1] * 10) | ||
| component_list = ['Imputer', 'SMOTE Oversampler', 'Random Forest Classifier'] |
There was a problem hiding this comment.
How come we need both component_graph.fit and component_graph2.fit if we're only checking the output of the mocks after component_graph2.fit? I think we need a reset_mock in between if we want to keep both.
There was a problem hiding this comment.
These are two different mocks. One is mocking the RFC pipeline, and the other mocks the DTC pipeline
| @pytest.mark.parametrize("sampling_ratio", [0.8, 0.5, 0.25, 0.2, 0.1, 0.05]) | ||
| def test_automl_search_sampler_ratio(sampling_ratio, size, categorical_features, problem_type, mock_imbalanced_data_X_y, has_minimal_dependencies): | ||
| X, y = mock_imbalanced_data_X_y(problem_type, categorical_features, size) | ||
| automl = AutoMLSearch(X_train=X, y_train=y, problem_type=problem_type, sampler_method='auto', sampler_balanced_ratio=sampling_ratio) |
There was a problem hiding this comment.
For paranoia's sake, can you search with max_batches=2 and make sure each pipeline gets the right value during the search? Checking the rankings table should be sufficient.
I see we use the _frozen_pipelines_parameter argument in IterativeAlgorithm and I don't think we have coverage for that (at least in test_iterative_algorithm.py)
Feel free to merge after that!
There was a problem hiding this comment.
Ah not a bad idea. So you want to run two batches, and assert that all the pipelines were given the right undersampler parameters? That makes sense.
I wonder if this sort of test would fit better in test_iterative_algorithm.py, since its testing _frozen_pipelines_parameter functionality and isn't specific to the undersampler component, right?
@bchen1116 if you run search please don't forget to mock fit and score :)
There was a problem hiding this comment.
Yep, adding the test to test_iterative_algorithm.py!

fix #2053
Part 1 support of #2098 (Only add undersampler for now, adding oversampler components will be a separate PR)
Initial implementation and testing. 2053 was fixed in order to support adding samplers to pipelines.
Quip doc here
Perf tests here