Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Undersampler Component #2030

Merged
merged 44 commits into from Apr 1, 2021
Merged

Add Undersampler Component #2030

merged 44 commits into from Apr 1, 2021

Conversation

bchen1116
Copy link
Contributor

fix #2020

@bchen1116 bchen1116 self-assigned this Mar 24, 2021
@codecov
Copy link

codecov bot commented Mar 24, 2021

Codecov Report

Merging #2030 (12a4f37) into main (8873ed1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #2030     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         284      288      +4     
  Lines       23271    23425    +154     
=========================================
+ Hits        23261    23415    +154     
  Misses         10       10             
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
evalml/pipelines/component_graph.py 100.0% <100.0%> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <100.0%> (ø)
...lines/components/transformers/samplers/__init__.py 100.0% <100.0%> (ø)
...s/components/transformers/samplers/base_sampler.py 100.0% <100.0%> (ø)
...s/components/transformers/samplers/undersampler.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_components.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_undersampler.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_utils.py 100.0% <100.0%> (ø)
...valml/tests/pipeline_tests/test_component_graph.py 100.0% <100.0%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8873ed1...12a4f37. Read the comment docs.

@@ -54,7 +54,10 @@ def from_list(cls, component_list, random_seed=0):

component_dict[component_name] = [component_class]
if previous_component is not None:
component_dict[component_name].append(f"{previous_component}.x")
if "sampler" in previous_component:
component_dict[component_name].extend([f"{previous_component}.x", f"{previous_component}.y"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to track both X and y outputs from the undersampler (using 'sampler' to prepare for potential future oversampler transformers)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

@bchen1116 bchen1116 marked this pull request as ready for review March 24, 2021 21:54
Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bryan, solid work. It's nice to see the base class of the Undersampler get put to work. A few things I'd like to resolve before merge would be:

  1. What to do about transform()? Can we just relegate the transform() task to the base class?
  2. The change in the transform() API with the return values. I think we should do our best to expect all component's transform() to return the same number of values and for each return value to have the same type. But I'd like to know other people's thoughts, too.
  3. The test added in test_component_graph, should it be more generalized? Or should it specifically reference the undersampler.

I think a lot of these can be addressed in a post-standup discussion! Still, overall great work. Definitely appreciate the clarity of the tests.

train_indices = index_df[index_df.isin(indices)].dropna().index.values.tolist()
return X.iloc[train_indices], y.iloc[train_indices]

def transform(self, X, y):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? I need to check if this is how we handle other components that don't need transform. Is there any ambiguity in someone calling .transform() on their data and perhaps expecting the sampling to be applied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our base Transformer class, we end up trying to transform with self._component_obj, which would raise an error since we don't have a transform method with the undersampler. I overrode it to avoid this issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, this is curious. Does it make more sense to raise a NotImplementedError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think so. When we try to score/predict on a pipeline, we call transform() on all of the transformer components rather than fit_transform, which makes sense since we don't want to refit the component if we have already trained the pipeline and are predicting. In this scenario, our sampler shouldn't do anything (since we only sample during training), and so we should allow users to both call transform() and get a result from it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a cool way of specifying "train only" transformers! I was thinking we'd have to pass info from the pipeline down to the component but I think it's great we can do something simpler.

The one confusing thing is that fit().transform() != fit_transform() which I think is what someone coming fresh off sklearn would expect. Maybe we should just be super clear about which transformers are "train only" and what users should expect with them. I think "train only" transformers are a cool concept we can use to differentiate ourselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point! Would it be best to state that this is a 'train only' transformer in the documentation/description of the component?

evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_components.py Outdated Show resolved Hide resolved
evalml/tests/component_tests/test_undersampler.py Outdated Show resolved Hide resolved
assert component_graph.get_parents('OneHot') == ['Imputer.x']
assert component_graph.get_parents('Undersampler') == ['OneHot.x']
assert component_graph.get_parents('Random Forest') == ['Undersampler.x', 'Undersampler.y']
assert component_graph.get_parents('Elastic Net') == ['Undersampler.x', 'Undersampler.y']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I think I'm a little out of the loop with respect to the component graphs. Is it expected that there's a branch from the undersampler to elastic net and rf? i assume because theyre both estimators that makes both estimators children of the last component?

That kind of leads me to another question : is this a test specific to the undersampler? Or would the test look like this without the undersampler?

assert component_graph.get_parents('Imputer') == []
assert component_graph.get_parents('OneHot') == ['Imputer.x']
assert component_graph.get_parents('Random Forest') == ['OneHot.x', 'OneHot.y']
assert component_graph.get_parents('Elastic Net') == ['OneHot.x', 'OneHot.y']

If it would look like the above with OneHot taking the place of Undersampler, I think we really just need a more generalized test that tests the behavior with respect to the transformer/estimator boundary in the component graph without necessarily using the Undersampler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chukarsten, currently, our component parents for all previous transformers only stores the x value in the component_graph parents' rather than storing both X and y since our previous transformers didn't alter y. I added in the support for y since our sampling methods (both over and under) will all be transforming both X and y, which is why the tests I added were specific to the undersampler. For your example, it would look like:

assert component_graph.get_parents('Imputer') == []
assert component_graph.get_parents('OneHot') == ['Imputer.x']
assert component_graph.get_parents('Random Forest') == ['OneHot.x']
assert component_graph.get_parents('Elastic Net') == ['OneHot.x']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this test a test that it's handling the passing forward of the y value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, this is making sure that when we add in an undersampler component to the pipeline, we keep track of both the X and y output of that component, while for other components we still only track the X output. This is definitely something that we can discuss in conjunction with your point about having the output of transform() be normalized among all transformers.

)


class BaseSampler(Transformer):
Copy link
Contributor Author

@bchen1116 bchen1116 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this base class to expand all of our future under/oversamplers with.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a solid add and we'll reap the rewards of it later.

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent!

@@ -736,3 +737,44 @@ def test_component_graph_dataset_with_different_types():
assert input_feature_names['Elastic Net'] == ['column_3', 'column_1_a', 'column_1_b', 'column_1_c', 'column_1_d',
'column_2_1', 'column_2_2', 'column_2_3', 'column_2_4', 'column_2_5', 'column_2_6']
assert input_feature_names['Logistic Regression'] == ['Random Forest', 'Elastic Net']


def test_component_graph_sampler():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about this RE target imputer, but can you try adding a test for fitting and predicting with a component graph that has an undersampler and confirm that it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test in the pipeline tests!

@bchen1116 bchen1116 merged commit 90b70be into main Apr 1, 2021
@chukarsten chukarsten mentioned this pull request Apr 6, 2021
@freddyaboulton freddyaboulton deleted the bc_2020_undersampler_component branch May 13, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create an undersampler component
6 participants