Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Components to Create Features from URL and Email Logical Types #2550

Merged
merged 11 commits into from Jul 28, 2021

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jul 23, 2021

Pull Request Description

Fixes #2444

build_conda_pkg will fail until after we merge cause we need to bump the lower bound on feature tools to 0.26.1

@codecov
Copy link

codecov bot commented Jul 23, 2021

Codecov Report

Merging #2550 (62a4313) into main (1f0b96d) will increase coverage by 0.1%.
The diff coverage is 99.6%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #2550     +/-   ##
=======================================
+ Coverage   99.9%   99.9%   +0.1%     
=======================================
  Files        285     287      +2     
  Lines      26115   26338    +223     
=======================================
+ Hits       26081   26304    +223     
  Misses        34      34             
Impacted Files Coverage Δ
evalml/pipelines/components/__init__.py 100.0% <ø> (ø)
...alml/pipelines/components/transformers/__init__.py 100.0% <ø> (ø)
evalml/tests/component_tests/test_utils.py 97.5% <66.7%> (ø)
.../components/transformers/preprocessing/__init__.py 100.0% <100.0%> (ø)
...rs/preprocessing/transform_primitive_components.py 100.0% <100.0%> (ø)
evalml/pipelines/utils.py 99.1% <100.0%> (+0.1%) ⬆️
...nt_tests/test_ft_transform_primitive_components.py 100.0% <100.0%> (ø)
evalml/tests/conftest.py 98.6% <100.0%> (+0.1%) ⬆️
...s/prediction_explanations_tests/test_explainers.py 100.0% <100.0%> (ø)
...understanding_tests/test_permutation_importance.py 100.0% <100.0%> (ø)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f0b96d...62a4313. Read the comment docs.

from evalml.utils import infer_feature_types


class _ExtractFeaturesWithTransformPrimitives(Transformer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a parent class to minimize the code duplication that comes from the following pattern:

  1. specify some ft primitives at class definition
  2. Run dfs to extract features with those primitives
  3. Delete original columns

Maybe this could be used for the TextFeaturizer later as well. But if we find this abstraction useful, we may need to rethink how the casting to categorical at the end of transform happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ this!

@freddyaboulton freddyaboulton requested a review from a team July 26, 2021 16:42
@freddyaboulton freddyaboulton marked this pull request as ready for review July 26, 2021 16:42
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, looks great to me. Thanks for setting up a parent class for feature extraction, I have no doubt it'll come in handy. Love the tests too!

"""{}"""

def __init__(self, random_seed=0, **kwargs):
self._primitives_provenance = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this variable get used?

from evalml.utils import infer_feature_types


class _ExtractFeaturesWithTransformPrimitives(Transformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea

elif is_running_py_39_or_above:
n_components = 47
else:
n_components = 49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of confused as to why this isn't throwing a codecov warning for lack of coverage. Did Codecov not check this change because n_components = 49 seems to have come from the else statement before? That's wild

],
)
def test_component_fit_transform(
component, make_data, make_expected, make_expected_ltypes, df_with_url_and_email
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the way you parametrized this with helper functions

)
assert (
not pd.Series(
explanations["explanations"][1]["explanations"][0][
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this test, the word explanations is starting to sound weird in my head

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is all super clean! I think the parent class was a great idea

# featuretools expects str-type column names
X_to_transform.rename(columns=str, inplace=True)
ft_variable_types = self._get_feature_types_for_featuretools(X)
# all_email_variable_types =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Will delete thank you

Comment on lines 8 to 15
def test_url_featurizer_init():
url = URLFeaturizer()
assert url.parameters == {}


def test_email_featurizer_init():
email = EmailFeaturizer()
assert email.parameters == {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it might be cleaner/more scalable to parameterize these

Copy link
Collaborator

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely thorough. I'm very glad for your complete product knowledge as a lot of these extra tests I would not have known to add. Looks good.

@@ -2,6 +2,8 @@ Release Notes
-------------
**Future Release**
* Enhancements
* Added components to extract features from ``URL`` and ``EmailAddress`` Logical Types :pr:`2550`
* Added support for `NaN` values in ``TextFeaturizer`` :pr:`2532`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

from evalml.utils import infer_feature_types


class _ExtractFeaturesWithTransformPrimitives(Transformer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this!

Comment on lines +93 to +102
provenance = {}
for feature in features:
input_col = feature.base_features[0].get_name()
# Return a copy because `get_feature_names` returns a reference to the names
output_features = [name for name in feature.get_feature_names()]
if input_col not in provenance:
provenance[input_col] = output_features
else:
provenance[input_col] += output_features
return provenance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shame this is different from the component graph provenance, but nothing to do there.

@@ -69,12 +71,22 @@ def _get_preprocessing_components(
if len(all_null_cols) > 0:
pp_components.append(DropNullColumns)

email_columns = list(X.ww.select("EmailAddress", return_schema=True).columns)
if len(email_columns) > 0:
pp_components.append(EmailFeaturizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic

Comment on lines +1312 to +1315
"abalone_0@gmail.com",
"AbaloneRings@yahoo.com",
"abalone_2@abalone.com",
"$titanic_data%&@hotmail.com",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the most depressing email addresses I've ever seen. I'm tempted to try to register the hotmail one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Define Behavior for Woodwork's EmailAddress and URL logical types
5 participants