-
Notifications
You must be signed in to change notification settings - Fork 92
Add Components to Create Features from URL and Email Logical Types #2550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e14e712 to
143eafa
Compare
Codecov Report
@@ Coverage Diff @@
## main #2550 +/- ##
=======================================
+ Coverage 99.9% 99.9% +0.1%
=======================================
Files 285 287 +2
Lines 26115 26338 +223
=======================================
+ Hits 26081 26304 +223
Misses 34 34
Continue to review full report at Codecov.
|
469ee09 to
f24e7ba
Compare
| from evalml.utils import infer_feature_types | ||
|
|
||
|
|
||
| class _ExtractFeaturesWithTransformPrimitives(Transformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a parent class to minimize the code duplication that comes from the following pattern:
- specify some ft primitives at class definition
- Run dfs to extract features with those primitives
- Delete original columns
Maybe this could be used for the TextFeaturizer later as well. But if we find this abstraction useful, we may need to rethink how the casting to categorical at the end of transform happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ this!
e92d642 to
7837cd9
Compare
ParthivNaresh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, looks great to me. Thanks for setting up a parent class for feature extraction, I have no doubt it'll come in handy. Love the tests too!
| """{}""" | ||
|
|
||
| def __init__(self, random_seed=0, **kwargs): | ||
| self._primitives_provenance = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this variable get used?
| from evalml.utils import infer_feature_types | ||
|
|
||
|
|
||
| class _ExtractFeaturesWithTransformPrimitives(Transformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great idea
| elif is_running_py_39_or_above: | ||
| n_components = 47 | ||
| else: | ||
| n_components = 49 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of confused as to why this isn't throwing a codecov warning for lack of coverage. Did Codecov not check this change because n_components = 49 seems to have come from the else statement before? That's wild
| ], | ||
| ) | ||
| def test_component_fit_transform( | ||
| component, make_data, make_expected, make_expected_ltypes, df_with_url_and_email |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the way you parametrized this with helper functions
| ) | ||
| assert ( | ||
| not pd.Series( | ||
| explanations["explanations"][1]["explanations"][0][ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading this test, the word explanations is starting to sound weird in my head
eccabay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is all super clean! I think the parent class was a great idea
| # featuretools expects str-type column names | ||
| X_to_transform.rename(columns=str, inplace=True) | ||
| ft_variable_types = self._get_feature_types_for_featuretools(X) | ||
| # all_email_variable_types = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Will delete thank you
| def test_url_featurizer_init(): | ||
| url = URLFeaturizer() | ||
| assert url.parameters == {} | ||
|
|
||
|
|
||
| def test_email_featurizer_init(): | ||
| email = EmailFeaturizer() | ||
| assert email.parameters == {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it might be cleaner/more scalable to parameterize these
a3fe6e9 to
97dbb04
Compare
chukarsten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely thorough. I'm very glad for your complete product knowledge as a lot of these extra tests I would not have known to add. Looks good.
| **Future Release** | ||
| * Enhancements | ||
| * Added components to extract features from ``URL`` and ``EmailAddress`` Logical Types :pr:`2550` | ||
| * Added support for `NaN` values in ``TextFeaturizer`` :pr:`2532` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
| from evalml.utils import infer_feature_types | ||
|
|
||
|
|
||
| class _ExtractFeaturesWithTransformPrimitives(Transformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this!
| provenance = {} | ||
| for feature in features: | ||
| input_col = feature.base_features[0].get_name() | ||
| # Return a copy because `get_feature_names` returns a reference to the names | ||
| output_features = [name for name in feature.get_feature_names()] | ||
| if input_col not in provenance: | ||
| provenance[input_col] = output_features | ||
| else: | ||
| provenance[input_col] += output_features | ||
| return provenance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shame this is different from the component graph provenance, but nothing to do there.
|
|
||
| email_columns = list(X.ww.select("EmailAddress", return_schema=True).columns) | ||
| if len(email_columns) > 0: | ||
| pp_components.append(EmailFeaturizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic
| "abalone_0@gmail.com", | ||
| "AbaloneRings@yahoo.com", | ||
| "abalone_2@abalone.com", | ||
| "$titanic_data%&@hotmail.com", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are the most depressing email addresses I've ever seen. I'm tempted to try to register the hotmail one
97dbb04 to
62a4313
Compare
Pull Request Description
Fixes #2444
build_conda_pkg will fail until after we merge cause we need to bump the lower bound on feature tools to 0.26.1