-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a tutorial for text data #1357
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1357 +/- ##
=======================================
Coverage 99.95% 99.95%
=======================================
Files 213 213
Lines 13857 13857
=======================================
Hits 13850 13850
Misses 7 7 Continue to review full report at Codecov.
|
Thanks! I'll check back when the docs are done building, should be visible here. The RTD build is queued at the moment, lots building. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay This is great!
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"automl.describe_pipeline(automl.rankings.iloc[0][\"id\"])" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect the best pipeline to have a one hot encoder since in this case there is nothing to encode because the TextFeaturizer converts the Message column (the only input feature) into 5 numeric columns.
I don't think the OneHotEncoder is negatively impacting performance (maybe impacting the fit time a little bit?) but I think we should remove it from the pipeline in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton that's a really interesting point! The OneHotEncoder does nothing on this dataset, but it's included in the dynamically generated pipelines nonetheless because _get_preprocessing_components
considers the text column a categorical column. Once Woodwork is integrated into this part of AutoML, this should no longer be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah very good point! I'm working on a PR now which will patch this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Without the `Text Featurization Component`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent \"categories\" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` did not beat the random guess of predicting \"ham\" in every case." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be interesting to compare against a bag-of-words encoding in the future to get a sense of the improvement in performance of using the TextFeaturizer over a naive approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this, great description!
Agree @freddyaboulton that would be a cool extension
I see the docs generated! Reviewing now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eccabay this is really great!! I love it. I especially like that you're showing why its important to use woodwork to select which columns are text.
I left a handful of little suggestions, but otherwise this is definitely ready to 🚢 !
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Without the `Text Featurization Component`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent \"categories\" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` did not beat the random guess of predicting \"ham\" in every case." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this, great description!
Agree @freddyaboulton that would be a cool extension
I see the windows checkin tests failed. There's been a new flake lately with the conda installation on windows. If you rerun, should be all set. |
Adds a tutorial highlighting the
Text Featurizer
and the significant performance improvements it allows!