Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a tutorial for text data #1357

Merged
merged 9 commits into from
Oct 29, 2020
Merged

Add a tutorial for text data #1357

merged 9 commits into from
Oct 29, 2020

Conversation

eccabay
Copy link
Contributor

@eccabay eccabay commented Oct 28, 2020

Adds a tutorial highlighting the Text Featurizer and the significant performance improvements it allows!

@codecov
Copy link

codecov bot commented Oct 28, 2020

Codecov Report

Merging #1357 into main will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1357   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files         213      213           
  Lines       13857    13857           
=======================================
  Hits        13850    13850           
  Misses          7        7           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4aa95d3...d4268bd. Read the comment docs.

@eccabay eccabay requested a review from dsherry October 28, 2020 22:04
@dsherry
Copy link
Contributor

dsherry commented Oct 28, 2020

Thanks! I'll check back when the docs are done building, should be visible here. The RTD build is queued at the moment, lots building.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay This is great!

"metadata": {},
"outputs": [],
"source": [
"automl.describe_pipeline(automl.rankings.iloc[0][\"id\"])"
Copy link
Contributor

@freddyaboulton freddyaboulton Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't expect the best pipeline to have a one hot encoder since in this case there is nothing to encode because the TextFeaturizer converts the Message column (the only input feature) into 5 numeric columns.

I don't think the OneHotEncoder is negatively impacting performance (maybe impacting the fit time a little bit?) but I think we should remove it from the pipeline in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton that's a really interesting point! The OneHotEncoder does nothing on this dataset, but it's included in the dynamically generated pipelines nonetheless because _get_preprocessing_components considers the text column a categorical column. Once Woodwork is integrated into this part of AutoML, this should no longer be the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very good point! I'm working on a PR now which will patch this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/demos/text_input.ipynb Show resolved Hide resolved
"cell_type": "markdown",
"metadata": {},
"source": [
"Without the `Text Featurization Component`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent \"categories\" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` did not beat the random guess of predicting \"ham\" in every case."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to compare against a bag-of-words encoding in the future to get a sense of the improvement in performance of using the TextFeaturizer over a naive approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this, great description!

Agree @freddyaboulton that would be a cool extension

docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
@dsherry
Copy link
Contributor

dsherry commented Oct 29, 2020

I see the docs generated! Reviewing now

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eccabay this is really great!! I love it. I especially like that you're showing why its important to use woodwork to select which columns are text.

I left a handful of little suggestions, but otherwise this is definitely ready to 🚢 !

docs/source/tutorials.ipynb Outdated Show resolved Hide resolved
docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
docs/source/demos/text_input.ipynb Show resolved Hide resolved
"cell_type": "markdown",
"metadata": {},
"source": [
"Without the `Text Featurization Component`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent \"categories\" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` did not beat the random guess of predicting \"ham\" in every case."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this, great description!

Agree @freddyaboulton that would be a cool extension

docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
docs/source/demos/text_input.ipynb Outdated Show resolved Hide resolved
@dsherry
Copy link
Contributor

dsherry commented Oct 29, 2020

I see the windows checkin tests failed. There's been a new flake lately with the conda installation on windows. If you rerun, should be all set.

@eccabay eccabay merged commit 4166fbb into main Oct 29, 2020
@dsherry dsherry mentioned this pull request Oct 29, 2020
@eccabay eccabay deleted the text_tutorial branch November 2, 2020 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants