Add a tutorial for text data #1357

eccabay · 2020-10-28T20:46:48Z

Adds a tutorial highlighting the Text Featurizer and the significant performance improvements it allows!

codecov · 2020-10-28T20:53:18Z

Codecov Report

Merging #1357 into main will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main    #1357   +/-   ##
=======================================
  Coverage   99.95%   99.95%           
=======================================
  Files         213      213           
  Lines       13857    13857           
=======================================
  Hits        13850    13850           
  Misses          7        7

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4aa95d3...d4268bd. Read the comment docs.

docs/source/tutorials.ipynb

dsherry · 2020-10-28T22:37:36Z

Thanks! I'll check back when the docs are done building, should be visible here. The RTD build is queued at the moment, lots building.

freddyaboulton

@eccabay This is great!

freddyaboulton · 2020-10-28T23:35:39Z

docs/source/demos/text_input.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "automl.describe_pipeline(automl.rankings.iloc[0][\"id\"])"


I wouldn't expect the best pipeline to have a one hot encoder since in this case there is nothing to encode because the TextFeaturizer converts the Message column (the only input feature) into 5 numeric columns.

I don't think the OneHotEncoder is negatively impacting performance (maybe impacting the fit time a little bit?) but I think we should remove it from the pipeline in the future.

@freddyaboulton that's a really interesting point! The OneHotEncoder does nothing on this dataset, but it's included in the dynamically generated pipelines nonetheless because _get_preprocessing_components considers the text column a categorical column. Once Woodwork is integrated into this part of AutoML, this should no longer be the case.

Ah very good point! I'm working on a PR now which will patch this.

docs/source/demos/text_input.ipynb

freddyaboulton · 2020-10-28T23:49:36Z

docs/source/demos/text_input.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Without the `Text Featurization Component`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent \"categories\" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` did not beat the random guess of predicting \"ham\" in every case."


Might be interesting to compare against a bag-of-words encoding in the future to get a sense of the improvement in performance of using the TextFeaturizer over a naive approach.

I love this, great description!

Agree @freddyaboulton that would be a cool extension

docs/source/demos/text_input.ipynb

dsherry · 2020-10-29T14:33:40Z

I see the docs generated! Reviewing now

dsherry

@eccabay this is really great!! I love it. I especially like that you're showing why its important to use woodwork to select which columns are text.

I left a handful of little suggestions, but otherwise this is definitely ready to 🚢 !

docs/source/tutorials.ipynb

docs/source/demos/text_input.ipynb

dsherry · 2020-10-29T15:01:22Z

docs/source/demos/text_input.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Without the `Text Featurization Component`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent \"categories\" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` did not beat the random guess of predicting \"ham\" in every case."


I love this, great description!

Agree @freddyaboulton that would be a cool extension

docs/source/demos/text_input.ipynb

dsherry · 2020-10-29T15:07:52Z

I see the windows checkin tests failed. There's been a new flake lately with the conda installation on windows. If you rerun, should be all set.

…utorial

eccabay added 2 commits October 28, 2020 16:45

Initial commit, add text tutorial

e012cf7

Update release notes

4c69bfc

eccabay requested review from dsherry, freddyaboulton and jeremyliweishih October 28, 2020 20:48

eccabay self-assigned this Oct 28, 2020

Add text tutorial to index

df9072b

dsherry reviewed Oct 28, 2020

View reviewed changes

docs/source/tutorials.ipynb Outdated Show resolved Hide resolved

Add metadata to tutorials.ipynb

3026ffb

eccabay requested a review from dsherry October 28, 2020 22:04

freddyaboulton approved these changes Oct 29, 2020

View reviewed changes

eccabay and others added 3 commits October 29, 2020 08:19

Fix typos

3182e68

Merge branch 'main' into text_tutorial

46d1f3b

Merge branch 'main' into text_tutorial

15bd3a6

dsherry approved these changes Oct 29, 2020

View reviewed changes

eccabay added 2 commits October 29, 2020 11:25

Address PR comments

bc96495

Merge branch 'text_tutorial' of github.com:alteryx/evalml into text_t…

d4268bd

…utorial

eccabay merged commit 4166fbb into main Oct 29, 2020

dsherry mentioned this pull request Oct 29, 2020

Release v0.15.0 #1370

Merged

eccabay deleted the text_tutorial branch November 2, 2020 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a tutorial for text data #1357

Add a tutorial for text data #1357

eccabay commented Oct 28, 2020

codecov bot commented Oct 28, 2020 •

edited

Loading

dsherry commented Oct 28, 2020

freddyaboulton left a comment

freddyaboulton Oct 28, 2020 •

edited

Loading

eccabay Oct 29, 2020

dsherry Oct 29, 2020

dsherry Oct 29, 2020

freddyaboulton Oct 28, 2020

dsherry Oct 29, 2020

dsherry commented Oct 29, 2020

dsherry left a comment

dsherry Oct 29, 2020

dsherry commented Oct 29, 2020

Add a tutorial for text data #1357

Add a tutorial for text data #1357

Conversation

eccabay commented Oct 28, 2020

codecov bot commented Oct 28, 2020 • edited Loading

Codecov Report

dsherry commented Oct 28, 2020

freddyaboulton left a comment

Choose a reason for hiding this comment

freddyaboulton Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

eccabay Oct 29, 2020

Choose a reason for hiding this comment

dsherry Oct 29, 2020

Choose a reason for hiding this comment

dsherry Oct 29, 2020

Choose a reason for hiding this comment

freddyaboulton Oct 28, 2020

Choose a reason for hiding this comment

dsherry Oct 29, 2020

Choose a reason for hiding this comment

dsherry commented Oct 29, 2020

dsherry left a comment

Choose a reason for hiding this comment

dsherry Oct 29, 2020

Choose a reason for hiding this comment

dsherry commented Oct 29, 2020

codecov bot commented Oct 28, 2020 •

edited

Loading

freddyaboulton Oct 28, 2020 •

edited

Loading