alteryx · eccabay · Jul 29, 2021 · Jul 27, 2021 · Jul 27, 2021 · Jul 28, 2021
diff --git a/docs/source/demos/text_input.ipynb b/docs/source/demos/text_input.ipynb
@@ -237,6 +237,94 @@
     "As you can see, this model performs relatively well on this dataset, even on unseen data."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## What does the Text Featurization Component do?\n",
+    "\n",
+    "Machine learning models cannot handle non-numeric data. Any text must be broken down into numeric features that provide useful information about that text. The Text Featurization component first normalizes your text by removing any punctuation and other non-alphanumeric characters and converting any capital letters to lowercase. From there, it passes the text into [featuretools](https://www.featuretools.com/)' [nlp_primitives](https://docs.featuretools.com/en/v0.16.0/api_reference.html#natural-language-processing-primitives) `dfs` search, resulting in several informative features that replace the original column in your dataset: Diversity Score, Mean Characters per Word, Polarity Score, and LSA (Latent Semantic Analysis).\n",
+    "\n",
+    "**Diversity Score** is the ratio of unique words to total words.\n",
+    "\n",
+    "**Mean Characters per Word** is the average number of letters in each word.\n",
+    "\n",
+    "**Polarity Score** is a prediction of how \"polarized\" the text is, on a scale from -1 (extremely negative) to 1 (extremely positive).\n",
+    "\n",
+    "**Latent Semantic Analysis** is an abstract representation of how important each word is with respect to the entire text, reduced down into two values per text. While the other text features are each a single column, this feature adds two columns to your data, `LSA(column_name)[0]` and `LSA(column_name)[1]`.\n",
+    "\n",
+    "Let's see what this looks like with our spam/ham example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "best_pipeline.input_feature_names"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, the Text Featurization component takes in a single \"Message\" column, but then the next component in the pipeline, the Imputer, receives five columns of input. These five columns are the result of featurizing the text-type \"Message\" column. Most importantly, these featurized columns are what ends up passed in to the estimator.\n",
+    "\n",
+    "If the dataset had any non-text columns, those would be left alone by this process. If the dataset had more than one text column, each would be broken into these five feature columns independently. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### The features, more directly\n",
+    "\n",
+    "Rather than just checking the new column names, let's examine the output of this component directly. We can see this by running the component on its own."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_featurizer = evalml.pipelines.components.TextFeaturizer()\n",
+    "X_featurized = text_featurizer.fit_transform(X_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can compare the input data to the output from the text featurizer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_featurized.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These numeric values now represent important information about the original text that the estimator at the end of the pipeline can successfully use to make predictions."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -361,4 +449,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}
diff --git a/docs/source/release_notes.rst b/docs/source/release_notes.rst
@@ -13,6 +13,7 @@ Release Notes
         * Renamed ``ComponentGraph``'s ``get_parents`` to ``get_inputs`` :pr:`2540`
         * Removed ``ComponentGraph.linearized_component_graph`` and ``ComponentGraph.from_list`` :pr:`2556`
     * Documentation Changes
+        * Improved detail of ``TextFeaturizer`` docstring and tutorial :pr:`2568`
     * Testing Changes
         * Added test that makes sure ``split_data`` does not shuffle for time series problems :pr:`2552`
 

diff --git a/evalml/pipelines/components/transformers/preprocessing/text_featurizer.py b/evalml/pipelines/components/transformers/preprocessing/text_featurizer.py
@@ -14,6 +14,12 @@
 class TextFeaturizer(TextTransformer):
     """Transformer that can automatically featurize text columns using featuretools' nlp_primitives.
 
+    Since models cannot handle non-numeric data, any text must be broken down into features that
+    provide useful information about that text. This component splits each text column into
+    several informative features: Diversity Score, Mean Characters per Word, Polarity Score, and
+    LSA (Latent Semantic Analysis). Calling transform on this component will replace any text columns
+    in the given dataset with these numeric columns.
+
     Arguments:
         random_seed (int): Seed for the random number generator. Defaults to 0.
     """