# Using Text Data with EvalML

In this demo, we will show you how to use EvalML to build models which use text data. 

In [None]:
import evalml
from evalml import AutoMLSearch

## Dataset

We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not ("ham"). This dataset is originally from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset), but modified to produce a slightly more even distribution of spam to ham.

In [None]:
from urllib.request import urlopen
import pandas as pd

input_data = urlopen(
    "https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv"
)
data = pd.read_csv(input_data)[:750]

X = data.drop(["Category"], axis=1)
y = data["Category"]

display(X.head())

The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% [accuracy](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification) in order to perform better than a trivial baseline model which simply classifies everything as ham. 

In [None]:
y.value_counts(normalize=True)

In order to properly utilize Woodwork's 'Natural Language' typing, we need to pass this argument in during initialization. Otherwise, this will be treated as an 'Unknown' type and dropped in the search.

In [None]:
X.ww.init(logical_types={"Message": "NaturalLanguage"})

## Search for best pipeline

In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.

In [None]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2, random_seed=0
)

EvalML uses [Woodwork](https://woodwork.alteryx.com/en/stable/) to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data. We can print out the logical type of the `Message` column and assert that it is indeed inferred as a natural language column.

In [None]:
X_train.ww

Because the spam/ham labels are binary, we will use `AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')`. When we call `.search()`, the search for the best pipeline will begin. 

In [None]:
automl = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="binary",
    max_batches=1,
    optimize_thresholds=True,
    verbose=True,
)

automl.search(interactive_plot=False)

### View rankings and select pipeline

Once the fitting process is done, we can see all of the pipelines that were searched.

In [None]:
automl.rankings

To select the best pipeline we can call `automl.best_pipeline`.

In [None]:
best_pipeline = automl.best_pipeline

### Describe pipeline

You can get more details about any pipeline, including how it performed on other objective functions.

In [None]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])

In [None]:
best_pipeline.graph()

Notice above that there is a `Natural Language Featurizer` as the first step in the pipeline. AutoMLSearch uses the woodwork accessor to recognize that `'Message'` is a text column, and converts this text into numerical values that can be handled by the estimator.

## Evaluate on holdout

Now, we can score the pipeline on the holdout data using the ranking objectives for binary classification problems.

In [None]:
scores = best_pipeline.score(
    X_holdout, y_holdout, objectives=evalml.objectives.get_ranking_objectives("binary")
)
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')

As you can see, this model performs relatively well on this dataset, even on unseen data.

## What does the Natural Language Featurizer do?

Machine learning models cannot handle non-numeric data. Any text must be broken down into numeric features that provide useful information about that text. The Natural Natural Language Featurizer first normalizes your text by removing any punctuation and other non-alphanumeric characters and converting any capital letters to lowercase. From there, it passes the text into [featuretools](https://www.featuretools.com/)' [nlp_primitives](https://docs.featuretools.com/en/v0.16.0/api_reference.html#natural-language-processing-primitives) `dfs` search, resulting in several informative features that replace the original column in your dataset: Diversity Score, Mean Characters per Word, Polarity Score, LSA (Latent Semantic Analysis), Number of Characters, and Number of Words.

**Diversity Score** is the ratio of unique words to total words.

**Mean Characters per Word** is the average number of letters in each word.

**Polarity Score** is a prediction of how "polarized" the text is, on a scale from -1 (extremely negative) to 1 (extremely positive).

**Latent Semantic Analysis** is an abstract representation of how important each word is with respect to the entire text, reduced down into two values per text. While the other text features are each a single column, this feature adds two columns to your data, `LSA(column_name)[0]` and `LSA(column_name)[1]`.

**Number of Characters** is the number of characters in the text.

**Number of Words** is the number of words in the text.

Let's see what this looks like with our spam/ham example.

In [None]:
best_pipeline.input_feature_names

Here, the Natural Language Featurizer takes in a single "Message" column, but then the next component in the pipeline, the Imputer, receives five columns of input. These five columns are the result of featurizing the text-type "Message" column. Most importantly, these featurized columns are what ends up passed in to the estimator.

If the dataset had any non-text columns, those would be left alone by this process. If the dataset had more than one text column, each would be broken into these five feature columns independently. 

### The features, more directly

Rather than just checking the new column names, let's examine the output of this component directly. We can see this by running the component on its own.

In [None]:
natural_language_featurizer = evalml.pipelines.components.NaturalLanguageFeaturizer()
X_featurized = natural_language_featurizer.fit_transform(X_train)

Now we can compare the input data to the output from the Natural Language Featurizer:

In [None]:
X_train.head()

In [None]:
X_featurized.head()

These numeric values now represent important information about the original text that the estimator at the end of the pipeline can successfully use to make predictions.

## Why encode text this way?

To demonstrate the importance of text-specific modeling, let's train a model with the same dataset, without letting `AutoMLSearch` detect the text column. We can change this by explicitly setting the data type of the `'Message'` column in Woodwork to `Categorical` using the utility method `infer_feature_types`.

In [None]:
from evalml.utils import infer_feature_types

X = infer_feature_types(X, {"Message": "Categorical"})
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2, random_seed=0
)

In [None]:
automl_no_text = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="binary",
    max_batches=1,
    optimize_thresholds=True,
    verbose=True,
)

automl_no_text.search(interactive_plot=False)

Like before, we can look at the rankings and pick the best pipeline.

In [None]:
automl_no_text.rankings

In [None]:
best_pipeline_no_text = automl_no_text.best_pipeline

Here, changing the data type of the text column removed the `Natural Language Featurizer` from the pipeline.

In [None]:
best_pipeline_no_text.graph()

In [None]:
automl_no_text.describe_pipeline(automl_no_text.rankings.iloc[0]["id"])

In [None]:
# get standard performance metrics on holdout data
scores = best_pipeline_no_text.score(
    X_holdout, y_holdout, objectives=evalml.objectives.get_ranking_objectives("binary")
)
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')

Without the `Natural Language Featurizer`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent "categories" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` performs very similarly to randomly guessing "ham" in every case.