# Using Text Data with EvalML

In this demo, we will show you how to use text data in a model using EvalML. 

In [None]:
import evalml
from evalml import AutoMLSearch

## Dataset

We will be utilizing a dataset of SMS text messages, some of which are categorized as spam, and others which are not ("ham").

In [None]:
from urllib.request import urlopen
import pandas as pd

input_data = urlopen('https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv')
data = pd.read_csv(input_data)

X = data.drop(['Category'], axis=1)
y = data['Category']

display(X.head())

The ham vs spam distribution of the data is 3:1, so any machine learning model must get above 75% accuracy in order to perform better than simply classifying everything as ham. 

In [None]:
y.value_counts(normalize=True)

## Search for best pipeline

In order to validate the results of the pipeline creation and optimization process, we will save some of our data as a holdout set.

In [None]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=0.2, random_state=0)

print(X.dtypes)

EvalML uses [Woodwork](https://woodwork.alteryx.com/en/stable/) to automatically detect which columns are text columns, so you can run search normally, as you would if there was no text data.

Because the spam/ham labels are binary, we will use `AutoMLSearch(problem_type='binary')`. When we call `.search()`, the search for the best pipeline will begin. 

In [None]:
automl = AutoMLSearch(problem_type='binary',
                      max_iterations=5,
                      optimize_thresholds=True)

automl.search(X_train, y_train)

### View rankings and select pipeline

Once the fitting process is done, we can see all of the pipelines that were searched, ranked by their score on the lead scoring objective we defined

In [None]:
automl.rankings

to select the best pipeline we can run

In [None]:
best_pipeline = automl.best_pipeline

### Describe pipeline

You can get more details about any pipeline, including how it performed on other objective functions.

In [None]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])

In [None]:
best_pipeline.graph()

Notice above that there is a `Text Featurization Component` as the second step in the pipeline. It recognizes that `'Message'` is a text column, and converts this text into numerical values that can be handled by the estimator.

## Evaluate on hold out

Finally, we retrain the best pipeline on all of the training data and evaluate on the holdout

In [None]:
best_pipeline.fit(X_train, y_train)

Now, we can score the pipeline on the hold out data using the core objectives for binary classification problems

In [None]:
best_pipeline.score(X_holdout, y_holdout, objectives=evalml.objectives.get_core_objectives('binary'))

As you can see, this model performs relatively well on this dataset, even on unseen data.

## Why encode text this way?

To demonstrate the importance of text-specific modeling, let's train a model with the same dataset, without letting `AutoMLSearch` detect the text column. We can change this by explicitly setting the data type of the `'Message'` column in Woodwork.

In [None]:
from woodwork import DataTable

X_train_datatable = DataTable(X_train, logical_types={'Message': 'Categorical'})

In [None]:
automl_no_text = AutoMLSearch(problem_type='binary',
                      max_iterations=5,
                      optimize_thresholds=True)

automl_no_text.search(X_train_datatable, y_train)

Like before, we can look at the rankings and pick the best pipeline

In [None]:
automl_no_text.rankings

In [None]:
best_pipeline_no_text = automl_no_text.best_pipeline
best_pipeline_no_text.graph()

Here, changing the data type of the text column removed the `Text Featurization Component` from the pipeline

In [None]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])

In [None]:
# train on the full training data
best_pipeline_no_text.fit(X_train, y_train)

# get standard performance metrics on holdout data
best_pipeline_no_text.score(X_holdout, y_holdout,  objectives=evalml.objectives.get_core_objectives('binary'))

Without the `Text Featurization Component`, the `'Message'` column was treated as a categorical column, and therefore the conversion of this text to numerical features happened in the `One Hot Encoder`. The best pipeline encoded the top 10 most frequent "categories" of these texts, meaning 10 text messages were one-hot encoded and all the others were dropped. Clearly, this removed almost all of the information from the dataset, as we can see the `best_pipeline_no_text` did not beat the random guess of predicting "ham" in every case.