# 07 __Working with text data__

There are four kinds of string data you might see:
 - Categorical data
 - Free strings that can be semantically mapped to categories
 - Structures string data
 - Text data

Categorical data is data that comes from a fixed list. Say you collect data via a survey
where you ask people their favorite color, with a drop-down menu that allows them
to select from “red,” “green,” “blue,” “yellow,” “black,” “white,” “purple,” and “pink.”
This will result in a dataset with exactly eight different possible values, which clearly
encode a categorical variable. You can check whether this is the case for your data by
eyeballing it (if you see very many different strings it is unlikely that this is a categorical variable) and confirm it by computing the unique values over the dataset, and
possibly a histogram over how often each appears. You also might want to check
whether each variable actually corresponds to a category that makes sense for your
application. Maybe halfway through the existence of your survey, someone found that
“black” was misspelled as “blak” and subsequently fixed the survey. As a result, your
dataset contains both “blak” and “black,” which correspond to the same semantic
meaning and should be consolidated.

Now imagine instead of providing a drop-down menu, you provide a text field for the
users to provide their own favorite colors. Many people might respond with a color
name like “black” or “blue.” Others might make typographical errors, use different
spellings like “gray” and “grey,” or use more evocative and specific names like
“midnight blue.” You will also have some very strange entries. Some good examples come
from the xkcd Color Survey, where people had to name colors and came up with
names like “velociraptor cloaka” and “my dentist’s office orange. I still remember his
dandruff slowly wafting into my gaping yaw,” which are hard to map to colors auto‐
matically (or at all). The responses you can obtain from a text field belong to the
second category in the list, free strings that can be semantically mapped to categories. It will probably be best to encode this data as a categorical variable, where you can
select the categories either by using the most common entries, or by defining
categories that will capture responses in a way that makes sense for your application. You
might then have some categories for standard colors, maybe a category
“multicolored” for people that gave answers like “green and red stripes,” and an “other” category for things that cannot be encoded otherwise. This kind of preprocessing of
strings can take a lot of manual effort and is not easily automated. If you are in a position where you can influence data collection, we highly recommend avoiding manually entered values for concepts that are better captured using categorical variables.

Often, manually entered values do not correspond to fixed categories, but still have
some underlying structure, like addresses, names of places or people, dates, telephone
numbers, or other identifiers. These kinds of strings are often very hard to parse, and
their treatment is highly dependent on context and domain.

The final category of string data is freeform text data that consists of phrases or sentences. Examples include tweets, chat logs, and hotel reviews, as well as the collected
works of Shakespeare, the content of Wikipedia, or the Project Gutenberg collection
of 50,000 ebooks. All of these collections contain information mostly as sentences
composed of words. 1 For simplicity’s sake, let’s assume all our documents are in one
language, English. 2 In the context of text analysis, the dataset is often called the corpus, and each data point, represented as a single text, is called a document. These
terms come from the information retrieval (IR) and natural language processing (NLP)
community, which both deal mostly in text data.

## __Example Application: Sentiment Analysis of Moview Reviews__

As a running example in this chapter, we will use a dataset of movie reviews from the
IMDb (Internet Movie Database) website collected by Stanford researcher Andrew
Maas. This dataset contains the text of the reviews, together with a label that indicates whether a review is “positive” or “negative.” The IMDb website itself contains
ratings from 1 to 10. To simplify the modeling, this annotation is summarized as a
two-class classification dataset where reviews with a score of 6 or higher are labeled as
positive, and the rest as negative. We will leave the question of whether this is a good
representation of the data open, and simply use the data as provided by Andrew
Maas.

After unpacking the data, the dataset is provided as text files in two separate folders,
one for the training data and one for the test data. Each of these in turn has two subfolders, one called pos and one called neg:

In [1]:
!tree -L 2 input/aclImdb/

[01;34minput/aclImdb/[00m
├── imdbEr.txt
├── imdb.vocab
├── README
├── [01;34mtest[00m
│   ├── labeledBow.feat
│   ├── [01;34mneg[00m
│   ├── [01;34mpos[00m
│   ├── urls_neg.txt
│   └── urls_pos.txt
└── [01;34mtrain[00m
    ├── labeledBow.feat
    ├── [01;34mneg[00m
    ├── [01;34mpos[00m
    ├── [01;34munsup[00m
    ├── unsupBow.feat
    ├── urls_neg.txt
    ├── urls_pos.txt
    └── urls_unsup.txt

7 directories, 11 files


The pos folder contains all the positive reviews, each as a separate text file, and simi‐
larly for the neg folder. There is a helper function in `scikit-learn` to load files stored
in such a folder structure, where each subfolder corresponds to a label, called
`load_files` . We apply the `load_files` function first to the training data:

In [2]:
from sklearn.datasets import load_files

reviews_train = load_files('input/aclImdb/train/')

# load files returns a bunch, containing texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target

In [3]:
print(f'type of text_train: {type(text_train)}')
print(f'length of text_train: {len(text_train)}')
print(f'text_train[1]:\n{text_train[1]}')

type of text_train: <class 'list'>
length of text_train: 75000
text_train[1]:
b"Amount of disappointment I am getting these days seeing movies like Partner, Jhoom Barabar and now, Heyy Babyy is gonna end my habit of seeing first day shows.<br /><br />The movie is an utter disappointment because it had the potential to become a laugh riot only if the d\xc3\xa9butant director, Sajid Khan hadn't tried too many things. Only saving grace in the movie were the last thirty minutes, which were seriously funny elsewhere the movie fails miserably. First half was desperately been tried to look funny but wasn't. Next 45 minutes were emotional and looked totally artificial and illogical.<br /><br />OK, when you are out for a movie like this you don't expect much logic but all the flaws tend to appear when you don't enjoy the movie and thats the case with Heyy Babyy. Acting is good but thats not enough to keep one interested.<br /><br />For the positives, you can take hot actresses, last 30 minutes,

You can see that text_train is a list of length 25,000, where each entry is a string
containing a review. We printed the review with index 1. You can also see that the
review contains some HTML line breaks ( <br /> ). While these are unlikely to have a
large impact on our machine learning models, it is better to clean the data and
remove this formatting before we proceed:

In [4]:
text_train = [doc.replace(b'<br />', b' ') for doc in text_train]

In [5]:
# we have a balanced dataset
import numpy as np
print(f'samples per class (training): {np.bincount(y_train)}')

samples per class (training): [12500 12500 50000]


### As of this date, Jun 2020, the dataset aclImdb test dir contains 3 classes

The task we want to solve is as follows: given a review, we want to assign the label
“positive” or “negative” based on the text content of the review. This is a standard
binary classification task. However, the text data is not in a format that a machine
learning model can handle. We need to convert the string representation of the text
into a numeric representation that we can apply our machine learning algorithms to.

## __Representing Text Data as a Bag of Words__

One of the most simple but effective and commonly used ways to represent text for
machine learning is using the bag-of-words representation. When using this
representation, we discard most of the structure of the input text, like chapters, paragraphs,
sentences, and formatting, and only count how often each word appears in each text in
the corpus. Discarding the structure and counting only word occurrences leads to the
mental image of representing text as a “bag.”

Computing the bag-of-words representation for a corpus of documents consists of
the following three steps:

1. _Tokenization_: Split each document into the words that appear in it
   (called _tokens_), for example by splitting them on whitespace and punctuation.
2. _Vocabulary building_. Collect a vocabulary of all words that appear in
   any of the documents, and number them (say, in alphabetical order)
3. _Encoding_: For each document, count how often each of the words in the
   vocabulary appear in this document.

In [6]:
# apply bag-of-words to a Toy dataset
# this is implemented in CountVectorizer
bards_words = [
    'the fool doth think he is wise,',
    'but the wise man knows himself to be a fool.'
]

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [8]:
# fitting the CountVectorizer consists of the
# tokenization of the training data and building
# of the vocabulary, which we can access as the
# vocabulary_ attribute
print(f'Vocabulary size: {len(vect.vocabulary_)}')
print(f'Voocabulary content:\n{vect.vocabulary_}')

Vocabulary size: 13
Voocabulary content:
{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


In [9]:
bag_of_words = vect.transform(bards_words)

print(f'bag_of_words: {repr(bag_of_words)}')

bag_of_words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>


To look at the actual content of the sparse matrix, we can convert it to a “dense” NumPy array (that also stores
all the 0 entries) using the toarray method:

In [10]:
print(f'dense representation of bag_of_words:\n{bag_of_words.toarray()}')

dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


### Bag of Words for Movie Reviews

In [11]:
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)

print(f'X_train:\n{repr(X_train)}')

X_train:
<75000x124255 sparse matrix of type '<class 'numpy.int64'>'
	with 10315542 stored elements in Compressed Sparse Row format>


In [12]:
# the vocabulary conains 124255 entries
# let's take a closer look
feature_names = vect.get_feature_names()
print(f'Number of features: {len(feature_names)}')
print(f'First 20 features:\n{feature_names[:20]}')
print(f'Features 20010 to 20030:\n{feature_names[20010:20030]}')
print(f'Every 2000th feature:\n{feature_names[::2000]}')

Number of features: 124255
First 20 features:
['00', '000', '0000', '0000000000000000000000000000000001', '0000000000001', '000000001', '000000003', '00000001', '000001745', '00001', '0001', '00015', '0002', '0007', '00083', '000ft', '000s', '000th', '001', '002']
Features 20010 to 20030:
['cheapen', 'cheapened', 'cheapening', 'cheapens', 'cheaper', 'cheapest', 'cheapie', 'cheapies', 'cheapjack', 'cheaply', 'cheapness', 'cheapo', 'cheapozoid', 'cheapquels', 'cheapskate', 'cheapskates', 'cheapy', 'chearator', 'cheat', 'cheata']
Every 2000th feature:
['00', '_require_', 'aideed', 'announcement', 'asteroid', 'banquière', 'besieged', 'bollwood', 'btvs', 'carboni', 'chcialbym', 'clotheth', 'consecration', 'cringeful', 'deadness', 'devagan', 'doberman', 'duvall', 'endocrine', 'existent', 'fetiches', 'formatted', 'garard', 'godlie', 'gumshoe', 'heathen', 'honoré', 'immatured', 'interested', 'jewelry', 'kerchner', 'köln', 'leydon', 'lulu', 'mardjono', 'meistersinger', 'misspells', 'mumblecore'

All these numbers appear somewhere in the reviews, and are therefore
extracted as words. Most of these numbers don’t have any immediate semantic
meaning—apart from "007" , which in the particular context of movies is likely to refer to
the James Bond character. 5 Weeding out the meaningful from the nonmeaningful
“words” is sometimes tricky. Looking further along in the vocabulary, we find a
collection of English words starting with “dra”. You might notice that for "draught" ,
"drawback" , and "drawer" both the singular and plural forms are contained in the
vocabulary as distinct words. These words have very closely related semantic
meanings, and counting them as different words, corresponding to different features,
might not be ideal.

Before we try to improve our feature extraction, let’s obtain a quantitative measure of
performance by actually building a classifier. We have the training labels stored in
`y_train` and the bag-of-words representation of the training data in `X_train` , so we
can train a classifier on this data. For high-dimensional, sparse data like this, linear
models like LogisticRegression often work best.

In [13]:
# let's start by evaluating LogisticRegression using cv
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(
    LogisticRegression(max_iter = 10000), X_train, y_train, cv = 5, n_jobs = -1, verbose = 2
)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:   29.8s remaining:   44.7s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   29.8s remaining:    0.0s


KeyboardInterrupt: 

In [None]:
print(f'Mean cross-validation accuracy: {np.mean(scores):.2f}')

We know that LogisticRegression
has a regularization parameter, C , which we can tune via cross-validation:

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10]
}

grid = GridSearchCV(LogisticRegression(), param_grid, cv = 5, n_jobs = 8, verbose = 2)
grid.fit(X_train, y_train)

print(f'best cross-validation score: {grid.best_score_:.2f}')
print(f'best params: ', grid.best_params_)

We can now assess the generalization performance of this parameter setting on the test set:

In [None]:
reviews_test = load_files('input/aclImdb/test/')

# load files returns a bunch, containing texts and training labels
text_test, y_test = reviews_test.data, reviews_test.target

In [None]:
X_test = vect.transform(text_test)

In [None]:
print(f'{grid.score(X_test, y_test):.2f}')

Now, let’s see if we can improve the extraction of words. The CountVectorizer
extracts tokens using a regular expression. By default, the regular expression that is
used is `"\b\w\w+\b"` . If you are not familiar with regular expressions, this means it
finds all sequences of characters that consist of at least two letters or numbers (`\w`)
and that are separated by word boundaries ( `\b` ). It does not find single-letter words,
and it splits up contractions like “doesn’t” or “bit.ly”, but it matches “h8ter” as a single
word. The CountVectorizer then converts all words to lowercase characters, so that
“soon”, “Soon”, and “sOon” all correspond to the same token (and therefore feature).
This simple mechanism works quite well in practice, but as we saw earlier, we get
many uninformative features (like the numbers). One way to cut back on these is to
only use tokens that appear in at least two documents (or at least five documents, and
so on). A token that appears only in a single document is unlikely to appear in the test
set and is therefore not helpful. We can set the minimum number of documents a
token needs to appear in with the `min_df` parameter:

In [None]:
vect = CountVectorizer(min_df = 5).fit(text_train)
X_train = vect.transform(text_train)

print(f'X_train with min_df: {repr(X_train)}')

In [None]:
feature_names = vect.get_feature_names()

In [None]:
print(f'first 50 features:\n{feature_names[:50]}')
print(f'features 20010 to 20030:\n{feature_names[20010:20030]}')
print(f'every 700th feature:\n{feature_names[::700]}')

There are clearly many fewer numbers, and some of the more obscure words or mis‐
spellings seem to have vanished. Let’s see how well our model performs by doing a
grid search again:

In [None]:
grid = GridSearchCV(LogisticRegression(max_iter = 10000), param_grid, cv = 5, verbose = 2, n_jobs = -1)
grid.fit(X_train, y_train)

print(f'Best cross-validation score: {grid.best_score_:.2f}')

## Stopwords
Another way that we can get rid of uninformative words is by discarding words that
are too frequent to be informative. There are two main approaches: using a `language-specific`
list of stopwords, or discarding words that appear too frequently. `scikit-learn`
has a built-in list of English stopwords in the `feature_extraction.text`
module: