In [3]:
import numpy
import matplotlib.pyplot as plt 
from matplotlib import cm
import pandas
import mglearn

import os
import scipy

import sklearn
import sklearn.ensemble              # import seperatley otherwise sub module won't be imported
import sklearn.neural_network        # import seperatley otherwise sub module won't be imported
from sklearn.cluster import KMeans
import sklearn.feature_selection

import graphviz
import mpl_toolkits.mplot3d as plt3dd

import time

# Data represented as strings

There are four kinds of string data you might see:

- Categorical data
- Free strings that can be semantically mapped to categories
- Structured string data
- Text data

Categorical data is data that comes from a fixed list. Say you collect data via a survey where you ask people their favorite color, with a drop-down menu that allows them to select from “red,” “green,” “blue,” “yellow,” “black,” “white,” “purple,” and “pink.” This will result in a dataset with exactly eight different possible values, which clearly encode a categorical variable. You can check whether this is the case for your data by eyeballing it (if you see very many different strings it is unlikely that this is a categorical variable) and confirm it by computing the unique values over the dataset, and possibly a histogram over how often each appears. You also might want to check whether each variable actually corresponds to a category that makes sense for your application. Maybe halfway through the existence of your survey, someone found that “black” was misspelled as “blak” and subsequently fixed the survey. As a result, your dataset contains both “blak” and “black,” which correspond to the same semantic meaning and should be consolidated


Now imagine instead of providing a drop-down menu, you provide a text field for the users to provide their own favorite colors. Many people might respond with a color name like “black” or “blue.” Others might make typographical errors, use different spellings like “gray” and “grey,” or use more evocative and specific names like “midnight blue.” You will also have some very strange entries. Some good examples come from the xkcd Color Survey, where people had to name colors and came up with names like “velociraptor cloaka” and “my dentist’s office orange. I still remember his dandruff slowly wafting into my gaping yaw,” which are hard to map to colors automatically (or at all). The responses you can obtain from a text field belong to the second category in the list, free strings that can be semantically mapped to categories. It will probably be best to encode this data as a categorical variable, where you can select the categories either by using the most common entries, or by defining categories that will capture responses in a way that makes sense for your application. You might then have some categories for standard colors, maybe a category “multicolored” for people that gave answers like “green and red stripes,” and an “other” category for things that cannot be encoded otherwise. This kind of preprocessing of strings can take a lot of manual effort and is not easily automated. If you are in a position where you can influence data collection, we highly recommend avoiding manually entered values for concepts that are better captured using categorical variables.


Often, manually entered values do not correspond to fixed categories, but still have some underlying structure, like addresses, names of places or people, dates, telephone numbers, or other identifiers. These kinds of strings are often very hard to parse, and their treatment is highly dependent on context and domain. A systematic treatment of these cases is beyond the scope of this book.


The final category of string data is freeform text data that consists of phrases or sentences. Examples include tweets, chat logs, and hotel reviews, as well as the collected works of Shakespeare, the content of Wikipedia, or the Project Gutenberg collection of 50,000 ebooks. All of these collections contain information mostly as sentences composed of words.1 For simplicity’s sake, let’s assume all our documents are in one language, English.2 In the context of text analysis, the dataset is often called the corpus, and each data point, represented as a single text, is called a document. These terms come from the information retrieval (IR) and natural language processing (NLP) community, which both deal mostly in text data.

# Example application: sentiment analysis of movie reviews

In [8]:
path_data = r"./Raw Data/aclImdb/"
reviews_train = sklearn.datasets.load_files(path_data+"train", categories=["pos", "neg"]);

In [9]:
text_train, y_train = reviews_train.data, reviews_train.target;

print("type of text_train: {}".format(type(text_train)));
print("length of text_train: {}".format(len(text_train)));
print("\ntext_train[1]:\n{}".format(text_train[1]));
print("\ny_train[1]:\n{}".format(y_train[1]));

# remove unicode back spaces
text_train = [doc.replace(b"<br />",b" ") for doc in text_train];



type of text_train: <class 'list'>
length of text_train: 25000

text_train[1]:
b'Words can\'t describe how bad this movie is. I can\'t explain it by writing only. You have too see it for yourself to get at grip of how horrible a movie really can be. Not that I recommend you to do that. There are so many clich\xc3\xa9s, mistakes (and all other negative things you can imagine) here that will just make you cry. To start with the technical first, there are a LOT of mistakes regarding the airplane. I won\'t list them here, but just mention the coloring of the plane. They didn\'t even manage to show an airliner in the colors of a fictional airline, but instead used a 747 painted in the original Boeing livery. Very bad. The plot is stupid and has been done many times before, only much, much better. There are so many ridiculous moments here that i lost count of it really early. Also, I was on the bad guys\' side all the time in the movie, because the good guys were so stupid. "Executive Decisi

In [617]:
reviews_test = sklearn.datasets.load_files(path_data+"test", categories=["pos", "neg"]);

In [618]:
text_test, y_test = reviews_test.data, reviews_test.target;
text_test = [doc.replace(b"<br />", b" ") for doc in text_test];
print("Number of documents in test data: {}".format(len(text_test)));
print("Samples per class (test): {}".format(numpy.bincount(y_test)));

Number of documents in test data: 25000
Samples per class (test): [12500 12500]


## Representing Text Data as a Bag of Words

One of the most simple but effective and commonly used ways to represent text for machine learning is using the bag-of-words representation. When using this represen‐ tation, we discard most of the structure of the input text, like chapters, paragraphs, sentences, and formatting, and only count how often each word appears in each text in the corpus. Discarding the structure and counting only word occurrences leads to the mental image of representing text as a “bag.”

Computing the bag-of-words representation for a corpus of documents consists of the following three steps

1) $Tokenization$. Split each document into the words that appear in it (called tokens),
for example by splitting them on whitespace and punctuation.
2) $Vocabulary$ building. Collect a vocabulary of all words that appear in any of the
documents, and number them (say, in alphabetical order)
3) $Encoding$. For each document, count how often each of the words in the vocabulary appear in this document.

In [428]:
vect = sklearn.feature_extraction.text.CountVectorizer();

In [430]:
bards_words =["The fool doth think he is wise,",
"but the wise man knows himself to be a fool"];
vect.fit(bards_words);

In [432]:
print("Vocabulary size: {}".format(len(vect.vocabulary_)));
print("Vocabulary content:\n {}".format(vect.vocabulary_));

Vocabulary size: 13
Vocabulary content:
 {'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


In [254]:
bag_of_words = vect.transform(bards_words);
print("bag_of_words: {}\n".format(repr(bag_of_words)));
print(bag_of_words);

bag_of_words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 6)	1
  (0, 9)	1
  (0, 10)	1
  (0, 12)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 5)	1
  (1, 7)	1
  (1, 8)	1
  (1, 9)	1
  (1, 11)	1
  (1, 12)	1


The bag-of-words representation is stored in a SciPy sparse matrix that only stores the entries that are nonzero (see Chapter 1). The matrix is of shape 2×13, with one row for each of the two data points and one feature for each of the words in the vocabulary. A sparse matrix is used as most documents only contain a small subset of the words in the vocabulary, meaning most entries in the feature array are 0. Think about how many different words might appear in a movie review compared to all the words in the English language (which is what the vocabulary models). Storing all those zeros would be prohibitive, and a waste of memory. To look at the actual con‐ tent of the sparse matrix, we can convert it to a “dense” NumPy array (that also stores all the 0 entries) using the toarray method.

In [150]:
print("Dense representation of bag_of_words:\n{}".format(bag_of_words.toarray()));

Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


We can see that the word counts for each word are either 0 or 1; neither of the two strings in bards_words contains a word twice. Let’s take a look at how to read these feature vectors. The first string ("The fool doth think he is wise,") is repre‐ sented as the first row in, and it contains the first word in the vocabulary, "be", zero times. It also contains the second word in the vocabulary, "but", zero times. It con‐ tains the third word, "doth", once, and so on. Looking at both rows, we can see that the fourth word, "fool", the tenth word, "the", and the thirteenth word, "wise", appear in both strings.

## Bag-of-words for movie reviews

In [621]:
vect = sklearn.feature_extraction.text.CountVectorizer();
X_train = vect.fit_transform(text_train);

In [445]:
print("X_train:\n{}".format(repr(X_train)))

X_train:
<25000x74849 sparse matrix of type '<class 'numpy.int64'>'
	with 3431196 stored elements in Compressed Sparse Row format>


In [446]:
feature_names = vect.get_feature_names_out();

In [447]:
print("Number of features: {}\n".format(len(feature_names)))
print("First 20 features:\n{}\n".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}\n".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}\n".format(feature_names[::2000]))

Number of features: 74849

First 20 features:
['00' '000' '0000000000001' '00001' '00015' '000s' '001' '003830' '006'
 '007' '0079' '0080' '0083' '0093638' '00am' '00pm' '00s' '01' '01pm' '02']

Features 20010 to 20030:
['dratted' 'draub' 'draught' 'draughts' 'draughtswoman' 'draw' 'drawback'
 'drawbacks' 'drawer' 'drawers' 'drawing' 'drawings' 'drawl' 'drawled'
 'drawling' 'drawn' 'draws' 'draza' 'dre' 'drea']

Every 2000th feature:
['00' 'aesir' 'aquarian' 'barking' 'blustering' 'bête' 'chicanery'
 'condensing' 'cunning' 'detox' 'draper' 'enshrined' 'favorit' 'freezer'
 'goldman' 'hasan' 'huitieme' 'intelligible' 'kantrowitz' 'lawful' 'maars'
 'megalunged' 'mostey' 'norrland' 'padilla' 'pincher' 'promisingly'
 'receptionist' 'rivals' 'schnaas' 'shunning' 'sparse' 'subset'
 'temptations' 'treatises' 'unproven' 'walkman' 'xylophonist']



As you can see, possibly a bit surprisingly, the first 10 entries in the vocabulary are all
numbers. All these numbers appear somewhere in the reviews, and are therefore
extracted as words. Most of these numbers don’t have any immediate semantic mean‐
ing—apart from "007", which in the particular context of movies is likely to refer to
the James Bond character.5 Weeding out the meaningful from the nonmeaningful
“words” is sometimes tricky. Looking further along in the vocabulary, we find a col‐
lection of English words starting with “dra”. You might notice that for "draught",
"drawback", and "drawer" both the singular and plural forms are contained in the
vocabulary as distinct words. These words have very closely related semantic mean‐
ings, and counting them as different words, corresponding to different features,
might not be ideal.

Challenge: Numbers appear in the first 20 words of the given data. The word "007" likely refers to James Bond. Verify this.

In [504]:
# Steps
# Filter list where 007 apears, this can be done with a "map" key word byt it mighty be messy in a single line
idx = [];
for i, x in enumerate(X_train):
    word_vals = x.toarray()[0];
    if word_vals[vect.vocabulary_["007"]] > 0: idx.append(i);        # filter out 007 word

    
# apply idx reviews list 
arrTxt = numpy.array(text_train);    # convert text_train list to array to used indecises
arrBondJamesBond = arrTxt[idx];

In [518]:
# pritn random review
print(arrBondJamesBond[numpy.random.randint(0,len(idx))]);

b'EA have shown us that they can make a classic 007 agent and make you feel in the 60\'s world. The graphics of the game are outstanding and also the voice recording is very professional. I got this game April 2007 (two years after release), and I am still impressed with the gameplay. It\'s a shame that EA will no longer make 007 games.  I give this game 10/10 for the levels it contains, especially the "consulate" level. I would recommend this game to anyone from the age of 13 and over. The only thing I didn\'t like in the game is the Russian boat level, it was too much pressure. On the whole I like the game A LOT!!'


### Building the model

Before we try to improve our feature extraction, let’s obtain a quantitative measure of performance by actually building a classifier. We have the training labels stored in y_train and the bag-of-words representation of the training data in X_train, so we can train a classifier on this data. For high-dimensional, sparse data like this, linear models like LogisticRegression often work best. Let’s start by evaluating LogisticRegresssion using cross-validation:

In [533]:
logreg = sklearn.linear_model.LogisticRegression(max_iter=1000);
scores = sklearn.model_selection.cross_val_score(logreg, X_train, y_train, cv=5,n_jobs = 3);

In [548]:
print("Mean cross-validation accuracy: {:4.2f} %".format(100*scores.mean()));

Mean cross-validation accuracy: 88.13 %


We obtain a mean cross-validation score of 88%, which indicates reasonable perfor‐ mance for a balanced binary classification task. We know that LogisticRegression has a regularization parameter, C, which we can tune via cross-validation

In [565]:
param_grid = {'C' : numpy.logspace(-3,1,5)};
grid = sklearn.model_selection.GridSearchCV(logreg, param_grid=param_grid, n_jobs=3);
grid.fit(X_train, y_train);

In [646]:
print("Best cross-validation score: {:4.2f} %".format(100*grid.best_score_))
print("Best parameters: ", grid.best_params_)
best_para = grid.best_params_;

Best cross-validation score: 88.80 %
Best parameters:  {'C': 0.1}


In [629]:
X_test = vect.transform(text_test)
print("Test score: {:4.2f} %".format(100*grid.score(X_test, y_test)))

Test score: 87.89 %


Now, let’s see if we can improve the extraction of words. The CountVectorizer
extracts tokens using a regular expression. By default, the regular expression that is
used is "\b\w\w+\b". If you are not familiar with regular expressions, this means it
finds all sequences of characters that consist of at least two letters or numbers (\w)
and that are separated by word boundaries (\b). It does not find single-letter words,
and it splits up contractions like “doesn’t” or “bit.ly”, but it matches “h8ter” as a single
word. The CountVectorizer then converts all words to lowercase characters, so that
“soon”, “Soon”, and “sOon” all correspond to the same token (and therefore feature).
This simple mechanism works quite well in practice, but as we saw earlier, we get
many uninformative features (like the numbers). One way to cut back on these is to
only use tokens that appear in at least two documents (or at least five documents, and
so on). A token that appears only in a single document is unlikely to appear in the test
set and is therefore not helpful. We can set the minimum number of documents a
token needs to appear in with the min_df parameter:

In [674]:
vect_2 = sklearn.feature_extraction.text.CountVectorizer(min_df=5);
X_train = vect_2.fit_transform(text_train, y_train);

In [665]:
print("X_train with min_df: {}".format(repr(X_train)))

X_train with min_df: <25000x27271 sparse matrix of type '<class 'numpy.int64'>'
	with 3354014 stored elements in Compressed Sparse Row format>


By requiring at least five appearances of each token, we can bring down the number of features to 27,271, as seen in the preceding output—only about a third of the origi‐ nal features. Let’s look at some tokens again

In [669]:
feature_names = vect_2.get_feature_names_out();

print("First 50 features:\n{}\n".format(feature_names[0:50]));
print("Features 20010 to 20030:\n{}\n".format(feature_names[20010:20030]));
print("Every 1000th feature:\n{}\n".format(feature_names[::1000]));

First 50 features:
['00' '000' '007' '00s' '01' '02' '03' '04' '05' '06' '07' '08' '09' '10'
 '100' '1000' '100th' '101' '102' '103' '104' '105' '107' '108' '10s'
 '10th' '11' '110' '112' '116' '117' '11th' '12' '120' '12th' '13' '135'
 '13th' '14' '140' '14th' '15' '150' '15th' '16' '160' '1600' '16mm' '16s'
 '16th']

Features 20010 to 20030:
['repentance' 'repercussions' 'repertoire' 'repetition' 'repetitions'
 'repetitious' 'repetitive' 'rephrase' 'replace' 'replaced' 'replacement'
 'replaces' 'replacing' 'replay' 'replayable' 'replayed' 'replaying'
 'replays' 'replete' 'replica']

Every 1000th feature:
['00' 'alternatively' 'baked' 'bothersome' 'centipede' 'complicity'
 'cutlery' 'disgraceful' 'elton' 'fatal' 'gaining' 'hamburgers' 'ideals'
 'ivory' 'leering' 'martin' 'moxy' 'opportunist' 'picasso' 'prudish'
 'repartee' 'sas' 'silvers' 'standup' 'talkative' 'trend' 'verisimilitude'
 'wreaking']



There are clearly many fewer numbers, and some of the more obscure words or mis‐ spellings seem to have vanished. Let’s see how well our model performs by doing a grid search again

In [680]:
param_grid = {'C' : numpy.logspace(-3,1,5)};
logreg = sklearn.linear_model.LogisticRegression(max_iter=10000);
grid = sklearn.model_selection.GridSearchCV(logreg, param_grid=param_grid, cv=5, n_jobs=4);
grid.fit(X_train, y_train);

In [686]:
print("Best cross-validation score: {:4.2f} %".format(100*grid.best_score_))

Best cross-validation score: 88.81 %


The best validation accuracy of the grid search is still 89%, unchanged from before.
We didn’t improve our model, but having fewer features to deal with speeds up pro‐
cessing and throwing away useless features might make the model more interpretable

If the transform method of CountVectorizer is called on a docu‐ ment that contains words that were not contained in the training data, these words will be ignored as they are not part of the dictio‐ nary. This is not really an issue for classification, as it’s not possible to learn anything about words that are not in the training data. For some applications, like spam detection, it might be helpful to add a feature that encodes how many so-called “out of vocabulary” words there are in a particular document, though. For this to work, you need to set min_df; otherwise, this feature will never be active dur‐
ing training.

## Stopwords

Another way that we can get rid of uninformative words is by discarding words that are too frequent to be informative. There are two main approaches: using a language- specific list of stopwords, or discarding words that appear too frequently. scikit- learn has a built-in list of English stopwords in the feature_extraction.text module

In [694]:
print("Number of stop words: {}".format(len(sklearn.feature_extraction.text.ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(sklearn.feature_extraction.text.ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318
Every 10th stopword:
['cant', 'whence', 'noone', 'up', 'or', 'above', 'per', 'whoever', 'move', 'i', 'else', 'somewhere', 'forty', 'cannot', 'ever', 'put', 'last', 'anywhere', 'get', 'whatever', 'became', 'five', 'within', 'eight', 'very', 'have', 'hers', 'yourself', 'you', 'this', 'bottom', 'nobody']


In [778]:
vect = sklearn.feature_extraction.text.CountVectorizer(stop_words="english", min_df=5).fit(text_train);
X_train = vect.transform(text_train);

In [779]:
print("X_train with stop words:\n{}".format(repr(X_train)));

X_train with stop words:
<25000x26966 sparse matrix of type '<class 'numpy.int64'>'
	with 2149958 stored elements in Compressed Sparse Row format>


There are now 305 (27,271–26,966) fewer features in the dataset, which means that
most, but not all, of the stopwords appeared. Let’s run the grid search again

In [781]:
param_grid = {'C' : numpy.logspace(-3,1,5)};
logreg = sklearn.linear_model.LogisticRegression(max_iter=10000);
grid = sklearn.model_selection.GridSearchCV(logreg, param_grid=param_grid, cv=5, n_jobs=4);
grid.fit(X_train, y_train);

In [782]:
print("Best cross-validation score: {:4.2f} %".format(100*grid.best_score_))

Best cross-validation score: 88.30 %


The grid search performance decreased slightly using the stopwords—not enough to worry about, but given that excluding 305 features out of over 27,000 is unlikely to change performance or interpretability a lot, it doesn’t seem worth using this list. Fixed lists are mostly helpful for small datasets, which might not contain enough information for the model to determine which words are stopwords from the data itself. As an exercise, you can try out the other approach, discarding frequently appearing words, by setting the max_df option of CountVectorizer and see how it
influences the number of features and the performance.

## Rescaling the data with tf–idf

In [789]:
#335