There is a third kind of feature that can be found in many applications, which is **text**.

Text data is usually represented as **strings**, made up of characters. In any of the examples just given, the length of the text data will vary. This feature is clearly very differ‐
ent from the numeric features and we need to process the data before we can apply our machine learning algorithms to it

## Types of Data Represented As Strings

Text is usually just a **string** in your dataset, but not all string features should be treated as text. A string feature can sometimes represent categorical variables. There is no way to know how to treat a string feature before looking at the data.

There are four kinds of string data you might see:
* Categorical data
* Free strings that can be semantically mapped to categories
* Structured string data
* Text data

**Categorical data** is data that comes from a **fixed list**. Ex : [red,green,blue]. You can check whether this is the case for your data by eyeballing it and confirm it by computing the unique values over the dataset, and possibly a histogram over how often each appears. 

**free strings that can be semantically mapped to categories** are for example the responses that can be obtain from a text field. It will probably be best to encode this data as a categorical variable, where you can select the categories either by using the most common entries, or by defining categories that will capture responses in a way that makes sense for your application. This kind of preprocessing of strings can take a lot of manual effort and is not easily automated. If you are in a position where you can influence data collection, we highly recommend avoiding manually entered values for concepts that are better captured using categorical variables.

**Structured string data** : manually entered values that do not correspond to fixed categories, but still have some underlying structure, like addresses, names of places or people, dates, telephone. These kinds of strings are often very hard to parse, and their treatment is highly dependent on context and domain.

**Text data** : consists of phrases or sentences. Ex : Tweets, chats, reviews.. In the context of text analysis, the dataset is often called the **corpus**, and each data point, represented as a single text, is called a **document**. These terms come from the information retrieval (IR) and natural language processing (NLP) community, which both deal mostly in text data.


### Example Application : Sentiment Analysis of Movie Reviews

This dataset contains the text of the reviews, together with a label that indicates whether a review is “positive” or “negative". After unpacking the data, the dataset is provided as text files in two separate folders, one for the training data and one for the test data. Each of these in turn has two subfolders, one called pos and one called neg

In [25]:
import numpy as np
import pandas as pd
import mglearn
from sklearn.exceptions import ConvergenceWarning
import warnings
# Ignorer les avertissements spécifiques liés à la convergence
warnings.filterwarnings("ignore", category=ConvergenceWarning)

In [5]:
from sklearn.datasets import load_files
reviews_train = load_files("data/aclImdb/train/")
# load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[1]:\n{}".format(text_train[1]))

type of text_train: <class 'list'>
length of text_train: 75000
text_train[1]:
b"Amount of disappointment I am getting these days seeing movies like Partner, Jhoom Barabar and now, Heyy Babyy is gonna end my habit of seeing first day shows.<br /><br />The movie is an utter disappointment because it had the potential to become a laugh riot only if the d\xc3\xa9butant director, Sajid Khan hadn't tried too many things. Only saving grace in the movie were the last thirty minutes, which were seriously funny elsewhere the movie fails miserably. First half was desperately been tried to look funny but wasn't. Next 45 minutes were emotional and looked totally artificial and illogical.<br /><br />OK, when you are out for a movie like this you don't expect much logic but all the flaws tend to appear when you don't enjoy the movie and thats the case with Heyy Babyy. Acting is good but thats not enough to keep one interested.<br /><br />For the positives, you can take hot actresses, last 30 minutes,

In [9]:
# Review contains HTML line breaks. Clean the data
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]
print("Samples per class (training): {}".format(np.bincount(y_train)))

Samples per class (training): [12500 12500 50000]


In [11]:
reviews_test = load_files("data/aclImdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target
print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]


Number of documents in test data: 25000
Samples per class (test): [12500 12500]


The task we want to solve is as follows: given a review, we want to assign the label “positive” or “negative” based on the text content of the review. This is a standard
binary classification task. However, the text data is not in a format that a machine learning model can handle. We need to convert the string representation of the text
into a numeric representation that we can apply our machine learning algorithms to

## Representing Text Data as a Bag of Words

One of the most simple but effective and commonly used ways to represent text for machine learning is using the **bag-of-words representation**. When using this representation, we discard most of the structure of the input text, like chapters, paragraphs, sentences, and formatting, and **only count how often each word appears** in each text in the corpus.

Computing the bag-of-words representation for a corpus of documents consists of the following three steps:
1. **Tokenization.** Split each document into the words that appear in it (called tokens), for example by splitting them on whitespace and punctuation.
2. **Vocabulary building**. Collect a vocabulary of all words that appear in any of the documents, and number them (say, in alphabetical order).
3. **Encoding.** For each document, count how often each of the words in the vocabulary appear in this document.


### Applying Bag of Words to a toy Dataset

The bag-of-words representation is implemented in **CountVectorizer**, which is a transformer. 

In [14]:
bards_words =["The fool doth think he is wise,",
 "but the wise man knows himself to be a fool"]
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)
print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))


Vocabulary size: 13
Vocabulary content:
 {'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


In [16]:
bag_of_words = vect.transform(bards_words)
print("bag_of_words: {}".format(repr(bag_of_words)))
print("Dense representation of bag_of_words:\n{}".format(
 bag_of_words.toarray()))

bag_of_words: <2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>
Dense representation of bag_of_words:
[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


### Bag-of-Words for Movie Reviews

In [17]:
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))


X_train:
<75000x124255 sparse matrix of type '<class 'numpy.int64'>'
	with 10315542 stored elements in Compressed Sparse Row format>


In [19]:
feature_names = vect.get_feature_names_out()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

Number of features: 124255
First 20 features:
['00' '000' '0000' '0000000000000000000000000000000001' '0000000000001'
 '000000001' '000000003' '00000001' '000001745' '00001' '0001' '00015'
 '0002' '0007' '00083' '000ft' '000s' '000th' '001' '002']
Features 20010 to 20030:
['cheapen' 'cheapened' 'cheapening' 'cheapens' 'cheaper' 'cheapest'
 'cheapie' 'cheapies' 'cheapjack' 'cheaply' 'cheapness' 'cheapo'
 'cheapozoid' 'cheapquels' 'cheapskate' 'cheapskates' 'cheapy' 'chearator'
 'cheat' 'cheata']
Every 2000th feature:
['00' '_require_' 'aideed' 'announcement' 'asteroid' 'banquière'
 'besieged' 'bollwood' 'btvs' 'carboni' 'chcialbym' 'clotheth'
 'consecration' 'cringeful' 'deadness' 'devagan' 'doberman' 'duvall'
 'endocrine' 'existent' 'fetiches' 'formatted' 'garard' 'godlie' 'gumshoe'
 'heathen' 'honoré' 'immatured' 'interested' 'jewelry' 'kerchner' 'köln'
 'leydon' 'lulu' 'mardjono' 'meistersinger' 'misspells' 'mumblecore'
 'ngah' 'oedpius' 'overwhelmingly' 'penned' 'pleading' 'previlag

Before we try to improve our feature extraction, let’s obtain a quantitative measure of performance by actually building a classifier. We have the training labels stored in
y_train and the bag-of-words representation of the training data in X_train, so we can train a classifier on this data. For high-dimensional, sparse data like this, linear
models like LogisticRegression often work best.

In [26]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

Mean cross-validation accuracy: 0.67


In [27]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)
X_test = vect.transform(text_test)
print("{:.2f}".format(grid.score(X_test, y_test)))

Best cross-validation score: 0.68
Best parameters:  {'C': 0.01}
0.11


Now, let’s see if we can improve the extraction of words. The CountVectorizer extracts tokens using a regular expression. By default, the regular expression that is used is **"\b\w\w+\b"**. This means it finds all sequences of characters that consist of at least two letters or numbers (\w) and that are separated by word boundaries (\b). It does not find single-letter words, and it splits up contractions like “doesn’t” or “bit.ly”, but it matches “h8ter” as a single word. 

The CountVectorizer then converts all words to lowercase characters, so that “soon”, “Soon”, and “sOon” all correspond to the same token (and therefore feature).
This simple mechanism works quite well in practice, but as we saw earlier, we get many uninformative features (like the numbers). One way to cut back on these is to
only use tokens that appear in at least two documents (or at least five documents, and so on). A token that appears only in a single document is unlikely to appear in the test
set and is therefore not helpful. We can set the minimum number of documents a token needs to appear in with the min_df parameter

In [29]:
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))
feature_names = vect.get_feature_names_out()
print("First 50 features:\n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature:\n{}".format(feature_names[::700]))

X_train with min_df: <75000x44532 sparse matrix of type '<class 'numpy.int64'>'
	with 10191240 stored elements in Compressed Sparse Row format>
First 50 features:
['00' '000' '001' '007' '00am' '00pm' '00s' '01' '02' '03' '04' '05' '06'
 '07' '08' '09' '10' '100' '1000' '1001' '100k' '100th' '100x' '101'
 '101st' '102' '103' '104' '105' '106' '107' '108' '109' '10am' '10pm'
 '10s' '10th' '10x' '11' '110' '1100' '110th' '111' '112' '1138' '115'
 '116' '117' '11pm' '11th']
Features 20010 to 20030:
['inert' 'inertia' 'inescapable' 'inescapably' 'inevitability'
 'inevitable' 'inevitably' 'inexcusable' 'inexcusably' 'inexhaustible'
 'inexistent' 'inexorable' 'inexorably' 'inexpensive' 'inexperience'
 'inexperienced' 'inexplicable' 'inexplicably' 'inexpressive'
 'inextricably']
Every 700th feature:
['00' 'accountability' 'alienate' 'appetite' 'austen' 'battleground'
 'bitten' 'bowel' 'burton' 'cat' 'choreographing' 'collide' 'constipation'
 'creatively' 'dashes' 'descended' 'dishing' 'dramat

In [30]:
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

Best cross-validation score: 0.68


## Stopwords

Another way that we can **get rid of uninformative words** is by discarding words that are **too frequent to be informative**. 

There are two main approaches: using a l**anguagespecific list of stopwords**, or **discarding words that appear too frequently**

In [31]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
print("Number of stop words: {}".format(len(ENGLISH_STOP_WORDS)))
print("Every 10th stopword:\n{}".format(list(ENGLISH_STOP_WORDS)[::10]))

Number of stop words: 318
Every 10th stopword:
['onto', 'hereupon', 'enough', 'on', 'take', 'even', 'top', 'bottom', 'herein', 'any', 'how', 'thru', 'twenty', 'show', 'so', 'here', 'cant', 'with', 'why', 'and', 'eleven', 'until', 'she', 'amoungst', 'whole', 'down', 'someone', 'one', 'becomes', 'couldnt', 'must', 'therefore']


In [32]:
# Specifying stop_words="english" uses the built-in list.
# We could also augment it and pass our own.
vect = CountVectorizer(min_df=5, stop_words="english").fit(text_train)
X_train = vect.transform(text_train)
print("X_train with stop words:\n{}".format(repr(X_train)))


Caching the list of root modules, please wait!
(This will only be done once - type '%rehashx' to reset cache!)

X_train with stop words:
<75000x44223 sparse matrix of type '<class 'numpy.int64'>'
	with 6577418 stored elements in Compressed Sparse Row format>


In [33]:
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

Best cross-validation score: 0.71


## Rescaling the Data with tf-idf

Instead of dropping features that are deemed unimportant, another approach is to **rescale features by how informative we expect them to be**. One of the most common
ways to do this is using the **term frequency–inverse document frequency (tf–idf)** method. 

The intuition of this method is to give high weight to any term that appears often in a particular document, but not in many documents in the corpus. If a word appears often in a particular document, but not in very many documents, it is likely to be very descriptive of the content of that document.

**TfidfTransformer**, which takes in the sparse matrix output produced by CountVectorizer and transforms it, and **TfidfVectorizer**, which takes in the text data and does both the bag-of-words feature extraction and the tf–idf transformation

The tf–idf score for word w in document d as implemented in both the TfidfTransformer and TfidfVectorizer classes is given by:

$$
w_{d} = tf \times \log\left(\frac{N + 1}{N_w + 1}\right) + 1
$$

where N is the number of documents in the training set, Nw is the number of docu‐
ments in the training set that the word w appears in, and tf (the term frequency) is the
number of times that the word w appears in the query document d (the document
you want to transform or encode).

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(TfidfVectorizer(min_df=5, norm=None),
 LogisticRegression())
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

Best cross-validation score: 0.72


In [36]:
vectorizer = grid.best_estimator_.named_steps["tfidfvectorizer"]
# transform the training dataset
X_train = vectorizer.transform(text_train)
# find maximum value for each of the features over the dataset
max_value = X_train.max(axis=0).toarray().ravel()
sorted_by_tfidf = max_value.argsort()
# get feature names
feature_names = np.array(vectorizer.get_feature_names_out())
print("Features with lowest tfidf:\n{}".format(
 feature_names[sorted_by_tfidf[:20]]))
print("Features with highest tfidf: \n{}".format(
 feature_names[sorted_by_tfidf[-20:]]))


Features with lowest tfidf:
['remained' 'acclaimed' 'combines' 'rapidly' 'uniformly' 'diverse'
 'avoiding' 'fills' 'feeble' 'admired' 'wherever' 'admission' 'abound'
 'starters' 'assure' 'pivotal' 'comprehend' 'deliciously' 'strung'
 'inadvertently']
Features with highest tfidf: 
['nukie' 'reno' 'dominick' 'taz' 'ling' 'rob' 'victoria' 'turtles'
 'khouri' 'lorenzo' 'id' 'zizek' 'elwood' 'nikita' 'rishi' 'timon'
 'titanic' 'zohan' 'pammy' 'godzilla']


Features with low tf–idf are those that either are very commonly used across docu‐
ments or are only used sparingly, and only in very long documents. Interestingly,
many of the high-tf–idf features actually identify certain shows or movies. These words are unlikely to help us in our sentiment classification task
but certainly contain a lot of specific information about the reviews.

In [None]:
sorted_by_idf = np.argsort(vectorizer.idf_)
print("Features with lowest idf:\n{}".format(
 feature_names[sorted_by_idf[:100]]))

As expected, these are mostly English stopwords like "the" and "no". But some are clearly domain-specific to the movie reviews, like "movie", "film", "time", "story",
and so on.

## Investigating Model Coefficients

Because there are so many features—27,271 after removing the infrequent ones—we clearly cannot look at all of the coefficients at the same time.
However, we can look at the largest coefficients, and see which words these correspond to. We will use the last model that we trained, based on the tf–idf feature

In [None]:
mglearn.tools.visualize_coefficients(
 grid.best_estimator_.named_steps["logisticregression"].coef_,
 feature_names, n_top_features=40)

The negative coefficients on the left belong to words that according to the model are indicative of negative reviews, while the positive coefficients on the right belong to
words that according to the model indicate positive reviews

## Bag-of-Words with More Than One Word (n-Grams)

One of the main disadvantages of using a bag-of-words representation is that word order is completely discarded. Therefore, the two strings “it’s bad, not good at all” and
“it’s good, not bad at all” have exactly the same representation, even though the meanings are inverted. Putting “not” in front of a word is only one example (if an extreme
one) of how context matters. 

Fortunately, there is a way of capturing context when using a bag-of-words representation, by not only considering the counts of single tokens, but also the counts of pairs or triplets of tokens that appear next to each other.

Pairs of tokens are known as **bigrams**, triplets of tokens are known as **trigrams**, and more generally sequences of tokens are known as **n-grams**. We can change the range
of tokens that are considered as features by changing the ngram_range parameter of CountVectorizer or TfidfVectorize

In [None]:
#Unigrams
print("bards_words:\n{}".format(bards_words))
cv = CountVectorizer(ngram_range=(1, 1)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))


In [None]:
#bigrams
cv = CountVectorizer(ngram_range=(2, 2)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))


In [None]:
print("Transformed data (dense):\n{}".format(cv.transform(bards_words).toarray()))

For most applications, the minimum number of tokens should be one, as single words often capture a lot of meaning. Adding bigrams helps in most cases. Adding
longer sequences—up to 5-grams—might help too, but this will lead to an explosion of the number of features and might lead to overfitting, as there will be many very
specific features. In principle, the number of bigrams could be the number of unigrams squared and the number of trigrams could be the number of unigrams to
the power of three, leading to very large feature spaces. 

In [None]:
cv = CountVectorizer(ngram_range=(1, 3)).fit(bards_words)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names()))

In [None]:
pipe = make_pipeline(TfidfVectorizer(min_df=5), LogisticRegression())
# running the grid search takes a long time because of the
# relatively large grid and the inclusion of trigrams
param_grid = {"logisticregression__C": [0.001, 0.01, 0.1, 1, 10, 100],
 "tfidfvectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(text_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters:\n{}".format(grid.best_params_))

In [None]:
import matplotlib.pyplot as plt
# extract scores from grid_search
scores = grid.cv_results_['mean_test_score'].reshape(-1, 3).T
# visualize heat map
heatmap = mglearn.tools.heatmap(
 scores, xlabel="C", ylabel="ngram_range", cmap="viridis", fmt="%.3f",
 xticklabels=param_grid['logisticregression__C'],
 yticklabels=param_grid['tfidfvectorizer__ngram_range'])
plt.colorbar(heatmap)

In [None]:
# extract feature names and coefficients
vect = grid.best_estimator_.named_steps['tfidfvectorizer']
feature_names = np.array(vect.get_feature_names())
coef = grid.best_estimator_.named_steps['logisticregression'].coef_
mglearn.tools.visualize_coefficients(coef, feature_names, n_top_features=40)

There are particularly interesting features containing the word “worth” that were not
present in the unigram model: "not worth" is indicative of a negative review, while
"definitely worth" and "well worth" are indicative of a positive review. 

In [None]:
# find 3-gram features
mask = np.array([len(feature.split(" ")) for feature in feature_names]) == 3
# visualize only 3-gram features
mglearn.tools.visualize_coefficients(coef.ravel()[mask],
 feature_names[mask], n_top_features=40)

## Advanced Tokenization, Stemming, and Lemmatization

One particular step that is often improved in more sophisticated text-processing applications is the first step in the bag-of-words model: **tokenization**. This step defines what
constitutes a word for the purpose of feature extraction.

We saw earlier that the vocabulary often contains singular and plural versions of some words. For the purposes of a bag-of-words model, the semantics
of "drawback" and "drawbacks" are so close that distinguishing them will only increase overfitting, and not allow the model to fully exploit the training data. Similarly to having singular and plural forms of a noun, treating different verb forms and related words as distinct tokens is disadvantageous for building a model that generalizes well.

This problem can be overcome by representing each word using its word **stem**, which involves identifying (or conflating) all the words that have the same word stem. If this
is done by using a rule-based heuristic, like dropping common suffixes, it is usually referred to as **stemming**. 

If instead a dictionary of known word forms is used (**an explicit and human-verified system**), and the role of the word in the sentence is taken into account, the process is referred to as **lemmatization** and the standardized form of the word is referred to as the **lemma**. 

Both processing methods, lemmatization and stemming, are forms of normalization that try to extract some normal form of a word.

In [None]:
import spacy
import nltk
# load spacy's English-language models
en_nlp = spacy.load('en')
# instantiate nltk's Porter stemmer
stemmer = nltk.stem.PorterStemmer()
# define function to compare lemmatization in spacy with stemming in nltk
def compare_normalization(doc):
 # tokenize document in spacy
 doc_spacy = en_nlp(doc)
 # print lemmas found by spacy
 print("Lemmatization:")
 print([token.lemma_ for token in doc_spacy])
 # print tokens found by Porter stemmer
 print("Stemming:")
 print([stemmer.stem(token.norm_.lower()) for token in doc_spacy])

In [None]:
compare_normalization(u"Our meeting today was worse than yesterday, "
 "I'm scared of meeting the clients tomorrow.")

In general, lemmatization is a much more involved process than stemming, but it usually produces better results than stemming when used for normalizing tokens for machine learning.

While scikit-learn implements neither form of normalization, CountVectorizer
allows specifying your own tokenizer to convert each document into a list of tokens
using the tokenizer parameter. We can use the lemmatization from spacy to create a
callable that will take a string and produce a list of lemmas

In [None]:
# Technicality: we want to use the regexp-based tokenizer
# that is used by CountVectorizer and only use the lemmatization
# from spacy. To this end, we replace en_nlp.tokenizer (the spacy tokenizer)
# with the regexp-based tokenization.
import re
# regexp used in CountVectorizer
regexp = re.compile('(?u)\\b\\w\\w+\\b')
# load spacy language model and save old tokenizer
en_nlp = spacy.load('en')
old_tokenizer = en_nlp.tokenizer
# replace the tokenizer with the preceding regexp
en_nlp.tokenizer = lambda string: old_tokenizer.tokens_from_list(
 regexp.findall(string))
# create a custom tokenizer using the spacy document processing pipeline
# (now using our own tokenizer)
def custom_tokenizer(document):
 doc_spacy = en_nlp(document, entity=False, parse=False)
 return [token.lemma_ for token in doc_spacy]
# define a count vectorizer with the custom tokenizer
lemma_vect = CountVectorizer(tokenizer=custom_tokenizer, min_df=5)

# transform text_train using CountVectorizer with lemmatization
X_train_lemma = lemma_vect.fit_transform(text_train)
print("X_train_lemma.shape: {}".format(X_train_lemma.shape))
# standard CountVectorizer for reference
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train.shape: {}".format(X_train.shape))


In [None]:
# build a grid search using only 1% of the data as the training set
from sklearn.model_selection import StratifiedShuffleSplit
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
cv = StratifiedShuffleSplit(n_iter=5, test_size=0.99,
 train_size=0.01, random_state=0)
grid = GridSearchCV(LogisticRegression(), param_grid, cv=cv)
# perform grid search with standard CountVectorizer
grid.fit(X_train, y_train)
print("Best cross-validation score "
 "(standard CountVectorizer): {:.3f}".format(grid.best_score_))
# perform grid search with lemmatization
grid.fit(X_train_lemma, y_train)
print("Best cross-validation score "
 "(lemmatization): {:.3f}".format(grid.best_score_))

## Topic Modeling and Document Clustering

One particular technique that is often applied to text data is **topic modeling**, which is an umbrella term describing the task of assigning each document to one or multiple topics, usually without supervision.

Often, when people talk about topic modeling, they refer to one particular decomposition method called **Latent Dirichlet Allocation (often LDA for short)**.

### Latent Dirichlet Allocation

Intuitively, the LDA model tries to find groups of words (the topics) that appear together frequently. LDA also requires that each document can be understood as a
“mixture” of a subset of the topics. It is important to understand that for the machine learning model a “topic” might not be what we would normally call a topic in every‐
day speech, but that it resembles more the components extracted by PCA or NMF .Even if there is a semantic meaning for an LDA “topic”, it might not be some‐
thing we’d usually call a topic.

In [None]:
vect = CountVectorizer(max_features=10000, max_df=.15)
X = vect.fit_transform(text_train)
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=10, learning_method="batch",
 max_iter=25, random_state=0)
# We build the model and transform the data in one step
# Computing transform takes some time,
# and we can save time by doing both at  once
document_topics = lda.fit_transform(X)

In [None]:
# For each topic (a row in the components_), sort the features (ascending)
# Invert rows with [:, ::-1] to make sorting descending
sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
# Get the feature names from the vectorizer
feature_names = np.array(vect.get_feature_names())
# Print out the 10 topics:
mglearn.tools.print_topics(topics=range(10), feature_names=feature_names,
 sorting=sorting, topics_per_chunk=5, n_words=10)

Judging from the important words, topic 1 seems to be about historical and war mov‐
ies, topic 2 might be about bad comedies, topic 3 might be about TV series

In [None]:
lda100 = LatentDirichletAllocation(n_topics=100, learning_method="batch",
 max_iter=25, random_state=0)
document_topics100 = lda100.fit_transform(X)
topics = np.array([7, 16, 24, 25, 28, 36, 37, 45, 51, 53, 54, 63, 89, 97])
sorting = np.argsort(lda100.components_, axis=1)[:, ::-1]
feature_names = np.array(vect.get_feature_names())
mglearn.tools.print_topics(topics=topics, feature_names=feature_names,
 sorting=sorting, topics_per_chunk=7, n_words=20)

The topics we extracted this time seem to be more specific, though many are hard to
interpret.

For example, topic 45 seems to be about music. Let’s check which kinds of
reviews are assigned to this topic

In [None]:
# sort by weight of "music" topic 45
music = np.argsort(document_topics100[:, 45])[::-1]
# print the five documents where the topic is most important
for i in music[:10]:
 # pshow first two sentences
 print(b".".join(text_train[i].split(b".")[:2]) + b".\n")

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 10))
topic_names = ["{:>2} ".format(i) + " ".join(words)
 for i, words in enumerate(feature_names[sorting[:, :2]])]
# two column bar chart:
for col in [0, 1]:
 start = col * 50
 end = (col + 1) * 50
 ax[col].barh(np.arange(50), np.sum(document_topics100, axis=0)[start:end])
 ax[col].set_yticks(np.arange(50))
 ax[col].set_yticklabels(topic_names[start:end], ha="left", va="top")
 ax[col].invert_yaxis()
 ax[col].set_xlim(0, 2000)
 yax = ax[col].get_yaxis()
 yax.set_tick_params(pad=130)
plt.tight_layout()

Topic models like LDA are interesting methods to understand large text corpora in the absence of labels—or, as here, even if labels are available. The LDA algorithm is
randomized, though, and changing the random_state parameter can lead to quite different outcomes. While identifying topics can be helpful, any conclusions you
draw from an unsupervised model should be taken with a grain of salt, and we rec‐
ommend verifying your intuition by looking at the documents in a specific topic. The
topics produced by the LDA.transform method can also sometimes be used as a com‐
pact representation for supervised learning