# Learning to Classify Text

With the data we gathered in the previous section, we can now build our text
analysis service. At the core, the service will use a classifier similar to the one we
created earlier, in chapter 2.2. The main difference would be that the input for this
classifier won’t be a single word, but an entire block of text.

This means we can use words as features and more precisely, we can use word
counts as features. We will then train a classifier to decipher the correlations
between the word counts for us. The main thing to note here is that categories
contain certain words with a certain frequency.

### Text Feature Extractor
Remember that in the previous chapter we transformed a word, a name to be more
specific, into a feature vector. Now, will perform a similar operation on a block of
text. Let’s take a random sentence and transform it into the feature space:

>How much wood does a woodchuck chuck if a woodchuck could chuck wood

This is how this sentence would look transformed into the feature space:

**Tongue Twister to Feature Space**

```python

{
'wood': 2,
'a': 2,
'woodchuck': 2,
'chuck': 2,
'wow': 1,
'much': 1,
'does': 1,
'if': 1,
'could': 1
}
```

Here’s how simple it is to transform it:

**Convert Text to Dict**

In [2]:
import collections

text = """
How much wood does a woodchuck chuck
if a woodchuck could chuck wood
"""
collections.Counter(text.lower().split())

Counter({'how': 1,
         'much': 1,
         'wood': 2,
         'does': 1,
         'a': 2,
         'woodchuck': 2,
         'chuck': 2,
         'if': 1,
         'could': 1})

One thing that might throw you off at this point is that this method doesn’t take
into consideration the order of the words inside a text. This is one of the known
drawbacks of this method. Using word counts is an approximation and this type of
approximation is called **Bag of Words.**


There are methods that deal with actual sequences, such as Hidden Markov Models
or Recurrent Neutral Networks, but using these type of methods are not the subject
of this book.


Bag of Words models are really popular and widely used because they are simple
and can perform pretty well. We can get even better approximations of the
sequence using bigram or trigram models.

As we discussed in the first chapters, a bigram is a pair of adjacent words inside a
sentence and a trigram is a triplet of such words. Here’s how to compute them for
a given text using nltk utils:

**Compute Bigram and Trigram Features**

In [3]:
import nltk
import collections
from pprint import pprint

text = """
How much wood does a woodchuck chuck
if a woodchuck could chuck wood
"""

bigram_features = collections.Counter(
    list(nltk.bigrams(text.lower().split())))

trigram_features = collections.Counter(
    list(nltk.trigrams(text.lower().split())))

In [4]:
pprint(bigram_features)

Counter({('a', 'woodchuck'): 2,
         ('how', 'much'): 1,
         ('much', 'wood'): 1,
         ('wood', 'does'): 1,
         ('does', 'a'): 1,
         ('woodchuck', 'chuck'): 1,
         ('chuck', 'if'): 1,
         ('if', 'a'): 1,
         ('woodchuck', 'could'): 1,
         ('could', 'chuck'): 1,
         ('chuck', 'wood'): 1})


In [5]:
pprint(trigram_features)

Counter({('how', 'much', 'wood'): 1,
         ('much', 'wood', 'does'): 1,
         ('wood', 'does', 'a'): 1,
         ('does', 'a', 'woodchuck'): 1,
         ('a', 'woodchuck', 'chuck'): 1,
         ('woodchuck', 'chuck', 'if'): 1,
         ('chuck', 'if', 'a'): 1,
         ('if', 'a', 'woodchuck'): 1,
         ('a', 'woodchuck', 'could'): 1,
         ('woodchuck', 'could', 'chuck'): 1,
         ('could', 'chuck', 'wood'): 1})


Some things to have in mind before we move forward: using bigrams and trigrams
makes the feature space way larger because of the combinatorial explosion, meaning:


- The vocabulary size is $|V|=N$
- The size of feature space of the Bag Of Words model is $N$
- The size of feature space of the Bigram model is at most $N^2$
- The size of feature space of the Trigram model is at most $N^3$



## Scikit-Learn Feature Extraction

Now that we’ve covered how a text gets transformed to features, we can move on to
how this is actually done in practice. Scikit-Learn has special vectorizers for dealing
with text that come in handy.

In [7]:
#Scikit-Learn CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
text = """
How much wood does a woodchuck chuck
if a woodchuck could chuck wood
"""
vectorizer = CountVectorizer(lowercase=True)
# "train" the vectorizer, aka compute the vocabulary
vectorizer.fit([text])
# transform text to features
print(vectorizer.transform([text]))

  (0, 0)	2
  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 5)	1
  (0, 6)	2
  (0, 7)	2


Analyzing the results, you will notice that the right column represents the counts of
the words which have exactly the same values as the ones we computed earlier with
the Counter function, while the left column represents the indices `(sample_index,
word_index_in_vocabulary)`

Scikit-Learn works mainly with matrices and almost all components require a
matrix as input. The vectorizer transforms a list of texts (notice how both fit
and transform get a list of texts as input) into a matrix of size (sample_count, vocabulary_
size). The purpose of the fit method is to compute the vocabulary (thus the
vocabulary size) by computing how many different words we have. Let’s find out
what happens when we use words outside the vocabulary:

**Scikit-Learn CountVectorizer**

In [8]:
result = vectorizer.transform(["Unseen words", "BLT sandwich"])
print(type(result), result, result.shape)

<class 'scipy.sparse.csr.csr_matrix'>  (2, 8)


As you can notice, we get an empty matrix with all values set to 0. That means
none of the known features (words) have been detected and that’s why our print
statement doesn’t output anything


## Text Classification with Naive Bayes

In my opinion, understanding how vectorizers work on text is probably the hardest
part of text classification. By now we covered all the essentials and that is well
understood, so we’re now ready to read the data and train our classifier.

**Scikit-Learn MultinomialNB**

In [10]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

In [11]:
# Let's print a snippet of the documentation for Naive Bayes classifier
print(MultinomialNB.__doc__[:415])


    Naive Bayes classifier for multinomial models

    The multinomial Naive Bayes classifier is suitable for classification with
    discrete features (e.g., word counts for text classification). The
    multinomial distribution normally requires integer feature counts. However,
    in practice, fractional counts such as tf-idf may also work.

    Read more in the :ref:`User Guide <multinomial_naive_bayes>`.




In [12]:
# Remember this is where we saved all the data we crawled previously
data = pd.read_csv('./text_analysis_data.csv')

FileNotFoundError: File b'./text_analysis_data.csv' does not exist

In [9]:
# Where we keep the actual texts
text_samples, labels = []
for idx, row in data.iterrows():
with open('./clean_data/{0}'.format(row['file_name']), 'r') as text_file:
text = text_file.read()
text_samples.append(text)
labels.append(row['category'])
# Split the data for training and for testing and shuffle it
# keep 20% for testing, and use the rest for training
# shuffling is important because the classes are not random in our dataset
labels = data['category'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(
text_samples, labels, test_size=0.2, shuffle=True)
vectorizer = CountVectorizer(lowercase=True)
# Compute the vocabulary using only the training data
vectorizer.fit(X_train)
# Transform the text list to a matrix form
X_train_vectorized = vectorizer.transform(X_train)
classifier = MultinomialNB()

IndentationError: expected an indented block (<ipython-input-9-0aada913c67a>, line 12)

In [None]:
# Train the classifier
classifier.fit(X_train_vectorized, y_train)
# Vectorize the test data
X_test_vectorized = vectorizer.transform(X_test)
# Check our classifier performance
score = classifier.score(X_test_vectorized, y_test)
print("Accuracy=", score)