# Text analysis (part 2)

In this notebook, we will:

- Preprocess text.
- Extract features.
- Build a couple types of classification models.
- Examine the performance of those models.
- Use topic modeling to find emergent topics.

In [1]:
from textblob import TextBlob, Word

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jkiley/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Preprocess text

There are a number of common ways to preprocess text for use in machine learning and other text analysis models.
While these things are often helpful, feel free to experiment with your own models and text corpora.


Note that we will look at some of this functionality in TextBlob, though as we will see, we often use scikit-learn's tools for these tasks.
However, textblob makes it easy to see how these tools work.

- lower case
- punctuation removed
- POS tagging
- lemmatization
- n-grams
- stop words removed

In [2]:
example_text_1 = ('Ultimately, we want to turn our text into a matrix that '
                 'gives the algorithm information to categorize text. That '
                 'is more difficult if we miss the same words due to case, '
                 'punctuation, or common words that don\'t help predict. '
                 'So, we can clean our text to potentially make our '
                 'predictions better.')
example_text_1

"Ultimately, we want to turn our text into a matrix that gives the algorithm information to categorize text. That is more difficult if we miss the same words due to case, punctuation, or common words that don't help predict. So, we can clean our text to potentially make our predictions better."

In [3]:
e_blob_1 = TextBlob(example_text_1)
e_blob_1.word_counts

defaultdict(int,
            {'ultimately': 1,
             'we': 3,
             'want': 1,
             'to': 4,
             'turn': 1,
             'our': 3,
             'text': 3,
             'into': 1,
             'a': 1,
             'matrix': 1,
             'that': 3,
             'gives': 1,
             'the': 2,
             'algorithm': 1,
             'information': 1,
             'categorize': 1,
             'is': 1,
             'more': 1,
             'difficult': 1,
             'if': 1,
             'miss': 1,
             'same': 1,
             'words': 2,
             'due': 1,
             'case': 1,
             'punctuation': 1,
             'or': 1,
             'common': 1,
             'do': 1,
             "n't": 1,
             'help': 1,
             'predict': 1,
             'so': 1,
             'can': 1,
             'clean': 1,
             'potentially': 1,
             'make': 1,
             'predictions': 1,
             'better': 1})

Notice a few things about the dictionary above.

1. This text has been **tokenized**, meaning that it has been split into tokens that have meaning (words in this case).
1. textblob make the words lowercase before counting them. The word "that" appears in the original text both capitalized and lower case. This is perhaps the most common transformation of all, so it is not surprising that it does that for us automatically.
1. The punctuation has been removed. That's not always something we will want, but it is quite helpful in most cases.
1. The word "don't" was split into `'do'` and ``"n't"``. The tokenizer is smart enough to separate it so that the negation is captured separately.

Many times, we would like to consider parts of speech, and there are quite good models for finding this information for words.
textblob has this functionality built in.
For some tasks, it can be helpful to treat words used as different parts of speech as different words.

In [4]:
# Use slicing to look at the first ten.
e_blob_1.tags[:10]

[('Ultimately', 'RB'),
 ('we', 'PRP'),
 ('want', 'VBP'),
 ('to', 'TO'),
 ('turn', 'VB'),
 ('our', 'PRP$'),
 ('text', 'NN'),
 ('into', 'IN'),
 ('a', 'DT'),
 ('matrix', 'NN')]

Similarly, we may want to reduce words to their base or **lemmatized** form in order to construct better counts.

In [5]:
Word('learning')

'learning'

In [6]:
# We tell the lemmatize method the part of speech.
Word('learning').lemmatize('v')

'learn'

Another common transformation is using more than one word at a time to capture context.
These multi-word groups are called **n-grams**.
We do have to be careful here, as the dimensionality (and, thus, computational intensity) grows very quickly.

**Note:** we would typically add the n-grams to the single words as features.

In [7]:
print(f'Length of words alone:  {len(e_blob_1.word_counts)}')
print(f'Length of n-grams of 2: {len(e_blob_1.ngrams(2))}')

Length of words alone:  39
Length of n-grams of 2: 51


In [8]:
e_blob_1.words.lower()

WordList(['ultimately', 'we', 'want', 'to', 'turn', 'our', 'text', 'into', 'a', 'matrix', 'that', 'gives', 'the', 'algorithm', 'information', 'to', 'categorize', 'text', 'that', 'is', 'more', 'difficult', 'if', 'we', 'miss', 'the', 'same', 'words', 'due', 'to', 'case', 'punctuation', 'or', 'common', 'words', 'that', 'do', "n't", 'help', 'predict', 'so', 'we', 'can', 'clean', 'our', 'text', 'to', 'potentially', 'make', 'our', 'predictions', 'better'])

In [9]:
e_blob_1_stop = [w for w in e_blob_1.words.lower() 
                 if w not in stopwords.words('english')]
e_blob_1_stop

['ultimately',
 'want',
 'turn',
 'text',
 'matrix',
 'gives',
 'algorithm',
 'information',
 'categorize',
 'text',
 'difficult',
 'miss',
 'words',
 'due',
 'case',
 'punctuation',
 'common',
 'words',
 "n't",
 'help',
 'predict',
 'clean',
 'text',
 'potentially',
 'make',
 'predictions',
 'better']

# Feature extraction

We're going to use a built-in dataset of `sklearn` as an example for expediency, though the patterns we will see are generally quite common.

In [10]:
_NEWS_CATS = ['comp.sys.mac.hardware',
              'sci.electronics',
              'rec.sport.baseball',
              'rec.sport.hockey']
news_train = fetch_20newsgroups(subset='train',
                                remove=('headers', 'footers', 'quotes'),
                                categories=_NEWS_CATS)
news_test = fetch_20newsgroups(subset='test',
                               remove=('headers', 'footers', 'quotes'),
                               categories=_NEWS_CATS)

In [11]:
# Nice description that comes with this dataset.
# You can uncomment and run it yourself if you like.
# print(news_test['DESCR'])

`sklearn`'s text utilities do a lot of feature extraction for us relatively easily.
We will look at them in a few examples.

In [12]:
# Let's look at the defaults.
test_cv = CountVectorizer()
test_cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

Note a few things:

1. By default, `lowercase=True`. As we discussed before, this is a transform that is nearly universal.
1. It has a default of `ngram_range=(1, 1)`, but we can see that we can specify n-grams.
1. It can filter stop words, but it is off be default. As the [documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#stop-words) notes, there are reasons to worry about stop words.
1. If we want to override the built-in behavior, it allows us to pass in our own functions for the `preprocessor` and `tokenizer` arguments.
1. Note that we do not have POS tagging built-in, but we could preprocess the text ourselves to feed in data with tags.

Let's see some output.

In [13]:
test_sentences = ['If we want to override the built-in behavior, '
                  'it allows us to pass in our own functions for the '
                  ' preprocessor and tokenizer arguments.',
                  'Note that we do not have POS tagging built-in, '
                  'but we could preprocess the '
                  'text ourselves to feed in data with tags.']
test_sent_vec = test_cv.fit_transform(test_sentences)
print(test_cv.get_feature_names())
print(test_sent_vec.toarray())

['allows', 'and', 'arguments', 'behavior', 'built', 'but', 'could', 'data', 'do', 'feed', 'for', 'functions', 'have', 'if', 'in', 'it', 'not', 'note', 'our', 'ourselves', 'override', 'own', 'pass', 'pos', 'preprocess', 'preprocessor', 'tagging', 'tags', 'text', 'that', 'the', 'to', 'tokenizer', 'us', 'want', 'we', 'with']
[[1 1 1 1 1 0 0 0 0 0 1 1 0 1 2 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 2 2 1 1 1 1
  0]
 [0 0 0 0 1 1 1 1 1 1 0 0 1 0 2 0 1 1 0 1 0 0 0 1 1 0 1 1 1 1 1 1 0 0 0 2
  1]]


In [14]:
# Let's see what happens with n-grams of 2.
test_cv_2 = CountVectorizer(ngram_range=(1, 2))
test_sent_vec_2 = test_cv_2.fit_transform(test_sentences)
print(test_cv_2.get_feature_names())
print(test_sent_vec_2.toarray())

['allows', 'allows us', 'and', 'and tokenizer', 'arguments', 'behavior', 'behavior it', 'built', 'built in', 'but', 'but we', 'could', 'could preprocess', 'data', 'data with', 'do', 'do not', 'feed', 'feed in', 'for', 'for the', 'functions', 'functions for', 'have', 'have pos', 'if', 'if we', 'in', 'in behavior', 'in but', 'in data', 'in our', 'it', 'it allows', 'not', 'not have', 'note', 'note that', 'our', 'our own', 'ourselves', 'ourselves to', 'override', 'override the', 'own', 'own functions', 'pass', 'pass in', 'pos', 'pos tagging', 'preprocess', 'preprocess the', 'preprocessor', 'preprocessor and', 'tagging', 'tagging built', 'tags', 'text', 'text ourselves', 'that', 'that we', 'the', 'the built', 'the preprocessor', 'the text', 'to', 'to feed', 'to override', 'to pass', 'tokenizer', 'tokenizer arguments', 'us', 'us to', 'want', 'want to', 'we', 'we could', 'we do', 'we want', 'with', 'with tags']
[[1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 1 2 1 0 0 1 1 1 0 0
  0 0 1 

One issue that may seem obvious from our discussion of stop words earlier is that some words don't do a lot for us in terms of prediction.
Another strategy for dealing with that issue is weighting terms such that those that are less frequent receive a higher weight and vice versa.
We call this **term frequency times inverse document frequency** or tf-idf.

Another issue you may have thought of is that we're using raw counts above.
Longer documents will naturally have higher counts, so we can normalize those values if we choose (like the example below).
It is not that important for our examples, but some models are sensitive to that.

In [15]:
# Again, let's look at it.
test_tt_1 = TfidfTransformer()
test_tt_1

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

We can see that, by default, it both normalizes and uses idf, but we can change those arguments if we choose.

In [16]:
test_sent_tdidf_1 = test_tt_1.fit_transform(test_sent_vec_2.toarray())
print(test_sent_tdidf_1.toarray())

[[0.14809752 0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.14809752 0.1053726  0.1053726  0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.14809752 0.14809752 0.14809752 0.14809752 0.
  0.         0.14809752 0.14809752 0.2107452  0.14809752 0.
  0.         0.14809752 0.14809752 0.14809752 0.         0.
  0.         0.         0.14809752 0.14809752 0.         0.
  0.14809752 0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.         0.         0.         0.         0.14809752 0.14809752
  0.         0.         0.         0.         0.         0.
  0.         0.2107452  0.14809752 0.14809752 0.         0.2107452
  0.         0.14809752 0.14809752 0.14809752 0.14809752 0.14809752
  0.14809752 0.14809752 0.14809752 0.1053726  0.         0.
  0.14809752 0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.10840958 0.10840958 0.15236588 0.15236588 0.15236588
  0.15236588 0.15

While we looked at a number of the intermediate states, these tasks are common enough that the `TfidfVectorizer` class bundles together both `CountVectorizer` and `TfidfTransformer` into one step.

# Classification and performance: Naive Bayes

In [17]:
pipeline_1 = Pipeline([('cv', CountVectorizer()),
                       ('tfidf', TfidfTransformer()),
                       ('model', MultinomialNB())
                      ])

pipeline_1.fit(news_train.data, news_train.target)
news_pred = pipeline_1.predict(news_test.data)

A confusion matrix is a good start for evaluation.
The values on the diagonal are correct classifications, and the off-diagonals are misses.

In [18]:
print(confusion_matrix(news_test.target, news_pred))

[[339   3  16  27]
 [  4 339  48   6]
 [  1   8 386   4]
 [ 73   9  16 295]]


In [19]:
print(classification_report(news_test.target, news_pred,
                            target_names=news_test.target_names))

                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.81      0.88      0.85       385
   rec.sport.baseball       0.94      0.85      0.90       397
     rec.sport.hockey       0.83      0.97      0.89       399
      sci.electronics       0.89      0.75      0.81       393

            micro avg       0.86      0.86      0.86      1574
            macro avg       0.87      0.86      0.86      1574
         weighted avg       0.87      0.86      0.86      1574



# Classification and performance: Support Vector Machine

SVM models are known for good performance on text, so let's see how it compares.

In [20]:
# Note how similar this code is to the prior version.
# We could have just changed the classifier.
pipeline_2 = Pipeline([('cv', CountVectorizer()),
                       ('tfidf', TfidfTransformer()),
                       ('model', SGDClassifier(penalty='l2',
                                             alpha=1e-3,
                                             max_iter=5, tol=None))
                      ])

pipeline_2.fit(news_train.data, news_train.target)
news_pred_2 = pipeline_2.predict(news_test.data)

In [21]:
print(confusion_matrix(news_test.target, news_pred_2))

[[316  19   3  47]
 [  7 349  29  12]
 [  2  33 354  10]
 [ 46  23   2 322]]


In [22]:
print(classification_report(news_test.target, news_pred_2,
                            target_names=news_test.target_names))

                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.85      0.82      0.84       385
   rec.sport.baseball       0.82      0.88      0.85       397
     rec.sport.hockey       0.91      0.89      0.90       399
      sci.electronics       0.82      0.82      0.82       393

            micro avg       0.85      0.85      0.85      1574
            macro avg       0.85      0.85      0.85      1574
         weighted avg       0.85      0.85      0.85      1574



Interestingly, the performance is very slightly worse overall, but it is more consistent across labels.
In general, like the NB, performance is good.
This isn't unusual with categories that are fairly distinct (like many categories of firm press releases).

However, let's try one more important part of fitting machine learning models: hyperparameter tuning.
In this case, we are going to do a grid search with a couple of options to see if performance improves.

In [23]:
p2_params = {'cv__ngram_range': [(1, 1), (1, 2), (1, 3)],
             'tfidf__use_idf': (True, False),
             'model__alpha': (0.01, 0.001)}

p2_grid = GridSearchCV(pipeline_2, p2_params, cv=3, n_jobs=-1)
p2_grid.fit(news_train.data, news_train.target)

news_pred_3 = p2_grid.predict(news_test.data)

In [24]:
print(confusion_matrix(news_test.target, news_pred_3))

[[321  18   4  42]
 [  9 341  36  11]
 [  2  28 359  10]
 [ 49  22   4 318]]


In [25]:
print(classification_report(news_test.target, news_pred_3,
                            target_names=news_test.target_names))

                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.84      0.83      0.84       385
   rec.sport.baseball       0.83      0.86      0.85       397
     rec.sport.hockey       0.89      0.90      0.90       399
      sci.electronics       0.83      0.81      0.82       393

            micro avg       0.85      0.85      0.85      1574
            macro avg       0.85      0.85      0.85      1574
         weighted avg       0.85      0.85      0.85      1574



In [26]:
p2_grid.best_params_

{'cv__ngram_range': (1, 1), 'model__alpha': 0.001, 'tfidf__use_idf': True}

Interestingly enough, the grid search results suggest that the words only model outperforms the ones using n-grams.
That's not surprising for a model predicting clear categories like this, but n-grams are often helpful in cases where the categories are more nuanced.

Looking at our confusion matrices, note that we see the model missing more on a couple pairs of groups: the hardware and electronics pair and the baseball and hockey pair.
Given the overlap, that's logical.

Our overall results here also illustrate another common pattern.
Models like Naive Bayes give solid results with little hyperparameter tuning.
As models increase in complexity, good performance often depends on more work on hyperparameters.
I suggest starting with simpler models and working your way up as needed.

For reporting results to reviewers, it's worth noting what you tried and the associated results.
Many reviewers will not be familiar with machine learning, but chances are that editors will look for at least one, and he or she will likely want to see a bit of the process.

# Topic modeling