# Natural Language Processing: The Term-Document Matrix and Topic Modeling

## Term-Document Matrix

In this course, we've already seen a few examples of working with text. We've used basic string operations and `pandas` `str` operations in order to manipulate text data. Now that we have some array programming and machine learning skills under our belt, we can take our exploration of text data much further. 

In this lecture, we'll introduce one of the most important constructs for analyzing text data: the [term-document matrix.](https://en.wikipedia.org/wiki/Document-term_matrix)

This might sound intimidating, but the idea is very simple. Consider the following three sentences. We regard each of them as a "document."

1. I like Harry Potter. 
2. You like Harry Potter. 
3. I like Totoro.

We can think of the term-document matrix as a data frame with a column for each possible word. In each column, we count up how many times that word appears in document. For example, using the three short "documents" above, the term-document matrix is: 

| document | I | you | like | harry | potter | totoro |
|----------|---|-----|------|------|------|--------|
| 1        | 1 | 0   | 1    | 1    | 1    | 0      |
| 2        | 0 | 1   | 1    | 1    | 1    | 0      |
| 3        | 1 | 0   | 1    | 0    | 0    | 1      |

This turns out to be an extremely convenient format for working with text data, and we'll see soon how to use it for both sentiment analysis (figuring out how "positive" a word or sentence is) and topic modeling (figuring out the main "ideas" in a set of documents). 

If you're very persistent, you would be able to make a term-document matrix using a lot of `for`-loops and basic string operations. However, `scikit-learn` offers a much more convenient approach. In this lecture, we'll see an example of organizing our data and constructing the term-document matrix. In coming lectures, we'll start to use our construction for data analysis. 

## Data

Our data for this lecture is the complete text of the short book *Alice’s Adventures in Wonderland* by Lewis Carroll. The package `nltk` (Natural Language ToolKit) makes it wonderfully easy to obtain this data set. 

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import nltk

# only have to do this once
nltk.download("gutenberg")

from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/yourth/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


In [2]:
s = gutenberg.raw('carroll-alice.txt')
s[:200]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once"

We observe that the chapters are demaracted by the all-caps word "CHAPTER". So, we can simply split on this word to break the book up into chapters. We need to exclude the very first part of the split, since this isn't a real chapter -- it just contains the title and author information. 

In [3]:
chapters = s.split("CHAPTER")[1:]
len(chapters)

12

In [4]:
# number of characters in each chapter
[len(c) for c in chapters]

[11452,
 10989,
 9552,
 13871,
 11986,
 13860,
 12688,
 13656,
 12618,
 11528,
 10392,
 11661]

There's lots of punctuation and special characters in the text, but we don't have to worry about those this time -- there are built-in functions that will filter these out for us. 

It's helpful to keep ourselves organized by placing the text of each chapter into a data frame. 

In [5]:
df = pd.DataFrame({
    "chapter" : range(1, len(chapters) + 1),
    "text"    : chapters
})

In [6]:
df

Unnamed: 0,chapter,text
0,1,I. Down the Rabbit-Hole\n\nAlice was beginnin...
1,2,II. The Pool of Tears\n\n'Curiouser and curio...
2,3,III. A Caucus-Race and a Long Tale\n\nThey we...
3,4,IV. The Rabbit Sends in a Little Bill\n\nIt w...
4,5,V. Advice from a Caterpillar\n\nThe Caterpill...
5,6,VI. Pig and Pepper\n\nFor a minute or two she...
6,7,VII. A Mad Tea-Party\n\nThere was a table set...
7,8,VIII. The Queen's Croquet-Ground\n\nA large r...
8,9,IX. The Mock Turtle's Story\n\n'You can't thi...
9,10,X. The Lobster Quadrille\n\nThe Mock Turtle s...


Next, we are going to grab the `CountVectorizer` class from the `sklearn.feature_extraction.text` module. This module gives a whole range of tools for turning unstructured text into delicious, quantitative numbers that we can feed into algorithms. 

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

We now create a `CountVectorizer` object. This is an object which will construct the term-document matrix for us. As usual, this object accepts various parameters. In this case, I've only specified the use of common English-language "stop words." A stop word is a word that's considered uninteresting for the purposes of natural language processing. For example, "she," "can", and "the" are common stop words. 

In [8]:
vec = CountVectorizer(stop_words = "english")

Creating the term-document matrix is easy, using the `fit_transform()` method on the appropriate column of `df`. 

In [9]:
counts = vec.fit_transform(df['text'])

However, there is a small hitch...

In [10]:
counts

<12x2312 sparse matrix of type '<class 'numpy.int64'>'
	with 5302 stored elements in Compressed Sparse Row format>

Hmmm, we haven't really worked with sparse matrices before. While these are very useful in general, for this course we can just convert it into a regular matrix (i.e. 2d `numpy.array()`. 

In [11]:
counts = counts.toarray()
counts

array([[0, 0, 0, ..., 0, 1, 0],
       [1, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

Even better, let's convert it into a `DataFrame` with appropriately labeled columns! 

In [None]:
count_df = pd.DataFrame(counts, columns = vec.get_feature_names_out())
count_df

If we'd like to be extra-organized, we can now add all this information to our original data frame: 

In [None]:
df = pd.concat((df, count_df), axis = 1)

In [None]:
df

## Interpreting the Term-Document Matrix

We can now use the Term-Document matrix to check how frequently a given term appears in each chapter of the novel. For example: 

In [None]:
df['alice']

We can also plot terms to see how often they appear over time: 

In [None]:
fig, ax = plt.subplots(1)

for term in ['alice', 'dinah', 'queen', 'hatter']:
    ax.plot(df[term], label = term)
    
ax.set(ylim = (0, None))
ax.legend()

We can see that Alice is a prominent character throughout the entirety of the book. In contrast, Dinah (Alice's pet cat) only appears in the first half of the book, and the Mad Hatter appears in just a few specific chapters. 

## Sidebar: Normalization

In many applications, it is desirable to use not the raw number of times that a word appears. Instead, various normalizations are possible, each of which provide a quantification of how important a word is within a document. For example, one could compute what proportion of a document is allocated to each word. This approach automatically accounts for the fact that some documents are longer than others. 

The most popular way to normalize is slightly more mathematically complex: it is called [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). We can compute a tf-idf term-document matrix easily, replacing the `CountVectorizer` above with the `TfidfVectorizer`. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(stop_words = "english")

In [None]:
tfidf = vec.fit_transform(df['text'])
tfidf.toarray()

The entries of `count_df` are no longer integers, but rather floats that estimate a weight for a word within each document. 

We won't worry much about the difference between count vectorization and tf-idf vectorization in this course, but feel free to try both when working with models to see whether you can improve your results. 

## Topic Modeling

Let's work through an example of *topic modeling*. The idea of topic modeling is to find "topics" in documents that tie together many words. Here are some examples of hypothetical topics that you might find in a newspaper: 

1. **Finance**: "dollar", "stock", "banks"
2. **Politics**: "party", "vote", "election"
3. **Sports**: "team", "win", "game"

We'll see how to use the term-document matrix, in combination with some nice algorithms from `scikit-learn`, to perform topic modeling. Our overall aim is to get a coarse, topic-level summary of the plot of the short book *Alice’s Adventures in Wonderland* by Lewis Carroll. 

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

import nltk
from nltk.corpus import gutenberg
# need to do this once to download the data
# nltk.download('gutenberg')

Let's briefly review the steps that we took to construct our term-document matrix. First, we used the `gutenberg` module to read in the raw text of the book, and split it into chapters. 

In [None]:
s = gutenberg.raw("carroll-alice.txt")
chapters = s.split("CHAPTER")[1:]

Then, we created a nice, tidy data frame in which we stored the complete text of each chapter. 

In [None]:
df = pd.DataFrame({
    "chapter" : range(1, len(chapters) + 1),
    "text" : chapters
})
df

Then, and this is the complex part, we used the `CountVectorizer` from `sklearn` to construct the term-document matrix. In this example, I've used a few more of the arguments for `CountVectorizer`. In particular, because I'd like to eventually be able to see how topics evolve between chapters, I use the `max_df` argument to specify that I'd like like to include words that appear in at most 50% of the chapters. 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(max_df = 0.5, min_df = 0.0, stop_words = "english")

Next, we can use this `CountVectorizer` to create the term-document matrix and collect it all as a nice, tidy data frame. 

In [None]:
counts = vec.fit_transform(df['text'])
counts = counts.toarray()
count_df = pd.DataFrame(counts, columns = vec.get_feature_names_out())
df = pd.concat((df, count_df), axis = 1)

In [None]:
df.head(3)

## On To Topic Modeling

Now we are ready to run our model! Topic modeling is an *unsupervised* machine learning framework, which means that there's no set of true labels `y`. So, we just need to create the variables `X`. To do this, we can ignore the `text` and `chapter` columns. 

In [None]:
X = df.drop(['text', 'chapter'], axis = 1)

There are many algorithms for topic modeling. We will use *nonnegative matrix factorization* or NMF for now. As usual, there are three easy steps: 

1. Import the model we want. 
2. Initialize an instance of the model. 
3. Fit the model on data. 

NMF requires us to specify `n_components`, which is the number of topics to find. Choosing the right number of topics is a bit of an art, but there are also quantitative approaches based on Bayesian statistics that we won't go into here. 

In [None]:
from sklearn.decomposition import NMF
model = NMF(n_components = 4, init = "random", random_state = 0)
model.fit(X)

There are two important parts of NMF. First, we have the topics themselves, which are stored in the `components_` attribute of the model. 

In [None]:
model.components_

In [None]:
model.components_.shape

Uh, what does that mean? We can think of each component as a collection of **weights** for each word. We can find the most important words in each component by finding the words where the weights are highest within that component. We can do this with a handy function called `np.argsort()`, which tells you which entries of an array are the largest, second largest, etc.

In [None]:
orders = np.argsort(model.components_, axis = 1)
orders

We can then use `numpy` "fancy" indexing to arrange the words in the needed orders. 

In [None]:
important_words = np.array(X.columns)[orders]
important_words

It's convenient to write a function to automate this for us: 

In [None]:
def top_words(X, model, component, num_words):
    orders = np.argsort(model.components_, axis = 1)
    important_words = np.array(X.columns)[orders]
    return important_words[component][-num_words:]

In [None]:
top_words(X, model, 3, 5)

The next important aspect of topic modeling is the assignment of topics per document. This is done via weights. We can access this by using the `transform()` method of the model. 

In [None]:
weights = model.transform(X)
weights

In [None]:
fig, ax = plt.subplots(1)
ax.imshow(weights.T)

The weights indicate the relative presence of each topic in each chapter. For example, Topic 2 is highly present in the first six chapters, but then mostly absent for the rest of the book. Topic 3 appears in Chapters 7 and 11, and so on. 

We can also visualize the same information as a line chart. Let's add as labels some of the top words for each topic. 

In [None]:
fig, ax = plt.subplots(1)

for i in range(4):
    ax.plot(df['chapter'], weights[:,i], label = top_words(X, model, i, 5))

ax.legend(bbox_to_anchor=(1.05, 0.65), loc="upper left")

This plot allows us to easily see several major features of the plot of the novel, including the tea party with the March Hare, the Mad Hatter, and the Dormouse (Chapter 7), the crocquet game in the court of the Queen of Hearts (Chapter 8), the appearance of the  Mock Turtle and the Lobster in (Chapters 9 and 10), and the reappearance of many characters in Chapter 11. 