# Topic modeling

We are going to look at data from the [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset.  These are postings to newsgroups in 20 different categories.

Scikit-learn has a function for downloading the data.  See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

## Supervised Learning with Naive Bayes

We are going to start out using Naive Bayes for supervised ML.

In [None]:
from sklearn.datasets import fetch_20newsgroups

In [None]:
ntrain = fetch_20newsgroups(subset='train')

Here is the description of the dataset, along with an interesting example and tidbits of advice: 

In [None]:
print(ntrain.DESCR)

An example of the first five newsgroup posts:

In [None]:
ntrain.data[:5]

We have 11314 newsgroup postings, with 20 classes.

In [None]:
ntrain.target.shape

In [None]:
ntrain.target[:10]

In [None]:
[ntrain.target_names[i] for i in ntrain.target[:10]]

We'll keep it a bit simpler for the moment and only consider two of the categories:

In [None]:
cats = ['sci.space', 'comp.graphics']

Re-import the training data for just these two categories, along with test data too.

In [None]:
ntrain = fetch_20newsgroups(subset='train', categories=cats)
ntest = fetch_20newsgroups(subset='test',categories=cats)

In [None]:
print(ntrain.data[0])

How does this compare with what I showed on the slide?  And does it matter that it's different?

In [None]:
ntrain.target_names

In [None]:
ntrain.target.shape

In [None]:
ntrain.target[:10]

In [None]:
ntest.target.shape

## How can we think of text as numbers for quantitative analysis?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize, sent_tokenize

import nltk
nltk.download('punkt')

## Bag-of-Words (BoW)

BoW represents a document as a set of words without regard for word order.  Each word is assigned a unique index, and a document is represented as a vector whose values at the index for each word are the word counts.

In [None]:
corpus = ["The cat slept and then meowed.", 
          "The tiger slept and then roared.", 
          "The boy ran home and then the boy laughed."]

vectorizer = CountVectorizer()

X = vectorizer.fit_transform(corpus)

Even though we are using Scikit-Learn to do the CountVectoriz-ing, there is no reason that we couldn't manually do it ourselves too with a bit of Python.  It's just convenient to do it the Scikit-Learn way.

In [None]:
vectorizer.get_feature_names_out()

In [None]:
pd.DataFrame(X.toarray(), 
             columns=vectorizer.get_feature_names_out())

In [None]:
# as to compare against our corpus:
corpus

## Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF extends BoW by accounting for the uniqueness of words in distinguishing between documents.  The word counts of BoW are weighted by words' relative rarity across the entire corpus.

* Scikit-Learn's TF-IDF calculation is [described here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer)

In [None]:
vectorizer = TfidfVectorizer()

X_tfidf = vectorizer.fit_transform(corpus)

In [None]:
pd.DataFrame(X_tfidf.toarray(), 
             columns=vectorizer.get_feature_names_out())

There are a lot of mathematical details that come in here for trying to get well behaved forms of TF-IDF, and it's actually a messy business trying to back this out from the word counts and frequencies.

You can ignore the following if you want to, but here is how one would go directly from the matrix of counts to scikit-learn's version of the TFIDF measure.

In [None]:
x_bow = pd.DataFrame(X.toarray(), 
             columns=vectorizer.get_feature_names_out())

In [None]:
x_bow

In [None]:
# Getting the term frequencies in each of the three documents
(x_bow.T / x_bow.T.sum(axis=0)).T

In [None]:
# Getting the number of documents in which each word occurs
(x_bow > 0).sum(axis=0)

In [None]:
tf = (x_bow.T / x_bow.T.sum(axis=0)).T

# the +1 at the end is so that even words that occur across all docs
# still have a non-zero TFIDF
# the +1 in numerator and +1 in denominator are conveniences to
# handle the otherwise division by 0 for words that have 0 counts
idf = np.log((1+3) / (1+(x_bow > 0).sum(axis=0))) + 1

tf * idf

... and then one has to do a cosine normalization (the squares of elements in the rows add up to 1).  This is convenient because one can then do an inner (dot) product of rows to get a cosine similarity measure that varies between -1 and 1.

In [None]:
tfidf = tf * idf
tfidf = (tfidf.T / np.sqrt((tfidf.T * tfidf.T).sum(axis=0))).T
tfidf

In [None]:
np.dot(tfidf.loc[0], tfidf.loc[1])

## Back to the topic modeling

We can choose how to convert our posts and words into a document-term matrix.  Let's use TF-IDF:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()
vectors_train = vectorizer.fit_transform(ntrain.data)
vectors_test = vectorizer.transform(ntest.data)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(vectors_train.toarray(),
                  columns=vectorizer.get_feature_names_out())

In [None]:
df

What do the number of rows and columns in the dataframe represent?

In [None]:
df[['earth','graphics','image','nasa','algorithms','astronomy']]

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [None]:
clf = MultinomialNB()
clf.fit(vectors_train, ntrain.target)

In [None]:
pred = clf.predict(vectors_test)

In [None]:
from sklearn.metrics import classification_report

In [None]:
metrics.accuracy_score(ntest.target, pred)

In [None]:
print(classification_report(ntest.target, pred))

So good!

Too good??

Let's look at the words that have the highest probability for each of these two classes.

We can get the feature names with:

In [None]:
vectorizer.get_feature_names_out()

And we can get the probabilities (actually log of probabilities because they're so small) with:

In [None]:
clf.feature_log_prob_

In [None]:
clf.feature_log_prob_.shape

There are two rows and 23882 columns because this shows *P(class | word)* for each class and each word.

In [None]:
import numpy as np

In [None]:
def show_top10(classifier, vectorizer, categories):
    feature_names = vectorizer.get_feature_names_out()
    for i,category in enumerate(categories):
        top10 = np.argsort(-classifier.feature_log_prob_[i])[:10]
        print('%s:' % (category))
        for j in top10:
            print("%s: %.2f" % (feature_names[j], classifier.feature_log_prob_[i][j]))
        print('\n')

In [None]:
show_top10(clf, vectorizer, ntrain.target_names)

We could stand to do more data processing.

First, remove all the meta-data in the headers and footers, along with quotes.

In [None]:
ntrain = fetch_20newsgroups(subset='train',categories=cats,remove=('headers','footers','quotes'))
ntest = fetch_20newsgroups(subset='test',categories=cats,remove=('headers','footers','quotes'))

In [None]:
vectors = vectorizer.fit_transform(ntrain.data)
vectors_test = vectorizer.transform(ntest.data)

In [None]:
clf = MultinomialNB()
clf.fit(vectors, ntrain.target)
pred = clf.predict(vectors_test)

In [None]:
print(classification_report(ntest.target, pred))

The performance dropped a little, but likely because we aren't overfitting to words in the parts of the post that aren't as relevant to our meaningful newsgroup content.

How do the word probabilities change?

In [None]:
show_top10(clf, vectorizer, ntrain.target_names)

Still have a lot of fluff.

We can remove stopwords.  The TfidfVectorizer will also allow us to remove words that are really really common, as well as requiring words to occur at least a certain minimum number of times:

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.5, min_df=2)

In [None]:
vectors = vectorizer.fit_transform(ntrain.data)
vectors_test = vectorizer.transform(ntest.data)

In [None]:
clf = MultinomialNB()
clf.fit(vectors, ntrain.target)
pred = clf.predict(vectors_test)

In [None]:
print(classification_report(ntest.target, pred))

The performance did not drop; it may even have slightly improved.

In [None]:
show_top10(clf, vectorizer, ntrain.target_names)

That is looking like a much more meaningful set of words.

Let me try to make a prediction on a new (hypothetical) newsgroup post.

In [None]:
mypost = '''
I was sitting outside with my cat, looking up at the sky through our telescope, 
when lo and behold, to my little set of eyes I spied an anomalous signal emanating 
from a distant galaxy.  I've scoured my astronomy books and can't find any 
description of what I saw.  Is this a new type of celestial phenomenon? or, 
dare I say it, something extraterrestrial in origin?

Please let me know if you're up for sharing insights on my dataset.
'''

Do remember that if you want to make a prediction on a new data point, you'll need to transform it in the same way that you transformed your training data.

In [None]:
vectorizer.transform([mypost])

In [None]:
vectorizer.transform([mypost]).toarray()

So many zeroes, we can't really see much in looking at our vector representation.

In [None]:
clf.predict(vectorizer.transform([mypost]))

In [None]:
ntrain.target_names[1]

After all the transforming, predictions, and decoding, indeed we find that the approach correctly classifies mypost as belonging to sci.space.

.... though with a certain probability:

In [None]:
clf.predict_proba(vectorizer.transform([mypost]))