# Topic Modeling

## Using sklearn to build tf-idf matrices

Now, rather than implementing everything ourselves, we will use a well-known python library to compute it for us.

**I will move this section to the end of the previous notebook. It doesn't belong here**

In [None]:
# Importing dependencies

import numpy as np
import pandas as pd

from nlpia.data.loaders import get_data
from nltk.tokenize.casual import casual_tokenize
from pugnlp.stats import Confusion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Same small corpus as before
docs = ["The faster Harry got to the store, the faster and faster Harry would get home."]
docs.append("Harry is hairy and faster than Jill.")
docs.append("Jill is not as hairy as Harry.")
corpus = docs

In [None]:
vectorizer = TfidfVectorizer(min_df=1)
"""min_df: ignore terms that have a document frequency 
strictly lower than the given threshold (aka cut-off).
"""

model = vectorizer.fit_transform(corpus)
"""model is a sparse tf-idf matrix (mostly zeros) 
sklearn does not store zeros to save resources"""

print(model)

In [None]:
# We can convert it into a dense matrix in one line!
print("\n--\n".join(corpus))
print(model.todense().round(2))

Back to today's topic...

## Thought Exercise

Training a topic model with _common sense_

In [None]:
topic = {}
# zip returns an iterator of tuples, where the i-th tuple 
# contains the i-th element from each of the argument sequences
tfidf = dict(list(zip('cat dog apple lion NYC love'.split(),\
                      np.random.rand(6))))
# Random tf-idf vector for our single document
tfidf

In [None]:
# I have "created" common-sense weights
# Now, we multiply the tf-idf vector by the 
# "hand-crafted” weights (notice the subtractions)
topic['petness'] = (.3 * tfidf['cat'] +\
                .3 * tfidf['dog'] +\
                0 * tfidf['apple'] +\
                0 * tfidf['lion'] -\
                .2 * tfidf['NYC'] +\
                .2 * tfidf['love'])
topic['animalness'] = (.1 * tfidf['cat'] +\
                .1 * tfidf['dog'] -\
                .1 * tfidf['apple'] +\
                .5 * tfidf['lion'] +\
                .1 * tfidf['NYC'] -\
                .1 * tfidf['love'])
topic['cityness'] = ( 0 * tfidf['cat'] -\
                .1 * tfidf['dog'] +\
                .2 * tfidf['apple'] -\
                .1 * tfidf['lion'] +\
                .5 * tfidf['NYC'] +\
                .1 * tfidf['love'])
topic

Transposing the 6x3 matrix to produce topic weights for each word

In [None]:
word_vector = {}
# word_vector['cat'] = [.3*topic['petness'] +\
#                     .1*topic['animalness'] +\
#                     0*topic['cityness']

# word_vector['dog'] = .3*topic['petness'] +\
#                     .1*topic['animalness'] -\
#                     .1*topic['cityness']

# word_vector['apple']= 0*topic['petness'] -\
#                     .1*topic['animalness'] +\
#                     .2*topic['cityness']

# word_vector['lion'] = 0*topic['petness'] +\
#                     .5*topic['animalness'] -\
#                     .1*topic['cityness']
# word_vector['NYC'] = -.2*topic['petness'] +\
#                     .1*topic['animalness'] +\
#                     .5*topic['cityness']
# word_vector['love'] = .2*topic['petness'] -\
#                     .1*topic['animalness'] +\
#                     .1*topic['cityness']

word_vector['cat'] = [.3*topic['petness'],
                    .1*topic['animalness'],
                    0*topic['cityness']]

word_vector['dog'] = [.3*topic['petness'],
                    .1*topic['animalness'], 
                    -.1*topic['cityness']]

word_vector['apple']= [0*topic['petness'],
                    .1*topic['animalness'],
                    .2*topic['cityness']]

word_vector['lion'] = [0*topic['petness'],
                    .5*topic['animalness'],
                    -.1*topic['cityness']]
word_vector['NYC'] = [-.2*topic['petness'],
                    .1*topic['animalness'],
                    .5*topic['cityness']]
word_vector['love'] = [.2*topic['petness'],
                    -.1*topic['animalness'],
                    .1*topic['cityness']]
word_vector

## Training a Linear Discriminant Analysis classifier

In [None]:
# Loading a labeled corpus: spam
sms = get_data('sms-spam')
print(sms[:-10])

# Just setting up the printing properties
pd.options.display.width = 120

In [None]:
# For display purposes: spam instances have a "!" added to the label
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in \
         zip(range(len(sms)), sms.spam)]
print(index[:20])

In [None]:
#'!'*0
#'!'*1
#'!'*4

In [None]:
# Creating a pandas df, using the data and the new index
sms = pd.DataFrame(sms.values, columns=sms.columns, index=index)
sms['spam'] = sms.spam.astype(int)
print(sms)
# len(sms)

In [None]:
# QUESTION: what am I getting with this sum?
sms.spam.sum()

In [None]:
# Vectorising the corpus
tfidf_model = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf_model.fit_transform(raw_documents=sms.text).toarray()
# QUESTION: what is the number on the right?
tfidf_docs.shape
tfidf_docs




We have 
* 4837 messages
* 638 positive instances
* 9232 types

That's too much for a Naive Bayes classifier

### Implementing the LDA

We just need the centroids of spam and non-spam, so we implement it 

(keep in mind that sklearn has an [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html))

In [None]:
# A mask (or "filter") to select only spam messages
mask = sms.spam.astype(bool).values
print(mask)

In [None]:
# Computing the spam centroid
spam_centroid = tfidf_docs[mask].mean(axis=0)
# axis=0 tells numpy to compute the mean for each column independently
print(spam_centroid.round(2))
len(spam_centroid)

In [None]:
# Computing the ham centroid
ham_centroid = tfidf_docs[~mask].mean(axis=0)
print(ham_centroid.round(2))
len(ham_centroid)

In [None]:
spam_centroid - ham_centroid

In [None]:
# Computing the centroid difference: "the line between spam and ham"
spamminess_score = tfidf_docs.dot(spam_centroid - ham_centroid)
print(spamminess_score.round(2))
len(spamminess_score)

Not just subtracting. We computed the dot product!

**spamminess_score** is $dis(centroid_{(spam)}, centroid_{(ham)})$

We compute it by projecting each TF-IDF vector onto that line between the centroids using the dot product (those were indeed 4,837 dot products computed at once!)

In [None]:
# Turning into "probabilities" and predictions
sms['lda_score'] = MinMaxScaler().fit_transform(spamminess_score.reshape(-1,1))
sms['lda_predict'] = (sms.lda_score > .5).astype(int)

sms['spam lda_predict lda_score'.split()].round(2).head(6)


In [None]:
# What is accuracy of the model?
(1. - (sms.spam - sms.lda_predict).abs().sum() / len(sms)).round(3)

In [None]:
# Getting a confusion matrix
Confusion(sms['spam lda_predict'.split()])