### Obtain the TF-IDF matrix for the following documents

In [1]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.decomposition import NMF

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
tfidf = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names())

['100', 'app', 'belly', 'best', 'came', 'cat', 'chrome', 'climbing', 'eating', 'extension', 'face', 'feedback', 'google', 'impressed', 'incredible', 'key', 'kitten', 'kitty', 'little', 'map', 'merley', 'ninja', 'open', 'photo', 'play', 'promoter', 'restaurant', 'smiley', 'squooshy', 'tab', 'taken', 'translate', 've']


8 sentences, and we have 33 vocabulary words. so the dimensions are **8x33**.

Now we want to decompose the topics into terms of W and H (because ```Tf_IDF = W * H``` -- this is from lin alg)

In [4]:
num_of_topics = 2 #let's say we have 2 topics
nmf = NMF(n_components=num_of_topics, random_state=1).fit(tfidf)

### Why do we call it non-negative matrix factorization?

* The BoW and TF-IDF matrix elements are all non-negative (zero or positive)
* We want to build matrices W and H such that WH returns BoW of TF-IDF matrix
* Such that all elements in W and H be non-negative

In [5]:
W = nmf.fit_transform(tfidf)
W

array([[0.        , 0.        ],
       [0.        , 0.45217213],
       [0.55735742, 0.        ],
       [0.49414046, 0.        ],
       [0.        , 0.74849032],
       [0.        , 0.5964714 ],
       [0.55735742, 0.        ],
       [0.52368298, 0.        ]])

In [6]:
H = nmf.components_
H.shape

(2, 33)

### Verify we can reconstruct TF-IDF matrix from WH

In [7]:
import numpy as np

# tfidf - np.dot(W, H)

In [10]:
tfidf - np.dot(W, H)

matrix([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.40824829,
          0.        ,  0.        ,  0.        ,  0.40824829,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.40824829,  0.40824829,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.40824829,
          0.        ,  0.40824829,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.37700057,  0.19533809,  0.        ,
         -0.24333424,  0.        , -0.14642602,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.37700057,  0.        ,  0.        ,  0.        ,
          0.37700057, -0.14642602,  0.        , -0.14392194,  0.        ,
          0.        ,  0.        ,  0.        ,  0.37700057,  0.        ,
         -0.14392194,  0.        , -0.14392194],
        [-0.10

### Obtain the top 10 words from H matrix:

In [9]:
n_top_words = 10
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

Topic #0:
google feedback map app impressed incredible translate key extension chrome
Topic #1:
cat best climbing ninja ve photo taken belly merley kitten


# Now let's apply LDA for the same documents we have.

In [13]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [14]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

# no_features = 100
# NMF is able ot use tf-idf
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics = 2

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1).fit(tfidf)
# Run LDA
lda = LatentDirichletAllocation(n_components=no_topics).fit(tf)

no_top_words = 10
print('Topic Modelling with NMF:')
display_topics(nmf, tfidf_feature_names, no_top_words)
print(' ---- ')
print('Topic Modelling with LDA')
display_topics(lda, tf_feature_names, no_top_words)

Topic Modelling with NMF:
Topic 0:
google feedback map app impressed incredible translate key extension chrome
Topic 1:
cat best climbing ninja ve photo taken belly merley kitten
 ---- 
Topic Modelling with LDA
Topic 0:
google came kitty restaurant play eating little extension promoter key
Topic 1:
best cat kitten belly squooshy merley photo ve taken open


### Unsupervised -- thus, no formal evaluation metrics -- gotta do that ourselves. Yeet

# LDA: Which sentences is what document?

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pickle

# the dataset to predict on (first 2 samples were also in the training set so one can compare)
documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

# Vectorize the training set using the model features as vocabulary
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_topics=2).fit(tf)

# transform method returns a matrix with one line per document, columns being topics weight
predict = lda.transform(tf)
print(predict)

[[0.04459171 0.95540829]
 [0.06658954 0.93341046]
 [0.91159976 0.08840024]
 [0.95671856 0.04328144]
 [0.07558235 0.92441765]
 [0.13248486 0.86751514]
 [0.1020867  0.8979133 ]
 [0.08515002 0.91484998]]




### How does this work exactly?

Not sure. BUT word ORDER is NOT INCORPORATED INTO THIS THING -- so

m = 'cat is kitty'
n = 'kitty is cat'

will have the same shape, because
word order is not incorporated in

this little is ... cat kitty chrome

[0   0         1 .... 1    .... etc ]

There are ways to take word order into considerion, but NOT in TF-IDF or word order

We can also do some kind of preprocessing to allow for the word order to be taken into account

In [17]:
m = tf_vectorizer.transform(['cat is kitty'])
lda.transform(m)

array([[0.37686744, 0.62313256]])

Something about **zero padding**

We want to keep the word orderings.

And we want them to have the same vector length.

All models need pre-defined features -- we can first create a dictionary of our vocab, transform it, do zero padding.

We usually use keras for deep learning, but here, we're gonna use keras for _text preprocessing_!

We are transforming our text into sequences.

### Keras

In [19]:
from keras.preprocessing.text import Tokenizer
documents = ['This is the first document.',
            'This document is the second document.',
            'And this is the third one.',
            'Is this the first document?']
tok = Tokenizer()
tok.fit_on_texts(documents)
mat_texts = tok.texts_to_matrix(documents, mode='tfidf')
print(mat_texts)
X = tok.texts_to_sequences(documents)
print(X)
print('Word Index (This is a DICTIONARY)')
print(tok.word_index)
print('Word Counts:')
print(tok.word_counts)
print('Document Count:')
print(tok.document_count)
print('Words in Doc:')
print(tok.word_docs)

[[0.         0.58778666 0.58778666 0.58778666 0.69314718 0.84729786
  0.         0.         0.         0.        ]
 [0.         0.58778666 0.58778666 0.58778666 1.17360019 0.
  1.09861229 0.         0.         0.        ]
 [0.         0.58778666 0.58778666 0.58778666 0.         0.
  0.         1.09861229 1.09861229 1.09861229]
 [0.         0.58778666 0.58778666 0.58778666 0.69314718 0.84729786
  0.         0.         0.         0.        ]]
[[1, 2, 3, 5, 4], [1, 4, 2, 3, 6, 4], [7, 1, 2, 3, 8, 9], [2, 1, 3, 5, 4]]
Word Index (This is a DICTIONARY)
{'this': 1, 'is': 2, 'the': 3, 'document': 4, 'first': 5, 'second': 6, 'and': 7, 'third': 8, 'one': 9}
Word Counts:
OrderedDict([('this', 4), ('is', 4), ('the', 4), ('first', 2), ('document', 4), ('second', 1), ('and', 1), ('third', 1), ('one', 1)])
Document Count:
4
Words in Doc:
defaultdict(<class 'int'>, {'the': 4, 'this': 4, 'is': 4, 'first': 2, 'document': 3, 'second': 1, 'one': 1, 'third': 1, 'and': 1})
