# Text Analysis

 In this lecture, we look at more recent methods of feature extraction and topic modeling.

We will cover the following:

- word2vec
- latent semantic analysis
- non-negative matrix factorization
- latent Dirichlet allocation

A technical guide to topic modeling can be found in these [lecture notes](http://pages.cs.wisc.edu/~jerryzhu/cs769/latent.pdf) but is outside the scope of this class.

## Similarity

In order to find similar words or documents after they have been vectorized, we need definitions of similarity. Similarity measures often used in text analysis include

- edit 
- cosine 
- Hellinger 
- Kullback-Leibler 
- Jacard 

These may be given as the similarity or distance.

### Edit 

The edit distance between two strings is the minimum number of changes needed to covert from one string to another. These changes may be weighted, for example, by making a deletion changes have a different weight than an insertion operation. Also known as Levenshtein distance.

Such distance metrics are the basis for aligning DNA, RNA and protein sequences.

In [1]:
! python3 -m pip install --quiet textdistance

In [2]:
import textdistance as td

In [3]:
td.levenshtein.distance('slaves', 'salve')

3

In [4]:
td.levenshtein.similarity('slaves', 'salve')

3

### Jacard 

The Jacard distance is the intersection divided by union of two sets.

In [5]:
td.jaccard.similarity('the quick brown fox'.split(), 'the quick brown dog'.split())

0.6

Note that the implementation is actually for multisets.

In [6]:
td.jaccard.similarity('slaves', 'salve')

0.8333333333333334

### Cosine 

For two real valued vectors.

In [7]:
s1 = 'the quick brown fox'
s2 = 'the quick brown dog'

In [8]:
td.cosine.similarity(s1.split(), s2.split())

0.75

Cosine distance works on vectors - the default is just to use the bag of words counts.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
cv = CountVectorizer()

In [11]:
t = cv.fit_transform([s1, s2]).toarray()
t

array([[1, 0, 1, 1, 1],
       [1, 1, 0, 1, 1]])

Cosine distance is equivalent to the inner product of the normalized vectors with length 1.

In [12]:
from scipy.spatial.distance import cosine

In [13]:
import numpy as np

In [14]:
np.around(1- cosine(t[0], t[1]), 2)

0.75

In [15]:
np.dot(t[0]/np.linalg.norm(t[0]), t[1]/np.linalg.norm(t[1]))

0.75

### Hellinger

For two probability distributions.

In [16]:
from gensim.matutils import hellinger

In [17]:
p = t[0]/t[0].sum()
q = t[1]/t[1].sum()
p, q

(array([0.25, 0.  , 0.25, 0.25, 0.25]), array([0.25, 0.25, 0.  , 0.25, 0.25]))

In [18]:
hellinger(p, q)

0.5

In [19]:
def discrete_hellinger(p, q):
    return 1/np.sqrt(2) * np.linalg.norm(np.sqrt(p) - np.sqrt(q))

In [20]:
discrete_hellinger(p, q)

0.5

### Kullback-Leibler

In [21]:
t = cv.fit_transform(['one two three', 'one one one two two three']).toarray()
t

array([[1, 1, 1],
       [3, 1, 2]])

In [22]:
p = t[0]/t[0].sum()
q = t[1]/t[1].sum()
p, q

(array([0.33333333, 0.33333333, 0.33333333]),
 array([0.5       , 0.16666667, 0.33333333]))

In [23]:
from gensim.matutils import  kullback_leibler

In [24]:
kullback_leibler(p, q)

0.09589402415059362

Not symmetric.

In [25]:
kullback_leibler(q, p)

0.08720802396075798

In [26]:
def discrete_dkl(p, q):
    return -np.sum(p * (np.log(q) - np.log(p)))

In [27]:
discrete_dkl(p, q)

0.09589402415059356

In [28]:
discrete_dkl(q, p)

0.08720802396075805

## Word2Vec

the `word2vec` family of algorithms is a powerful method for converting a word into a vector that takes into account its context. There are two main ideas - in continuous bag of words, we try to predict the current word from nearby words; in continuous skip-gram, the current word is used to predict nearby words. The phrase "nearby words" is intentionally vague - in the simplest case, it is a sliding window of words centered on the current word. 

Suppose we have the sentence

```
I do not like green eggs and ham
```

and suppose we use a centered window of length 3,

```
((I, not), do), ((do, like), not), ((not, green), like), ((like, eggs), green), ((green, and), eggs), ((eggs, ham) and)
```

In continuous bag of words, we make the (input, output) pairs to be
```
(I, do)
(not, do)
(do, not)
(like, not)
(not, like)
(green, like)
(like, green)
(eggs, green)
(green, eggs)
(and, eggs)
(eggs, and)
(ham, and)
```

That is, we try to predict `do` when we see `I`, `do` when we see `not` and so on.

In continuous skip-gram, we do the inverse for (input, output) pairs
```
(do, I)
(do, not)
(not, do)
(not, like)
(like, not)
(like, green)
(green, like)
(green, eggs)
(eggs, green)
(eggs, and)
(and, eggs)
(and, ham)
```

That is, we try to predict `I` when we see `do`, `not` when we see `do` and so on.

To do this prediction, we first assign each word to a vector of some fixed length $n$ - i.e. we embed each word as an $\mathbb{R}^n$ vector. To do a prediction for all words in the vocabulary using `softmax` would be prohibitively expensive, and is unnecessary if we are just trying to find a good embedding vector. Instead we select $k$ noise words, typically from the unigram distributions, and just train the classifier to distinguish the target word from the noise words using logistic regression (negative sampling). We use stochastic gradient descent to move the embedding word vectors (initialized randomly) until the model gives a high probability to the target words and low probability to the noise ones. If successful, words that are meaningful when substituted in the same context will be close together in $\mathbb{R}^n$. For instance, `dog` and `cat` are likely to be close together because they appear together in similar contexts like

- `My pet dog|cat`
- `Raining dogs|cats and cats|dogs`
- `The dog|cat chased the rat`
- `Common pets are dogs|cats`

while `dog` and `apple` are less likely to occur in the same context and hence will end up further apart in the embedding space. Interestingly, the vectors resulting from vector subtraction are also meaningful since they represent analogies - the vector between `man` and `woman` is likely to be similar to that between `king` and `queen`, or `boy` and `girl`.

Note: you will encounter `word2vec` again if you take a deep learning class - it is a very influential idea and has many applications beyond text processing since you can apply it to any discrete distribution where local context is meaningful (e.g. genomes). 

There is a very nice tutorial on Word2Vec that you should read if you want to learn more about the algorithm - [Part 1](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [Part 2](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)

Word2Vec learns about the feature representations of *words* - there is a an extension `doc2vec` that generates a feature vector for paragraphs or documents in the same way; we may cover this in the next lecture along with other document retrieval algorithms.

There are several other word to vector algorithms inspired by `word2vec` - for example, [`fasttext`](https://fasttext.cc), [approximate nearest neighbors](https://github.com/spotify/annoy) and `wordrank`. Conveniently, many of these are available in the `gensim.models` package.

We illustrate the mechanics of `word2vec` using `gensim` on the tiny newsgroup corpora; however, you really need much large corpora for `word2vec` to learn effectively.

In [29]:
import re
import numpy as np
import pandas as pd

In [30]:
import nltk
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.collocations import QuadgramCollocationFinder, TrigramCollocationFinder
from nltk.metrics.association import QuadgramAssocMeasures, TrigramAssocMeasures
import string

In [31]:
import gensim
from gensim.models.word2vec import Word2Vec

In [32]:
from sklearn.datasets import fetch_20newsgroups

In [33]:
import warnings

warnings.simplefilter('ignore', FutureWarning)

In [34]:
newsgroups_train = fetch_20newsgroups(
        subset='train',
        remove=('headers', 'footers', 'quotes')
)

In [35]:
newsgroups_test = fetch_20newsgroups(
        subset='test',
        remove=('headers', 'footers', 'quotes')
)

In [36]:
def gen_sentences(corpus):
    for item in corpus:
        yield from nltk.tokenize.sent_tokenize(item)

In [37]:
for i, t in enumerate(newsgroups_train.target[:20]):
    print('%-24s:%s' % (newsgroups_train.target_names[t], 
                        newsgroups_train.data[i].strip().replace('\n', ' ')[:50]))

rec.autos               :I was wondering if anyone out there could enlighte
comp.sys.mac.hardware   :A fair number of brave souls who upgraded their SI
comp.sys.mac.hardware   :well folks, my mac plus finally gave up the ghost 
comp.graphics           :Do you have Weitek's address/phone number?  I'd li
sci.space               :From article <C5owCB.n3p@world.std.com>, by tombak
talk.politics.guns      :Of course.  The term must be rigidly defined in an
sci.med                 :There were a few people who responded to my reques
comp.sys.ibm.pc.hardware:ALL this shows is that YOU don't know much about S
comp.os.ms-windows.misc :I have win 3.0 and downloaded several icons and BM
comp.sys.mac.hardware   :I've had the board for over a year, and it does wo
rec.motorcycles         :I have a line on a Ducati 900GTS 1978 model with 1
talk.religion.misc      :Yep, that's pretty much it. I'm not a Jew but I un
comp.sys.mac.hardware   :--
sci.space               :{Description of "External Tank" opt

In [38]:
list(newsgroups_train.data[:3])

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't an

In [39]:
list(gen_sentences(newsgroups_train.data[:2]))[:3]

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day.',
 'It was a 2-door sports car, looked to be from the late 60s/\nearly 70s.',
 'It was called a Bricklin.']

In [40]:
from gensim.parsing.preprocessing import STOPWORDS

In [41]:
docs = [gensim.utils.simple_preprocess(s) 
        for s in newsgroups_train.data]
docs = [[s for s in doc if not s in STOPWORDS] for doc in docs]

In [42]:
try:
    model = Word2Vec.load('newsgroup_w2v.model')
except:
    model = Word2Vec(docs,
                     size=64, # we use 64 dimensions to represent each word
                     window=5, # size of each context window
                     min_count=3, # ignore words with frequency less than this
                     workers=4)
    model.train(docs, total_examples=len(docs), epochs=10)
    model.save('newsgroup_w2v.model')

In [43]:
len(model.wv.vocab)

28036

The embedding vector for the word `player`

In [44]:
model.wv.word_vec('england')

array([-0.56349105, -0.69237125,  0.62378424,  0.42315495,  0.6060761 ,
        0.8513135 , -1.4870658 ,  1.8713725 ,  1.0691187 , -0.40534988,
        0.13010648, -0.22015332,  1.2252256 ,  0.43524563,  0.16700111,
       -0.870683  , -0.3986146 ,  0.8673006 , -1.3044095 ,  0.27732578,
        0.13746515, -0.27391872, -0.40350547,  0.1196682 , -0.12516008,
        1.094012  ,  0.35382795,  0.28467032, -0.06392957, -0.5844203 ,
       -0.6216453 , -0.9612679 , -1.0871466 ,  0.8269012 ,  1.8319201 ,
        0.05405332, -0.870298  , -0.63468206, -0.53045857,  0.17161274,
       -1.5805042 ,  0.57658744, -0.6095063 ,  2.0426319 ,  0.4949136 ,
        0.37475356,  0.80208707, -0.60350955,  0.8224899 ,  0.33021414,
       -0.64974153, -0.21154499,  0.26340187, -0.68231577, -1.5057738 ,
        0.20943005,  0.938601  ,  1.1808133 ,  0.5425614 ,  1.5063051 ,
        0.28405553,  0.98842007, -0.6394381 ,  0.41372457], dtype=float32)

In [45]:
model.wv.most_similar('england', topn=5)

[('york', 0.8661391139030457),
 ('mexico', 0.844334602355957),
 ('london', 0.7788676023483276),
 ('hampshire', 0.7766897678375244),
 ('county', 0.76786208152771)]

In [46]:
model.wv.similarity('england', 'france')

0.69574744

In [47]:
model.wv.similarity('england', 'rabbit')

0.31536517

Apparently, man is to baseball as woman is to stats. Who knew?

In [48]:
model.wv.most_similar(positive=['baseball', 'man'], negative=['woman'], topn=3)

[('nhl', 0.7390564680099487),
 ('hockey', 0.7029756903648376),
 ('games', 0.6984614133834839)]

Because of the small and very biased data sets (including `soc.religion.christian` and `alt.atheism`), some of the analogies found are pretty weird.

In [49]:
model.wv.most_similar(positive=['father', 'son'],
                   negative=['mother'])

[('ye', 0.8620442748069763),
 ('abraham', 0.8424699902534485),
 ('unto', 0.8400495052337646),
 ('angel', 0.8375562429428101),
 ('isaiah', 0.8345378637313843),
 ('prophet', 0.82969069480896),
 ('hath', 0.8248409032821655),
 ('apostles', 0.8244911432266235),
 ('ascended', 0.8211110234260559),
 ('messenger', 0.820909857749939)]

## Doc2Vec

The `doc2vec` algorithm is basically the same as `word2vec` with the addition of a paragraph or document context vector. That is, certain words may be used differently in different types of documents, and this is captured in the  vector representing the paragraph or document.

![img](https://cdn-images-1.medium.com/max/1600/0*x-gtU4UlO8FAsRvL.)

In [50]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [51]:
tagged_docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]

In [52]:
try:
    model = Doc2Vec.load('newsgroup_d2v.model')
except:
    model = Doc2Vec(tagged_docs,
                    vector_size=10, # we use 10 dimensions to represent each doc
                    window=5, # size of each context window
                    min_count=3, # ignore words with frequency less than this
                    workers=4)
    model.train(tagged_docs, total_examples=len(tagged_docs), epochs=10)
    model.save('newsgroup_d2v.model')

In [53]:
query = newsgroups_test.data[9]

In [54]:
query = [token for token in gensim.utils.simple_preprocess(query) 
         if not token in STOPWORDS]

In [55]:
vector = model.infer_vector(query)
vector

array([ 0.61702985,  0.36884686, -0.07321872,  0.05467742,  0.06055934,
       -0.43449533,  0.19031854,  0.05017118,  0.15060216,  0.7359674 ],
      dtype=float32)

In [56]:
model.docvecs.most_similar([vector])

[(215, 0.9346252679824829),
 (3840, 0.9333208799362183),
 (4535, 0.9313441514968872),
 (7400, 0.9279981851577759),
 (10697, 0.9263129234313965),
 (4083, 0.9233070611953735),
 (4915, 0.9226728677749634),
 (8966, 0.9175481796264648),
 (1398, 0.9162013530731201),
 (4715, 0.9149473905563354)]

In [57]:
print(newsgroups_test.data[9])
for i, score in model.docvecs.most_similar([vector], topn=5):
    print('-'*80)
    print(newsgroups_train.data[i])

:  
: well, i have lots of experience with scanning in images and altering
: them.  as for changing them back into negatives, is that really possible?

: (stuff deleted)

: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful
: undertones)

I use Aldus Photostyler on the PC and I can turn a colour or black and white
image into a negative or turn a negative into a colour or black and white
image.  I don't know how it does it but it works well.  To test it I scanned
a negative and used Aldus to create a positive.  It looked better than the
print that the film developers gave me.


-- 
--------------------------------------------------------------------------------


MSG is mono sodium glutamate, a fairly straight forward compound. If it is
pure, the source should not be a problem. Your comment suggests that 
impurities may be the cause.
My experience of MSG effects (as part of a double blind study) was that the
pure stuff caused me some rather severe effects.


Soya bean

## Latent Semantic Indexing (LSI)

### Concept

Latent semantic indexing is basically using SVD to find a low rank approximation to the document/word feature matrix.

Recall that with SVD, $A = U \Sigma V^T$. With LSI, we interpret the matrices as

\begin{array}
& A &= & T & \Sigma & D^T  \\
(t \times d) &= & (t \times n) & (n \times n) & (n \times d)
\end{array}

where $T$ is a mnemonic for Term and $D$ is a mnemonic for Document.

If we use $r$ singular values, we reconstruct the rank-$r$ matrix $A_r$ as 

\begin{array}
& A_r &= & \hat{T} & \hat{\Sigma} & \hat{D}^T  \\
(t \times d) &= & (t \times r) & (r \times r) & (r \times d)
\end{array}

or as the sum of outer products

$$
A_r = \sum_{k=1}^{r} \sigma_r t_r d_r^T
$$

The $r$ columns $\hat{T}$ are the basis vectors for the rotated lower-dimensional coordinate system, and we can consider each of the $r$ columns or $\hat{T}$ as representing a topic. The value of $\hat{T}_{ij}$ is the weight of the $i^\text{th}$ term for topic $j$.

### Queries

Suppose we have a new document $x$ with dimensions $t \times 1$. We convert it to the $\hat{T}$ space by a change-of basis transformation

$$
x^* = \hat{T}^T x
$$

which you can check will have dimensions $r \times 1$.

To find what documents are similar to $x$, we look for what original documents are close to $x^*$ in the $\hat{T}$ space by looking for the columns of $\hat{\Sigma} D^T$ (with dimensions $r \times d$) that are closest to $x*$.

### Example of LSI

In [58]:
len(docs)

11314

In [59]:
dictionary = gensim.corpora.Dictionary(docs)

In [60]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [61]:
lsi = gensim.models.LsiModel(corpus, num_topics=10, id2word = dictionary)

In [62]:
for i, topic in  lsi.print_topics(num_words=5):
    print(topic)

0.997*"ax" + 0.072*"max" + 0.009*"pl" + 0.007*"ei" + 0.006*"tm"
-0.230*"cx" + -0.225*"db" + -0.225*"file" + -0.177*"hz" + -0.159*"edu"
-0.267*"file" + 0.214*"cx" + -0.182*"edu" + 0.172*"hz" + 0.152*"ww"
-0.868*"db" + -0.216*"mov" + -0.189*"bh" + -0.136*"si" + -0.127*"cs"
0.438*"file" + 0.272*"output" + -0.181*"people" + 0.177*"entry" + -0.175*"know"
0.246*"file" + -0.236*"edu" + 0.230*"mr" + 0.224*"di" + 0.188*"pl"
0.290*"di" + 0.251*"pl" + 0.223*"wm" + 0.222*"tm" + 0.188*"um"
-0.493*"jpeg" + -0.261*"file" + -0.251*"image" + -0.201*"gif" + -0.177*"mr"
-0.303*"mr" + -0.259*"file" + 0.253*"jpeg" + -0.235*"stephanopoulos" + -0.178*"edu"
0.371*"stephanopoulos" + 0.309*"mr" + -0.269*"file" + 0.181*"president" + -0.175*"gun"


In [63]:
for i, t in enumerate(newsgroups_test.target[:20]):
    print('%02d %-24s:%s' % (i, newsgroups_test.target_names[t], 
                        newsgroups_test.data[i].strip().replace('\n', ' ')[:50]))

00 rec.autos               :I am a little confused on all of the models of the
01 comp.windows.x          :I'm not familiar at all with the format of these "
02 alt.atheism             :In a word, yes.
03 talk.politics.mideast   :They were attacking the Iraqis to drive them out o
04 talk.religion.misc      :I've just spent two solid months arguing that no s
05 sci.med                 :Elisabeth, let's set the record straight for the n
06 soc.religion.christian  :Dishonest money dwindles away, but he who gathers 
07 soc.religion.christian  :A friend of mine managed to get a copy of a comput
08 comp.windows.x          :Hi,     We have a requirement for dynamically clos
09 comp.graphics           ::   : well, i have lots of experience with scannin
10 comp.os.ms-windows.misc :I have uploaded the Windows On-Line Review sharewa
11 comp.windows.x          :Most graphics systems I have seen have drawing rou
12 talk.politics.mideast   :You *know* that putting something like this out on
13 rec.m

#### Find topics in document

Note that topics are rather hard to interpret. After all they are just words with the largest weights in the low rank approximation.

In [64]:
query = newsgroups_test.data[9]
query

":  \n: well, i have lots of experience with scanning in images and altering\n: them.  as for changing them back into negatives, is that really possible?\n\n: (stuff deleted)\n\n: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful\n: undertones)\n\nI use Aldus Photostyler on the PC and I can turn a colour or black and white\nimage into a negative or turn a negative into a colour or black and white\nimage.  I don't know how it does it but it works well.  To test it I scanned\na negative and used Aldus to create a positive.  It looked better than the\nprint that the film developers gave me.\n\n\n-- "

In [65]:
query = gensim.utils.simple_preprocess(query)
query[:3]

['well', 'have', 'lots']

In [66]:
query = dictionary.doc2bow(query)

In [67]:
query = lsi[query]

In [68]:
sorted(query, key=lambda x: -x[1])[:5]

[(8, 0.6821816482262105),
 (3, 0.08967435368416013),
 (9, 0.05678673314421907),
 (0, 6.550817096540486e-05),
 (6, -0.027705407390045133)]

In [69]:
topics = [i for i, score in sorted(query, key=lambda x: -x[1])[:5]]

In [70]:
lsi.print_topic(topics[0])

'-0.303*"mr" + -0.259*"file" + 0.253*"jpeg" + -0.235*"stephanopoulos" + -0.178*"edu" + 0.162*"people" + -0.153*"gun" + -0.134*"president" + 0.124*"image" + 0.121*"output"'

In [71]:
pat = re.compile(r'.*?(-)?\d+.*?\"(\w+)\"')

In [72]:
for topic in topics:
    words = [''.join(pair) for pair in pat.findall(lsi.print_topic(topic))]
    print(','.join(words))

-mr,-file,jpeg,-stephanopoulos,-edu,people,-gun,-president,image,output
-db,-mov,-bh,-si,-cs,-byte,hz,-bl,-di,-al
stephanopoulos,mr,-file,president,-gun,output,edu,-people,program,entry
ax,max,pl,ei,tm,bhj,giz,di,ey,wm
di,pl,wm,tm,um,edu,-stephanopoulos,bxn,-know,giz


#### Find similar documents

In [73]:
index = gensim.similarities.MatrixSimilarity(lsi[corpus])

In [74]:
sims = index[query]

In [75]:
hits = sorted(enumerate(sims), key=lambda x: -x[1])[:5]
hits

[(5755, 0.993729),
 (8226, 0.99354684),
 (16, 0.99055254),
 (10697, 0.9892783),
 (6864, 0.9879426)]

In [76]:
print(newsgroups_test.data[9])
for match in [newsgroups_train.data[k] for k, score in hits]:
    print('-'*80)
    print(match)

:  
: well, i have lots of experience with scanning in images and altering
: them.  as for changing them back into negatives, is that really possible?

: (stuff deleted)

: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful
: undertones)

I use Aldus Photostyler on the PC and I can turn a colour or black and white
image into a negative or turn a negative into a colour or black and white
image.  I don't know how it does it but it works well.  To test it I scanned
a negative and used Aldus to create a positive.  It looked better than the
print that the film developers gave me.


-- 
--------------------------------------------------------------------------------

Why didn't you create 8 grey-level images, and display them for
1,2,4,8,16,32,64,128... time slices?

This requires the same total exposure time, and the same precision in
timing, but drastically reduces the image-preparation time, no?






---------------------------------------------------------------------

$$
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

## Non-negative Matrix Factorization

The topics generated by LSI can be hard to understand because they include negative weights for words. They may also be hard to understand since they may not map to topics in the way that we would. Remember, a topic is just a low rank approximation that minimizes the Frobenius norm. An alternative factorization is non-negative matrix (NMF) factorization, which does not use negatively valued words in the topic.

NMF performs the following decomposition

\begin{array}
& A &= & W & H  \\
(t \times d) &= & (t \times n) & (n \times d)
\end{array}

using an iterative procedure to minimize the Frobenius norm $\norm{A - WH}_F^2$ subject to the constraint that $W, H > 0$. There are several different methods to perform this iterative minimization that do not concern us here.

NMF basically finds a different set of basis vectors (not the eigenvectors of the covariance matrix) to project onto. The vectors point in the direction of clusters of word features that appear in common across multiple documents.

![nmf_svd](https://qph.fs.quoracdn.net/main-qimg-9b4e31ec4b57f4baf7d08d5df17c6bc0)

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

In [78]:
import warnings

In [79]:
warnings.simplefilter('ignore', FutureWarning)

In [80]:
from sklearn.decomposition import NMF

In [81]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups_train.data)
X = normalize(X, norm='l1', axis=1)
model = NMF(n_components=10, init='random', random_state=0)
W = model.fit_transform(X)

In [82]:
W.shape

(11314, 10)

In [83]:
vocab = vectorizer.get_feature_names()
for i, topic in enumerate(model.components_):
    print("Topic %d:" % i, end=' ')
    print(" ".join([vocab[i] for i in topic.argsort()[:-10 - 1:-1]]))

Topic 0: the of to and in is that on not it
Topic 1: you to it for have your if thanks me what
Topic 2: hello testing please networld xelm mailreader looking andreas am mail
Topic 3: each chris 00 postage for answered are sale includes usa
Topic 4: was that he thought oh it just bet sunpost411ld chicago
Topic 5: hi thanks appreciated anyone advance any windows for anybody card
Topic 6: test this is thanks message only tesrt it putt david
Topic 7: ax max g9v b8f a86 pl 1d9 1t 3t 145
Topic 8: ditto me too here for cdt he copy let hillary
Topic 9: deletion god why atheism alt is concidered gospels exist drivel


How would you find documents similar to the query document? 

## Latent Dirichlet Allocation

- [Original paper](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

In [84]:
import numpy as np

### The Dirichlet distribution

A random sample from a Dirichlet distribution is a multinomial distribution, of the same length as the Dirichlet concentration parameter $\alpha$

In [85]:
α = np.array([1,2,3])
for i in range(5):
    print(np.random.dirichlet(α))

[0.42000365 0.2330118  0.34698455]
[0.04387792 0.42570748 0.5304146 ]
[0.0853653  0.14306953 0.77156517]
[0.01721692 0.59394126 0.38884182]
[0.04981342 0.21369137 0.73649521]


#### Relationship between $\alpha$ and samples

In [86]:
α/α.sum()

array([0.16666667, 0.33333333, 0.5       ])

In [87]:
n = int(1e6)
np.random.dirichlet(α, n).mean(axis=0)

array([0.16672605, 0.33334275, 0.4999312 ])

### Concept of LDA

LDA is a generative model - that is, it provides a probability distribution from which we can generate documents, each of which is composed of generated words. We sketch the generative process here; the MCMC machinery that is used for implementation is not covered in this course (but will be in STA 663).

- There are $M$ documents
  - A document consists of the words $w_{1:N}$
- There are $K$ topics $\varphi_{1:K}$ from which we can choose words from a vocabulary of length $V$
  - For each topic
    - Sample a topic $\varphi$ from a Dirichlet distribution with parameter $\beta$
    - Each topic $\varphi$ is a multinomial distribution of size $V$
- There are $N$ words in a document
  - For each document
    - Sample a topic multinomial $\theta$ of size $K$ from a different Dirichlet distribution with parameter $\alpha$
    - Repeat for each word position in the document
      - Sample the integer index $z$ from $\theta$
      - Sample a word $w$ from the topic $\varphi_z$ 
      
![lda](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)

### Example of LDA

In [88]:
from gensim.models.ldamodel import LdaModel

In [89]:
lda = LdaModel(corpus, num_topics=10, id2word = dictionary)

In [90]:
for i, topic in  lda.print_topics(num_words=5):
    print(topic)

0.006*"use" + 0.005*"people" + 0.004*"law" + 0.004*"like" + 0.004*"government"
0.010*"like" + 0.008*"drive" + 0.007*"know" + 0.005*"ve" + 0.005*"new"
0.015*"la" + 0.011*"det" + 0.011*"van" + 0.011*"ax" + 0.010*"vs"
0.616*"ax" + 0.046*"max" + 0.008*"pl" + 0.005*"ei" + 0.004*"tm"
0.017*"cx" + 0.014*"c_" + 0.009*"ax" + 0.009*"hz" + 0.009*"qs"
0.011*"edu" + 0.008*"com" + 0.006*"mail" + 0.005*"space" + 0.005*"ripem"
0.010*"people" + 0.009*"god" + 0.006*"think" + 0.006*"know" + 0.004*"like"
0.008*"use" + 0.007*"windows" + 0.007*"key" + 0.006*"file" + 0.006*"data"
0.005*"good" + 0.005*"time" + 0.005*"space" + 0.004*"team" + 0.004*"think"
0.006*"year" + 0.006*"car" + 0.004*"gun" + 0.004*"new" + 0.003*"good"


In [91]:
query = newsgroups_test.data[9]
query

":  \n: well, i have lots of experience with scanning in images and altering\n: them.  as for changing them back into negatives, is that really possible?\n\n: (stuff deleted)\n\n: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful\n: undertones)\n\nI use Aldus Photostyler on the PC and I can turn a colour or black and white\nimage into a negative or turn a negative into a colour or black and white\nimage.  I don't know how it does it but it works well.  To test it I scanned\na negative and used Aldus to create a positive.  It looked better than the\nprint that the film developers gave me.\n\n\n-- "

In [92]:
query = gensim.utils.simple_preprocess(query)
query[:3]

['well', 'have', 'lots']

In [93]:
query = dictionary.doc2bow(query)

In [94]:
query = lda[query]

#### Topics in query document

In [95]:
sorted(query, key=lambda x: -x[1])[:5]

[(1, 0.3583182),
 (7, 0.3494905),
 (6, 0.1866449),
 (9, 0.05416738),
 (8, 0.039437424)]

In [96]:
index = gensim.similarities.MatrixSimilarity(lda[corpus])

In [97]:
sims = index[query]

In [98]:
topics = [i for i, score in sorted(query, key=lambda x: -x[1])[:5]]

In [99]:
lda.print_topic(topics[0])

'0.010*"like" + 0.008*"drive" + 0.007*"know" + 0.005*"ve" + 0.005*"new" + 0.004*"thanks" + 0.004*"think" + 0.004*"good" + 0.004*"ll" + 0.004*"got"'

#### Find similar documents

In [100]:
hits = sorted(enumerate(sims), key=lambda x: -x[1])[:5]
hits

[(60, 0.99013793),
 (9747, 0.98868847),
 (6464, 0.98546094),
 (1645, 0.98302364),
 (2521, 0.9799195)]

In [101]:
print(newsgroups_test.data[9])
for match in [newsgroups_train.data[k] for k, score in hits]:
    print('-'*80)
    print(match)

:  
: well, i have lots of experience with scanning in images and altering
: them.  as for changing them back into negatives, is that really possible?

: (stuff deleted)

: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful
: undertones)

I use Aldus Photostyler on the PC and I can turn a colour or black and white
image into a negative or turn a negative into a colour or black and white
image.  I don't know how it does it but it works well.  To test it I scanned
a negative and used Aldus to create a positive.  It looked better than the
print that the film developers gave me.


-- 
--------------------------------------------------------------------------------
Hello netters:)  Does anyone out there know any FTP sites for projects,
plans, etc of an electrical nature?  

-Jason
--------------------------------------------------------------------------------

   Sean, the 68070 exists! :-)



   Sean, I don't want to get into a 'mini-war' by what I am going to say,
but 