# Text Analysis

 In this lecture, we look at more recent methods of feature extraction and topic modeling.

We will cover the following:

- word2vec
- latent semantic analysis
- non-negative matrix factorization
- latent Dirichlet allocation

## Similarity

In order to find similar words or documents after they have been vectorized, we need definitions of similarity. Similarity measures often used in text analysis include

- edit 
- cosine 
- Hellinger 
- Kullback-Leibler 
- Jacard 

These may be given as the similarity or distance.

### Edit 

The edit distance between two strings is the minimum number of changes needed to covert from one string to another. These changes may be weighted, for example, by making a deletion changes have a different weight than an insertion operation. Also known as Levenshtein distance.

Such distance metrics are the basis for aligning DNA, RNA and protein sequences.

In [1]:
import textdistance as td

In [2]:
td.levenshtein.distance('slaves', 'salve')

3

In [3]:
td.levenshtein.similarity('slaves', 'salve')

3

### Jacard 

The Jacard distance is the intersection divided by union of two sets.

In [4]:
td.jaccard.similarity('the quick brown fox'.split(), 'the quick brown dog'.split())

0.6

Note that the implementation is actually for multisets.

In [5]:
td.jaccard.similarity('slaves', 'salve')

0.8333333333333334

### Cosine 

For two real valued vectors.

In [6]:
s1 = 'the quick brown fox'
s2 = 'the quick brown dog'

In [7]:
td.cosine.similarity(s1.split(), s2.split())

0.75

Cosine distance works on vectors - the default is just to use the bag of words counts.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
cv = CountVectorizer()

In [10]:
t = cv.fit_transform([s1, s2]).toarray()
t

array([[1, 0, 1, 1, 1],
       [1, 1, 0, 1, 1]], dtype=int64)

Cosine distance is equivalent to the inner product of the normalized vectors with length 1.

In [11]:
from scipy.spatial.distance import cosine

In [12]:
import numpy as np

In [13]:
np.around(1- cosine(t[0], t[1]), 2)

0.75

In [14]:
np.dot(t[0]/np.linalg.norm(t[0]), t[1]/np.linalg.norm(t[1]))

0.75

### Hellinger

For two probability distributions.

In [15]:
from gensim.matutils import  hellinger

In [16]:
p = t[0]/t[0].sum()
q = t[1]/t[1].sum()
p, q

(array([0.25, 0.  , 0.25, 0.25, 0.25]), array([0.25, 0.25, 0.  , 0.25, 0.25]))

In [17]:
hellinger(p, q)

0.5

In [18]:
def discrete_hellinger(p, q):
    return 1/np.sqrt(2) * np.linalg.norm(np.sqrt(p) - np.sqrt(q))

In [19]:
discrete_hellinger(p, q)

0.5

### Kullback-Leibler

In [20]:
t = cv.fit_transform(['one two three', 'one one one two two three']).toarray()
t

array([[1, 1, 1],
       [3, 1, 2]], dtype=int64)

In [21]:
p = t[0]/t[0].sum()
q = t[1]/t[1].sum()
p, q

(array([0.33333333, 0.33333333, 0.33333333]),
 array([0.5       , 0.16666667, 0.33333333]))

In [22]:
from gensim.matutils import  kullback_leibler

In [23]:
kullback_leibler(p, q)

0.09589402415059362

Not symmetric.

In [24]:
kullback_leibler(q, p)

0.08720802396075798

In [25]:
def discrete_dkl(p, q):
    return -np.sum(p * (np.log(q) - np.log(p)))

In [26]:
discrete_dkl(p, q)

0.09589402415059356

In [27]:
discrete_dkl(q, p)

0.08720802396075805

## Word2Vec

the `word2vec` family of algorithms is a powerful method for converting a word into a vector that takes into account its context. There are two main ideas - in continuous bag of words, we try to predict the current word from nearby words; in continuous skip-gram, the current word is used to predict nearby words. The phrase "nearby words" is intentionally vague - in the simplest case, it is a sliding window of words centered on the current word. 

Suppose we have the sentence

```
I do not like green eggs and ham
```

and suppose we use a centered window of length 3,

```
((I, not), do), ((do, like), not), ((not, green), like), ((like, eggs), green), ((green, and), eggs), ((eggs, ham) and)
```

In continuous bag of words, we make the (input, output) pairs to be
```
(I, do)
(not, do)
(do, not)
(like, not)
(not, like)
(green, like)
(like, green)
(eggs, green)
(green, eggs)
(and, eggs)
(eggs, and)
(ham, and)
```

That is, we try to predict `do` when we see `I`, `do` when we see `not` and so on.

In continuous skip-gram, we do the inverse for (input, output) pairs
```
(do, I)
(do, not)
(not, do)
(not, like)
(like, not)
(like, green)
(green, like)
(green, eggs)
(eggs, green)
(eggs, and)
(and, eggs)
(and, ham)
```

That is, we try to predict `I` when we see `do`, `not` when we see `do` and so on.

To do this prediction, we first assign each word to a vector of some fixed length $n$ - i.e. we embed each word as an $\mathbb{R}^n$ vector. To do a prediction for all words in the vocabulary using `softmax` would be prohibitively expensive, and is unnecessary if we are just trying to find a good embedding vector. Instead we select $k$ noise words, typically from the unigram distributions, and just train the classifier to distinguish the target word from the noise words using logistic regression (negative sampling). We use stochastic gradient descent to move the embedding word vectors (initialized randomly) until the model gives a high probability to the target words and low probability to the noise ones. If successful, words that are meaningful when substituted in the same context will be close together in $\mathbb{R}^n$. For instance, `dog` and `cat` are likely to be close together because they appear together in similar contexts like

- `My pet dog|cat`
- `Raining dogs|cats and cats|dogs`
- `The dog|cat chased the rat`
- `Common pets are dogs|cats`

while `dog` and `apple` are less likely to occur in the same context and hence will end up further apart in the embedding space. Interestingly, the vectors resulting from vector subtraction are also meaningful since they represent analogies - the vector between `man` and `woman` is likely to be similar to that between `king` and `queen`, or `boy` and `girl`.

Note: you will encounter `word2vec` again if you take a deep learning class - it is a very influential idea and has many applications beyond text processing since you can apply it to any discrete distribution where local context is meaningful (e.g. genomes). 

There is a very nice tutorial on Word2Vec that you should read if you want to learn more about the algorithm - [Part 1](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [Part 2](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)

Word2Vec learns about the feature representations of *words* - there is a an extension `doc2vec` that generates a feature vector for paragraphs or documents in the same way; we may cover this in the next lecture along with other document retrieval algorithms.

There are several other word to vector algorithms inspired by `word2vec` - for example, [`fasttext`](https://fasttext.cc), [approximate nearest neighbors](https://github.com/spotify/annoy) and `wordrank`. Conveniently, many of these are available in the `gensim.models` package.

We illustrate the mechanics of `word2vec` using `gensim` on the tiny newsgroup corpora; however, you really need much large corpora for `word2vec` to learn effectively.

In [28]:
import re
import numpy as np
import pandas as pd

In [29]:
import nltk
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.collocations import QuadgramCollocationFinder, TrigramCollocationFinder
from nltk.metrics.association import QuadgramAssocMeasures, TrigramAssocMeasures
import string

In [30]:
import gensim
from gensim.models.word2vec import Word2Vec

In [31]:
from sklearn.datasets import fetch_20newsgroups

In [32]:
import warnings

warnings.simplefilter('ignore', FutureWarning)

In [33]:
newsgroups_train = fetch_20newsgroups(
        subset='train',
        remove=('headers', 'footers', 'quotes')
)

In [34]:
newsgroups_test = fetch_20newsgroups(
        subset='test',
        remove=('headers', 'footers', 'quotes')
)

In [35]:
def gen_sentences(corpus):
    for item in corpus:
        yield from nltk.tokenize.sent_tokenize(item)

In [36]:
for i, t in enumerate(newsgroups_train.target[:20]):
    print('%-24s:%s' % (newsgroups_train.target_names[t], 
                        newsgroups_train.data[i].strip().replace('\n', ' ')[:50]))

rec.autos               :I was wondering if anyone out there could enlighte
comp.sys.mac.hardware   :A fair number of brave souls who upgraded their SI
comp.sys.mac.hardware   :well folks, my mac plus finally gave up the ghost 
comp.graphics           :Do you have Weitek's address/phone number?  I'd li
sci.space               :From article <C5owCB.n3p@world.std.com>, by tombak
talk.politics.guns      :Of course.  The term must be rigidly defined in an
sci.med                 :There were a few people who responded to my reques
comp.sys.ibm.pc.hardware:ALL this shows is that YOU don't know much about S
comp.os.ms-windows.misc :I have win 3.0 and downloaded several icons and BM
comp.sys.mac.hardware   :I've had the board for over a year, and it does wo
rec.motorcycles         :I have a line on a Ducati 900GTS 1978 model with 1
talk.religion.misc      :Yep, that's pretty much it. I'm not a Jew but I un
comp.sys.mac.hardware   :--
sci.space               :{Description of "External Tank" opt

In [37]:
list(newsgroups_train.data[:3])

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't an

In [38]:
list(gen_sentences(newsgroups_train.data[:2]))[:3]

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day.',
 'It was a 2-door sports car, looked to be from the late 60s/\nearly 70s.',
 'It was called a Bricklin.']

In [39]:
from gensim.parsing.preprocessing import STOPWORDS

In [40]:
docs = [gensim.utils.simple_preprocess(s) 
        for s in newsgroups_train.data]
docs = [[s for s in doc if not s in STOPWORDS] for doc in docs]

In [41]:
try:
    model = Word2Vec.load('newsgroup_w2v.model')
except:
    model = Word2Vec(docs,
                     size=64, # we use 64 dimensions to represent each word
                     window=5, # size of each context window
                     min_count=3, # ignore words with frequency less than this
                     workers=4)
    model.train(docs, total_examples=len(docs), epochs=10)
    model.save('newsgroup_w2v.model')

In [42]:
len(model.wv.vocab)

28037

The embedding vector for the word `player`

In [43]:
model.wv.word_vec('england')

array([-0.0745109 , -0.3727787 , -0.07858007, -0.02495546, -0.81768256,
        0.3675847 , -0.04986222, -0.35944948, -1.3577297 , -0.82298636,
        0.6177153 , -0.65133417, -0.4350872 , -0.5583063 ,  0.18579966,
        1.0602158 , -0.3022152 ,  0.5284609 , -0.59862494, -0.45161876,
       -0.27300555,  0.60734206, -0.6525729 , -0.23259163,  0.13352758,
        0.5370736 ,  0.7715225 , -0.12659687, -0.44654194,  0.22047612,
       -0.16561052,  0.4706513 , -0.33596152,  1.2309055 , -0.49149442,
        0.86923444,  1.4756995 , -1.8942418 , -0.00781839,  2.0033727 ,
       -0.09868202,  1.1259419 , -0.189596  , -0.2777818 , -2.4286313 ,
       -0.38005677, -1.182714  , -1.6038613 ,  0.84564364,  0.21023819,
       -0.53427446,  0.646102  , -0.58787376, -0.97272074,  1.0672528 ,
        0.23760247, -0.18388252, -0.36993474, -0.22132048, -0.96092176,
       -0.52498657, -0.48206168,  1.1901151 ,  0.4976162 ], dtype=float32)

In [44]:
model.wv.most_similar('england', topn=5)

[('york', 0.8504003286361694),
 ('mexico', 0.8040080070495605),
 ('london', 0.7945983409881592),
 ('rhode', 0.7761226296424866),
 ('county', 0.7622696161270142)]

In [45]:
model.wv.similarity('england', 'france')

0.6483306

In [46]:
model.wv.similarity('england', 'rabbit')

0.329547

Apparently, man is to baseball as woman is to stats. Who knew?

In [47]:
model.wv.most_similar(positive=['baseball', 'man'], negative=['woman'], topn=3)

[('nhl', 0.7642770409584045),
 ('hockey', 0.7287185192108154),
 ('stats', 0.7006800770759583)]

Because of the small and very biased data sets (including `soc.religion.christian` and `alt.atheism`), some of the analogies found are pretty weird.

In [48]:
model.wv.most_similar(positive=['father', 'son'],
                   negative=['mother'])

[('ye', 0.8762507438659668),
 ('angel', 0.8611559867858887),
 ('unto', 0.8579176664352417),
 ('messenger', 0.8458547592163086),
 ('abraham', 0.8432290554046631),
 ('thy', 0.8288504481315613),
 ('isaiah', 0.8286863565444946),
 ('allah', 0.8268882632255554),
 ('wicked', 0.8183465003967285),
 ('apostles', 0.8173311352729797)]

## Doc2Vec

The `doc2vec` algorithm is basically the same as `word2vec` with the addition of a paragraph or document context vector. That is, certain words may be used differently in different types of documents, and this is captured in the  vector representing the paragraph or document.

![img](https://cdn-images-1.medium.com/max/1600/0*x-gtU4UlO8FAsRvL.)

In [49]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [50]:
tagged_docs = [TaggedDocument(doc, [i]) for i, doc in enumerate(docs)]

In [51]:
try:
    model = Doc2Vec.load('newsgroup_d2v.model')
except:
    model = Doc2Vec(tagged_docs,
                    vector_size=10, # we use 10 dimensions to represent each doc
                    window=5, # size of each context window
                    min_count=3, # ignore words with frequency less than this
                    workers=4)
    model.train(tagged_docs, total_examples=len(tagged_docs), epochs=10)
    model.save('newsgroup_d2v.model')

In [52]:
query = newsgroups_test.data[9]

In [53]:
query = [token for token in gensim.utils.simple_preprocess(query) 
         if not token in STOPWORDS]

In [54]:
vector = model.infer_vector(query)
vector

array([-0.31516454,  0.00805197, -0.0253187 ,  0.19353893,  0.0273648 ,
        0.1886811 ,  0.10788542,  0.22791667,  0.03705199, -0.01411314,
       -0.17832312,  0.05631726, -0.10725331,  0.14338617,  0.18527962,
       -0.01110269,  0.06654119, -0.04594845, -0.22378922, -0.04476319,
       -0.21860364,  0.12073784, -0.16528177,  0.16585809, -0.05773527,
        0.3454903 , -0.19849953, -0.333022  , -0.16839017,  0.10700481,
        0.17470066,  0.24749497,  0.18888773, -0.07537188,  0.18191707,
       -0.1095916 ,  0.02935272, -0.4249575 ,  0.0251768 ,  0.0349746 ,
        0.12637445,  0.24568476, -0.2119108 ,  0.48085764,  0.20985557,
        0.00770369, -0.2141663 , -0.18078327,  0.12313445, -0.21511662,
        0.24245998,  0.02400354, -0.00676978, -0.00284921, -0.12853989,
       -0.22861975, -0.20369759,  0.08046672,  0.2529689 , -0.11732776,
        0.02008816, -0.18802854,  0.10185857,  0.06126981], dtype=float32)

In [55]:
model.docvecs.most_similar([vector])

[(5391, 0.7474963665008545),
 (5085, 0.7442615032196045),
 (1289, 0.7347975373268127),
 (1727, 0.7336825132369995),
 (5990, 0.7317161560058594),
 (336, 0.7278159856796265),
 (4120, 0.7259418964385986),
 (6215, 0.7255467176437378),
 (6604, 0.7246582508087158),
 (2326, 0.7232733964920044)]

In [56]:
print(newsgroups_test.data[9])
for i, score in model.docvecs.most_similar([vector], topn=5):
    print('-'*80)
    print(newsgroups_train.data[i])

:  
: well, i have lots of experience with scanning in images and altering
: them.  as for changing them back into negatives, is that really possible?

: (stuff deleted)

: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful
: undertones)

I use Aldus Photostyler on the PC and I can turn a colour or black and white
image into a negative or turn a negative into a colour or black and white
image.  I don't know how it does it but it works well.  To test it I scanned
a negative and used Aldus to create a positive.  It looked better than the
print that the film developers gave me.


-- 
--------------------------------------------------------------------------------
Hi guys.

I am scanning in a color image and it looks fine on the screen.  When I 
converted it into PCX,BMP,GIF files so as to get it into MS Windows the colors
got much lighter.  For example the yellows became white.  Any ideas?
--------------------------------------------------------------------------------


## Latent Semantic Indexing (LSI)

Latent semantic indexing is basically using SVD to find a low rank approximation to the document/word feature matrix.

Recall that with SVD, $X = U \Sigma V^T$. With lsi, we interpret the matrices as

![img](https://i.stack.imgur.com/s9K8q.jpg)

where the weights indicate the importance of each topic, where a topic is a particular distribution of words.

In [57]:
len(docs)

11314

In [58]:
dictionary = gensim.corpora.Dictionary(docs)

In [59]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [60]:
lsi = gensim.models.LsiModel(corpus, num_topics=10, id2word = dictionary)

In [61]:
for i, topic in  lsi.print_topics(num_words=5):
    print(topic)

0.997*"ax" + 0.072*"max" + 0.009*"pl" + 0.007*"ei" + 0.006*"tm"
-0.230*"cx" + -0.225*"db" + -0.225*"file" + -0.177*"hz" + -0.159*"edu"
0.267*"file" + -0.214*"cx" + 0.182*"edu" + -0.172*"hz" + -0.152*"ww"
-0.868*"db" + -0.216*"mov" + -0.189*"bh" + -0.136*"si" + -0.127*"cs"
-0.438*"file" + -0.272*"output" + 0.181*"people" + -0.177*"entry" + 0.175*"know"
-0.246*"file" + 0.236*"edu" + -0.230*"mr" + -0.224*"di" + -0.188*"pl"
-0.290*"di" + -0.251*"pl" + -0.223*"wm" + -0.222*"tm" + -0.188*"um"
0.493*"jpeg" + 0.261*"file" + 0.251*"image" + 0.201*"gif" + 0.177*"mr"
-0.303*"mr" + -0.259*"file" + 0.253*"jpeg" + -0.235*"stephanopoulos" + -0.178*"edu"
0.371*"stephanopoulos" + 0.309*"mr" + -0.269*"file" + 0.181*"president" + -0.175*"gun"


In [62]:
for i, t in enumerate(newsgroups_test.target[:20]):
    print('%02d %-24s:%s' % (i, newsgroups_test.target_names[t], 
                        newsgroups_test.data[i].strip().replace('\n', ' ')[:50]))

00 rec.autos               :I am a little confused on all of the models of the
01 comp.windows.x          :I'm not familiar at all with the format of these "
02 alt.atheism             :In a word, yes.
03 talk.politics.mideast   :They were attacking the Iraqis to drive them out o
04 talk.religion.misc      :I've just spent two solid months arguing that no s
05 sci.med                 :Elisabeth, let's set the record straight for the n
06 soc.religion.christian  :Dishonest money dwindles away, but he who gathers 
07 soc.religion.christian  :A friend of mine managed to get a copy of a comput
08 comp.windows.x          :Hi,     We have a requirement for dynamically clos
09 comp.graphics           ::   : well, i have lots of experience with scannin
10 comp.os.ms-windows.misc :I have uploaded the Windows On-Line Review sharewa
11 comp.windows.x          :Most graphics systems I have seen have drawing rou
12 talk.politics.mideast   :You *know* that putting something like this out on
13 rec.m

#### Find topics in document

Note that topics are rather hard to interpret. After all they are just words with the largest weights in the low rank approximation.

In [63]:
query = newsgroups_test.data[9]
query

":  \n: well, i have lots of experience with scanning in images and altering\n: them.  as for changing them back into negatives, is that really possible?\n\n: (stuff deleted)\n\n: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful\n: undertones)\n\nI use Aldus Photostyler on the PC and I can turn a colour or black and white\nimage into a negative or turn a negative into a colour or black and white\nimage.  I don't know how it does it but it works well.  To test it I scanned\na negative and used Aldus to create a positive.  It looked better than the\nprint that the film developers gave me.\n\n\n-- "

In [64]:
query = gensim.utils.simple_preprocess(query)
query[:3]

['well', 'have', 'lots']

In [65]:
query = dictionary.doc2bow(query)

In [66]:
query = lsi[query]

In [67]:
sorted(query, key=lambda x: -x[1])[:5]

[(7, 0.6934390798666026),
 (8, 0.682172125680393),
 (2, 0.6238479703880672),
 (4, 0.4296537870003311),
 (5, 0.1288588802535491)]

In [68]:
topics = [i for i, score in sorted(query, key=lambda x: -x[1])[:5]]

In [69]:
lsi.print_topic(topics[0])

'0.493*"jpeg" + 0.261*"file" + 0.251*"image" + 0.201*"gif" + 0.177*"mr" + -0.144*"edu" + 0.139*"color" + 0.138*"stephanopoulos" + -0.138*"entry" + -0.132*"output"'

In [70]:
pat = re.compile(r'.*?(-)?\d+.*?\"(\w+)\"')

In [71]:
for topic in topics:
    words = [''.join(pair) for pair in pat.findall(lsi.print_topic(topic))]
    print(','.join(words))

jpeg,file,image,gif,mr,-edu,color,stephanopoulos,-entry,-output
-mr,-file,jpeg,-stephanopoulos,-edu,people,-gun,-president,image,output
file,-cx,edu,-hz,-ww,-c_,-uw,-qs,-ck,use
-file,-output,people,-entry,know,stephanopoulos,mr,said,-oname,think
-file,edu,-mr,-di,-pl,-tm,-wm,-stephanopoulos,-output,-um


#### Find similar documents

In [72]:
index = gensim.similarities.MatrixSimilarity(lsi[corpus])

In [73]:
sims = index[query]

In [74]:
hits = sorted(enumerate(sims), key=lambda x: -x[1])[:5]
hits

[(5755, 0.9937236),
 (8226, 0.9935441),
 (16, 0.9905483),
 (10697, 0.9892752),
 (6864, 0.9879291)]

In [75]:
print(newsgroups_test.data[9])
for match in [newsgroups_train.data[k] for k, score in hits]:
    print('-'*80)
    print(match)

:  
: well, i have lots of experience with scanning in images and altering
: them.  as for changing them back into negatives, is that really possible?

: (stuff deleted)

: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful
: undertones)

I use Aldus Photostyler on the PC and I can turn a colour or black and white
image into a negative or turn a negative into a colour or black and white
image.  I don't know how it does it but it works well.  To test it I scanned
a negative and used Aldus to create a positive.  It looked better than the
print that the film developers gave me.


-- 
--------------------------------------------------------------------------------

Why didn't you create 8 grey-level images, and display them for
1,2,4,8,16,32,64,128... time slices?

This requires the same total exposure time, and the same precision in
timing, but drastically reduces the image-preparation time, no?






---------------------------------------------------------------------

## Non-negative Matrix Factorization

The topics generated by LSI can be hard to understand because they include negative value words. They may also be hard to understand since they may not map to topics in the way that we would. Remember, a topic is just a low rank approximation that minimizes the Frobenius norm. An alternative factorization is non-negative matrix (NMF) factorization, which does not use negatively valued words in the topic.

NMF basically finds a different set of basis vectors (not the eigenvectors of the covariance matrix) to project onto.

![nmf_svd](https://qph.fs.quoracdn.net/main-qimg-9b4e31ec4b57f4baf7d08d5df17c6bc0)

In [76]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

In [77]:
import warnings

In [78]:
warnings.simplefilter('ignore', FutureWarning)

In [79]:
from sklearn.decomposition import NMF

In [80]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups_train.data)
X = normalize(X, norm='l1', axis=1)
model = NMF(n_components=10, init='random', random_state=0)
W = model.fit_transform(X)

In [81]:
W.shape

(11314, 10)

In [82]:
vocab = vectorizer.get_feature_names()
for i, topic in enumerate(model.components_):
    print("Topic %d:" % i, end=' ')
    print(" ".join([vocab[i] for i in topic.argsort()[:-10 - 1:-1]]))

Topic 0: the of to and in is that on not it
Topic 1: you to it for have your if thanks me what
Topic 2: hello testing please networld xelm mailreader looking andreas am mail
Topic 3: each chris 00 postage for answered are sale includes usa
Topic 4: was that he thought oh it just bet sunpost411ld chicago
Topic 5: hi thanks appreciated anyone advance any windows for anybody card
Topic 6: test this is thanks message only tesrt it putt david
Topic 7: ax max g9v b8f a86 pl 1d9 1t 3t 145
Topic 8: ditto me too here for cdt he copy let hillary
Topic 9: deletion god why atheism alt is concidered gospels exist drivel


How would you find documents similar to the query document? (Homework)

## Latent Dirichlet Allocation

![lda](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)

- [Original paper](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

LDA is an example of a non-parametric Bayesian model, and you will need quite a bit more background in Bayesian machinery than we can cover in this course. For now, just see how it is used. If you are interested, we will probably cover this in more detail in STA 663.

In [83]:
from gensim.models.ldamodel import LdaModel

In [84]:
lda = LdaModel(corpus, num_topics=10, id2word = dictionary)

In [85]:
for i, topic in  lda.print_topics(num_words=5):
    print(topic)

0.008*"team" + 0.007*"year" + 0.006*"game" + 0.005*"hockey" + 0.005*"season"
0.011*"key" + 0.010*"drive" + 0.005*"use" + 0.004*"like" + 0.004*"number"
0.020*"god" + 0.008*"jesus" + 0.005*"bible" + 0.005*"people" + 0.004*"believe"
0.007*"like" + 0.005*"know" + 0.004*"time" + 0.004*"good" + 0.003*"use"
0.011*"cx" + 0.010*"c_" + 0.006*"hz" + 0.006*"qs" + 0.005*"ck"
0.012*"edu" + 0.007*"com" + 0.006*"mail" + 0.005*"information" + 0.005*"available"
0.007*"space" + 0.004*"stephanopoulos" + 0.004*"government" + 0.003*"president" + 0.003*"technology"
0.010*"people" + 0.006*"think" + 0.005*"know" + 0.005*"like" + 0.004*"said"
0.007*"use" + 0.007*"windows" + 0.005*"file" + 0.005*"program" + 0.004*"card"
0.609*"ax" + 0.045*"max" + 0.008*"pl" + 0.005*"ei" + 0.004*"tm"


In [86]:
query = newsgroups_test.data[9]
query

":  \n: well, i have lots of experience with scanning in images and altering\n: them.  as for changing them back into negatives, is that really possible?\n\n: (stuff deleted)\n\n: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful\n: undertones)\n\nI use Aldus Photostyler on the PC and I can turn a colour or black and white\nimage into a negative or turn a negative into a colour or black and white\nimage.  I don't know how it does it but it works well.  To test it I scanned\na negative and used Aldus to create a positive.  It looked better than the\nprint that the film developers gave me.\n\n\n-- "

In [87]:
query = gensim.utils.simple_preprocess(query)
query[:3]

['well', 'have', 'lots']

In [88]:
query = dictionary.doc2bow(query)

In [89]:
query = lda[query]

#### Topics in query document

In [90]:
sorted(query, key=lambda x: -x[1])[:5]

[(8, 0.48419613), (7, 0.29503086), (3, 0.1578797), (2, 0.050125718)]

In [91]:
index = gensim.similarities.MatrixSimilarity(lda[corpus])

In [92]:
sims = index[query]

In [93]:
topics = [i for i, score in sorted(query, key=lambda x: -x[1])[:5]]

In [94]:
lda.print_topic(topics[0])

'0.007*"use" + 0.007*"windows" + 0.005*"file" + 0.005*"program" + 0.004*"card" + 0.004*"thanks" + 0.004*"software" + 0.004*"bit" + 0.004*"like" + 0.004*"window"'

#### Find similar documents

In [95]:
hits = sorted(enumerate(sims), key=lambda x: -x[1])[:5]
hits

[(5076, 0.99254215),
 (3779, 0.9913205),
 (4535, 0.99104667),
 (9218, 0.98717034),
 (8927, 0.98634976)]

In [96]:
print(newsgroups_test.data[9])
for match in [newsgroups_train.data[k] for k, score in hits]:
    print('-'*80)
    print(match)

:  
: well, i have lots of experience with scanning in images and altering
: them.  as for changing them back into negatives, is that really possible?

: (stuff deleted)

: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful
: undertones)

I use Aldus Photostyler on the PC and I can turn a colour or black and white
image into a negative or turn a negative into a colour or black and white
image.  I don't know how it does it but it works well.  To test it I scanned
a negative and used Aldus to create a positive.  It looked better than the
print that the film developers gave me.


-- 
--------------------------------------------------------------------------------
Hi... what alternatives to the Express modem do Duo owners have (if
they want to go at least 9600 baud)?

Every place in town says they are back ordered, and part of the reason
I want a laptop mac is so I can use it as a remote terminal from
wherever I am, but I really would hate to have to wait 2 months to get