# Text Analysis

 In this lecture, we look at more recent methods of feature extraction and topic modeling.

We will cover the following:

- word2vec
- latent semantic analysis
- non-negative matrix factorization
- latent Dirichlet allocation

## Word2Vec

the `word2vec` family of algorithms is a powerful method for converting a word into a vector that takes into account its context. There are two main ideas - in continuous bag of words, we try to predict the current word from nearby words; in continuous skip-gram, the current word is used to predict nearby words. The phrase "nearby words" is intentionally vague - in the simplest case, it is a sliding window of words centered on the current word. 

Suppose we have the sentence

```
I do not like green eggs and ham
```

and suppose we use a centered window of length 3,

```
((I, not), do), ((do, like), not), ((not, green), like), ((like, eggs), green), ((green, and), eggs), ((eggs, ham) and)
```

In continuous bag of words, we make the (input, output) pairs to be
```
(I, do)
(not, do)
(do, not)
(like, not)
(not, like)
(green, like)
(like, green)
(eggs, green)
(green, eggs)
(and, eggs)
(eggs, and)
(ham, and)
```

That is, we try to predict `do` when we see `I`, `do` when we see `not` and so on.

In continuous skip-gram, we do the inverse for (input, output) pairs
```
(do, I)
(do, not)
(not, do)
(not, like)
(like, not)
(like, green)
(green, like)
(green, eggs)
(eggs, green)
(eggs, and)
(and, eggs)
(and, ham)
```

That is, we try to predict `I` when we see `do`, `not` when we see `do` and so on.

To do this prediction, we first assign each word to a vector of some fixed length $n$ - i.e. we embed each word as an $\mathbb{R}^n$ vector. To do a prediction for all words in the vocabulary using `softmax` would be prohibitively expensive, and is unnecessary if we are just trying to find a good embedding vector. Instead we select $k$ noise words, typically from the unigram distributions, and just train the classifier to distinguish the target word from the noise words using logistic regression (negative sampling). We use stochastic gradient descent to move the embedding word vectors (initialized randomly) until the model gives a high probability to the target words and low probability to the noise ones. If successful, words that are meaningful when substituted in the same context will be close together in $\mathbb{R}^n$. For instance, `dog` and `cat` are likely to be close together because they appear together in similar contexts like

- `My pet dog|cat`
- `Raining dogs|cats and cats|dogs`
- `The dog|cat chased the rat`
- `Common pets are dogs|cats`

while `dog` and `apple` are less likely to occur in the same context and hence will end up further apart in the embedding space. Interestingly, the vectors resulting from vector subtraction are also meaningful since they represent analogies - the vector between `man` and `woman` is likely to be similar to that between `king` and `queen`, or `boy` and `girl`.

Note: you will encounter `word2vec` again if you take a deep learning class - it is a very influential idea and has many applications beyond text processing since you can apply it to any discrete distribution where local context is meaningful (e.g. genomes). 

There is a very nice tutorial on Word2Vec that you should read if you want to learn more about the algorithm - [Part 1](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) and [Part 2](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/)

Word2Vec learns about the feature representations of *words* - there is a an extension `doc2vec` that generates a feature vector for paragraphs or documents in the same way; we may cover this in the next lecture along with other document retrieval algorithms.

We illustrate the mechanics of `word2vec` using `gensim` on the tiny newsgroup corpora; however, you really need much large corpora for `word2vec` to learn effectively.

In [213]:
import re
import numpy as np
import pandas as pd

In [1]:
import nltk
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.collocations import QuadgramCollocationFinder, TrigramCollocationFinder
from nltk.metrics.association import QuadgramAssocMeasures, TrigramAssocMeasures
import string

In [14]:
import gensim
from gensim.models.word2vec import Word2Vec

In [4]:
from sklearn.datasets import fetch_20newsgroups

In [62]:
newsgroups_train = fetch_20newsgroups(
        subset='train',
        remove=('headers', 'footers', 'quotes')
)

In [63]:
newsgroups_test = fetch_20newsgroups(
        subset='test',
        remove=('headers', 'footers', 'quotes')
)

In [9]:
def gen_sentences(corpus):
    for item in corpus:
        yield from nltk.tokenize.sent_tokenize(item)

In [187]:
for i, t in enumerate(newsgroups_train.target[:20]):
    print('%-24s:%s' % (newsgroups_train.target_names[t], 
                        newsgroups_train.data[i].strip().replace('\n', ' ')[:50]))

rec.autos               :I was wondering if anyone out there could enlighte
comp.sys.mac.hardware   :A fair number of brave souls who upgraded their SI
comp.sys.mac.hardware   :well folks, my mac plus finally gave up the ghost 
comp.graphics           :Do you have Weitek's address/phone number?  I'd li
sci.space               :From article <C5owCB.n3p@world.std.com>, by tombak
talk.politics.guns      :Of course.  The term must be rigidly defined in an
sci.med                 :There were a few people who responded to my reques
comp.sys.ibm.pc.hardware:ALL this shows is that YOU don't know much about S
comp.os.ms-windows.misc :I have win 3.0 and downloaded several icons and BM
comp.sys.mac.hardware   :I've had the board for over a year, and it does wo
rec.motorcycles         :I have a line on a Ducati 900GTS 1978 model with 1
talk.religion.misc      :Yep, that's pretty much it. I'm not a Jew but I un
comp.sys.mac.hardware   :--
sci.space               :{Description of "External Tank" opt

In [188]:
list(newsgroups_train.data[:3])

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.',
 "A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't an

In [12]:
list(gen_sentences(newsgroups_train.data[:2]))[:3]

['I was wondering if anyone out there could enlighten me on this car I saw\nthe other day.',
 'It was a 2-door sports car, looked to be from the late 60s/\nearly 70s.',
 'It was called a Bricklin.']

In [145]:
from gensim.parsing.preprocessing import STOPWORDS

In [230]:
gensim.utils.simple_preprocess?

In [231]:
docs = [gensim.utils.simple_preprocess(s) 
        for s in newsgroups_train.data]

In [147]:
try:
    model = Word2Vec.load('newsgroup_w2v.model')
except:
    model = Word2Vec(docs,
                     size=64, # we use 64 dimensions to represent each word
                     window=5, # size of each context window
                     min_count=3, # ignore words with frequency less than this
                     workers=4)
    model.train(docs, total_examples=len(docs), epochs=10)
    model.save('newsgroup_w2v.model')

In [148]:
len(model.wv.vocab)

28361

The embedding vector for the word `player`

In [149]:
model.wv.word_vec('england')

array([ 1.0021844 , -0.07934759,  0.4993077 ,  0.22183499, -0.88920945,
       -0.2514776 , -0.7221043 ,  0.6758521 , -0.5281115 ,  0.6384827 ,
       -0.3575983 , -0.55267096, -1.3098922 , -0.49298698,  0.16711435,
        0.18228245, -0.08765895, -1.1436514 , -0.9509327 ,  1.2340423 ,
        1.4055172 , -0.16391546, -0.6297922 , -0.42145184, -0.747315  ,
        0.49057895, -1.4533808 , -0.79050404,  0.6408485 , -0.41635767,
       -0.84332716, -2.3325257 , -0.33703378, -1.034308  ,  0.82547283,
       -0.7613817 ,  0.92971116, -1.0731293 ,  0.2729064 , -1.4177942 ,
        0.36007303, -0.9504615 ,  0.7302042 , -0.64634454, -0.59395856,
       -0.24954185,  1.1206496 ,  0.995071  ,  0.34501752, -0.35099226,
        0.17384943,  0.42027408, -0.45629948,  0.2814209 ,  1.3862259 ,
       -0.90324163,  0.65404916, -0.7192859 ,  0.00362594,  0.7246876 ,
        0.07577156, -2.1167848 ,  0.8846294 ,  0.12508965], dtype=float32)

In [150]:
model.wv.most_similar('england', topn=5)

[('mexico', 0.8245236277580261),
 ('france', 0.7831640839576721),
 ('county', 0.7622649669647217),
 ('york', 0.7313283681869507),
 ('london', 0.7193907499313354)]

In [151]:
model.wv.similarity('england', 'france')

0.7831641125198537

In [152]:
model.wv.similarity('england', 'rabbit')

0.1040321903123281

In [153]:
model.wv.most_similar(negative=['man'], topn=3)

[('priced', 0.44513508677482605),
 ('duplicating', 0.4120890200138092),
 ('advertise', 0.41152292490005493)]

Because of the small and very biased data sets (including `soc.religion.christian` and `alt.atheism`), some of the analogies found are pretty weird.

In [154]:
model.most_similar(positive=['father', 'son'],
                   negative=['mother'])

[('christ', 0.7456965446472168),
 ('spirit', 0.736154317855835),
 ('allah', 0.7279092073440552),
 ('prophet', 0.7225679159164429),
 ('grace', 0.6986494064331055),
 ('holy', 0.6917949914932251),
 ('messenger', 0.6914761066436768),
 ('luke', 0.6900230646133423),
 ('praise', 0.686742901802063),
 ('prophets', 0.6823266744613647)]

## Latent Semantic Indexing (LSI)

Latent semantic indexing is basically using SVD to find a low rank approximation to the document/word feature matrix.

Recall that with SVD, $X = U \Sigma V^T$. With lsi, we interpret the matrices as

![img](https://i.stack.imgur.com/s9K8q.jpg)

where the weights indicate the importance of each topic, where a topic is a particular distribution of words.

In [155]:
len(docs)

11314

In [156]:
dictionary = gensim.corpora.Dictionary(docs)

In [158]:
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [237]:
lsi = gensim.models.LsiModel(corpus, num_topics=10, id2word = dictionary)

In [238]:
for i, topic in  lsi.print_topics(num_words=5):
    print(topic)

0.997*"ax" + 0.072*"max" + 0.009*"pl" + 0.007*"ei" + 0.006*"tm"
-0.230*"cx" + -0.225*"db" + -0.225*"file" + -0.177*"hz" + -0.159*"edu"
-0.267*"file" + 0.214*"cx" + -0.182*"edu" + 0.172*"hz" + 0.152*"ww"
-0.868*"db" + -0.216*"mov" + -0.189*"bh" + -0.136*"si" + -0.127*"cs"
0.438*"file" + 0.272*"output" + -0.181*"people" + 0.177*"entry" + -0.175*"know"
0.246*"file" + -0.236*"edu" + 0.230*"mr" + 0.224*"di" + 0.188*"pl"
-0.290*"di" + -0.251*"pl" + -0.223*"wm" + -0.222*"tm" + -0.188*"um"
0.493*"jpeg" + 0.261*"file" + 0.251*"image" + 0.201*"gif" + 0.177*"mr"
0.303*"mr" + 0.259*"file" + -0.253*"jpeg" + 0.235*"stephanopoulos" + 0.178*"edu"
-0.371*"stephanopoulos" + -0.309*"mr" + 0.269*"file" + -0.181*"president" + 0.175*"gun"


In [239]:
for i, t in enumerate(newsgroups_test.target[:20]):
    print('%02d %-24s:%s' % (i, newsgroups_test.target_names[t], 
                        newsgroups_test.data[i].strip().replace('\n', ' ')[:50]))

00 rec.autos               :I am a little confused on all of the models of the
01 comp.windows.x          :I'm not familiar at all with the format of these "
02 alt.atheism             :In a word, yes.
03 talk.politics.mideast   :They were attacking the Iraqis to drive them out o
04 talk.religion.misc      :I've just spent two solid months arguing that no s
05 sci.med                 :Elisabeth, let's set the record straight for the n
06 soc.religion.christian  :Dishonest money dwindles away, but he who gathers 
07 soc.religion.christian  :A friend of mine managed to get a copy of a comput
08 comp.windows.x          :Hi,     We have a requirement for dynamically clos
09 comp.graphics           ::   : well, i have lots of experience with scannin
10 comp.os.ms-windows.misc :I have uploaded the Windows On-Line Review sharewa
11 comp.windows.x          :Most graphics systems I have seen have drawing rou
12 talk.politics.mideast   :You *know* that putting something like this out on
13 rec.m

In [240]:
query = newsgroups_test.data[9]
query

":  \n: well, i have lots of experience with scanning in images and altering\n: them.  as for changing them back into negatives, is that really possible?\n\n: (stuff deleted)\n\n: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful\n: undertones)\n\nI use Aldus Photostyler on the PC and I can turn a colour or black and white\nimage into a negative or turn a negative into a colour or black and white\nimage.  I don't know how it does it but it works well.  To test it I scanned\na negative and used Aldus to create a positive.  It looked better than the\nprint that the film developers gave me.\n\n\n-- "

In [241]:
query = gensim.utils.simple_preprocess(query)
query[:3]

['well', 'have', 'lots']

In [242]:
query = dictionary.doc2bow(query)

In [243]:
query = lsi[query]

In [244]:
sorted(query, key=lambda x: -x[1])[:3]

[(7, 0.6933926498136153), (3, 0.08967606908424179), (6, 0.027687905053157384)]

In [245]:
lsi.print_topic(7)

'0.493*"jpeg" + 0.261*"file" + 0.251*"image" + 0.201*"gif" + 0.177*"mr" + -0.144*"edu" + 0.139*"color" + 0.138*"stephanopoulos" + -0.138*"entry" + -0.132*"output"'

In [262]:
pat = re.compile(r'.*?(-)?\d+.*?\"(\w+)\"')

In [266]:
words = [''.join(pair) for pair in pat.findall(lsi.print_topic(7))]
words

['jpeg',
 'file',
 'image',
 'gif',
 'mr',
 '-edu',
 'color',
 'stephanopoulos',
 '-entry',
 '-output']

#### Find similar documents

In [267]:
index = gensim.similarities.MatrixSimilarity(lsi[corpus])

In [268]:
sims = index[query]

In [272]:
hits = sorted(enumerate(sims), key=lambda x: -x[1])[:5]
hits

In [282]:
print(newsgroups_test.data[9])
for match in [newsgroups_train.data[k] for k, score in hits]:
    print('-'*80)
    print(match)

:  
: well, i have lots of experience with scanning in images and altering
: them.  as for changing them back into negatives, is that really possible?

: (stuff deleted)

: jennifer urso:  the oh-so bitter woman of utter blahness(but cheerful
: undertones)

I use Aldus Photostyler on the PC and I can turn a colour or black and white
image into a negative or turn a negative into a colour or black and white
image.  I don't know how it does it but it works well.  To test it I scanned
a negative and used Aldus to create a positive.  It looked better than the
print that the film developers gave me.


-- 
--------------------------------------------------------------------------------

Why didn't you create 8 grey-level images, and display them for
1,2,4,8,16,32,64,128... time slices?

This requires the same total exposure time, and the same precision in
timing, but drastically reduces the image-preparation time, no?






---------------------------------------------------------------------

## Non-negative Matrix Factorization

The topics generated by LSI can be hard to understand because they include negative value words. They may also be hard to understand since they may not map to topics in the way that we would. Remember, a topic is just a low rank approximation that minimizes the Frobenius norm. An alternative factorization is non-negative matrix (NMF) factorization, which does not use negatively valued words in the topic.

NMF basically finds a different set of basis vectors (not the eigenvectors of the covariance matrix) to project onto.

![nmf_svd](https://qph.fs.quoracdn.net/main-qimg-9b4e31ec4b57f4baf7d08d5df17c6bc0)

## Latent Dirichlet Allocation

![lda](https://upload.wikimedia.org/wikipedia/commons/4/4d/Smoothed_LDA.png)