# Module 4: Topic Modeling

## Semantic Text Similarity
### Applications of semantic similarity
1. Grouping similar words into semantic concepts
2. As a building block in natural language understanding tasks
    - Textual entailment
    - Paraphrasing

### WordNet
1. Semantic dictionary of (mostly) English words, interlinked by semantic relations
2. Includes rich linguistic information
    - part of speech, word senses, synonyms, hypernyms, meronyms, derivationally related forms, ...
3. Machine-readable, freely available

### Semantic similary using WordNet
1. Wordnet organizes information in a hierarchy
2. Many similarity measures use the hierarchy in some way
3. Verbs, nouns, adjectives all have separate hierarchies
For example,
<img src="https://img.ceclinux.org/39/686ed8dc683f3a7218cf38af0063965d493d87.png">

### Path Similarity
1. Find the shortest path between the two concepts
2. Similarity measure inversely related to path distance
    - PathSim(deer, elk) = 1/1+distance = 1/(1+1) = 0.5
    - PathSim(deer, giraffe) = 1/(1+1+1) = 0.33
    - PathSim(deer, horse) = 1/7 = 0.14 

### Lowest Common Subsumer (LCS)
1. Find the closest ancestor to both concepts
    - LCS(deer, elk) = deer
    - LCS(deer, giraffe) = ruminant
    - LCS(deer, horse) = ungulate
    
### Lin Similarity
1. Similarity measure based on the information contained in the LCS of the two concepts
    - LinSim(u,v) = 2 * logP(LCS(u,v)) /logP(u) + logP(V))
2. P(u) is given by the information content learnt over a large corpus

``` python3
import nltk
from nltk.corpus import wordnet as wn

# find appropriate sense of the words
deer = wn.synset('deer.n.01')
elk = wn.synset('elk.n.01')

# find path similarity
deer.path_similarity(elk)
deer.path_similarity(horse)
```

```python3
# use an infromation criteria to find Lin Similarity
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

deer.lin_similarity(elk, brown_ic)
deer.lin_similarity(horse, brown_ic)
```

### Collcations and Disributional similarity
1. "You know a word by the company it keeps"
2. Two words that frequently appear in similar contexts are more likely to be semantically related. 
    e.g. 
    - The friends met at a cafe (met, at, a)
    - Shayne met Ray at a pizzeria (met, at, a)

### Distributional Similarity: Context
1. Words before, after, within a small window
2. Parts of speech of words before, after, in a small window
3. Specific syntactic relation to the target word
4. Words in the same sentence, same document, ..

### Strength of association between words
1. Not similar if two words don't occur together often
2. Also important to see how frequent are individual words
    - e.g. "the"
3. Pointwise Mutual Information 
    **PMI(w,c) = log[P(w,c)/P(w)P(c)]**

```python3
# use NLTK collocations and association measures
import nltk
from nltk.collections import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)
finder.nbest(bigram_measrues.pmi, 10)

# finder also has other useful funcions, e.g. frequency filter
finder.apply_freq_filter(10)
```

## Topic Modeling
1. A coarse-level analysis of what's in a text collection
2. Topic: the subject (theme) of a discourse
3. Topics are represented as a word distribution
4. A document is assumed to be a mixture of topics
5. What's known: 
    - the text collection or corpus
    - number of topics
6. What's not known:
    - the actual topics
    - topic distribution for each document
7. Essentially, topic modeling is a text clustering problem
    - Documents and words clustered simultaneously
8. Different topic modeling approaches available
    - Probabilistic Latent Semantic Analysis (PLSA) (Hoffman 99)
    - Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 03)

## Generative Models and LDA