<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# Latent variable models
 
---


Many advances in NLP are based on using data to learn rules of grammar and language.  We saw these tools earlier.

- Tokenization

- Stemming or lemmatization

- Parsing and tagging

Each of these are based on a classical or theoretical understanding of language.


Latent variable models are models in which we assume the data we are observing has some hidden, underlying structure that we can’t see, and which we’d like to learn.

These hidden, underlying structures are the latent (i.e. hidden) variables we want our model to understand.

Text processing is a common application of latent variables.

Latent variable models are different in that they try to understand language based on how the words are used.

- For example, instead of learning that ‘bad’ and ‘badly’ are related because they share the same root, we’ll determine that they are related because they are often used in the same way often or near the same words.

We’ll use unsupervised techniques (discovering patterns or structure) to extract the information.


| Traditional NLP Models |      Latent Variable Models   |
|------|------|
|  ‘bad’ and ‘badly’ are related because they share a common root. | ‘bad’ and ‘badly’ are related because they are used the same way or near the same words.|
| | |
| ‘Python’ and ‘C++’ are both programming languages because they are often a noun preceded by the verb ‘program’ or ‘code’. | ‘Python’ and ‘C++’ are both programming languages because they are often used in the same context.| 

Latent variable techniques are often used for recommending news articles or mining large troves of data to find commonalities.

Topic modeling, a method we’ll cover today, is used in the NY times recommendation engine.

The New York Times attempts to map their articles to a latent space of topics using the content of the article.


<img src="assets/images/nyt_lda.png" style="width:60%" />

## Dimensionality reduction in text

Our previous ‘representation’ of a set of text documents (articles) for classification was a matrix with one row per document and one column per word (or n-gram).

While this sums up most of the information, it does drop a few things, mostly structure and order. 

Additionally, many of the columns may be correlated.


For example, an article that contains the word `‘IPO’` is likely to contain the word `‘stock’` or `‘NASDAQ’`.

Therefore, those columns are repetitive and likely to represent the same concept or idea.

For classification, we may only care that there are finance-related words.


To reduce this, we perform dimensionality reduction, where we first identify the correlated columns and the replace them with a column that represents the concept they have in common.

For instance, we could replace `‘IPO’`, `‘stocks’`, and `‘NASDAQ’` with a single column - `‘HasFinancialWords’` column.


There are many techniques to do this automatically and most follow a very similar approach.

- Identify correlated columns.

- Replace them with a new column that encapsulates the others.

<img src="assets/images/correlated_reduction.png" style="width:60%" />



### Mixture models


Mixture models (specifically LDA or **Latent Dirichlet Allocation**) take this concept further and generate more structure around the documents.

Instead of just replacing correlated columns, we create clusters of common words and generate probability distributions to explicitly state how related words are.


This ‘model’ of text is assuming that each document is some mixture of topics.

It may be mostly science but may contain some business or news information.

The latent structure we want to uncover are the topics (or concepts) that generate that text.


Latent Dirichlet Allocation is a model that assumes this is the way text is generated and then attempts to learn two things:

- The word distribution of each topic

- The topic distribution of each document.


The word distribution is a multinomial distribution of each topic representing what words are most likely from that topic.

<img src="assets/images/word_dist.png" style="width:60%" />

For example, let’s say we have three topics: sports, business, and science.

For each topic, we uncover the most likely words to come from them:

- sports: [football: 0.3, basketball: 0.2, baseball: 0.2, touchdown: 0.02 ... genetics: 0.0001]
- science: [genetics: 0.2, drug: 0.2, ... baseball: 0.0001]
- business: [stocks: 0.1, ipo: 0.08,  ... baseball: 0.0001]

For each word and topic pair, we learn some probability:
                          **P(word|topic)**


The topic distribution is a multinomial distribution for each document representing what topics are most likely to appear in that document.

For all our of sample documents, we have a distribution over {sports, science, business}.

- ESPN article: [sports: 0.8, business: 0.2, science: 0.0]
- Bloomberg article: [business: 0.7, science: 0.2, sports: 0.1]

For each topic and document pair, we learn some probability: **P(topic|document)**.


Topic models are useful for organizing a collection of documents and uncovering the main underlying concepts.

There are many variants that attempt to add even more structure to the ‘model’:

- Supervised topic models guide the process with pre-decided topics.

- Position-dependent topic models ignore which words occur in which document and instead focus on where they occur.

- Variable number topic models test different numbers of topics to find the best model.
