# DSCI 575: Advanced Machine Learning (in the context of Natural Language Processing (NLP) applications)

UBC Master of Data Science program, 2019-20

Instructor: Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]

## Lecture 3: Markov models

### Learning outcomes

From this lesson you will be able to

- Define Markov chains.
- Carry out generation and inference with Markov chains. 
- Compute the probability of a sequence of states. 
- Explain the general idea of a stationary distribution. 
- Justify and apply Markov chains to compute the probability of natural language sentences. 

In [2]:
import re
from urllib.request import urlopen

import pandas as pd
import numpy as np
import os, sys
from IPython.display import display, HTML

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import time

from collections import defaultdict
from collections import Counter

### Does this look like Python code?

```
import sys
import warnings
import ast
import numpy.core.overrides import set_module
# While not in __all__, matrix_power used to be defined here, so we import
# it for backward compatibility
    getT = T.fget
    getI = I.fget
```

```
def _from_string(data):
    for char in '[]':
        data = data.replace(char, '')
    rows = str.split(';')
    rowtup = []
        for row in rows:
        trow = newrow
        coltup.append(thismat)
        rowtup.append(concatenate(coltup, axis=-1))
            return NotImplemented
```

### A high-level description of how the text was generated? 

- Suppose you have a corpus of thousands of Python programs. 
- Assume a discrete probability distribution over all unique words in the corpus.
- You scroll through the corpus note down the first word you see on that page. 
- Choose successive word based on the current word and continue for a while...   

<img src="images/Python_generation_Markov.png" height="550" width="550"> 


### Markov chains you have seen in MDS so far 

- DSCI 512
    - You wrote code to generate text using Markov models of language. 
- DSCI 553
    - You used it as a mathematical tool to simulate from the posterior distribution. 

### Markov chain idea and applications 

- Often we need to make inferences about evolving environments.
- Represent the state of the world at each specific point via a series of snapshots or time slices. 
- Predict future depending upon 
    - what the current state is and 
    - the probability of change    

- Examples: 
    - Weather: Given that today is cold, what will be the weather tomorrow? 
    - Stock prices: Given the current market conditions what will be the stock prices tomorrow?

### How Markov chains are relevant in NLP?

- Language is a temporal phenomenon.When we speak, we produce streams of indefinite length. 
- A simplistic model of language is an n-gram language model which is based on Markov chains.

<img src="images/Markov_autocompletion.png" height="500" width="500"> 


<img src="images/bigram_probabilities.png" height="500" width="500"> 



### Markov’s own application of his chains  (1913)


- Studied the sequence of 20,000 letters in A. S. Pushkin's poem _Eugeny Onegin_.

 
<img src="images/Markov_Pushkin.png" height="800" width="800"> 



### Markov assumption


<img src="images/Markov_assumption.png" height="550" width="550"> 

**Markov assumption: The future is conditionally independent of the past given present**

- In the example above 

    $$P(S_{3}|S_0, S_1, S_2) \approx P(S_{3} | S_2)$$

- Generalizing it to $t$ time steps

$$P(S_{t+1}|S_0, \dots, S_t) \approx P(S_{t+1} | S_t)$$
        

### Let's look at the details of Markov chains

### Discrete Markov chain example 


<img src="images/Markov_weather_example.png" height="1000" width="1000"> 

### Discrete Markov chain ingredients: State space

<img src="images/Markov_chain.png" height="300" width="300"> 

- We have discrete timesteps: $t = 0, t = 1, \dots$.
- **State space**: We have a finite set of possible states we can be in at time $t$
    - Represent the unique observations in the world. 
    - We can be in only one state at a given time. 
    - Here $S = \{HOT, COLD, WARM\}$.

### Discrete Markov chain ingredients: Initial probability distribution over states
<img src="images/Markov_chain.png" height="300" width="300"> 

- State space: $S = \{\text{HOT, COLD, WARM}\}$, 
- We could start in any state. The probability of starting with a particular state is given by an **initial discrete probability distribution over states**.        
    - Here, $\pi_0 = \begin{bmatrix} P(\text{HOT at time 0}) & P(\text{COLD at time 0}) & P(\text{WARM at time 0}) \end{bmatrix} = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$    
    

### Discrete Markov chain ingredients: Transition probability matrix

<img src="images/Markov_chain.png" height="300" width="300"> 


- State space: $S = \{\text{HOT, COLD, WARM}\}$, initial probability distribution: $\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$
- **Transition probability matrix** $T$, where each $a_{ij}$ represents the probability of moving from state $s_i$ to state $s_j$, such that $\sum_{j=1}^{n} a_{ij} = 1, \forall i$ 

$$ T = 
\begin{bmatrix}
\text{P(HOT|HOT)} & \text{P(COLD|HOT)} & \text{P(WARM|HOT)}\\
\text{P(HOT|COLD)} & \text{P(COLD|COLD)} & \text{P(WARM|COLD)}\\
\text{P(HOT|WARM)} & \text{P(COLD|WARM)} & \text{P(WARM|WARM)}\\
\end{bmatrix}
=
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$$ 

- Note that each row sums to 1.0. 
- Each state has a probability of staying in the same state (or transitioning to itself).
- _Note that some people use the the notation where the columns sum to one._

### Weather example: state space, initial probability distribution, transition probability matrix

$S = \{\text{HOT, COLD, WARM}\}$, $\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$, T = 
$
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$



<img src="images/Markov_chain.png" height="550" width="550"> 


### Markov chain general definition 

- A set of $n$ states: $S = \{s_1, s_2, ..., s_n\}$
- A set of discrete initial probability distribution over states $\pi_0 = \begin{bmatrix} \pi_{s_1} & \pi_{s_2} & \dots & \pi_{s_n} \end{bmatrix}$

- Transition probability matrix $T$, where each $a_{ij}$ represents the probability of moving from state $s_i$ to state $s_j$, such that $\sum_{j=1}^{n} a_{ij} = 1, \forall i$ 


$$ T = 
\begin{bmatrix}
    a_{11}       & a_{12} & a_{13} & \dots & a_{1n} \\
    a_{21}       & a_{22} & a_{23} & \dots & a_{2n} \\
    \dots \\
    a_{n1}       & a_{n2} & a_{n3} & \dots & a_{nn}
\end{bmatrix}
$$


### Homogeneous Markov chains

- Transition probabilities are the same for all $t$.
- In this class we will assume homogeneous Markov chain.


### What can we do with Markov chains? 

- **Predict probabilities of sequences of states**
- **Inference**: compute probability of being in a particular state at time $t$.    
- **Stationary distribution**: Find the steady state after running for a long time
- Generation: generate sequences that follow the probabilities of the states. 
    - You will be doing this in the lab. 
- Decoding: compute most likely sequences of states

### Predict probabilities of sequences of states 

- Given the Markov model: $S = \{\text{HOT, COLD, WARM}\}$, 
$\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$, T = 
$
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$

- Compute the probability of the sequences: HOT, HOT, WARM, COLD
    - Markov assumption: $P(S_{t+1}|S_{0}, S_1, \dots, S_t) \approx P(S_{t+1}| S_t)$

$$\begin{equation}
\begin{split}
P(\textrm{HOT, HOT, WARM, COLD}) =& P(\textrm{HOT}) \times P(\textrm{HOT|HOT})\\ 
                                  & \times P(\textrm{WARM|HOT})\\
                                  & \times P(\textrm{COLD|WARM})\\
                                 =& 0.5  \times 0.5 \times 0.3 \times 0.1\\
\end{split}
\end{equation}$$

<img src="images/Markov_chain.png" height="300" width="300"> 

### Your turn (Activity: 5 minutes)

- Pause the video and predict probabilities of the following sequences of states on your own. 
    1. COLD, COLD, WARM
    2. HOT, COLD, HOT, COLD
    
Hint: If we want to predict the future, all that matters is the current state.

$S = \{\text{HOT, COLD, WARM}\}$, 
$\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$, T = 
$
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$

<img src="images/Markov_chain.png" height="500" width="500"> 

### Inference

### Inference 

- **Compute probability of being in a particular state at time $t$.**
- Example: What is the probability of HOT at time 1?
    * P(HOT at time zero) $\times$ P(HOT|HOT) + P(COLD at time zero) $\times$ P(HOT|COLD) + P(WARM at time zero) $\times$ P(HOT|WARM) = $0.5 \times 0.5 + 0.3 \times 0.2 + 0.2\times 0.3 = 0.37$
    
<img src="images/Markov_chain.png" height="400" width="400"> 

### Inference: What is the probability of HOT at time 1?
- P(HOT at time zero) $\times$ P(HOT|HOT) + P(COLD at time zero) $\times$ P(HOT|COLD) + P(WARM at time zero) $\times$ P(HOT|WARM) = $0.5 \times 0.5 + 0.3 \times 0.2 + 0.2\times 0.3 = 0.37$    
- Dot product between $\pi_0$ and the first column of the transition matrix!

$$\pi_0 = \begin{bmatrix} P(\text{HOT at time 0}) & P(\text{COLD at time 0}) & P(\text{WARM at time 0}) \end{bmatrix} = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$$

$$ T = 
\begin{bmatrix}
\text{P(HOT|HOT)} & \text{P(COLD|HOT)} & \text{P(WARM|HOT)}\\
\text{P(HOT|COLD)} & \text{P(COLD|COLD)} & \text{P(WARM|COLD)}\\
\text{P(HOT|WARM)} & \text{P(COLD|WARM)} & \text{P(WARM|WARM)}\\
\end{bmatrix}
= \begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix} $$

<img src="files/images/Markov_chain.png" height="400" width="400"> 


### Inference: What is the probability of HOT, COLD, WARM at time 1?

$$\pi_1 = \pi_0T$$

$$\pi_0 = \begin{bmatrix} P(\text{HOT at time 0}) & P(\text{COLD at time 0}) & P(\text{WARM at time 0}) \end{bmatrix} = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$$

$$ T = 
\begin{bmatrix}
\text{P(HOT|HOT)} & \text{P(COLD|HOT)} & \text{P(WARM|HOT)}\\
\text{P(HOT|COLD)} & \text{P(COLD|COLD)} & \text{P(WARM|COLD)}\\
\text{P(HOT|WARM)} & \text{P(COLD|WARM)} & \text{P(WARM|WARM)}\\
\end{bmatrix}
= \begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix} $$

$$\pi_1 = \begin{bmatrix} P(\text{HOT at time 1}) & P(\text{COLD at time 1}) & P(\text{WARM at time 1}) \end{bmatrix} =  \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}\begin{bmatrix} 0.5 & 0.2 & 0.3\\ 0.2 & 0.5 & 0.3\\ 0.3 & 0.1 & 0.6\\ \end{bmatrix} = \begin{bmatrix}0.37 & 0.27 & 0.36\end{bmatrix}$$

<img src="files/images/Markov_chain.png" height="300" width="300"> 


### Inference: What is the probability of HOT, COLD, WARM at time 2?
- Multiply $\pi_0$ by the transition matrix
    $$\pi_2 = \pi_1T$$

$$\pi_1 = \begin{bmatrix} P(\text{HOT at time 1}) & P(\text{COLD at time 1}) & P(\text{WARM at time 1}) \end{bmatrix} =  \begin{bmatrix}0.37 & 0.27 & 0.36\end{bmatrix}$$

$$ T = 
\begin{bmatrix}
\text{P(HOT|HOT)} & \text{P(COLD|HOT)} & \text{P(WARM|HOT)}\\
\text{P(HOT|COLD)} & \text{P(COLD|COLD)} & \text{P(WARM|COLD)}\\
\text{P(HOT|WARM)} & \text{P(COLD|WARM)} & \text{P(WARM|WARM)}\\
\end{bmatrix}
= \begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix} $$

$$\pi_2 = \begin{bmatrix} P(\text{HOT at time 2}) & P(\text{COLD at time 2}) & P(\text{WARM at time 2} \end{bmatrix} = \pi_1T = \begin{bmatrix}0.347 & 0.245 & 0.408\end{bmatrix}$$

<img src="files/images/Markov_chain.png" height="300" width="300"> 


### Inference: probability of being in a particular state at time $t$

- Calculate 

$$\pi_t = \pi_{t-1} \times \text{transition probability matrix } T$$  

- Applying the matrix multiplication to the current state probabilities does an update to the state probabilities!

In [10]:
pi0 = np.matrix('0.5, 0.3, 0.2]')
T = np.matrix('0.5 0.2 0.3; 0.2 0.5 0.3; 0.3 0.1 0.6')
print("pi_0: ", pi0)
print("pi_1: ", pi0@T)
print("pi_2: ", pi0@T@T)
print("pi_3: ", pi0@T@T@T)
print("pi_4: ", pi0@T@T@T@T)

pi_0:  [[0.5 0.3 0.2]]
pi_1:  [[0.37 0.27 0.36]]
pi_2:  [[0.347 0.245 0.408]]
pi_3:  [[0.3449 0.2327 0.4224]]
pi_4:  [[0.34571 0.22757 0.42672]]


In [12]:
pi0*np.linalg.matrix_power(T,4)

matrix([[0.34571, 0.22757, 0.42672]])

In [4]:
def print_pi_over_time(pi0, T, steps=10):
    current = pi0
    for i in range(steps):    
        print('State probabilities at time step ', i, current)
        current = current@T
        
pi0 = np.matrix('0.5, 0.3, 0.2]')
print("Initial probability distribution over states: ", pi0)
T = np.matrix('0.5 0.2 0.3; 0.2 0.5 0.3; 0.3 0.1 0.6')
print("The transition probability matrix: \n", T)
print_pi_over_time(pi0, T, steps=15)

Initial probability distribution over states:  [[0.5 0.3 0.2]]
The transition probability matrix: 
 [[0.5 0.2 0.3]
 [0.2 0.5 0.3]
 [0.3 0.1 0.6]]
State probabilities at time step  0 [[0.5 0.3 0.2]]
State probabilities at time step  1 [[0.37 0.27 0.36]]
State probabilities at time step  2 [[0.347 0.245 0.408]]
State probabilities at time step  3 [[0.3449 0.2327 0.4224]]
State probabilities at time step  4 [[0.34571 0.22757 0.42672]]
State probabilities at time step  5 [[0.346385 0.225599 0.428016]]
State probabilities at time step  6 [[0.3467171 0.2248781 0.4284048]]
State probabilities at time step  7 [[0.34685561 0.22462295 0.42852144]]
State probabilities at time step  8 [[0.34690883 0.22453474 0.42855643]]
State probabilities at time step  9 [[0.34692829 0.22450478 0.42856693]]
State probabilities at time step  10 [[0.34693518 0.22449474 0.42857008]]
State probabilities at time step  11 [[0.34693756 0.22449141 0.42857102]]
State probabilities at time step  12 [[0.34693837 0.22449032

### Stationary distribution

- A stationary distribution of a Markov chain is a probability distribution over states that remains unchanged in the Markov chain as time progresses.

- A probability distribution $\pi$ on states $S$ is stationary where the following holds for the transition matrix $T$.    


$$\pi T=\pi$$ 


### Stationary distribution: SkyTrain example scenario

<blockquote>
Suppose TransLink launches Downtown to UBC SkyTrain. In the first month of operation it was found that 20% of the commuters going to UBC started using it and 80% of the commuters were still using other modes of transportation. The following transition matrix was determined from the records of other transit systems. 
</blockquote>


$$S = \{\text{SkyTrain, Other}\}, \pi_0 = \begin{bmatrix} 0.20 & 0.80 \end{bmatrix}, 
T = \begin{bmatrix}
0.9 & 0.1\\
0.4 & 0.6\\
\end{bmatrix}
$$

$$
Labeled\_T = 
\begin{bmatrix}
     & \text{SkyTrain} & \text{Other}\\
\text{SkyTrain}  & 0.9 & 0.1\\
\text{Other} & 0.4 & 0.6\\
\end{bmatrix}
$$

### We might want to answer the following questions

1. What percentage of the commuters will be using the SkyTrain after two months? 
2. What about after three months?
2. What's the percentage of the commuters using the SkyTrain after the service has been in place for a long time? 

### What percentage of the commuters will be using the SkyTrain after two months and after three months?

- State probability distribution after **one** month (initial state probability distribution)
    - $\pi_0 = \begin{bmatrix} 0.20 & 0.80 \end{bmatrix}$
- State probability distribution after **two** months:
    - $\pi_1 = \pi_0 T = \begin{bmatrix} 0.20 & 0.80 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1\\ 0.4 & 0.6\\ \end{bmatrix} = \begin{bmatrix} 0.5 & 0.5 \end{bmatrix}$  
- State probability distribution after **three** months:
    - $\pi_2 = \pi_1 T =  \begin{bmatrix} 0.5 & 0.5 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1\\ 0.4 & 0.6\\ \end{bmatrix} = \begin{bmatrix} 0.65 & 0.35 \end{bmatrix}$

- Big improvement at each time step!! How long does this continue?

In [7]:
pi_0 = np.matrix('0.2 0.8')
T = np.matrix('0.9 0.1; 0.4 0.6')
print_pi_over_time(pi_0, T, steps = 40)

State probabilities at time step  0 [[0.2 0.8]]
State probabilities at time step  1 [[0.5 0.5]]
State probabilities at time step  2 [[0.65 0.35]]
State probabilities at time step  3 [[0.725 0.275]]
State probabilities at time step  4 [[0.7625 0.2375]]
State probabilities at time step  5 [[0.78125 0.21875]]
State probabilities at time step  6 [[0.790625 0.209375]]
State probabilities at time step  7 [[0.7953125 0.2046875]]
State probabilities at time step  8 [[0.79765625 0.20234375]]
State probabilities at time step  9 [[0.79882813 0.20117188]]
State probabilities at time step  10 [[0.79941406 0.20058594]]
State probabilities at time step  11 [[0.79970703 0.20029297]]
State probabilities at time step  12 [[0.79985352 0.20014648]]
State probabilities at time step  13 [[0.79992676 0.20007324]]
State probabilities at time step  14 [[0.79996338 0.20003662]]
State probabilities at time step  15 [[0.79998169 0.20001831]]
State probabilities at time step  16 [[0.79999084 0.20000916]]
State pro

### Stationary distribution

- Seems like after the $27^{th}$ time step, there is not any change in the state probabilities.
- Seems like we have reached a steady state at $\pi = \begin{bmatrix} 0.80 & 0.20 \end{bmatrix}$ such that

$$\begin{bmatrix} 0.80 & 0.20 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1\\ 0.4 & 0.6\\ \end{bmatrix} = \begin{bmatrix} 0.80 & 0.20 \end{bmatrix} $$

- So the distribution $\pi = \begin{bmatrix} 0.80 & 0.20 \end{bmatrix}$ is a stationary distribution in this case because we have $\pi T = \pi$. 
- What's the percentage of the commuters using the SkyTrain after the service has been in place for a long time? 
    - In the long run we can expect 80% of the commuters using the SkyTrain.

### Conditions for stationary distribution

- Does a stationary distribution $\pi$ exist and is it unique?
- Under mild assumptions, a Markov chain has a stationary distribution. 

### Conditions for stationary distribution

- Sufficient condition for existence/uniqueness is positive transitions
    * $P(s_t | s_{t-1}) > 0$
    
- Weaker sufficient conditions for existence/uniqueness
    * _Irreducible_ 
        - A Markov chain is irreducible if it is possible to get to any state from any state.
        - It does not get stuck in part of the graph.     
    * _Aperiodic_        
        - Loosely, a Markov chain is aperiodic if it does not keep repeating the same sequence. 
        - A bit complicated definition. Check [this](https://en.wikipedia.org/wiki/Markov_chain#Periodicity) if you  want to know the formal definition. 

### Irreducibility and aperiodicity

- Which chains are irreducible? Which ones are aperiodic?
    * _Irreducible_ (doesn’t get stuck in part of the graph)
    * _Aperiodic_ (doesn’t keep repeating same sequence).    
<img src="images/Markov_irreducibility_aperiodicity.png" height="900" width="900"> 

### How to estimate the stationary distribution?

- Power iteration method
    - Multiply $\pi_0$ by powers of the transition matrix $T$ until the product looks stable. 
- Taking the eigenvalue decomposition of the transition matrix.
$$\pi T=\pi$$
- Through Monte Carlo simulation.
- In some cases (not always) simply counting the occurrences (lab). 

There are other ways too! 

### Learning Markov chains

### Learning Markov chains

- Learning Markov chains is just counting.
    * Similar to naive Bayes
    
- Given $n$ samples, MLE for homogeneous Markov chain is:

    * Initial: $P(s_i) = \frac{\text{number of times we start in } s_i}{n} $

    * Transition: $P(s_j|s_{i}) = \frac{\text{number of times we moved from } s_{i} \text{ to } s_j}{\text{number of times we moved from } s_{i} \text{ to } anything}$ 

### Language example

- Suppose you want to learn a Markov chain for words in a corpus of $n$ documents.
- Set of states is the set of all unique words in the corpus.
- Calculate the initial probability distribution $\pi_0$
    - For all states (unique words) $w_i$, compute $\frac{\text{number of times a document starts in } w_i}{n} $ 
    
- Calculate the transition probabilities for all state combinations $w_i$ and $w_j$
    - $\frac{\text{number of times } w_i \text{ is followed by } w_j}{\text{number of times } w_i \text{ is followed by anything}}$ 
     

In [6]:
#kant_tokens = tuple(nltk.word_tokenize(kant_text))
toy_corpus = "What’s in a name? A rose by any other name would smell as sweet. "
toy_corpus_tokens = nltk.word_tokenize(toy_corpus.lower())

frequencies = defaultdict(Counter)
for i in range(len(toy_corpus_tokens) - 1):
    frequencies[toy_corpus_tokens[i: i + 1][0]][toy_corpus_tokens[i + 1]] += 1
    
freq_df = pd.DataFrame(frequencies).transpose()
freq_df = freq_df.fillna(0)
freq_df

Unnamed: 0,’,s,in,a,name,rose,?,would,by,any,other,smell,as,sweet,.
what,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
’,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
s,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
in,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
a,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
name,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
?,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
rose,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
by,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
any,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### Applications of Markov chains in NLP

### Application1: Markov’s own application of his chains
- Studied the sequence of 20,000 letters in A. S. Pushkin's poem _Eugeny Onegin_.
- Gave the stationary distribution for vowels and consonants
    * $\pi = [0.432, 0.568]$, $S = \{\text{vowel, consonant}\}, T =
    \begin{bmatrix}
         & \text{vowel} & \text{consonant}\\
    \text{vowel}  & 0.128 & 0.872\\
    \text{consonant} & 0.663 & 0.337\\
    \end{bmatrix}
    $

- Stationary distribution in this case can be calculated as: 
$$\begin{bmatrix}
    \frac{\text{# vowels}}{\text{total number of characters}} & \frac{\text{# consonants}}{\text{total number of characters}}\\
    \end{bmatrix} $$

<img src="images/Markov_Pushkin.png" height="500" width="500"> s

In [36]:
# Markov's Pushkin Onegin consonant vowel probabilities
pi = np.matrix('0.432 0.568')
T = np.matrix('0.128 0.872; 0.663 0.337')
print(pi@T)
print(pi@T@T)
print(pi@T@T)
print(pi@T@T@T)

[[0.43188 0.56812]]
[[0.4319442 0.5680558]]
[[0.4319442 0.5680558]]
[[0.43190985 0.56809015]]


### Markov’s own application of his chains

- Markov also studied the sequence of 100,000 letters in S. T. Aksakov's novel "The Childhood of Bagrov, the Grandson" (tedious calculation)
- Gave the stationary distribution for vowels and consonants.
    * $\pi = [0.449,0.551]$ 
    * $S = \{\text{vowel, consonant}\}$ 
    * 
    $ T = 
    \begin{bmatrix}
         & \text{vowel} & \text{consonant}\\
    \text{vowel}  & 0.552 & 0.448\\
    \text{consonant} & 0.365 & 0.635\\
    \end{bmatrix}
    $
    
- Stationary distribution in this case can be calculated as: 
$$\begin{bmatrix}
    \frac{\text{# vowels}}{\text{total number of characters}} & \frac{\text{# consonants}}{\text{total number of characters}}\\
    \end{bmatrix} $$  

In [38]:
# Markov's stationary distribution for S. T. Aksakov's novel "The Childhood of Bagrov, the Grandson"
pi = np.matrix('0.449,0.551')
T = np.matrix('0.552 0.448; 0.365 0.635')
print_pi_over_time(pi, T)

Step 0 [[0.449 0.551]]
Step  1 [[0.448963 0.551037]]
Step  2 [[0.44895608 0.55104392]]
Step  3 [[0.44895479 0.55104521]]
Step  4 [[0.44895455 0.55104545]]
Step  5 [[0.4489545 0.5510455]]
Step  6 [[0.44895449 0.55104551]]
Step  7 [[0.44895449 0.55104551]]
Step  8 [[0.44895449 0.55104551]]
Step  9 [[0.44895449 0.55104551]]
Step  10 [[0.44895449 0.55104551]]


### Application2: Ngram language models

- Suppose your states are words and you are computing probabilities of sequences of words.  
- What does it tell us? 
    - Which sequence of words is more likely to occur in English?

<blockquote>
P(In the age of data algorithms have the answer) $>$ <br>
p(The age data of in algorithms answer the have) 
</blockquote>


### Language models

- Compute the probability of a sentence or a sequence of words
$P(w_1, w_2,\dots,w_t)$

- A related task: What's the probability of an upcoming word? 
$P(w_t|w_1,w_2,\dots,w_{t-1})$ 
    - Example: Your smartphone's or Gmail's feature of next word(s) suggestion

A model that computes either of these probabilities is called a _language model_.

### Language modeling: Why should we care?

Powerful idea in NLP and helps in many tasks.
- Machine translation 
    * P(In the age of data algorithms have the answer) > P(the age data of in algorithms answer the have)
- Spelling correction
    * My office is a 10  <span style="color:red">minuet</span> bus ride from my home.  
        * P(10 <span style="color:blue">minute</span> bus ride from my home) > P(10 <span style="color:red">minuet</span> bus ride from my home)
- Speech recognition 
    * P(<span style="color:blue">I read</span> a book) > P(<span style="color:red">Eye red</span> a book)

### Calculating probabilities of a sequence by applying chain rule 

Example: Suppose we want to calculate the probability of the following sequence of words: 

$
\begin{equation}
\begin{split}
P(\textrm{In the age of data algorithms have the answer}) =& P(\textrm{In}) \times P(\textrm{the|In})\\ 
                                              & \times P(\textrm{age|In the}) \times P(\textrm{of|In the age})\\
                                              & \times P(\textrm{data|In the age of})\\
                                              & \times P(\textrm{algorithms|In the age of data}) \\
                                              &  \times P(\textrm{have|In the age of data algorithms}) \\
                                              & \dots 
\end{split}
\end{equation}
$

- What if we just count occurrences to get conditional probabilities? 
- <span style="color:red">BAD IDEA!!</span> The counts will be tiny and the model will be very sparse. 

### Markov model of language

When predicting future the past doesn't matter only the present. 

- Bigram language model
    
$$
P(\textrm{algorithms|In the age of data}) \approx P(\textrm{algorithms|data})
$$

### Markov model of language (bigram language model)

- Use Markov assumption and calculate the probability of a sequence as follows!
\begin{equation}
\begin{split}
P(\textrm{In the age of data algorithms have the answer}) =& P(\textrm{In}) \times P(\textrm{the|In})\\ 
                                              & \times P(\textrm{age|the})\\
                                              & \times P(\textrm{of|age})\\
                                              & \times P(\textrm{data|of})\\
                                              & \times P(\textrm{algorithms|data}) \\                 
                                              & \times P(\textrm{have|algorithms}) \\                             
                                              & \times P(\textrm{the|have}) \\                                   
                                              & \times P(\textrm{answer|the}) \\                                                                                 
\end{split}
\end{equation}

### Estimating probabilities for the bigram language model

- Example
$$P(\textrm{algorithms|data}) = \frac{Count(\textrm{data algorithms})}{Count(\textrm{data})}$$

### Considering more history 

- Example: trigrams or four-gram language model
    - Trigram language model
$$
P(\textrm{algorithms|In the age of data}) \approx P(\textrm{algorithms|of data})
$$
    - Four-gram language model
$$
P(\textrm{algorithms|In the age of data}) \approx P(\textrm{algorithms|age of data})
$$


- One way to deal with this is by making each state larger and applying the Markov framework we know well. 
    - **You will be doing this in the lab.**

### Considering more history: Example 

Consider this corpus = {a rose is a rose}, Vocabulary size: $V = 3$

- Word bigram model of language (n = 1)    
    * Each state consists of a word (unigram) from the vocabulary. 
    * State space = {a, rose, is} ($3$ states)  
    * Markov assumption: $P(s_{t+1}|s_0, \dots s_t) = P(s_{t+1}|s_t)$ 
    * $P(\text{rose|a rose is a}) = P(\text{rose|a})$

### Considering more history: Example

Consider this corpus = {a rose is a rose}, Vocabulary size: $V = 3$

- Word trigram model of language (n = 2)
    * Each state consists of a sequence of two words (bigrams) from the vocabulary. 
    * State space = {(a a), (a rose), (a is), (rose a), (rose rose), (rose is), (is a), (is rose), (is is)} ($3^2$ states)
        - Many of these might not occur in the corpus
    * Markov assumption: $P(s_{t+1}|s_0, \dots s_t) = P(s_{t+1}|s_t)$ 
    * $P(\text{(a rose)|(a rose), (rose is), (is a)}) = P(\text{(a rose)}|\text{(is a)})$
    * What transitions are possible?

### Google Ngram release (2006)

- [All Our N-gram are Belong to You](https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html)
<blockquote>
<p>Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others.</p>
<p>That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.</p>
</blockquote>


### Are ngrams a good model of language?


- In many cases, we can get by with ngram models. 
- But in general, is it a good assumption that the next word that I utter will be dependent on the last $n$ words?

<blockquote>
    The computer I was talking about yesterday when we were having dinner crashed. 
</blockquote>    

- Language has long-distance dependencies.  
- We can extend it to $3$-grams, $4$-grams, $5$-grams. But then there is sparsity problem. 
- Also, ngram models have huge RAM requirements.

### Language models with word embeddings

- Ngram are great but we are representing context as the exact word.
- Suppose in your training data you have the sequence "feed the cat" but you do not have the sequence "feed the dog".

<blockquote>
I have to make sure to feed the cat.
</blockquote>

- Trigram model: P(dog|feed the) = 0
- If we represent words with embedding instead, we will be able to generalize to dog even if we haven't seen it in the corpus.
- We'll come back to this when we learn about Recurrent Neural Networks (RNNs). 

### Application3: PageRank
- Graph-based ranking algorithm, which assigns a rank to a webpage.
- The rank indicates a relative score of the page's importance and authority.
- Intuition
    - Important webpages are linked from other important webpages.
    - Don't just look at the number of links coming to a webpage but consider who the links are coming from 
<img src="files/images/wiki_page_rank.jpg" height="500" width="500"> 

[Credit](https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg)


### PageRank: scoring

- Imagine a browser doing a random walk 
    - At time t=0, start at a random webpage.
    - At time t=1, follow a random link on the current page.
    - At time t=2, follow a random link on the current page. 
    
- Intuition
    - In the "steady state" each page has a long-term visit rate, which is the page's score (rank). 

### PageRank as a Markov chain

- A state is a web page
- Transition probabilities represent probabilities of moving from one page to another
- We derive these from the adjacency matrix of the web graph
    - Adjacency matrix $M$ is a $n \times n$ matrix, if $n$ is the number of states (web pages)
    - $M_{ij} = 1$ if there is a hyperlink from page $i$ to page $j$.      

### Calculate page rank: power iteration method

- Start with a random initial probability distribution $\pi_0$
- Multiply $\pi_0$ by powers of the transition matrix $T$ until the product looks stable 
    - After one step, we are at $\pi T$
    - After two steps, we are at $\pi T^2$
    - After three steps, we are at $\pi T^3$
    - Eventually (for a large $k$), $\pi T^k = \pi$ 
    
    
#### Want to know more details? 

- Check out lecture4 from last year in the archive folder.     

### Summary

- A discrete Markov chain is a random process that has 
    * a set of finite states 
    * an initial probability distribution over states
    * transition probability matrix
- We can do a number of things with Markov chains
    - Generate a sequence of states. 
    - Calculate the probability of a sequence.  
    - Compute the probability of being in a particular state at time $t$. 
    - Calculate stationary distribution which is a probability distribution that remains unchanged in the Markov chain as time progresses. 
- Example applications of Markov chains in NLP
    - Language modeling
    - PageRank

### Other fun things with Markov chains 

- [Create and visualize Markov chains](https://www.stat.auckland.ac.nz/~wild/MarkovChains/)
- [Markov chains "explained visually"](http://setosa.io/ev/markov-chains)
- [Snakes and ladders](http://datagenetics.com/blog/november12011/index.html)
- [Candyland](http://www.datagenetics.com/blog/december12011/index.html)
- [Yahtzee](http://www.datagenetics.com/blog/january42012)
- [Chess pieces returning home and K-pop vs. ska](https://www.youtube.com/watch?v=63HHmjlh794)
- [The Life and Work of A. A. Markov](http://www.meyn.ece.ufl.edu/archive/spm_files/Markov-Work-and-life.pdf)
