![](img/575_banner.png)

# Lecture 1: Markov Models 

UBC Master of Data Science program, 2022-23

Instructor: Varada Kolhatkar

## Lecture plan, imports, LO

### Lecture plan 

- Motivation and high-level description (~15 mins)
- Markov chains definition (~10)
- Markov chains tasks (~15)
- Break (~5 mins)
- Q&A and iClicker questions (~5 mins)
- Stationary distribution (~15)
- Q&A and iClicker questions (~5 mins) 
- Final comments, summary, reflection (~5 mins)

### Imports 

In [1]:
import os
import re
import sys
import time
from collections import Counter, defaultdict

import IPython
import nltk
import numpy as np
import pandas as pd
from IPython.display import HTML, display
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

### Learning outcomes

From this lesson you will be able to

- Explain the general idea of a language model and name some of its applications. 
- Define Markov chains and explain terminology (states, initial probability distribution over states, and transition matrix) related to Markov chains.
- State Markov assumption.
- Compute the probability of a sequence of states. 
- Compute the probability of being in a state at time $t$. 
- Explain the general idea of a stationary distribution.

<br><br><br><br>

## Motivation and high-level description 

### Activity

Go to [this Google doc](https://docs.google.com/document/d/1ZSvfUsGo7uY82mK_O1Hso_nRbbD8oIhELVMCwzOAR0s/edit?usp=sharing
) and tell us how would you complete the sentences below. 

> #### In unsupervised learning the goal is to discover pattern ot structure in __. 
> #### I am __. 

In [2]:
url = "https://ai.googleblog.com/2018/05/smart-compose-using-neural-networks-to.html"

IPython.display.IFrame(url, width=1000, height=900)

And we all know the state of the art language model [ChatGPT](https://chat.openai.com)!   

### An example of a state-of-the-art language model
- [ChatGPT](https://chat.openai.com/chat)
    - Doesn't need any introduction 

Both these things are based on the idea of **a language model**!! 

### What is a language model? 

A language model computes **the probability distribution over sequences of tokens**. Given some vocabulary $V$, a language model assigns a probability (a number between 0 and 1) to all sequences of tokens in $V$. 

Intuitively, this probability tells us how "good" or plausible a sequence of tokens is. 

<!-- A model that computes the probability of a sequence of words (or characters) or the probability of an upcoming word (or character) is called a **language model**. -->

![](img/voice-assistant-ex.png)

<!-- <img src="img/voice-assistant-ex.png" height="1400" width="1400"> -->


- Compute the probability of a sentence or a sequence of words.
    - $P(w_1, w_2,\dots,w_t)$
    - P(I have read this book) > P(eye have red this book)

- A related task: What's the probability of an upcoming word? 
    - $P(w_t|w_1,w_2,\dots,w_{t-1})$ 
    - P(book | read this) > P(book | red this)



### Language modeling: Why should we care?

- Powerful idea in NLP and helps in many tasks.
- Since the last couple of years, large language models are affecting billions of people. So it's important to understand some history and at least some fundamentals of language models.  
- In old days they were used as a component of a larger system. 
    - Machine translation 
        * P(In the age of data algorithms have the answer) > P(the age data of in algorithms answer the have)
    - Spelling correction
        * My office is a 20  <span style="color:red">minuet</span> bike ride from my home.  
            * P(20 <span style="color:blue">minute</span> bike ride from my home) > P(20 <span style="color:red">minuet</span> bike ride from my home)
    - Speech recognition 
        * P(<span style="color:blue">I read</span> a book) > P(<span style="color:red">Eye red</span> a book)
- Now they are capable of being a standalone systems (e.g., ChatGPT)
    - Question answering (e.g., Andrei Markov was born in __)
    - Generating news articles
    - Summarization
    - Writing assistants (e.g., https://www.ai21.com/)
    - ...

Why is this hard?
- It requires understanding of linguistic knowledge as well as world knowledge.     

**A simplest model of language is a Markov model of language!** And that's the topic for this week. 
- Today we'll go through the theory 
- In the next lecture we'll look at some applications. 

### Markov chains you have seen in MDS so far 

- DSCI 512
    - You wrote code to generate text using Markov models of language. 
- DSCI 553
    - You used it as a mathematical tool to simulate from the posterior distribution. 
    
The model we are going to look at is similar to what you've seen in DSCI 512.     

<br><br><br><br>

## Markov model intuition and definition

### Does this look like Python code?

```
import sys
import warnings
import ast
import numpy.core.overrides import set_module
# While not in __all__, matrix_power used to be defined here, so we import
# it for backward compatibility
    getT = T.fget
    getI = I.fget
```

```
def _from_string(data):
    for char in '[]':
        data = data.replace(char, '')
    rows = str.split(';')
    rowtup = []
        for row in rows:
        trow = newrow
        coltup.append(thismat)
        rowtup.append(concatenate(coltup, axis=-1))
            return NotImplemented
```

Let's look at a high-level description of how the text was generated? 

- Suppose you have a corpus of thousands of Python programs. 
- Assume a discrete probability distribution over all unique words in the corpus.
- You scroll through the corpus note down the first word you see on that page. 
- Choose successive word based on the current word and continue for a while ...   

![](img/Python_generation_Markov.png)
<!-- <img src="img/Python_generation_Markov.png" height="550" width="550">  -->


### Markov chain idea

- Often we need to make inferences about evolving environments.
- Represent the state of the world at each specific point via a series of snapshots. 
- Markov chain idea: Predict future depending upon 
    - the current state
    - the probability of change    

- Examples: 
    - Weather: Given that today is cold, what will be the weather tomorrow? 
    - Stock prices: Given the current market conditions what will be the stock prices tomorrow?
    - Text: Given that the speaker has uttered the word **data** what will be the next word? 

## Markov assumption

Suppose we want to calculate probability of a sequence "there is a crack in everything that 's how the light gets in" (a phrase from Leonard Cohen's poem [Anthem](https://genius.com/Leonard-cohen-anthem-lyrics)). A naive approach to calculate this probability would be: 

\begin{equation}
\begin{split}
P(\text{there is a crack in everything that 's how the light gets in}) = &P(\text{there}) \times P(\text{is}\mid \text{there})\\ 
                                              & \times P(\text{a} \mid \text{there is}) \times P(\text{crack}\mid \text{there is a})\\
                                              & \times P(\text{in}\mid \text{there is a crack})\\
                                              & \times P(\text{everything} \mid \text{there is a crack in}) \\
                                              & \times P(\text{that} \mid \text{there is a crack in everything}) \\                                              
                                              & \dots\\        
\end{split}
\end{equation}


You can also express it as a product of conditional probabilities. 
$$P(w_{1:n}) = \prod_{i=1}^n  P(\text{word}_i \mid \text{word}_{1:i-1})$$

But this doesn't take us too far, as calculating probability of a word given the entire history (e.g., $P(\text{light} \mid \text{there is a crack in everything that 's how the})$) is not easy because 
language is creative and any particular context might not have occurred before.  

The intuition of Markov models of language (ngram models) is that instead of computing the probability of the next word given its entire history we **approximate** it by considering just the last few words. 

**Markov assumption: The future is conditionally independent of the past given present**

![](img/bigram-ex.png)

<!-- <img src="img/bigram-ex.png" height="700" width="700"> -->


$$
P(\text{everything} \mid \text{a crack in}) \approx P(\text{everything}\mid\text{in})
$$


![](img/Markov_assumption.png)

<!-- <img src="img/Markov_assumption.png" height="550" width="550">  -->
    
**Markov assumption: The future is conditionally independent of the past given present**

- In the example above 

$$P(S_{3} \mid S_0, S_1, S_2) \approx P(S_{3} \mid S_2)$$

- Generalizing it to $t$ time steps

$$P(S_{t+1} \mid S_0, \dots, S_t) \approx P(S_{t+1} \mid S_t)$$
        

### Simplistic auto-complete

- Supposed we have typed "und" so far and we want to predict the next letter, i.e., the state we would be in in the next time step. 
- Imagine that you have access to the conditional probability distribution for the next letter given the current letter. 
- We sample the next letter from this distribution.

![](img/autocomplete_Markov.png)

<!-- <img src="img/autocomplete_Markov.png" height="1200" width="1200">  -->

### (ASIDE) Markov’s own application of his chains  (1913)


- Studied the sequence of 20,000 letters in A. S. Pushkin's poem _Eugeny Onegin_.
- Markov also studied the sequence of 100,000 letters in S. T. Aksakov's novel "The Childhood of Bagrov, the Grandson".

![](img/Markov_Pushkin.png)

<!-- <img src="img/Markov_Pushkin.png" height="800" width="800">  -->


### Let's look at the details of discrete-time Markov chains


This is an unrolled version or a single realization of a Markov chain. 

<img src="img/weather-unrolled.png" height="1000" width="1000"> 

One way to graphically represent the overall behaviour of a Markov chain is with the following representation which shows the states with their initial probabilities and transition probabilities between states. 

<img src="img/Markov_chain.png" height="400" width="400"> 

### Discrete Markov chain ingredients: State space

<img src="img/Markov_chain.png" height="300" width="300"> 


- We have discrete timesteps: $t = 0, t = 1, \dots$.
- **State space**: We have a finite set of possible states we can be in at time $t$
    - Represent the unique observations in the world. 
    - We can be in only one state at a given time. 
    - Here $S = \{HOT, COLD, WARM\}$.

### Discrete Markov chain ingredients: Initial probability distribution over states

<img src="img/Markov_chain.png" height="300" width="300"> 


- State space: $S = \{\text{HOT, COLD, WARM}\}$, 
- We could start in any state. The probability of starting with a particular state is given by an **initial discrete probability distribution over states**.        
    - Here, 
    $$\pi_0 = \begin{bmatrix} P(\text{HOT at time 0}) & P(\text{COLD at time 0}) & P(\text{WARM at time 0}) \end{bmatrix} = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$$    
    

### Discrete Markov chain ingredients: Transition probability matrix

<img src="img/Markov_chain.png" height="300" width="300"> 

- State space: $S = \{\text{HOT, COLD, WARM}\}$, initial probability distribution: $\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$
- **Transition probability matrix** $T$, where each $a_{ij}$ represents the probability of moving from state $s_i$ to state $s_j$, such that $\sum_{j=1}^{n} a_{ij} = 1, \forall i$ 

$$ T = 
\begin{bmatrix}
P(\text{HOT} \mid \text{HOT}) & P(\text{COLD} \mid \text{HOT}) & P(\text{WARM} \mid \text{HOT})\\
P(\text{HOT} \mid \text{COLD}) & P(\text{COLD} \mid \text{COLD}) & P(\text{WARM} \mid \text{COLD})\\
P(\text{HOT} \mid \text{WARM}) & P(\text{COLD} \mid \text{WARM}) & P(\text{WARM} \mid \text{WARM})\\
\end{bmatrix}
=
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$$ 


- Note that each row sums to 1.0. 
- Each state has a probability of staying in the same state (or transitioning to itself).
- _Note that some people use the the notation where the columns sum to one._
- You can think of transition matrix as a data structure used to organize all the conditional probabilities concisely and efficiently.  

In our weather example state space, initial probability distribution, transition probability matrix are as follows: 

$S = \{\text{HOT, COLD, WARM}\}$, $\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$, T = 
$
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$

<img src="img/Markov_chain.png" height="550" width="550"> 



### Markov chain general definition 

- A set of $n$ states: $S = \{s_1, s_2, ..., s_n\}$
- A set of discrete initial probability distribution over states $\pi_0 = \begin{bmatrix} \pi_{s_1} & \pi_{s_2} & \dots & \pi_{s_n} \end{bmatrix}$

- Transition probability matrix $T$, where each $a_{ij}$ represents the probability of moving from state $s_i$ to state $s_j$, such that $\sum_{j=1}^{n} a_{ij} = 1, \forall i$ 


$$ T = 
\begin{bmatrix}
    a_{11}       & a_{12} & a_{13} & \dots & a_{1n} \\
    a_{21}       & a_{22} & a_{23} & \dots & a_{2n} \\
    \dots \\
    a_{n1}       & a_{n2} & a_{n3} & \dots & a_{nn}
\end{bmatrix}
$$


### Homogeneous Markov chains

- Transition probabilities tell you how your state probabilities are going to change over time. 
- Usually we assume **homogeneous Markov chain** where transition probabilities are the same for all time steps $t$. 
- In this class, we will assume homogeneous Markov chain.

<br><br><br><br>

## ❓❓ Question for you 

### Conditioning and marginalization (revision)

Many of these tasks are based on two rules of probability: 
- Conditioning
    - the process of calculating the probability of an event or variable given certain conditions
- Marginalization 
    - the process of integrating over all possible values of a variable to obtain the probability of another variable

Imagine that you have a bag with 4 letters: A, A, A, and N, as shown below. $X_0$ is the first time step and $X_1, X_2$ are the next time step. 

![](img/bananagrams.png)

- $X \in \{A, N\}$
- $P(X_0 = A) = \frac{3}{4}, P(X_0 = N) = \frac{1}{4}$
- $P(X_1 = A \lvert X_0 = A) = \frac{1}{3}, P(X_1 = N \lvert X_0 = A) = ?$
- $P(X_1 = A \lvert X_0 = N) = 1.0, P(X_1 = N \lvert X_0 = N) = ?$
- $P(X_2 = A)$ = ? 

<br><br><br><br>

## Markov chains tasks


### What can we do with Markov chains? 

- **Predict probabilities of sequences of states.**
- **Inference**: Compute probability of being in a particular state at time $t$.    
- **Stationary distribution**: Find the steady state after running the chain for a long time.
- Generation: generate sequences that follow the probabilities of the states. 
    - You will be doing this in the lab. 
- Decoding: Compute most likely sequences of states. 

### Predict probabilities of sequences of states 

- Given the Markov model: $S = \{\text{HOT, COLD, WARM}\}$, 
$\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$, T = 
$
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$

- Compute the probability of the sequences: HOT, HOT, WARM, COLD
    - Markov assumption: $P(S_{t+1}\mid S_{0}, S_1, \dots, S_t) \approx P(S_{t+1} \mid S_t)$

<img src="img/Markov_chain.png" height="400" width="400"> 


\begin{equation}
\begin{split}
P(\textrm{HOT, HOT, WARM, COLD}) =& P(\text{HOT}) \times P(\text{HOT} \mid \text{HOT})\\ 
                                  & \times P(\text{WARM} \mid \text{HOT})\\
                                  & \times P(\text{COLD}\mid \text{WARM})\\
                                 =& 0.5  \times 0.5 \times 0.3 \times 0.1\\
\end{split}
\end{equation}

### Your turn (Activity: 3 minutes)

- Predict probabilities of the following sequences of states on your own. 
    1. COLD, COLD, WARM
    2. HOT, COLD, HOT, COLD
    
Hint: If we want to predict the future, all that matters is the current state.

$S = \{\text{HOT, COLD, WARM}\}$, 
$\pi_0 = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$, T = 
$
\begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix}
$

<img src="img/Markov_chain.png" height="300" width="300"> 


<br><br>

### Inference

- **Compute probability of being in a particular state at time $t$.**
- Example: Assuming that the time starts at 0, what is the probability of HOT at time 1?

$$\begin{equation}
\begin{split}
P(\textrm{HOT at time 1}) =& P(\textrm{HOT at time 0}) \times P(\textrm{HOT} \mid \textrm{HOT})\\ 
                                  & + P(\textrm{COLD at time 0}) \times P(\textrm{HOT} \mid \textrm{COLD})\\
                                  &  + P(\textrm{WARM at time 0}) \times P(\textrm{HOT} \mid \textrm{WARM})\\
                                 =& 0.5 \times 0.5 + 0.3 \times 0.2 + 0.2\times 0.3 = 0.37\\
\end{split}
\end{equation}$$
    
<img src="img/Markov_chain.png" height="400" width="400"> 

### Inference: What is the probability of HOT at time 1?
- You can conveniently calculate it as the dot product between $\pi_0$ and the first column of the transition matrix!

$$\pi_0 = \begin{bmatrix} P(\text{HOT at time 0}) & P(\text{COLD at time 0}) & P(\text{WARM at time 0}) \end{bmatrix} = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$$



$$ T = 
\begin{bmatrix}
P(\text{HOT} \mid \text{HOT}) & P(\text{COLD} \mid \text{HOT}) & P(\text{WARM} \mid \text{HOT})\\
P(\text{HOT} \mid \text{COLD}) & P(\text{COLD} \mid \text{COLD}) & P(\text{WARM} \mid \text{COLD})\\
P(\text{HOT} \mid \text{WARM}) & P(\text{COLD} \mid \text{WARM}) & P(\text{WARM} \mid \text{WARM})\\
\end{bmatrix}
= \begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix} $$
$$\begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}  \begin{bmatrix} 0.5 \\ 0.2 \\ 0.3 \end{bmatrix} = 0.37$$


<img src="img/Markov_chain.png" height="400" width="400"> 


### Inference: What is the probability of HOT, COLD, WARM at time 1?
- You can get probabilities of all states HOT, COLD, WARM at time 1 by multiplying $\pi_0$ by the transition matrix. 

$$\pi_1 = \pi_0T$$

$$\pi_0 = \begin{bmatrix} P(\text{HOT at time 0}) & P(\text{COLD at time 0}) & P(\text{WARM at time 0}) \end{bmatrix} = \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}$$

$$ T = 
\begin{bmatrix}
P(\text{HOT} \mid \text{HOT}) & P(\text{COLD} \mid \text{HOT}) & P(\text{WARM} \mid \text{HOT})\\
P(\text{HOT} \mid \text{COLD}) & P(\text{COLD} \mid \text{COLD}) & P(\text{WARM} \mid \text{COLD})\\
P(\text{HOT} \mid \text{WARM}) & P(\text{COLD} \mid \text{WARM}) & P(\text{WARM} \mid \text{WARM})\\
\end{bmatrix}
= \begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix} $$

$$\pi_1 = \begin{bmatrix} P(\text{HOT at time 1}) & P(\text{COLD at time 1}) & P(\text{WARM at time 1}) \end{bmatrix} =  \begin{bmatrix} 0.5 & 0.3 & 0.2 \end{bmatrix}\begin{bmatrix} 0.5 & 0.2 & 0.3\\ 0.2 & 0.5 & 0.3\\ 0.3 & 0.1 & 0.6\\ \end{bmatrix} = \begin{bmatrix}0.37 & 0.27 & 0.36\end{bmatrix}$$


<img src="img/Markov_chain.png" height="300" width="300"> 



### Inference: What is the probability of HOT, COLD, WARM at time 2?
- Similarly can get probabilities of all states HOT, COLD, WARM at time 2 by multiplying $\pi_1$ by the transition matrix. 
    $$\pi_2 = \pi_1T$$

$$\pi_1 = \begin{bmatrix} P(\text{HOT at time 1}) & P(\text{COLD at time 1}) & P(\text{WARM at time 1}) \end{bmatrix} =  \begin{bmatrix}0.37 & 0.27 & 0.36\end{bmatrix}$$

$$ T = 
\begin{bmatrix}
P(\text{HOT} \mid \text{HOT}) & P(\text{COLD} \mid \text{HOT}) & P(\text{WARM} \mid \text{HOT})\\
P(\text{HOT} \mid \text{COLD}) & P(\text{COLD} \mid \text{COLD}) & P(\text{WARM} \mid \text{COLD})\\
P(\text{HOT} \mid \text{WARM}) & P(\text{COLD} \mid \text{WARM}) & P(\text{WARM} \mid \text{WARM})\\
\end{bmatrix}
= \begin{bmatrix}
0.5 & 0.2 & 0.3\\
0.2 & 0.5 & 0.3\\
0.3 & 0.1 & 0.6\\    
\end{bmatrix} $$

$$\pi_2 = \begin{bmatrix} P(\text{HOT at time 2}) & P(\text{COLD at time 2}) & P(\text{WARM at time 2} \end{bmatrix} = \pi_1T = \begin{bmatrix}0.347 & 0.245 & 0.408\end{bmatrix}$$


<img src="img/Markov_chain.png" height="300" width="300"> 



### Inference: probability of being in a particular state at time $t$

- Calculate 

$$\pi_t = \pi_{t-1} \times \text{transition probability matrix } T$$  

- Applying the matrix multiplication to the current state probabilities does an **update** to the state probabilities!

Let's try it out with numpy. 

In [3]:
pi_0 = np.array([0.5, 0.3, 0.2])  # initial state probability dist
T = np.array([[0.5, 0.2, 0.3], [0.2, 0.5, 0.3], [0.3, 0.1, 0.6]])  # transition matrix

print("Initial probability distribution over states: ", pi_0)
print("The transition probability matrix: \n", T)

Initial probability distribution over states:  [0.5 0.3 0.2]
The transition probability matrix: 
 [[0.5 0.2 0.3]
 [0.2 0.5 0.3]
 [0.3 0.1 0.6]]


In [4]:
pi_0 @ np.linalg.matrix_power(T, 18)

array([0.34693878, 0.2244898 , 0.42857143])

In [5]:
imat = np.eye(3)

In [6]:
pi_0

array([0.5, 0.3, 0.2])

In [7]:
np.linalg.matrix_power(T, 18)

array([[0.34693877, 0.2244898 , 0.42857143],
       [0.34693877, 0.2244898 , 0.42857143],
       [0.34693878, 0.22448979, 0.42857143]])

In [8]:
0.5 * 0.34 + 0.3 * 0.34 + 0.2 * 0.34

0.34

In [9]:
np.linalg.matrix_power(T, 19)

array([[0.34693878, 0.2244898 , 0.42857143],
       [0.34693878, 0.2244898 , 0.42857143],
       [0.34693878, 0.2244898 , 0.42857143]])

In [10]:
def print_pi_over_time(pi_0, T, time_step=10):
    current = pi_0
    print("Initial state probability distribution (pi_0)", pi_0)
    for i in range(time_step):
        current = current @ T
        print(
            "State probabilities at time step %d (pi_%d = pi_%d@T) = %s"
            % (i + 1, i + 1, i, current)
        )

In [11]:
import panel as pn
from panel import widgets
from panel.interact import interact
import matplotlib

pn.extension()

def f(time_steps):
    return print_pi_over_time(pi_0, T, time_steps)

#interact(f, eps=widgets.FloatSlider(start=1, end=12, step=1, value=1))

interact(f, time_steps=widgets.IntSlider(start=0, end=30, step=2, value=0))

Initial state probability distribution (pi_0) [0.5 0.3 0.2]


You can also get state probabilities at time $t$ by multiplying `pi_0` by the $t^{th}$ power of the transition matrix. 

In [12]:
def get_pi_at_time_t(pi_0, T, time_step=10):
    print(
        "State probabilities at time step %d (pi_%d = pi_0@T^%d) = %s"
        % (time_step, time_step, time_step, pi_0 @ np.linalg.matrix_power(T, time_step))
    )

In [13]:
def f(time_steps):
    return get_pi_at_time_t(pi_0, T, time_steps)

#interact(f, eps=widgets.FloatSlider(start=1, end=12, step=1, value=1))

interact(f, time_steps=widgets.IntSlider(start=0, end=30, step=2, value=0))

# interactive(
#     lambda time_steps=1: get_pi_at_time_t(pi_0, T, time_steps), time_steps=(1, 30, 1)
# )

State probabilities at time step 0 (pi_0 = pi_0@T^0) = [0.5 0.3 0.2]


Any interesting observations? 

<br><br>

### Break (~5 mins)

![](img/eva-coffee.png)

<br><br>

## ❓❓ Questions for you

iClicker cloud join link: https://join.iclicker.com/4QVT4

### Exercise 1.1 Select all of the following statements which are **True** (iClicker)

- (A) According to the Markov assumption the probability of being at a future state $s_{t+1}$ is independent of the past states $s_1$ to $s_{t-1}$. 
- (B) In the notation we are using, each row in the transition matrix of a Markov chain should sum to one. 
- (C) In a Markov chain, the probabilities associated with self loops (staying in the same state) of all states should sum to one. 
- (D) Suppose you are running a Markov chain with $\pi_0 = \begin{bmatrix} 0.80 & 0.20 \end{bmatrix}$.  $\pi_t = \begin{bmatrix} 0.40 & 0.60 \end{bmatrix}$ at time step $t$ means that at time $t$, the probability of being in $s_1$ is $0.40$ and being in state $s_2$ is $0.60$. 
- (E) Given $\pi_0$ as initial state probability distribution, and $T$ as transition matrix, we can calculate the probability distribution over states at time step $k$ by multiplying $\pi_0$ and $T^k$.

<br><br><br><br>

## Markov chains tasks: Stationary distribution

After time step 18 or so, the state probabilities stop changing!! 

### Stationary distribution

- A stationary distribution of a Markov chain is a probability distribution over states that remains unchanged in the Markov chain as time progresses.

- A probability distribution $\pi$ on states $S$ is stationary where the following holds for the transition matrix $T$.    


$$\pi T=\pi$$ 

Why is this useful? This tells us about the behaviour of a Markov chain in the long run.  


### Stationary distribution: SkyTrain example scenario

Suppose TransLink launches Downtown to UBC SkyTrain. In the first month of operation it was found that 20% of the commuters going to UBC started using it and 80% of the commuters were still using other modes of transportation. The following transition matrix was determined from the records of other transit systems. 


$$S = \{\text{SkyTrain, Other}\}, \pi_0 = \begin{bmatrix} 0.20 & 0.80 \end{bmatrix}, 
T = \begin{bmatrix}
0.9 & 0.1\\
0.4 & 0.6\\
\end{bmatrix}
$$

|               | Skytrain  | Other |
| ------------- |:---------:| -----:|
| Skytrain      | 0.9       | 0.1   |
| Other         | 0.4       | 0.6   |


### We might want to answer the following questions

Assuming that each time step is a month, 

1. What percentage of the commuters will be using the SkyTrain after two months? 
2. What about after three months?
2. What's the percentage of the commuters using the SkyTrain after the service has been in place for a long time? 

### What percentage of the commuters will be using the SkyTrain after two months and after three months?

- State probability distribution after **one** month (initial state probability distribution)
    - $\pi_0 = \begin{bmatrix} 0.20 & 0.80 \end{bmatrix}$
- State probability distribution after **two** months:
    - $\pi_1 = \pi_0 T = \begin{bmatrix} 0.20 & 0.80 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1\\ 0.4 & 0.6\\ \end{bmatrix} = \begin{bmatrix} 0.5 & 0.5 \end{bmatrix}$  
- State probability distribution after **three** months:
    - $\pi_2 = \pi_1 T =  \begin{bmatrix} 0.5 & 0.5 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1\\ 0.4 & 0.6\\ \end{bmatrix} = \begin{bmatrix} 0.65 & 0.35 \end{bmatrix}$

- Big improvement at each time step!! How long does this continue?

In [14]:
def stationary_dist(pi_0, T, time_step=20):
    print("pi_0 =", pi_0)
    pi_time_step = pi_0 @ np.linalg.matrix_power(T, time_step)
    print("pi_%d = %s" % (time_step, pi_time_step))
    if not np.allclose(pi_time_step @ T, pi_time_step):
        print("Not steady state yet: pi_%d@T != pi_%d" % (time_step, time_step))
    else:
        print("Steady state: pi_%d@T == pi_%d" % (time_step, time_step))

In [15]:
pi_0 = np.array([0.2, 0.8])  # initial state probability dist
T = np.array([[0.9, 0.1], [0.4, 0.6]])  # transition matrix

In [16]:
pi_0 @ np.linalg.matrix_power(T, 31)

array([0.8, 0.2])

In [17]:
def f(time_steps):
    return stationary_dist(pi_0, T, time_steps)

#interact(f, eps=widgets.FloatSlider(start=1, end=12, step=1, value=1))

interact(f, time_steps=widgets.IntSlider(start=0, end=40, step=2, value=0))

# interactive(
#     lambda time_steps=1: stationary_dist(pi_0, T, time_steps), time_steps=(1, 40, 1)
# )

pi_0 = [0.2 0.8]
pi_0 = [0.2 0.8]
Not steady state yet: pi_0@T != pi_0


### Stationary distribution

- Seems like after the $18^{th}$ time step, the state probabilities stay the same (within a tolerance). 
- Seems like we have reached a steady state at $\pi = \begin{bmatrix} 0.80 & 0.20 \end{bmatrix}$ such that

$$\begin{bmatrix} 0.80 & 0.20 \end{bmatrix} \begin{bmatrix} 0.9 & 0.1\\ 0.4 & 0.6\\ \end{bmatrix} = \begin{bmatrix} 0.80 & 0.20 \end{bmatrix} $$

- So the distribution $\pi = \begin{bmatrix} 0.80 & 0.20 \end{bmatrix}$ is a stationary distribution in this case because we have $\pi T = \pi$. 
- What's the percentage of the commuters using the SkyTrain after the service has been in place for a long time? 
    - In the long run we can expect 80% of the commuters using the SkyTrain.

### Conditions for stationary distribution

- Stationary distribution looks like a desirable property. 
- Does a stationary distribution $\pi$ exist and is it unique?
- Under mild assumptions, a Markov chain has a stationary distribution. 

### Conditions for stationary distribution

- Sufficient condition for existence/uniqueness is positive transitions. 
    * $P(s_t \mid s_{t-1}) > 0$
- But very often at least some of the transition probabilities are non-positive (e.g., zero).

### Conditions for stationary distribution
- Weaker sufficient conditions for existence/uniqueness
    * _Irreducible_ 
        - A finite Markov chain is irreducible if it is possible to get to any state from any state.
        - In other words, a finite Markov chain is irreducible if and only if its a strongly connected graph.    
    * _Aperiodic_        
        - Loosely, a Markov chain is aperiodic if it does not keep repeating the same sequence.         

### (Optional) Periodicity formal definition 

A state in a Markov chain is periodic if the chain can return to the state only at multiples of some integer larger than 1. Thus, starting in state 'i', the chain can return to 'i' only at multiples of the period 'k', and k is the largest such integer. State 'i' is aperiodic if k = 1 and periodic if k > 1.

### Irreducibility and aperiodicity

- Which chains are irreducible? Which ones are aperiodic?
    * _Irreducible_ (doesn’t get stuck in part of the graph)
    * _Aperiodic_ (doesn’t keep repeating same sequence).    
![](img/Markov_irreducibility_aperiodicity.png)    
<!-- <img src="img/Markov_irreducibility_aperiodicity.png" height="900" width="900">  -->

### Some ways to examine irreducibility

- Check whether the graph is strongly connected or not  
    - Check out [Kosaraju's algorithm](https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm). 

### How to estimate the stationary distribution?

- Power iteration method
    - Multiply $\pi_0$ by powers of the transition matrix $T$ until the product looks stable. 
- Taking the eigenvalue decomposition of the transpose of the transition matrix.
$$\pi T=\pi$$
- Through Monte Carlo simulation.
- In some cases (not always) simply counting the occurrences (lab). 

There are other ways too! 

### (Optional) Eigendecomposition to get stationary distribution 

- Note that $\pi T = \pi$ looks very similar to the eigenvalue equation $Av = \lambda v$ for eigenvalues and eigenvectors, with $\lambda = 1$.
- If you transpose the matrix 

$$(\pi T)^T = \pi^T \implies T^T \pi^T = \pi^T$$ 

In other words, if we transpose the matrix and take its eigendecomposition, the eigenvector with eigenvalue 1 is going to be the stationary distribution.

If there are multiple eigenvectors with eigenvalue 1.0, then the stationary distribution is not unique. 

<br><br><br><br>

## Final thoughts, summary, reflection

We define a discrete Markov chain as 
* a set of finite states 
* an initial probability distribution over states
* a transition probability matrix

We can do a number of things with Markov chains
- Calculate the probability of a sequence.  
- Compute the probability of being in a particular state at time $t$. 
- Calculate stationary distribution which is a probability distribution that remains unchanged in the Markov chain as time progresses. 
- Generate a sequence of states. 

- Learning Markov chains is just counting (next lecture). 
- Example applications of Markov chains in NLP (next lecture)
    - Language modeling
    - PageRank

<br><br><br><br>

## ❓❓ Questions for you

iClicker cloud join link: https://join.iclicker.com/4QVT4

### Exercise 1.2: Select all of the following statements which are **True** (iClicker)

- (A) To have a stationary distribution, we must satisfy  $\pi_0T=\pi_0$, where $\pi_0$ is the initial state probability distribution at time 0.  
- (B) If a state has only one possible transition, the transition probability for that transition would be 1.0.
- (C) If each row in the transition matrix of a Markov chain has only one possible transition, the chain would be deterministic.
- (D) If we have a self loop transition with probability 1.0 for state A in a Markov chain and we happen to be at state A, the chain is going to get stuck in that state forever. 

<br><br>

### Exercise 1.3: Questions for class discussion  

1. Let's say P(sunny today | sunny yesterday) = 0.8 and P(sunny today | no sun yesterday) = 0.4. What is the transition matrix? State your assumptions as needed.

2. Consider the Markov chain below: 

![](img/Markov_ex.png)
<!-- <img src="img/Markov_ex.png" height="500" width="500">  -->

Does a stationary distribution exist for this chain? Why or why not? 

<br><br><br><br>

### Resources and fun things with Markov chains 

- [Create and visualize Markov chains](https://www.stat.auckland.ac.nz/~wild/MarkovChains/)
- [Markov chains "explained visually"](http://setosa.io/ev/markov-chains)
- [Snakes and ladders](http://datagenetics.com/blog/november12011/index.html)
- [Candyland](http://www.datagenetics.com/blog/december12011/index.html)
- [Yahtzee](http://www.datagenetics.com/blog/january42012)
- [Chess pieces returning home and K-pop vs. ska](https://www.youtube.com/watch?v=63HHmjlh794)
- [The Life and Work of A. A. Markov](http://www.meyn.ece.ufl.edu/archive/spm_files/Markov-Work-and-life.pdf)
