![](img/575_banner.png)

# Lecture 2: Applications of Markov Models  and Text Preprocessing

UBC Master of Data Science program, 2022-23

Instructor: Varada Kolhatkar

## Lecture plan, imports, LO

### Lecture plan 

- Stationary distribution recap and leftover T/F (~15 mins)
- Learning Markov models (~5 mins)
- Language models (~20 mins)
- Break (~5 mins)
- Q&A and T/F (~5 mins)
- PageRank (~5)
- Preprocessing (~15)
- Final comments, summary, reflection (~5 mins)
- Questions for class discussion (~5 mins)

### Imports 

In [1]:
import os
import re
import string
import sys
import time
from collections import Counter, defaultdict

import IPython
import nltk
import numpy as np
import numpy.random as npr
import pandas as pd
from IPython.display import HTML
from ipywidgets import interactive
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

###  Learning outcomes <a name="lo"></a>

By the end of this class you will be able to 
- Explain learning of Markov models. 
- Given sequence data, estimate the transition matrix from the data. 
- Build a Markov model of language. 
- Generalize Markov model of language to consider more history. 
- Explain states, state space, and transition matrix in Markov models of language. 
- Explain and calculate stationary distribution over states in Markov models of language. 
- Generate text using Markov model of language. 
- Explain the intuition of the PageRank algorithm. 
- Carry out basic text preprocessing using `nltk` and `spaCy`

## Recap

- Stationary distribution 

<br><br><br><br>

## Learning Markov models

- Similar to naive Bayes, learning Markov models is just counting.
- Given $n$ samples ($n$ sequences), MLE for homogeneous Markov model is:

    * Initial: $P(s_i) = \frac{\text{number of times we start in } s_i}{n} $

    * Transition: $P(s_j \mid s_{i}) = \frac{\text{number of times we moved from } s_{i} \text{ to } s_j}{\text{number of times we moved from } s_{i} \text{ to } anything}$ 

- Suppose you want to learn a Markov chain for words in a corpus of $n$ documents.
- Set of states is the set of all unique words in the corpus.
- Calculate the initial probability distribution $\pi_0$
    - For all states (unique words) $w_i$, compute $\frac{\text{number of times a document starts with } w_i}{n} $ 
    
- Calculate transition probabilities for all state combinations $w_i$ and $w_j$
    - $\frac{\text{number of times } w_i \text{ precedes } w_j}{\text{number of times } w_i \text{ precedes anything}}$ 
     

### Example with daily activities as states

![](img/activity-seqs.png)

- What's $\pi_0$(😴)?
- What's $\pi_0$(🍎)?
- What is P(🍎|📚)? 
    - $\frac{\text{number of times 📚 precedes 🍎}}{\text{number of times 📚 precedes anything}}$ 

### Markov model of language toy example 
- Work through this example on your own.
- Consider this tiny corpus. 

In [2]:
toy_corpus = "a rose is a rose is a rose a rose."
toy_corpus_tokens = nltk.word_tokenize(toy_corpus.lower())
circ_corpus = toy_corpus_tokens + toy_corpus_tokens[:1]
print(circ_corpus)

['a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'a', 'rose', '.', 'a']


- What are the states in this model?
    - Unique words in the corpus: {., a, is, rose}

> We make it circular so that we have some non-zero transition from the last word in the sequence. 

In [3]:
toy_corpus = "a rose is a rose is a rose a rose."
toy_corpus_tokens = nltk.word_tokenize(toy_corpus.lower())
circ_corpus = toy_corpus_tokens + toy_corpus_tokens[:1]
print(circ_corpus)

['a', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'a', 'rose', '.', 'a']


- What are the states in this model?
    - Unique words in the corpus: {., a, is, rose}

> We make it circular so that we have some non-zero transition from the last word in the sequence. 

### $\pi_0$ of the toy corpus
- Calculate the initial probability distribution $\pi_0$.
    - For all states (unique words) $w_i$, compute $\frac{\text{number of times a document starts with } w_i}{n} $   
- Here we only have one sentence which starts with word "a". So $\pi_0 = \begin{bmatrix} 0.0 & 1.0 & 0.0 & 0.0\end{bmatrix}$

> In NLP, _document_ is a general terms used for sentences or small snippets of texts. For example a text message could be referred to as a document.    

### Transition matrix of the toy corpus
- Calculate transition probabilities for all state combinations $w_i$ and $w_j$
    - $\frac{\text{number of times } w_i \text{ precedes } w_j}{\text{number of times } w_i \text{ precedes anything}}$ 
    
- Let's calculate the transition matrix.  
    - Calculate transition probabilities for all state combinations. 

### How often a word $w_i$ occurs before $w_j$?

- Code that you used in [512 lab1](https://github.ubc.ca/MDS-2021-22/DSCI_512_alg-data-struct_students/blob/master/solutions/lab1/lab1.ipynb). 

In [4]:
n = 1
frequencies = defaultdict(Counter)
for i in range(len(circ_corpus) - n):
    frequencies[circ_corpus[i : i + n][0]][circ_corpus[i + n]] += 1
frequencies

freq_df = pd.DataFrame(frequencies).transpose()
freq_df = freq_df.fillna(0)
freq_df

Unnamed: 0,rose,is,a,.
a,4.0,0.0,0.0,0.0
rose,0.0,2.0,1.0,1.0
is,0.0,0.0,2.0,0.0
.,0.0,0.0,1.0,0.0


In [5]:
frequencies

defaultdict(collections.Counter,
            {'a': Counter({'rose': 4}),
             'rose': Counter({'is': 2, 'a': 1, '.': 1}),
             'is': Counter({'a': 2}),
             '.': Counter({'a': 1})})

In [6]:
n = 1
frequencies = defaultdict(Counter)
for i in range(len(circ_corpus) - n):
    frequencies[circ_corpus[i : i + n][0]][circ_corpus[i + n]] += 1
frequencies

freq_df = pd.DataFrame(frequencies).transpose()
freq_df = freq_df.fillna(0)
freq_df

Unnamed: 0,rose,is,a,.
a,4.0,0.0,0.0,0.0
rose,0.0,2.0,1.0,1.0
is,0.0,0.0,2.0,0.0
.,0.0,0.0,1.0,0.0


In [7]:
frequencies

defaultdict(collections.Counter,
            {'a': Counter({'rose': 4}),
             'rose': Counter({'is': 2, 'a': 1, '.': 1}),
             'is': Counter({'a': 2}),
             '.': Counter({'a': 1})})

### How often a word $w_i$ occurs before anything?

- Let's normalize the probabilities. 
    - Divide element in each row by the summation of the elements in the row. 

In [8]:
trans_df = freq_df.div(freq_df.sum(axis=1), axis=0)
trans_df

Unnamed: 0,rose,is,a,.
a,1.0,0.0,0.0,0.0
rose,0.0,0.5,0.25,0.25
is,0.0,0.0,1.0,0.0
.,0.0,0.0,1.0,0.0


This is our transition matrix for our tiny corpus! 

<br><br><br><br>

## Markov models of language 

- Let's normalize the probabilities. 
    - Divide element in each row by the summation of the elements in the row. 

In [9]:
trans_df = freq_df.div(freq_df.sum(axis=1), axis=0)
trans_df

Unnamed: 0,rose,is,a,.
a,1.0,0.0,0.0,0.0
rose,0.0,0.5,0.25,0.25
is,0.0,0.0,1.0,0.0
.,0.0,0.0,1.0,0.0


This is our transition matrix for our tiny corpus! 

### What is a language model? 

A model that computes the probability of a sequence of words (or characters) or the probability of an upcoming word (or character) is called a **language model**.

![](img/voice-assistant-ex.png)
<!-- <img src="img/voice-assistant-ex.png" height="1400" width="1400"> -->


### What is a language model? 
A model that computes the probability of a sequence of words (or characters) or the probability of an upcoming word (or character) is called a **language model**.

- Compute the probability of a sentence or a sequence of words.
    - $P(w_1, w_2,\dots,w_t)$
    - P(I have read this book) > P(eye have red this book)

- A related task: What's the probability of an upcoming word? 
    - $P(w_t|w_1,w_2,\dots,w_{t-1})$ 
    - P(book | read this) > P(book | red this)



### Language modeling: Why should we care?

Powerful idea in NLP and helps in many tasks.
- Machine translation 
    * P(In the age of data algorithms have the answer) > P(the age data of in algorithms answer the have)
- Spelling correction
    * My office is a 20  <span style="color:red">minuet</span> bike ride from my home.  
        * P(20 <span style="color:blue">minute</span> bike ride from my home) > P(20 <span style="color:red">minuet</span> bike ride from my home)
- Speech recognition 
    * P(<span style="color:blue">I read</span> a book) > P(<span style="color:red">Eye red</span> a book)

### A simplest model of language is a Markov model of language! 

<br><br><br><br>

### A naive way to calculate probability of a sentence

- Calculate probability of a sequence by applying the chain rule.  
- Example: Suppose we want to calculate the probability of the following sequence of words: 

$$
\begin{equation}
\begin{split}
P(\text{there is a crack in everything that 's how the light gets in}) \\
=& P(\text{there}) \times P(\text{is}\mid \text{there})\\ 
                                              & \times P(\text{a} \mid \text{there is}) \times P(\text{crack}\mid \text{there is a})\\
                                              & \times P(\text{in}\mid \text{there is a crack})\\
                                              & \times P(\text{everything} \mid \text{there is a crack in}) \\
                                              & \times P(\text{that} \mid \text{there is a crack in everything}) \\                                              
                                              & \dots 
\end{split}
\end{equation}
$$

$$P(\text{that} \mid \text{there is a crack in everything}) = \frac{Count(\text{there is a crack in everything that})}{Count(\text{there is a crack in everything})} $$

### Problem with the naive approach 

$$P(\text{that} \mid \text{there is a crack in everything}) = \frac{Count(\text{there is a crack in everything that})}{Count(\text{there is a crack in everything})} $$

- How often the exact same long sequences of words would occur in text? For example, how often the sequence "there is a crack in everything" is likely to occur in your data? 
- The counts will be tiny and the model will be very sparse and specific. 
- <span style="color:red">BAD IDEA!!</span> 

## A Markov model of language

**Markov assumption: The future is conditionally independent of the past given present**

![](img/bigram-ex.png)
<!-- <center> -->
<!-- <img src="img/bigram-ex.png" height="500" width="500"> -->
<!-- </center> -->

$$
P(\text{everything} \mid \text{a crack in}) \approx P(\text{everything}\mid\text{in})
$$

- How do we estimate the probabilities? 
$$P(\text{everything} \mid\text{in}) = \frac{Count(\text{in everything})}{Count(\text{in \{any word\}})}$$

- The model would be more generalizable now. 

### n-gram language model

- In NLP, a Markov model of language is also referred to as **an n-gram language model**. 
- So far we have been talking about approximating the conditional probability $P(s_{t+1} \mid s_{1}s_{2}\dots s_{t})$ using only the current state $P(s_{t+1} \mid s_{t})$. 
- So we have been talking about n-gram models where $n=1$, i.e., we only consider the current state in predicting the future.    
- Such a model is referred to as a **2-gram (bigram) language model**, because in such a model, we only consider 2 state sequences at a time, the current state to predict the next state. 

Let's try this out on a toy corpus. 

In [10]:
toy_data = open("data/cohen_poem.txt")
toy_corpus = toy_data.read()
print(toy_corpus[0:512])

The birds they sang
At the break of day
Start again
I heard them say
Don't dwell on what
Has passed away
Or what is yet to be
Yeah the wars they will
Be fought again
The holy dove
She will be caught again
Bought and sold
And bought again
The dove is never free
Ring the bells (ring the bells) that still can ring
Forget your perfect offering
There is a crack in everything (there is a crack in everything)
That's how the light gets in
We asked for signs
The signs were sent
The birth betrayed
The marriage spent



Let's calculate word co-occurrence frequencies. 

In [11]:
toy_corpus_tokens = nltk.word_tokenize(toy_corpus.lower())

frequencies = defaultdict(Counter)
for i in range(len(toy_corpus_tokens) - 1):
    frequencies[toy_corpus_tokens[i : i + 1][0]][toy_corpus_tokens[i + 1]] += 1

freq_df = pd.DataFrame(frequencies).transpose()
freq_df = freq_df.fillna(0)
freq_df

Unnamed: 0,birds,break,wars,holy,dove,bells,light,signs,birth,marriage,...,out,loud,but,like,summoned,up,going,from,me,wo
the,1.0,1.0,1.0,1.0,1.0,6.0,6.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
birds,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
they,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sang,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
at,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
heart,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
love,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
come,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
like,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's calculate the transition matrix. 

In [12]:
trans_df = freq_df.div(freq_df.sum(axis=1), axis=0)
trans_df

Unnamed: 0,birds,break,wars,holy,dove,bells,light,signs,birth,marriage,...,out,loud,but,like,summoned,up,going,from,me,wo
the,0.04,0.04,0.04,0.04,0.04,0.24,0.24,0.04,0.04,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
birds,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
they,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sang,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
at,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
heart,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
love,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
come,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
like,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
print("Conditional probability P(everything | in) = ", trans_df.loc["in"]["everything"])

Conditional probability P(everything | in) =  0.5714285714285714


### State space

- The states in our model are unique words in our corpus

In [14]:
states = np.unique(list(toy_corpus_tokens))
print("States:", states)
S = len(states)
print("Number of states:", S)

States: ["'re" "'s" "'ve" '(' ')' ',' 'a' 'add' 'again' 'all' 'and' 'asked' 'at'
 'away' 'be' 'bells' 'betrayed' 'birds' 'birth' 'bought' 'break' 'but'
 'ca' 'can' 'caught' 'come' 'crack' 'crowd' 'day' 'do' 'dove' 'drum'
 'dwell' 'every' 'everything' 'for' 'forget' 'fought' 'free' 'from' 'gets'
 'going' 'government' 'has' 'have' 'hear' 'heard' 'heart' 'high' 'holy'
 'how' 'i' 'in' 'is' 'killers' 'lawless' 'light' 'like' 'loud' 'love'
 'march' 'marriage' 'me' 'more' "n't" 'never' 'no' 'of' 'offering' 'on'
 'or' 'out' 'parts' 'passed' 'perfect' 'places' 'prayers' 'refugee' 'ring'
 'run' 'sang' 'say' 'see' 'sent' 'she' 'signs' 'sold' 'spent' 'start'
 'still' 'strike' 'sum' 'summoned' 'that' 'the' 'their' 'them' 'there'
 'they' 'thundercloud' 'to' 'up' 'wars' 'we' 'were' 'what' 'while'
 'widowhood' 'will' 'with' 'wo' 'yeah' 'yet' 'you' 'your']
Number of states: 115


### Text generation using Markov models of languaage 

- How can we predict next word given a sequence of words? 

In [15]:
seed = "in"
seq_len = 50
seq = ""
word = seed
for i in range(seq_len):
    seq += " " + word
    next_word = npr.choice(
        trans_df.columns.tolist(),
        p=trans_df.loc[
            word,
        ].values.flatten(),
    )
    word = next_word
print("THE GENERATED SEQUENCE:", seq)

THE GENERATED SEQUENCE:  in ring forget your perfect offering there is a crack , a crack in everything ) that lawless crowd while the light gets in everything ) that 's how the bells that still can strike up a crack in that 's how the light gets in everything ( there is


In practice, the corpus (dataset) is huge. For example, the full Wikipedia or the text available on the entire Internet, or all the New York Times articles from the last 20 years. 

### Extending the bigram model

- So far we have been talking about bigram models where we only consider the current word when predicting the next word. 
- If we want to predict future accurately, it's probably a good idea to use more history. 
- Can we generalize bigram language model ($n=1$) so that we can incorporate more history in the model ($n \gt 1$). 

![](img/bigram-ex.png)
<!-- <center> -->
<!-- <img src="img/bigram-ex.png" height="500" width="500"> -->
<!-- </center> -->

### Extending the bigram model
- One way to incorporate more history is by extending the definition of a state. 
    - Instead of defining state space as unique words in the corpus, we can define it as unique two-word (trigram), three-word (4-gram), ($n-1$)-word sequences of the unique words in the corpus.

- Trigram language model ($n=2$):  
$$
P(\text{everything} \mid \text{there is a crack in}) \approx P(\text{everything} \mid \text{crack in})
$$

![](img/trigram-ex.png)

<!-- <center> -->
<!-- <img src="img/trigram-ex.png" height="500" width="500"> -->
<!-- </center> -->


### Considering more history 

- When we consider three state sequences at a time (2 previous states as history), the model is a **trigram (3-gram) language model**.
- When we consider $n-1$ states as history, it's an **n-gram language model**. 

- Example: trigrams or four-gram language model
    - Trigram language model
$$
P(\text{everything} \mid \text{there is a crack in}) \approx P(\text{everything} \mid \text{crack in})
$$
    - Four-gram language model
$$
P(\text{everything} \mid \text{there is a crack in}) \approx P(\text{everything} \mid \text{a crack in})
$$


### Language model example ($n=2$)

- The state space would be all 2-word combinations of unique words in the corpus. 
    - What would be the size of the state space in our toy corpus? 
- Not all transitions would be valid transitions. 
    - Example: _a crack_ to _in everything_ is not a valid transition
- Some example states with valid transitions could be: 
![](img/trigram-ex.png)

<!-- <center> -->
<!-- <img src="img/trigram-ex.png" height="500" width="500"> -->
<!-- </center> -->


### Language model example ($n=3$)
- The state space would be all 3-word combinations of unique words in the corpus. 
    - What would be the size of the state space in our toy corpus?  
- Some example states with valid transitions could be: 

![](img/4-gram-ex.png)

<!-- <center> -->
<!-- <img src="img/4-gram-ex.png" height="500" width="500"> -->
<!-- </center> -->

- Now we are able to incorporate more history, without really changing the math of Markov models! 
- We can calculate probability of sequences, predict the state at a given time step, or calculate the stationary distribution the same way. 

### Are n-grams a good model of language?

- In many cases, we can get by with ngram models. 
- But in general, is it a good assumption that the next word that I utter will be dependent on the last 3 words or 4 words?

<blockquote>
    The computer I was talking about yesterday when we were having dinner __.     
</blockquote>    

- Language has long-distance dependencies.  
- We can extend it to $3$-grams, $4$-grams, $5$-grams. But then there is sparsity problem. 
- Also, ngram models have huge RAM requirements.

### Language models with word embeddings

- N-gram models are great but we are representing context as the exact word.
- Suppose in your training data you have the sequence "feed the cat" but you do not have the sequence "feed the dog".

<blockquote>
I have to make sure to feed the cat.
</blockquote>

- Trigram model: $P(\text{dog} \mid \text{feed the}) = 0$
- If we represent words with word embeddings instead, we will be able to generalize to dog even if we haven't seen it in the corpus.
- We'll come back to this when we learn about Recurrent Neural Networks (RNNs). 

### (Optional) [Google n-gram viewer](https://books.google.com/ngrams)
 
- All Our N-gram are Belong to You
    - https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-toyou.html

<blockquote>
Here at Google Research we have been using word n-gram models for a variety
of R&D projects, such as statistical machine translation, speech recognition,
spelling correction, entity detection, information extraction, and others.
That's why we decided to share this enormous dataset with everyone. We
processed 1,024,908,267,229 words of running text and are publishing the
counts for all 1,176,470,663 five-word sequences that appear at least 40
times. There are 13,588,391 unique words, after discarding words that appear
less than 200 times.”
</blockquote>

In [16]:
url = "https://books.google.com/ngrams/"
HTML("<iframe src=%s width=1000 height=800></iframe>" % url)



<br><br><br><br>

## ❓❓ Questions for you

iClicker cloud join link: https://join.iclicker.com/4QVT4

### Exercise 2.1: Select all of the following statements which are **True** (iClicker)

- (A) Building a transition matrix for a Markov model means calculating the proportion of how often sequences of states occur in your data. 
- (B) In our setup, when n=3, each state has 3 letters. So given a three letter sequence, we predict the next three letter sequence. 
- (C) Suppose you have two corpora from completely different domains. You build two word-based n-gram models one for each corpus. The stationary distribution would be the same for both n-gram models.
- (D) For $V=10$ and $n=4$ in our n-gram model setup, the dimension of the transition matrix is $10 \times 10$. 

### Exercise 2.2: Questions for class discussion

- In our setup, n=1 means a bigram model. I'm not sure if our implementation would work with n=0. But think about what does it mean and how would you generate text when n=0.  

<br><br><br><br>

## Applications of Markov models

### Markov’s own application of his chains
- Markov studied the transitions between vowels and consonants in a sequence of 20,000 letters in A. S. Pushkin's poem _Eugeny Onegin_. 

 $S = \{\text{vowel, consonant}\}, T =
    \begin{bmatrix}
            0.128 & 0.872\\ 
            0.663 & 0.337\\
    \end{bmatrix}
    $
    
|               | vowel     | consonant |
| ------------- |:---------:| -----:|
| vowel         | 0.128       | 0.872   |
| consonant     | 0.663      | 0.337   |   

### Stationary distribution in the context of text 

- He also gave the stationary distribution for vowels and consonants: $\pi = \begin{bmatrix}0.432 & 0.568\end{bmatrix}$       

- In the context of text, stationary distribution is calculated as how often each state occurs in the corpus. 
- So stationary distribution in this case can be calculated as: 

$$
\begin{bmatrix}
    \frac{\text{# vowel occurrences}}{\text{total number of letters}} & \frac{\text{# consonant occurrences} }{\text{total number of letters}}\\
\end{bmatrix} 
$$

- Let's check whether we get $\pi T = \pi$ with this stationary distribution.     

In [17]:
pi = np.array([0.432, 0.568])
T = np.array([[0.128, 0.872], [0.663, 0.337]])
np.allclose(pi @ T, pi)

False

In [18]:
# Markov's Pushkin Onegin consonant vowel probabilities
print(pi @ T)
print(pi @ np.linalg.matrix_power(T, 2))
print(pi @ np.linalg.matrix_power(T, 3))

[0.43188 0.56812]
[0.4319442 0.5680558]
[0.43190985 0.56809015]


- The state probabilities are not quite the same but they are pretty close.
- Note that the stationary distribution was calculated **by hand** by Markov and probably he rounded off the probabilities after 3 decimal places. (Tedious calculation!!) 

### Markov’s own application of his chains

- Markov also studied the sequence of 100,000 letters in S. T. Aksakov's novel "The Childhood of Bagrov, the Grandson".
    * $S = \{\text{vowel, consonant}\}$ 
    * 
    $ T = 
    \begin{bmatrix}
    0.552 & 0.448\\
    0.365 & 0.635\\
    \end{bmatrix}
    $

|               | vowel     | consonant |
| ------------- |:---------:| -----:|
| vowel         | 0.552       | 0.448   |
| consonant     | 0.365       | 0.635   |

- He gave the stationary distribution for vowels and consonants.
    * $\pi = [0.449,0.551]$ 
- Stationary distribution in this case can be calculated as: 

$$
\begin{bmatrix}
    \frac{\text{# vowels}}{\text{total number of letters}} & \frac{\text{# consonants}}{\text{total number of letters}}\\
\end{bmatrix} 
$$  

In [19]:
pi = np.array([0.449, 0.551])
T = np.array([[0.552, 0.448], [0.365, 0.635]])
np.allclose(pi @ T, pi)

False

In [20]:
# Markov's Pushkin Onegin consonant vowel probabilities
print(pi @ T)
print(pi @ np.linalg.matrix_power(T, 2))

[0.448963 0.551037]
[0.44895608 0.55104392]


Again, the state probabilities are not quite the same but they are pretty close. 

## PageRank 

- One of the algorithms used by Google Search to rank web pages in their search engine results.
- Graph-based ranking algorithm, which assigns a rank to a webpage.
- The rank indicates a relative score of the page's importance and authority.
- Intuition
    - Important webpages are linked from other important webpages.
    - Don't just look at the number of links coming to a webpage but consider who the links are coming from. 

![](img/wiki_page_rank.jpg)
<!-- <center>     -->
<!-- <img src="img/wiki_page_rank.jpg" height="400" width="400">  -->
<!-- </center> -->

[Credit](https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.jpg)


### PageRank: scoring

- Imagine a browser doing a random walk 
    - At time t=0, start at a random webpage.
    - At time t=1, follow a random link on the current page.
    - At time t=2, follow a random link on the current page. 
    
- Intuition
    - In the "steady state" each page has a long-term visit rate, which is the page's score (rank). 

### PageRank as a Markov chain

- A state is a web page.
- Transition probabilities represent probabilities of moving from one page to another.
- We derive these from the adjacency matrix of the web graph
    - Adjacency matrix $M$ is a $n \times n$ matrix, if $n$ is the number of states (web pages)
    - $M_{ij} = 1$ if there is a hyperlink from page $i$ to page $j$.   
- (Optional) If you want to know more details, check out [AppendixA-PageRank](AppendixA-PageRank.ipynb).    

### Calculate page rank: power iteration method

- Start with a random initial probability distribution $\pi_0$.
- Multiply $\pi_0$ by powers of the transition matrix $T$ until the product looks stable.  
    - After one step, we are at $\pi T$
    - After two steps, we are at $\pi T^2$
    - After three steps, we are at $\pi T^3$
    - Eventually (for a large $k$), $\pi T^k$ we get a stationary distribution.       

### Modern ranking methods are more advanced:

- Guarding against methods that exploit algorithm.
- Removing offensive/illegal content.
- Supervised and personalized ranking methods.
- Take into account that you often only care about top rankings.
- Also work on diversity of rankings:
    - E.g., divide objects into sub-topics and do weighted "covering" of topics.
- Persistence/freshness as in recommender systems (news articles).

<br><br><br><br>

## Basic text preprocessing [[video](https://www.youtube.com/watch?v=7W5Q8gzNPBc)]

### Introduction 
- Why do we need preprocessing?
    - Text data is unstructured and messy. 
    - We need to "normalize" it before we do anything interesting with it. 
- Example:     
    - **Lemma**: Same stem, same part-of-speech, roughly the same meaning
        - Vancouver's &rarr; Vancouver
        - computers &rarr; computer 
        - rising &rarr; rise, rose, rises    

### Tokenization

- Sentence segmentation
    - Split text into sentences
- Word tokenization 
    - Split sentences into words

### Tokenization: sentence segmentation

<blockquote>
MDS is a Master's program at UBC in British Columbia. MDS teaching team is truly multicultural!! Dr. Beuzen did his Ph.D. in Australia. Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. Dr. George did his in Scotland. Dr. Gelbart did his PhD in the U.S.
</blockquote>

- How many sentences are there in this text? 

In [21]:
### Let's do sentence segmentation on "."
text = (
    "MDS is a Master's program at UBC in British Columbia. "
    "MDS teaching team is truly multicultural!! "
    "Dr. Beuzen did his Ph.D. in Australia. "
    "Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. "
    "Dr. George did his in Scotland. "
    "Dr. Gelbart did his PhD in the U.S."
)

print(text.split("."))

["MDS is a Master's program at UBC in British Columbia", ' MDS teaching team is truly multicultural!! Dr', ' Beuzen did his Ph', 'D', ' in Australia', ' Dr', ' Timbers, Dr', ' Ostblom, Dr', ' Rodríguez-Arelis, and Dr', ' Kolhatkar did theirs in Canada', ' Dr', ' George did his in Scotland', ' Dr', ' Gelbart did his PhD in the U', 'S', '']


### Sentence segmentation

- In English, period (.) is quite ambiguous. (In Chinese, it is unambiguous.)
    - Abbreviations like Dr., U.S., Inc.  
    - Numbers like 60.44%, 0.98
- ! and ? are relatively ambiguous.
- How about writing regular expressions? 
- A common way is using off-the-shelf models for sentence segmentation. 

In [22]:
### Let's try to do sentence segmentation using nltk
from nltk.tokenize import sent_tokenize

sent_tokenized = sent_tokenize(text)
print(sent_tokenized)

["MDS is a Master's program at UBC in British Columbia.", 'MDS teaching team is truly multicultural!!', 'Dr. Beuzen did his Ph.D. in Australia.', 'Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada.', 'Dr. George did his in Scotland.', 'Dr. Gelbart did his PhD in the U.S.']


### Word tokenization

<blockquote>
MDS is a Master's program at UBC in British Columbia. 
</blockquote>

- How many words are there in this sentence?  
- Is whitespace a sufficient condition for a word boundary?

### Word tokenization 

<blockquote>
MDS is a Master's program at UBC in British Columbia. 
</blockquote>

- What's our definition of a word?
    - Should British Columbia be one word or two words? 
    - Should punctuation be considered a separate word?
    - What about the punctuations in `U.S.`?
    - What do we do with words like `Master's`?
- This process of identifying word boundaries is referred to as **tokenization**.
- You can use regex but better to do it with off-the-shelf ML models.  

In [23]:
### Let's do word segmentation on white spaces
print("Splitting on whitespace: ", [sent.split() for sent in sent_tokenized])

### Let's try to do word segmentation using nltk
from nltk.tokenize import word_tokenize

word_tokenized = [word_tokenize(sent) for sent in sent_tokenized]
# This is similar to the input format of word2vec algorithm
print("\n\n\nTokenized: ", word_tokenized)

Splitting on whitespace:  [['MDS', 'is', 'a', "Master's", 'program', 'at', 'UBC', 'in', 'British', 'Columbia.'], ['MDS', 'teaching', 'team', 'is', 'truly', 'multicultural!!'], ['Dr.', 'Beuzen', 'did', 'his', 'Ph.D.', 'in', 'Australia.'], ['Dr.', 'Timbers,', 'Dr.', 'Ostblom,', 'Dr.', 'Rodríguez-Arelis,', 'and', 'Dr.', 'Kolhatkar', 'did', 'theirs', 'in', 'Canada.'], ['Dr.', 'George', 'did', 'his', 'in', 'Scotland.'], ['Dr.', 'Gelbart', 'did', 'his', 'PhD', 'in', 'the', 'U.S.']]



Tokenized:  [['MDS', 'is', 'a', 'Master', "'s", 'program', 'at', 'UBC', 'in', 'British', 'Columbia', '.'], ['MDS', 'teaching', 'team', 'is', 'truly', 'multicultural', '!', '!'], ['Dr.', 'Beuzen', 'did', 'his', 'Ph.D.', 'in', 'Australia', '.'], ['Dr.', 'Timbers', ',', 'Dr.', 'Ostblom', ',', 'Dr.', 'Rodríguez-Arelis', ',', 'and', 'Dr.', 'Kolhatkar', 'did', 'theirs', 'in', 'Canada', '.'], ['Dr.', 'George', 'did', 'his', 'in', 'Scotland', '.'], ['Dr.', 'Gelbart', 'did', 'his', 'PhD', 'in', 'the', 'U.S', '.']]


### Word segmentation 

For some languages you need much more sophisticated tokenizers. 
- For languages such as Chinese, there are no spaces between words.
    - [jieba](https://github.com/fxsjy/jieba) is a popular tokenizer for Chinese. 
- German doesn't separate compound words.
    * Example: _rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz_
    * (the law for the delegation of monitoring beef labeling)

### Types and tokens
- Usually in NLP, we talk about 
    - **Type** an element in the vocabulary
    - **Token** an instance of that type in running text 


### Exercise for you 

<blockquote>    
UBC is located in the beautiful province of British Columbia. It's very close 
to the U.S. border. You'll get to the USA border in about 45 mins by car.     
</blockquote>  

- Consider the example above. 
    - How many types? (task dependent)
    - How many tokens? 

### Other commonly used preprocessing steps

- Punctuation and stopword removal
- Stemming and lemmatization

### Punctuation and stopword removal

- The most frequently occurring words in English are not very useful in many NLP tasks.
    - Example: _the_ , _is_ , _a_ , and punctuation
- Probably not very informative in many tasks 

In [24]:
# Let's use `nltk.stopwords`.
# Add punctuations to the list.
stop_words = list(set(stopwords.words("english")))
import string

punctuation = string.punctuation
stop_words += list(punctuation)
# stop_words.extend(['``','`','br','"',"”", "''", "'s"])
print(stop_words)

['o', 'out', 'few', 'with', 'had', 'shouldn', 'needn', 'after', 'themselves', "haven't", "you'll", 'ourselves', 'their', 'now', 'was', "hasn't", 'it', 'down', 'on', 'before', 'only', 'being', "needn't", 'should', 'didn', "aren't", 'those', 'are', 've', 'here', "isn't", 'until', "it's", 'of', 't', 'why', 'then', 'm', 'haven', 'during', 'and', 'doing', 'such', 'can', 'more', 'ma', 'so', 're', 'any', "that'll", 'mightn', 'having', 'this', "mightn't", 'doesn', 'who', 'that', 'him', 'while', 'whom', 'its', 'they', 'where', 'the', 'don', 'ain', 'up', 'his', 'myself', 'd', "shouldn't", 'himself', 'which', 'some', 'above', 'in', 'between', 'yourself', "mustn't", 'do', 'been', 'what', 'won', 'through', "she's", 'herself', "should've", 'hers', 'all', "you've", 'weren', 'them', 'does', 'into', "shan't", 'an', 'for', 'about', 'i', "you'd", 'aren', 'were', 'very', 'against', 'if', 'under', 's', "you're", "wasn't", "don't", 'from', 'over', 'you', "won't", "didn't", "doesn't", "wouldn't", 'couldn', '

In [25]:
### Get rid of stop words
preprocessed = []
for sent in word_tokenized:
    for token in sent:
        token = token.lower()
        if token not in stop_words:
            preprocessed.append(token)
print(preprocessed)

['mds', 'master', "'s", 'program', 'ubc', 'british', 'columbia', 'mds', 'teaching', 'team', 'truly', 'multicultural', 'dr.', 'beuzen', 'ph.d.', 'australia', 'dr.', 'timbers', 'dr.', 'ostblom', 'dr.', 'rodríguez-arelis', 'dr.', 'kolhatkar', 'canada', 'dr.', 'george', 'scotland', 'dr.', 'gelbart', 'phd', 'u.s']


### Lemmatization 

- For many NLP tasks (e.g., web search) we want to ignore morphological differences between words
    - Example: If your search term is "studying for ML quiz" you might want to include pages containing "tips to study for an ML quiz" or "here is how I studied for my ML quiz"
- Lemmatization converts inflected forms into the base form. 

In [26]:
import nltk

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /Users/kvarada/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [27]:
nltk.download("omw-1.4")

[nltk_data] Downloading package omw-1.4 to /Users/kvarada/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [28]:
# nltk has a lemmatizer
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print("Lemma of studying: ", lemmatizer.lemmatize("studying", "v"))
print("Lemma of studied: ", lemmatizer.lemmatize("studied", "v"))

Lemma of studying:  study
Lemma of studied:  study


### Stemming

- Has a similar purpose but it is a crude chopping of affixes 
    * _automates, automatic, automation_ all reduced to _automat_.
- Usually these reduced forms (stems) are not actual words themselves.  
- A popular stemming algorithm for English is PorterStemmer. 
- Beware that it can be aggressive sometimes.

In [29]:
from nltk.stem.porter import PorterStemmer

text = (
    "UBC is located in the beautiful province of British Columbia... "
    "It's very close to the U.S. border."
)
ps = PorterStemmer()
tokenized = word_tokenize(text)
stemmed = [ps.stem(token) for token in tokenized]
print("Before stemming: ", text)
print("\n\nAfter stemming: ", " ".join(stemmed))

Before stemming:  UBC is located in the beautiful province of British Columbia... It's very close to the U.S. border.


After stemming:  ubc is locat in the beauti provinc of british columbia ... it 's veri close to the u.s. border .


### Other tools for preprocessing 

- We used [Natural Language Processing Toolkit (nltk)](https://www.nltk.org/) above
    - You already have used it in 571 and 573  
- Many available tools    
- [spaCy](https://spacy.io/)

### [spaCy](https://spacy.io/)

- We already have used spaCy before in 573 and 563. 
- Industrial strength NLP library. 
- Lightweight, fast, and convenient to use. 
- spaCy does many things that we did above in one line of code! 
- Also has [multi-lingual](https://spacy.io/models/xx) support. 

In [33]:
# !python -m spacy download en_core_web_md

In [34]:
import spacy

# Load the model
nlp = spacy.load("en_core_web_md")
text = (
    "MDS is a Master's program at UBC in British Columbia. "
    "MDS teaching team is truly multicultural!! "
    "Dr. Beuzen did his Ph.D. in Australia. "
    "Dr. George did his in Scotland. "
    "Dr. Timbers, Dr. Ostblom, Dr. Rodríguez-Arelis, and Dr. Kolhatkar did theirs in Canada. "
    "Dr. Gelbart did his PhD in the U.S."
)

doc = nlp(text)

In [35]:
# Accessing tokens
tokens = [token for token in doc]
print("\nTokens: ", tokens)

# Accessing lemma
lemmas = [token.lemma_ for token in doc]
print("\nLemmas: ", lemmas)

# Accessing pos
pos = [token.pos_ for token in doc]
print("\nPOS: ", pos)


Tokens:  [MDS, is, a, Master, 's, program, at, UBC, in, British, Columbia, ., MDS, teaching, team, is, truly, multicultural, !, !, Dr., Beuzen, did, his, Ph.D., in, Australia, ., Dr., George, did, his, in, Scotland, ., Dr., Timbers, ,, Dr., Ostblom, ,, Dr., Rodríguez, -, Arelis, ,, and, Dr., Kolhatkar, did, theirs, in, Canada, ., Dr., Gelbart, did, his, PhD, in, the, U.S.]

Lemmas:  ['mds', 'be', 'a', 'Master', "'s", 'program', 'at', 'UBC', 'in', 'British', 'Columbia', '.', 'mds', 'teaching', 'team', 'be', 'truly', 'multicultural', '!', '!', 'Dr.', 'Beuzen', 'do', 'his', 'ph.d.', 'in', 'Australia', '.', 'Dr.', 'George', 'do', 'his', 'in', 'Scotland', '.', 'Dr.', 'Timbers', ',', 'Dr.', 'Ostblom', ',', 'Dr.', 'Rodríguez', '-', 'Arelis', ',', 'and', 'Dr.', 'Kolhatkar', 'do', 'theirs', 'in', 'Canada', '.', 'Dr.', 'Gelbart', 'do', 'his', 'phd', 'in', 'the', 'U.S.']

POS:  ['NOUN', 'AUX', 'DET', 'PROPN', 'PART', 'NOUN', 'ADP', 'PROPN', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'NOUN', 'NOUN', 'NOUN

### Other typical NLP tasks 
In order to understand text, we usually are interested in extracting information from text. Some common tasks in NLP pipeline are: 
- Part of speech tagging
    - Assigning part-of-speech tags to all words in a sentence.
- Named entity recognition
    - Labelling named “real-world” objects, like persons, companies or locations.    
- Coreference resolution
    - Deciding whether two strings (e.g., UBC vs University of British Columbia) refer to the same entity
- Dependency parsing
    - Representing grammatical structure of a sentence

### Extracting named-entities using spaCy

In [36]:
from spacy import displacy

doc = nlp(
    "University of British Columbia "
    "is located in the beautiful "
    "province of British Columbia."
)
displacy.render(doc, style="ent")
# Text and label of named entity span
print("Named entities:\n", [(ent.text, ent.label_) for ent in doc.ents])
print("\nORG means: ", spacy.explain("ORG"))
print("GPE means: ", spacy.explain("GPE"))

Named entities:
 [('University of British Columbia', 'ORG'), ('British Columbia', 'GPE')]

ORG means:  Companies, agencies, institutions, etc.
GPE means:  Countries, cities, states


### Dependency parsing using spaCy

In [37]:
doc = nlp("I like cats")
displacy.render(doc, style="dep")

### Many other things possible

- A powerful tool 
- All my Capstone groups last year used this tool. 
- You can build your own rule-based searches. 
- You can also access word vectors using spaCy with bigger models. (Currently we are using `en_core_web_sm` model.)

<br><br>

## ❓❓ Questions for you

### Exercise 2.3: Discuss the following questions with your neighbours 

1. Why your text might become unreadable after stemming? 
2. What's the difference between sentence segmentation and word tokenization? Which step would you carry out first: sentence segmentation or word tokenization?
3. Tokenize the following sentence and identify named entities in the sentence manually. Compare your annotations with what you get with spaCy. 

> The MadeUpOrg founder John Fakename lists his Point Grey penthouse for $15 million.  

```{admonition} Exercise 2.3: V's Solutions!
:class: tip, dropdown
1. Stemming carries out crude chopping of affixes and converts words to reduced forms called stems. Often these reduced forms are not actual words. For instance, in the example we saw, _located_ was reduced to _locat_ and _beautiful_ was reduced to _beauti_. So after applying stemming the text might become unreadable. 
2. Sentence segmentation is about identifying sentence boundaries and splitting text into sentences whereas word tokenization is about identifying word boundaries and splitting sentences into words. The general practice is to carry out sentence segmentation before word tokenization. 
3. Manual NER: 
The [MadeUpOrg ORGANIZATION] founder [John Fakename PERSON] lists his [POINT Grey LOCATION] penthouse for [$15 million MONEY] . 
```

In [38]:
from spacy import displacy

doc = nlp(
    "The MadeUpOrg founder John Fakename lists his Point Grey penthouse for $15 million."
)
displacy.render(doc, style="ent")

spaCy was not able to identify ORGANIZATION and LOCATION entities in the sentence.   

<br><br><br><br>

## Final comments, summary, and reflection

### Summary: Markov models 

- Markov models are the class of probabilistic models which assume that we can predict the probability of being in a particular state in future without looking too far into the past. 
- We looked at two applications of Markov models in language.
    - N-gram language models
    - PageRank

### Summary: Language models  

- A model that computes the probability of a sequence of words (or characters) or the probability of an upcoming word (or character) is called a **language model**.
- Language models are central to many NLP applications such as smart compose, spelling correction, machine translations, voice assistants. 
- Markov models are the simplest models of language. 
- They are also referred to as **n-gram models**. 

### Summary: N-gram language models

- We can build character-based or word-based n-gram models.
- We build a bigram model of language by assuming unique words or characters as states. 
- We can extend a bigram model by extending the definition of a state. 

### Summary: PageRank
- Another application of Markov chains in language is the PageRank algorithm. 
    - The intuition is that important webpages are linked from other important webpages.

### Preprocessing

- Preprocessing is an important step when you deal with text data.
- Some common preprocessing steps include:
    - Sentence segmentation
    - Word tokenization
    - Lemmatization
    - Stemming
- Some common tasks in NLP pipeline are:
    - POS tagging
    - Named-entity recognition
    - Coreference resolution
    - Dependency parsing 

<br><br>

## Resources

- [GPT-3 AI Automation](https://www.nytimes.com/2020/07/29/opinion/gpt-3-ai-automation.html)
- [spaCy's Python for data science cheat sheet](http://datacamp-community-prod.s3.amazonaws.com/29aa28bf-570a-4965-8f54-d6a541ae4e06)

- [Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf)
- [Try preprocessing using unix shell](https://web.stanford.edu/class/cs124/lec/124-2020-UnixForPoets.pdf)
- [Flair](https://github.com/flairNLP/flair) is another library with state-of-the-art NLP tools.  