# A Probabilistic Language Model

[Open in Colab](https://colab.research.google.com/github/febse/ta2024/blob/main/03-Probabilistic-Language-Models/01-The-Bigram-Model.ipynb)




## The Bi-gram Model

A very simple probabilistic model for a natural language is the Markov bi-gram model.

:::{.callout-note}
## Bi-grams

Bi-grams are sequences of two consecutive words in a sentence. For example, consider the sentence

```
In another moment down went Alice after it, never once considering how in the world she was to get out again. 
```

The bi-grams here are

1. "In another"
2. "another moment"
3. "moment down"
4. ...
:::

Applications of the bi-gram model include
- Speech recognition
- Optical character recognition
- Spelling and grammar correction, nonsense detection (low probability sequences)
- Machine translation


In most languages, the units of meaning are sentences (a sequence of words and punctuation). An interesting question would be to ask: what is the probability to see a specific sentence in a corpus of text?

Take the following quote from Oscar Wilde as an example:

```
Be yourself, everyone else is already taken.
```

Ignoring the punctuation, there are 7 words in this sentence: $w_1$ to $w_7$.

$w_1$ = "Be"
$w_2$ = "yourself"
$w_3$ = "everyone"
...
$w_7$ = "taken"

We could write the probability of this sentence as

$$
P(w_7, w_6, w_5, w_4, w_3, w_2, w_1)
$$

and (in principle) we could estimate this probability by counting the number of times this sentence occurs in a large corpus of text. However, this may not work well because the number of times a whole sentence (or any other long sequence of tokens) occurs in a corpus is likely to be very small or zero.

Let's look into a way to model this probability in a more tractable way.

Using the chain rule of probability, we can write the probability of a sentence as the product of the probabilities of each word given the preceding words. With the 7 words in the sentence, we can write

$$
\begin{align*}
P(w_7, w_6, w_5, w_4, w_3, w_2, w_1) = & P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6, w_5, w_4, w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5, w_4, w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4, w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3 | w_2, w_1) \cdot P(w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3 | w_2, w_1) \cdot P(w_2 | w_1) \cdot P(w_1)
\end{align*}
$$

Now, all these probabilities are not much more tractable than the probability of the whole sentence, so we need to make an assumption to simplify this model. The most strict assumption that we can make is that the conditional probability of a word only depends on the preceding word. This is called the **Markov assumption**. With this assumption, the probability of the 7-th word given the preceding words can be written as

$$
P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) = P(w_7 | w_6)
$$

Analogously, we can write the probability of the sentence as

$$
P(w_7, w_6, w_5, w_4, w_3, w_2, w_1) = P(w_7 | w_6) \cdot P(w_6 | w_5) \cdot P(w_5 | w_4) \cdot P(w_4 | w_3) \cdot P(w_3 | w_2) \cdot P(w_2 | w_1) \cdot P(w_1)
$$

The probabilities of the bi-grams (the probability so see the second word given the first word) are much easier to estimate than the probability of a whole sentence. We can estimate these probabilities by counting the number of times a word $w_t$ follows a word $w_{t - 1}$ in a corpus of text.

$$
\hat{P}(w_t | w_{t - 1}) = \frac{\text{Count}(w_{t - 1}, w_t)}{\text{Count}(w_{t - 1})}
$$

More generally, the probability of a sentence with $T$ words can be written as

$$
P(w_T, w_{T - 1}, \ldots, w_1) = P(w_1) \prod_{t = 2}^{T} P(w_t | w_{t - 1})
$$

As these probabilities would tend to be small, multiplying them in long sequences may lead to underflow problems because computer precision is limited. A common solution to this problem is to use the log-probabilities instead of the probabilities.


$$
\log P(w_T, w_{T - 1}, \ldots, w_1) = \log P(w_1) +  \sum_{t = 2}^{T} \log P(w_t | w_{t - 1})
$$


Another problem is that the log probabilities of sentences will be biased toward shorter sentences, simply because there are less terms in a short sentence. To solve this problem, we can normalize the log probabilities by dividing by the number of words in the sentence.

Instead of the maximum likelihood estimates of the bi-gram probabilities, we can use a smoothed version of these estimates to allow a small probability for bi-grams that do not appear in the corpus. A common approach is to add one to the count of each bi-gram before normalizing.

$$
\hat{P}(w_t | w_{t - 1}) = \frac{\text{Count}(w_{t - 1}, w_t) + 1}{\text{Count}(w_{t - 1}) + V}
$$

## Implementation of the Bi-gram Model

Let's implement the bi-gram model in Python.

:::{.callout-important}
## SpaCy Installation

The following relies on the `spaCy` library and its English language model. You can install `spaCy` and the English language model by running the following commands in your python environment:

```bash
pip install spacy
python -m spacy download en_core_web_sm
```

or you can uncomment the commands in the next cell and run it.
:::

The bi-gram model relies on a matrix of bi-gram probabilities to estimate the probability of a sentence. We will create this matrix by counting the number of times each bi-gram appears in "Alice's Adventures in Wonderland" by Lewis Carroll.

The steps are as follows:

1. We will tokenize the text using `spaCy`. For convenience, we will write a small function that accepts a string and returns a list of sentences, each sentence being a list of tokens.
2. We will introduce two special tokens "BEGINNING" and "END" to mark the beginning and the end of each sentence.
3. We will count the number of times each bi-gram appears in the text using the smoothed probability estimates.
4. We will normalize the counts to get the probabilities.
5. We will write a function that accepts a sentence and returns the log probability of the sentence.



In [6]:
# The following code uses the `!` operator to run shell commands in the notebook and % to run magic commands in the notebook
# This step is not necessary if you are running the notebook in Google Colab

# %pip install spacy
# !python -m spacy download en_core_web_sm

In [7]:
import numpy as np

# Import nltk in order to download the gutenberg corpus which contains a large collection of text data, including our Alice in Wonderland example

import nltk
nltk.download("gutenberg")

from nltk.corpus import gutenberg

# Load spaCy and the English model (in case of errors check if you have downloaded the model)
import spacy
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package gutenberg to /home/amarov/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [8]:
# Load the text of Alice in Wonderland

alice = gutenberg.raw(fileids="carroll-alice.txt")

# Check the first few characters of the text
alice[:1000]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seeme

Next, we will pass the whole string that contains the book to the `nlp` object. It will return an object of class `Doc` that we can use extract a list of sentences and tokens. We will use the `sent()` generator to iterate over the sentences in the book and save them in a list. See the examples below.

In [9]:
# First we will pass the whole text through spacy's pipeline

doc = nlp(alice)
type(doc)

spacy.tokens.doc.Doc

In [None]:
# Have a look at the first 10 tokens (lowercased) and a couple of their properties 
# POS = Part of Speech
# is_stop = is the token a stop word?
# is_punct = is the token punctuation?
# is_space = is the token a space?

print(f"{'Token':10} {'POS':15} {'is_stop':10} {'is_punct':10} {'is_space':10}")

for token in doc[0:10]:
    print(f"{token.lower_:10} {token.pos_:10} {token.is_stop:10} {token.is_punct:10} {token.is_space:10}")

Token      POS             is_stop    is_punct   is_space  
[          X                   0          1          0
alice      PROPN               0          0          0
's         PART                1          0          0
adventures PROPN               0          0          0
in         ADP                 1          0          0
wonderland PROPN               0          0          0
by         ADP                 1          0          0
lewis      PROPN               0          0          0
carroll    PROPN               0          0          0
1865       NUM                 0          0          0


In [None]:
# You can iterate over the first few sentences in the document

sent_n = 0

for sent in doc.sents:
    print(sent)
    sent_n += 1
    if sent_n == 5:
        break

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversation?'


So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.


There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!

Oh dear!
I shall be late!'


In [None]:
# Here we define a function that tokenizes the text into sentences and words
# It will remove the punctuation and spaces and return a list of sentences where each sentence is a list of words

def tokenize_doc(text: str) -> list:
    # Create an empty list to store the sentences
    sentences = []

    # Pass the text through spacy's pipeline
    text_doc = nlp(text)
    
    # Create a dictionary to store the word to index mapping
    word2idx = {
        "BEGINNING": 0,
        "END": 1
    }

    # Create a dictionary to store the index to word mapping
    idx2word = {
        0: "BEGINNING",
        1: "END"
    }
    
    # Iterate over the sentences in the text
    for i, sentence in enumerate(text_doc.sents):
        # For each sentence, create a list to store the tokens
        # The first token is the "BEGINNING" token (beginning of the sentence)

        tokens = ["BEGINNING"]
        
        # Iterate over the tokens in the sentence
        for token in sentence:
            # Omit spaces and punctuation
            if token.is_space or token.is_punct:
                continue

            # Lowercase the token
            token_normalized = token.lower_ 

            # Append the lowercased token to the list of tokens
            tokens.append(token_normalized)
            
            # If the token is not in the word2idx dictionary, add it
            if token_normalized not in word2idx:
                # The indices of the tokens must be unique, 
                # so taking the number of entries in the word2idx dictionary will give us the next index            
                idx = len(word2idx)

                # Add the token to the word2idx and idx2word dictionaries
                word2idx[token_normalized] = idx
                idx2word[idx] = token_normalized
        
        # Here we have already finished iterating over the tokens in the sentence
        # so we append the "END" (end of sentence) token to the list of tokens
        tokens.append("END")
        # Append the list of tokens to the list of sentences
        sentences.append(tokens)

    return sentences, word2idx, idx2word

6

In [28]:
# Apply the function to a sample string. 

sample_sents, sample_word2idx, sample_idx2word = tokenize_doc("This is a sample sentence. This is another sample sentence.")

In [29]:
sample_sents

[['BEGINNING', 'this', 'is', 'a', 'sample', 'sentence', 'END'],
 ['BEGINNING', 'this', 'is', 'another', 'sample', 'sentence', 'END']]

In [30]:
sample_word2idx

{'BEGINNING': 0,
 'END': 1,
 'this': 2,
 'is': 3,
 'a': 4,
 'sample': 5,
 'sentence': 6,
 'another': 7}

In [31]:
sample_idx2word

{0: 'BEGINNING',
 1: 'END',
 2: 'this',
 3: 'is',
 4: 'a',
 5: 'sample',
 6: 'sentence',
 7: 'another'}

Now we are ready to start counting the bi-grams. We will use a matrix to store the counts. The rows of the matrix will correspond to the first word of the bi-gram, and the columns will correspond to the second word. The matrix must be square because the number of rows and columns must be equal to the number of unique words in the text.

In [32]:
# Now we can create a V x V matrix where V is the size of the vocabulary (number of unique tokens)

def compute_bigrams_prob_mtx(sentences, word2idx: dict):
    
    # Get the vocabulary size (this is the number of unique tokens in the text plus the "BEGINNING" and "END" tokens)
    V = len(word2idx)
    
    # Let's first create a matrix of counts
    # We will initialize it with ones to avoid zero probabilities (this is the smoothing we mentioned earlier)
    bigram_counts = np.ones((V, V))

    # Now let us loop over all sentences, extract the bi-grams and count their occurrences
    # Each time we encounter the sequence "is strong" for example, we will increment the count of the
    # Row index of the first word (is) and the column index of the second word (strong)

    for sent in sentences:
        for i, word in enumerate(sent):
            # Skip the first word to avoid indexing errors as we have
            # no word before the start of sentence token "BEGINNING"
            if i == 0:            
                continue
            
            # Here i is greater than 0, so we can get the previous word by 
            # subtracting 1 from the index
            first_word = sent[i - 1]
            
            # We will use the word2idx dictionary to get the index of the word
            first_word_idx = word2idx[first_word]
            second_word_idx = word2idx[word]

            # We will use the index to increment the count of the word
            bigram_counts[first_word_idx, second_word_idx] += 1
    
    # Now we can divide each row by the sum of the row to get the probabilities
    
    BGP = bigram_counts / bigram_counts.sum(axis=1, keepdims=True)
    return BGP

In [34]:
# Apply the function to the sample sentences

sample_bigram_prob_mtx = compute_bigrams_prob_mtx(sample_sents, sample_word2idx)
sample_bigram_prob_mtx

array([[0.1       , 0.1       , 0.3       , 0.1       , 0.1       ,
        0.1       , 0.1       , 0.1       ],
       [0.125     , 0.125     , 0.125     , 0.125     , 0.125     ,
        0.125     , 0.125     , 0.125     ],
       [0.1       , 0.1       , 0.1       , 0.3       , 0.1       ,
        0.1       , 0.1       , 0.1       ],
       [0.1       , 0.1       , 0.1       , 0.1       , 0.2       ,
        0.1       , 0.1       , 0.2       ],
       [0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
        0.22222222, 0.11111111, 0.11111111],
       [0.1       , 0.1       , 0.1       , 0.1       , 0.1       ,
        0.1       , 0.3       , 0.1       ],
       [0.1       , 0.3       , 0.1       , 0.1       , 0.1       ,
        0.1       , 0.1       , 0.1       ],
       [0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
        0.22222222, 0.11111111, 0.11111111]])

## Sentence Scores

Once we have obtained the bi-gram probabilities, we can use them to estimate the probability of a sentence. We will write a function that accepts a sentence and returns the log probability of the sentence. The function will iterate over the words in the sentence and sum the log probabilities of the bi-grams. We will also normalize the log probability by dividing by the number of words in the sentence.



In [35]:
def score_sentence_str(sentence_str: str, word2idx: dict, bigram_probs: np.ndarray):
    # First we tokenize the sentence
    sents, _, _ = tokenize_doc(sentence_str)
    
    # Set the sentence score to zero initially
    sentence_score = 0
    
    # As our tokenize_doc function returns a list of sentences, we will only have one sentence
    # which is the first element of the list

    words = sents[0]
    
    for i, word in enumerate(words):
        if i == 0:
            continue
        
        try:
            first_word_idx = word2idx[words[i - 1]]
        except KeyError:
            raise KeyError(f"Word {words[i - 1]} not in vocabulary")
        
        try:
            second_word_idx = word2idx[word]
        except KeyError:
            raise KeyError(f"Word {word} not in vocabulary")
        
        # Get the log probability of the bigram and add it to the sentence score
        sentence_score += np.log(bigram_probs[first_word_idx, second_word_idx])
        
    # Normalize the score by the number of words in the sentence
    return sentence_score / len(words)


In [41]:
# Now let us run the whole thing

alice_sentences, alice_word2idx, alice_idx2word = tokenize_doc(alice)
BGP = compute_bigrams_prob_mtx(alice_sentences, alice_word2idx)

In [42]:
BGP.shape

(2680, 2680)

In [48]:
second_sentence_list = alice_sentences[32]
# second_sentence
second_sentence_list

['BEGINNING',
 'no',
 'it',
 "'ll",
 'never',
 'do',
 'to',
 'ask',
 'perhaps',
 'i',
 'shall',
 'see',
 'it',
 'written',
 'up',
 'somewhere',
 'END']

In [55]:
second_sentence_str = " ".join(second_sentence_list)
score_sentence_str(second_sentence_str, alice_word2idx, BGP)

-5.7696548515779575

In [None]:
# Let's try it out with a valid sentence that is not in the text

score_sentence_str("Be nice, everybody else is already taken.", alice_word2idx, BGP)

-7.0020184778362875

In [63]:
# Let's try it out with some nonsense sentence (note that the words must be in the vocabulary)

score_sentence_str("Rude foot fun ran egg.", alice_word2idx, BGP)

-6.734067793628513