# A Probabilistic Language Model




A very simple model for a natural language is the Markov bi-gram model.

Bi-grams are sequences of two consecutive words.

```
In another moment down went Alice after it, never once considering how in the world she was to get out again. 
```

The bi-grams of the sentence are

1. "In another"
2. "another moment"
3. "moment down"
4. ...

The model bi-gram model is a probabilistic language model based on the probabilities of bi-grams. The probability of a bi-gram is the conditional probability of the second word given the first word.

$$
P(\text{another} | \text{in}), \\
P(\text{moment} | \text{another}) \\
P(\text{down} | \text{moment})
$$

Applications of the bi-gram model include
- Speech recognition
- Optical character recognition
- Spelling and grammar correction, nonsense detection (low probability sequences)
- Machine translation

# The Bi-gram model

In most languages, the units of meaning are sentences (a sequence of words and punctuation). An interesting question would be to ask: what is the probability to see a specific sentence in a corpus of text?

Take the following quote from Oscar Wilde as an example:

```
Be yourself, everyone else is already taken.
```

Ignoring the punctuation, there are 7 words in this sentence: $w_1$ to $w_7$.

$w_1$ = "Be"
$w_2$ = "yourself"
$w_3$ = "everyone"
...
$w_7$ = "taken"

We could write the probability of this sentence as

$$
P(w_7, w_6, w_5, w_4, w_3, w_2, w_1)
$$

and (in principle) we could estimate this probability by counting the number of times this sentence occurs in a large corpus of text. However, this may not work well because the number of times a whole sentence (or any other long sequence of tokens) occurs in a corpus is likely to be very small or zero.

Let's look into a way to model this probability in a more tractable way.

Using the chain rule of probability, we can write the probability of a sentence as the product of the probabilities of each word given the preceding words. With the 7 words in the sentence, we can write

$$
\begin{align*}
P(w_7, w_6, w_5, w_4, w_3, w_2, w_1) = & P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6, w_5, w_4, w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5, w_4, w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4, w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3, w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3 | w_2, w_1) \cdot P(w_2, w_1) \\
& = P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) \cdot P(w_6 | w_5, w_4, w_3, w_2, w_1) \cdot P(w_5 | w_4, w_3, w_2, w_1) \cdot P(w_4 | w_3, w_2, w_1) \cdot P(w_3 | w_2, w_1) \cdot P(w_2 | w_1) \cdot P(w_1)
\end{align*}
$$

Now, all these probabilities are not much more tractable than the probability of the whole sentence, so we need to make an assumption to simplify this model. The most strict assumption that we can make is that the conditional probability of a word only depends on the preceding word. This is called the **Markov assumption**. With this assumption, the probability of the 7-th word given the preceding words can be written as

$$
P(w_7 | w_6, w_5, w_4, w_3, w_2, w_1) = P(w_7 | w_6)
$$

Analogously, we can write the probability of the sentence as

$$
P(w_7, w_6, w_5, w_4, w_3, w_2, w_1) = P(w_7 | w_6) \cdot P(w_6 | w_5) \cdot P(w_5 | w_4) \cdot P(w_4 | w_3) \cdot P(w_3 | w_2) \cdot P(w_2 | w_1) \cdot P(w_1)
$$

The probabilities of the bi-grams (the probability so see the second word given the first word) are much easier to estimate than the probability of a whole sentence. We can estimate these probabilities by counting the number of times a word $w_t$ follows a word $w_{t - 1}$ in a corpus of text.

$$
\hat{P}(w_t | w_{t - 1}) = \frac{\text{Count}(w_{t - 1}, w_t)}{\text{Count}(w_{t - 1})}
$$

More generally, the probability of a sentence with $T$ words can be written as

$$
P(w_T, w_{T - 1}, \ldots, w_1) = P(w_1) \prod_{t = 2}^{T} P(w_t | w_{t - 1})
$$

As these probabilities would tend to be small, multiplying them in long sequences may lead to underflow problems because computer precision is limited. A common solution to this problem is to use the log-probabilities instead of the probabilities.


$$
\log P(w_T, w_{T - 1}, \ldots, w_1) = \log P(w_1) +  \sum_{t = 2}^{T} \log P(w_t | w_{t - 1})
$$


Another problem is that the log probabilities of sentences will be biased toward shorter sentences, simply because there are less terms in a short sentence. To solve this problem, we can normalize the log probabilities by dividing by the number of words in the sentence.

In [1]:
import numpy as np
import nltk
nltk.download("gutenberg")
from nltk.corpus import gutenberg
import spacy

nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package gutenberg to /home/amarov/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [2]:
gutenberg.fileids()
alice = gutenberg.raw(fileids="carroll-alice.txt")

alice[:1000]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seeme

In [3]:
# First we will pass the whole text through spacy's pipeline

doc = nlp(alice[:200])
type(doc)

spacy.tokens.doc.Doc

In [4]:
for token in doc:
    print(f"{token.lower_:10} {token.pos_}")

[          X
alice      PROPN
's         PART
adventures PROPN
in         ADP
wonderland PROPN
by         ADP
lewis      PROPN
carroll    PROPN
1865       NUM
]          PUNCT


         SPACE
chapter    NOUN
i.         PROPN
down       ADP
the        DET
rabbit     PROPN
-          PUNCT
hole       PROPN


         SPACE
alice      PROPN
was        AUX
beginning  VERB
to         PART
get        VERB
very       ADV
tired      ADJ
of         ADP
sitting    VERB
by         ADP
her        PRON
sister     NOUN
on         ADP
the        DET

          SPACE
bank       NOUN
,          PUNCT
and        CCONJ
of         ADP
having     VERB
nothing    PRON
to         PART
do         VERB
:          PUNCT
once       ADV


In [5]:
for sent in doc.sents:
    print(sent)

[Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once


The result is an object of class `Doc` that we can use to a sequence of sentences and tokens. We will use the `sent()` generator to iterate over the first few sentences in the book and save them in a list. Next, we will create a small function to tokenize the sentences and remove the punctuation and spaces tokens. It wil also create a word to index dictionary and an index to word dictionary.

In [6]:
def tokenize_doc(text: str):
    sentences = []
    text_doc = nlp(text)
    
    word2idx = {
        "BEGINNING": 0,
        "END": 1
    }
    idx2word = {
        0: "BEGINNING",
        1: "END"
    }
    
    for i, sentence in enumerate(text_doc.sents):
        tokens = ["BEGINNING"]
        
        for token in sentence:            
            if token.is_space or token.is_punct:
                continue
            token_normalized = token.lower_ 
            tokens.append(token_normalized)
            
            if token_normalized not in word2idx:
                idx = len(word2idx)
                word2idx[token_normalized] = idx
                idx2word[idx] = token_normalized
        
        tokens.append("END")
        sentences.append(tokens)

    return sentences, word2idx, idx2word

tmp_, tmp_word2idx, tmp_idx2word = tokenize_doc(alice[0:1111])
len(tmp_word2idx)

129

In [7]:
# Now we can create a V x V matrix where V is the size of the vocabulary

def compute_bigrams_prob_mtx(sentences, word2idx: dict, smoothing: float = 1.0):
    # Get the vocabulary size
    vocab_size = len(word2idx)
    
    # Let's first create a matrix of counts
    BGC = np.ones((vocab_size, vocab_size)) * smoothing

    # Now let us loop over all sentences, extract the bi-grams and count their occurrences
    # Each time we encounter the sequence "is strong" for example, we will increment the count of the
    # Row index of the first word and the column index of the second word
    for sent in sentences:
        for i, word in enumerate(sent):
            if i == 0:
                continue
                
            first_word = sent[i - 1]
            
            # We will use the word2idx dictionary to get the index of the word
            first_word_idx = word2idx[first_word]
            second_word_idx = word2idx[word]
            # We will use the index to increment the count of the word
            BGC[first_word_idx, second_word_idx] += 1
    
    # Now we can normalize the counts to get the probabilities
    
    BGP = BGC / BGC.sum(axis=1, keepdims=True)
    return BGP

In [8]:
def score_sentence_str(sentence_str: str, word2idx: dict, bigram_probs: np.ndarray):
    # First we tokenize the sentence and remove the punctuation and spaces tokens
    sents, _, _ = tokenize_doc(sentence_str)
    
    sentence_score = 0
    
    words = sents[0]
    
    for i, word in enumerate(words):
        if i == 0:
            continue
        
        try:
            first_word_idx = word2idx[words[i - 1]]
        except KeyError:
            raise KeyError(f"Word {words[i - 1]} not in vocabulary")
        
        try:
            second_word_idx = word2idx[word]
        except KeyError:
            raise KeyError(f"Word {word} not in vocabulary")
                
        sentence_score += np.log(bigram_probs[first_word_idx, second_word_idx])
        
    return sentence_score / len(words)


In [9]:
# Now let us run the whole thing

sentences, word2idx, idx2word = tokenize_doc(alice)
BGP = compute_bigrams_prob_mtx(sentences, word2idx)

In [10]:
BGP[0:2, 0:2]

array([[0.00023596, 0.0261916 ],
       [0.00037313, 0.00037313]])

In [11]:
second_sentence = sentences[32]
# second_sentence
score_sentence_str(" ".join(second_sentence), word2idx, BGP)

-5.7696548515779575

In [12]:
sentences[32]

['BEGINNING',
 'no',
 'it',
 "'ll",
 'never',
 'do',
 'to',
 'ask',
 'perhaps',
 'i',
 'shall',
 'see',
 'it',
 'written',
 'up',
 'somewhere',
 'END']

In [13]:
# Let's try it out with a valid sentence
score_sentence_str("Be nice, everybody else is already taken.", word2idx, BGP)

-7.0020184778362875

In [14]:
score_sentence_str("Rude foot fun egg.", word2idx, BGP)

-6.539825013894565

# Logistic Regression Model

Instead of counting the number of times a word occurs in the corpus, we can use a logistic regression model to estimate the probability of a word given the preceding word. The logistic regression model will learn a vector representation for each word in the vocabulary. The probability of a word given the preceding word will be the dot product of the vector representations of the two words.

First we need to map the words to numbers, because the logistic regression model operates on matrices of numbers. What we can do is create a vocabulary (all unique words in our corpus) and represent each word with $V$ dimensional vector where $V$ is the size of the vocabulary. The vector will have a 1 at the index of the word and zeros everywhere else. This is called a one-hot encoding.

For example, if our vocabulary is

```
["be", "yourself", "everyone", "else", "is", "already", "taken"]
```

then the one-hot encoding of "yourself" will be

```
[0, 1, 0, 0, 0, 0, 0]
```

the one-hot encoding of "is" will be

```
[0, 0, 0, 0, 1, 0, 0]
```

Let's create a function to create these one-hot encodings.


In [15]:
# This will return a one-hot encoded vector with 1 at the index of idx
def one_hot_encode_word(idx: int, vocab_size: int):
    v = np.zeros(vocab_size)
    v[idx] = 1
    return v

In [16]:
# It is convenient to have a function that processes the raw text and returns the word indices
def text_to_indexed_sentences(sentences: list, word2idx: dict):
    sentences_with_idx = []
    
    for sentence in sentences:
        sentence_with_idx = []
        
        for word in sentence:
            idx = word2idx[word]    
            sentence_with_idx.append(idx)
            
        sentences_with_idx.append(sentence_with_idx)
    
    return sentences_with_idx

In [17]:
sample_sentences, sample_word2idx, sample_idx2word = tokenize_doc("Hello, my name is John. What is your name?")
sample_sentences[0]

['BEGINNING', 'hello', 'my', 'name', 'is', 'john', 'END']

In [18]:

sample_alice_sentences_idx = text_to_indexed_sentences(sample_sentences, sample_word2idx)
sample_alice_sentences_idx[0]

[0, 2, 3, 4, 5, 6, 1]

Now we need to consider how the training data for our problem should look like. In a classification problem we normally a have $N \times K$ matrix $\mathbf{X}$, representing the $K$ features of $N$ observations.

$$
\mathbf{X} = \begin{bmatrix}
x_{11} & x_{12} & \ldots & x_{1K} \\
x_{21} & x_{22} & \ldots & x_{2K} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & x_{N2} & \ldots & x_{NK} \\
\end{bmatrix}
$$

and a $N$ dimensional vector $\mathbf{y}$ representing the labels of the $N$ observations. It is convenient to represent the labels as one-hot encoded vectors. For example, with $K = 3$ classes, the labels will be $N \times K$ matrix $\mathbf{Y}$ of one-hot encoded vectors.

$$
\mathbf{Y} = \begin{pmatrix}
1 & 0 & 0 \\
0 & 0 & 1 \\
0 & 0 & 0 \\
\vdots & \vdots & \vdots \\
0 & 1 & 0
\end{pmatrix}
$$

In the example above, the first observation belongs to the first class, the second observation belongs to the second class, the third observation belongs to the third class and the last observation belongs to the second class.

In our specific case the labels are the second words in the bi-grams and each word is represented by a one-hot encoded vector. So the labels will be an $N \times V$ matrix $\mathbf{Y}$ where $V$ is the size of the vocabulary.

The predictor matrix $\mathbf{X}$ will also be an $N \times V$ matrix. The $i$-th row of $\mathbf{X}$ will be the one-hot encoded vector of the first word in the $i$-th bi-gram.

$$
\mathbf{X} = \begin{pmatrix}
1 & 0 & 0 & \ldots & 0 \\
0 & 0 & 0 & \ldots & 1 \\
0 & 0 & 0 & \ldots & 1 \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
0 & 1 & 0 & \ldots & 0
\end{pmatrix}
$$

You can read this example matrix as: the first bi-gram is "be yourself", the second bi-gram is "yourself everyone", the third bi-gram is "everyone else" and so on.


$$
p(y | x) = \frac{1}{1 + \exp(-\mathbf{w}^T \mathbf{x})}
$$


 Using the cross-entropy loss
 
$$
J(w) = -\frac{1}{N}\sum_{i = 1}^{N} \sum_{j = 1}^{V} y_{ij} \log \hat{y}_{ij}
$$

the gradient descent update rule for the weights is

$$
W^{\text{new}} = W^{\text{old}} - \eta \nabla_{w} J(w) \\
$$

where $\eta$ is the learning rate and the gradient of the loss with respect to the weights is

$$
\nabla J = X^T (\hat{Y} - Y)
$$



In [19]:
# Next we will define the softmax function that will take a np.array of shape (N, D) and return a np.array of shape (N, D) where each row is
# the softmax of the corresponding row in the input array

def softmax(x):
    exp_x = np.exp(x)
    return exp_x / exp_x.sum(axis=1, keepdims=True)


def train_logistic(sentences: list[list[int]], vocab_size: int, learning_rate: float = 0.01, epochs: int = 100):    
    losses = []
    
    # Initialize weights
    W = np.random.randn(vocab_size, vocab_size) / np.sqrt(vocab_size)

    for epoch in range(epochs):
        # shuffle sentences at each epoch
        np.random.shuffle(sentences)
        
        j = 0 # keep track of iterations
        for sentence in sentences:
            # convert sentence into one-hot encoded inputs and targets
            
            # An example sentence has the form ["BEGINNING", "hello", "my", "name", "is", "john", "END"]
            # Only with the word indices instead of the words
            # It has n = 7 words and therefore n - 1 = 6 bi-grams
            # So each row of the inputs and targets matrices will have the shape (1, vocab_size)
            
            n = len(sentence)
            
            inputs = np.zeros((n - 1, vocab_size))
            targets = np.zeros((n - 1, vocab_size))
            inputs[np.arange(n - 1), sentence[:n-1]] = 1
            targets[np.arange(n - 1), sentence[1:]] = 1

            #print("Inputs matrix")
            #print(inputs)

            #print("Targets matrix")
            #print(targets)
            
            # Compute the predictions
            predictions = softmax(inputs.dot(W))
            
            # Perform a gradient descent update
            W = W - learning_rate * inputs.T.dot(predictions - targets)
            
            # Save the loss at each iteration (we don't use it here, but you may want to plot it later)
            loss = - np.sum(targets * np.log(predictions)) / (n - 1)
            losses.append(loss)     
            
            if j % 10 == 0:
                print("epoch:", epoch, "sentence: %s/%s" % (j, len(sentences)), "loss:", loss)
            j += 1
    
    return W, losses

In [20]:
full_sentences, full_word2idx, full_idx2word = tokenize_doc(alice)
full_sentences_idx = text_to_indexed_sentences(full_sentences, full_word2idx)

In [21]:
train_logistic(full_sentences_idx, len(word2idx))

epoch: 0 sentence: 0/1558 loss: 7.897522305635123
epoch: 0 sentence: 10/1558 loss: 7.881285444011558
epoch: 0 sentence: 20/1558 loss: 7.894268054175027
epoch: 0 sentence: 30/1558 loss: 7.9014236352812315
epoch: 0 sentence: 40/1558 loss: 7.885451681302905
epoch: 0 sentence: 50/1558 loss: 7.850217431299732
epoch: 0 sentence: 60/1558 loss: 7.888500423235992
epoch: 0 sentence: 70/1558 loss: 7.8958181550691915
epoch: 0 sentence: 80/1558 loss: 7.898439318337972
epoch: 0 sentence: 90/1558 loss: 7.886170716644597
epoch: 0 sentence: 100/1558 loss: 7.9001266550996965
epoch: 0 sentence: 110/1558 loss: 7.897777072212502
epoch: 0 sentence: 120/1558 loss: 7.874714513968414
epoch: 0 sentence: 130/1558 loss: 7.895134820084948
epoch: 0 sentence: 140/1558 loss: 7.88157574950131
epoch: 0 sentence: 150/1558 loss: 7.9002320783536275
epoch: 0 sentence: 160/1558 loss: 7.879115143252103
epoch: 0 sentence: 170/1558 loss: 7.897435626651721
epoch: 0 sentence: 180/1558 loss: 7.880925049921342
epoch: 0 sentence: 1

KeyboardInterrupt: 