## Description

Understand how human languages are modelled for machines to understand

## Overview

The concept will introduce you to the language modeling concepts. In the concept you will learn

- Problem of Modeling Language

- Language Models

- N-grams

- Perplexity

- Smoothing Techniques

## Pre-requisite

Before you start learning this concept, be sure you have already covered

- Data wrangling with Pandas
- Manipulating Data with NumPy
- Summarizing Data with Statistics
- Foundations of Text Analytics


## Learning Outcomes

By the end of this concept, you will be able to do the following

- Understand why language modeling is hard

- Learn how to create language models using n-gram

- Understand how to evaluate language models using perplexity

- Learn the need for smoothing and how to implement it

# 1. Problem of Modeling Language

Description: In this chapter you will understand the difficulty in modeling languages with respect to machines

## 1.1 Difficulty of natural languages

**Difficulty of natural languages**

Consider you are designing a voice assistant like Siri(or Alexa) and you pass a voice command to it. The assistant outputs two sentences.

s1= "It's hard to recognize speech" 

s2= "It's hard to wreck a nice beach"

The above two sentences have the same sound signals. Sentence s1 is more likely but how do you tell the machine to have some intrinsic preference for one sentence over another?

Consider another scenario where machine has to understand the meaning of the following sentences:
- 7 foot doctors sue the hospital for negligence

- The woman had to make a toast with a very old microphone

- Andrew saw Max with a telescope

- Look at the dog with one eye

It's clear that the above sentences are ambigous and can have more than one intepretation.

`Natural languages`(Languages spoken by humans) can never be fully specified. Reason for that is natural languages are not designed;they emerge. 

Formal rules for language does exist but often while conversing natural language that does not confirm is used. It also involves many terms that can be used in ways that result in complex ambiguities. Furthermore, languages change and along with that word usages change. 

Despite all this, natural languages are understood by other humans.

Machines on the other hand usually work with `formal languages` that can be fully specified. All the words(terms) and rules for defining them are precisely defined.  

We humans are able to understand natural languages majorly due to `"context"`. 

We intrinsically know that for the sentence "7 foot doctors sue the hospital for negligence", the meaning "Seven doctors specialising in foot sue the hospital" `is more likely` than the meaning "Doctors who have seven foot sue the hospital". We know the first meaning is `more probable` than the second meaning.

Put simply, all we are doing is calculating probability of a sentence. Can't we have machines do that?

This is what led to the development of something called `Language Models`

**What is a Language model?**

A Language model is a probabilistic model which estimates the relative likelihood of sentences(sequence of words).

Language models were originally developed for the problem of speech recognition( and still play a central role in modern speech recognition systems). A language model learns the probability of word occurrence based on text examples we give. 

For simplicity, let's say it receives a sentence. The language model score for a sentence x is P(x) which is a score between 0 and 1, that can be interpreted as the probability of composing this sentence in English. 

Language models are able to capture some interesting language phenomena like the following: 

Which sentence is grammatically correct? - P("he eat pizza") < P("he eats pizza") 

Which word order is correct? - P("love I cats") < P("I love cats")

Even some logic and world knowledge: What is more likely? - P("good British food") < P("good Italian food")

In other words, language models generate score (probability) of a sentence, which will tell us whether the sentence is good or bad

**Applications of Language Models**

The probabilities returned by a language model are useful in many practical tasks, such as:

***

***Automatic Speech Recognition***: 

Speech recognition refers to the task of converting spoken words into written text.

It has input in the form of sounds; 

A first layer in it predicts the candidate words based on the sound.

For eg: Candidate words for a sound can be night, knight, right 


The language model(second layer) then helps in ranking the most likely sequence of words compatible with the candidate words produced by the first layer.

For eg: "I ate a cherry" is a more likely sentence than "Eye eight uh jerry"

![](spr.jpg)

***

***Machine Translation***: 

Machine Translation is the task of automatically translating one natural language into another while retaining the meaning of the original text. 

Each word from the source language is mapped to multiple candidate words of the target language; the language model of the target language then can rank the most likely sequence of candidate target words. This works because more likely sentences are probably better translations. 

![](mt_2.jpg)


For eg: Translating a sentence refering about employees who left, model would probably state that `P(former employee) > P(older employee)` as the ‘older’ might also refer to age of the employee and thus, not as probable as ‘former’


![](mt.png)

***

***Spell checking***: 


The task of spellchecking involves checking for spelling errors and possibly suggesting alternatives depending upon the context.

Spell checking is done by the machine when it observes a word which is not recognized as a `known word` (i.e. the word does not occur in a list of known words). It then finds the closest known words to the unknown words.

For eg: If someone writes `fomr`, the closest known words will be `from` and `form`. These are the candidate corrections. How can we select among these candidates the most likely correction for the error `fomr`?

We then compare the Language Model probability of the sentences:

- `P(name into form)`
- `P(name into from)`

and we hope that the right correction [name into form] will be selected.

![](lm.jpg)

***
Let's now see how we can go about creating a language model.

# 2. Statistical Language Modeling

Description: In this chapter, we will learn one of the most popular method of modeling language i.e. Statistical Language Modeling

## 2.1 N-grams

So we established that one way to model natural languagues is by measuring the probability of sentences. So our aim is to get the probability of a given sentence as being good, but before that let's first focus on getting the probability of given word as being the next word in a given sentence.

To build a language model, let's start with simple task of calculating probability $P(w|h)$, the probability of a word w given some history h. 

Suppose the history h is "<i>its lake is so clear that</i>" and we want to know the probability that the next word w is "the":


$$ P (\text{the} | \text{its lake is so transparent that})$$
<br/>

One of the ways to estimate this probability is from frequency counts.

Take a very large corpus(collection of written texts), count the number of times "<i>its lake is so clear that</i>" is present, and then count the number of times "<i>this</i>" is followed by "<i>the</i>". 

This would be similar to answering the question "<i>Out of the times we saw the history h, how many times was it followed by the word w</i>"

So the probability will be calculated as:

$$ P(\text{the}|\text{its lake is so clear that}) = \frac{C(\text{its lake is so clear that the})}{C(\text{its lake is so clear that})} $$

With a large enough corpus(***which internet is***), we can compute these counts and estimate the probability from above equation.

While this method of estimating probabilities directly from counts is intuitive and works fine in many cases, it turns out that even the web isn’t big enough to give us good estimates in most of the cases. That is because as mentioned before, language is creative and new things emerge; new sentences are created all the time, and we won’t always be able to count entire sentences. Even simple extensions of the example sentence may have counts of 0 (such as “Amsterdam’s lake is so clear that the”).

Additionally, if we wanted to know the probability(joint) of the entire sequence of words(which is what a language model has to do) in 'its lake is so clear', the question we need to solve is "out of all possible sequences of five words, how many of them are 'its lake is so clear'?" 

Mathematically, joint probability for a n-word sentence  will look something like:

$$ \begin{align} P(w_1^n) &= P(w_1).P(w_2 | w_1).P(w_3 | w_1^2).P(w_4 | w_1^3) \dots\dots P(w_n | w_1^{n-1})\\ &= \prod_{k=1}^{n}P(w_k|w_1^{k-1}) \end{align} $$


We have to find probability of a word as being the next word with this history among all the possible history. 


Applying that to our sentence we will get


$$ \begin{align} P(\text{its lake  is  so  clear that the})\end{align}$$

$$\begin{align} = P(\text{its}).P(\text{lake} | \text{its}).P(\text{is}| \text{its lake}).P(\text{so}| \text{its lake is}).P(\text{clear}|\text{its lake is so}).P(\text{that}|\text{its lake is so clear}).P(\text{the}|\text{its lake is so clear that})......\end{align} $$

This seems a lot of work doesn't it?

Conclusion: We need better ways of estimating the probability of a word w given a history h.

Construction of N-grams model is one of the solutions. 

**What is N-grams?**

N-grams are the simplest type of tool available to construct a language model. 

An N-gram is a sequence of N words.

*The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can **approximate** the history by just the last few words.*


The bigram(2- grams) model, for example, approximates the probability of a word w by using only the conditional probability of the preceding word $P(w_n |w_{n−1})$. Put in other way, instead of computing the probability 

$$P(\text{the}|\text{Amsterdam’s lake is so clear that})$$

we approximate it with just the probability:
$$P(\text{the}|\text{that})$$


Similar to bigram, we also have unigram(n=1), trigram(n=3), 4-grams and so on.

Following is an image explaining the difference:


![](ngram.jpg)


Let's try to understand n-grams better using an example.
 
 
Consider a corpus containing the following four sentences and we would like to find the probability that “You” starts the sentence.

$<s>$ You are a data scientist $</s>$

$<s>$ Data scientist you are $</s>$

$<s>$ You love statistics $</s>$


Here $<s>$ and $</s>$ denote the start and end of the sentence respectively.

**Note:** We need $<s>$ at the beginning of the sentence to get the bigram context of the first word. Similarly, we need the end-symbol $</s>$ to get the bigram context of the last word. 


Following will be the conditional probabilities of the words of corpus:

\begin{align}
& P(You|<s>) = \frac{2}{3} = .67  & \qquad &  P(Data|<s>) = \frac{1}{3} = .33  \\
& P(</s>|scientist) = \frac{1}{2} = .5  & \qquad &  P(</s>|are) = \frac{1}{2} = .5  \\
& P(</s>|statistics) = \frac{1}{1} = 1  & \qquad &  P(are|You) = \frac{2}{3} = .67  \\
& P(a|are) = \frac{1}{2} = .5  & \qquad &  P(data|a) = \frac{1}{1} = 1  \\
& P(scientist|data) = \frac{2}{2} = 1  & \qquad &  P(statistics|love) = \frac{1}{1}=1  \\
& P(love|you) = \frac{1}{3} = 0.33
\end{align}
    
Let's take the example of $P(are|You)$ above. First we calculate all possible bigrams in the corpus that is "<i>You are</i>". Then we calculate all possible bigrams in corpus with You(or you) as its first word.

In the above corpus we have 2 instances of "<i>You are</i>" and 3 total instances where you is first word in a bigram("<i>You are</i>" in first line, "<i>you are</i>" in second line and "<i>You love</i>" in third line). 

Therefore $P(are|You)=\frac{2}{3}$



When we use a bigram model to predict the probability, we are making the following approximation:
$$ P(w_n |w^{n−1}_1 ) \approx P(w_n |w_{n−1} ) $$ 

**Here $w_n$ is the $n_{th}$ word and $w^{n−1}_1$ is sequence of all n-1 words**

This assumption that the probability of a word depends only on the previous word is called a Markov assumption

**Markov Assumption**
***
A random process has the Markov property if we can predict the probability of  future states of the process without looking at every event in the past i.e next step depends only upon the present state and not on the sequence of events that preceded it. 

This helps in generalizing the bigram (which looks one word into the past) to trigram (which looks two words into the past) and ultimately to the n-gram (which looks n − 1 words into the past).
***


Markov assumption fits nicely with the n-gram model because of natural language's underlying property that in most of the cases, the probability of the word depends on its surrounding words.


Let's now look at how we can calculate probabilities of sentences.

# Task

- Test sentence tokens(Test sentence broken down to words) ,corpus and frequency counts(for unigram and bigram) is already given.(Print and see the values of the different values created)

- Function definition of `get_bigram_probability()` with parameters `first`,`second` is given.


- Inside the function:
    - Store the conditional frequency of the `first` and `second` term(`conditional_freq[first][second]`) in a variable called `'bigram_freq'`.

    - Store the frequency of the `first` term(`updated_uni_freq[first]`) in a variable called `'unigram_freq'`.

    - Return the value calculated by dividing `'bigram_freq'` with `'unigram_freq'`


- Create an empty list called `'prob_list'`. 


- Create a variable called `'previous'` and save the string `'*start_end*'` in it.(This will be our sentence beginner mark `<s>` that we encountered while learning bigram probabilities)


- Run a loop `for token in test_sentence_tokens`. Inside the loop 

    - Calculate the bigram probability by calling the function `"get_bigram_probability(previous, token)"` and store it in a variable called `'next_probability'`
    - Save the current `'token'` as `'previous'`
    - Append `'next_probability'` to `'prob_list'`
    
    
**Note:** Calculation of the final term is still left.

- Calculate the bigram prob. of the final term by calling the function `"get_bigram_probability()"` with `'previous'`(This will be store the final term after coming out of the loop) and `'*start_end*'`.


- Append the above calculated value to `'prob_list'`





In [1]:
import nltk
from nltk.corpus import brown

# Corpus
words = brown.words()
words=[w.lower() for w in words]

# Unigram frequency 
uni_freq = nltk.FreqDist(w.lower() for w in words)

# Size of corpus
total_words = len(words)

print('Frequency of tokens of the sample sentence:',total_words)

#Sentence 
test_sentence_tokens=['this','is','a','sunny','day','.']


for word in test_sentence_tokens:
    print('Frequency of "',word,'" is ',uni_freq[word])

print('\n\n')
    
# Creating bigrams

bigram_words = []
previous = 'EMPTY'
sentences = 0
for word in words:
    if previous in ['EMPTY','.','?','!']:
        ## insert word_boundaries at beginning of Brown,
        bigram_words.append('*start_end*')
    else:
        bigram_words.append(word)
    
    previous = word


    
    
bigram_words.append('*start_end*') ## assume one additional *start_end* at the end of Brown

updated_uni_freq  = nltk.FreqDist(w.lower() for w in bigram_words)


print('Calculating bigram probalities for sentence, including bigrams with sentence boundaries, i.e., *start_end*')


# Bigram corpus
bigrams = nltk.bigrams(w.lower() for w in bigram_words)


# Bigram probabilities
conditional_freq = nltk.ConditionalFreqDist(bigrams)



# Code begins here


# Function to calculate bigram probability
def get_bigram_probability(first,second):
    
    bigram_freq = conditional_freq[first][second]
    unigram_freq = updated_uni_freq[first]

    bigram_prob = (bigram_freq)/(unigram_freq)
    
    return bigram_prob

## Calculating the bigram probability

prob_list=[]
previous = '*start_end*'

for token in test_sentence_tokens:
    next_probability = get_bigram_probability(previous,token)
    print(previous,token,(float('%.3g' % next_probability)))
    previous = token
    prob_list.append(next_probability)


    
# For the final term    
next_probability = get_bigram_probability(previous,'*start_end*')
print(previous,'*start_end*',next_probability)
prob_list.append(next_probability)    

# print(prob_list)    



Frequency of tokens of the sample sentence: 1161192
Frequency of " this " is  5145
Frequency of " is " is  10109
Frequency of " a " is  23195
Frequency of " sunny " is  13
Frequency of " day " is  687
Frequency of " . " is  49346



Calculating bigram probalities for sentence, including bigrams with sentence boundaries, i.e., *start_end*
*start_end* this 0.0083
this is 0.0503
is a 0.0861
a sunny 4.51e-05
sunny day 0.154
day . 0.163
. *start_end* 1.0


# Hints

You can find the bigram probabilities by writing code similar to:

```python
prob_list=[]
previous = '*start_end*'
for token in test_sentence_tokens:
    next_probability = get_bigram_probability(previous,token)
    print(previous,token,(float('%.3g' % next_probability)))
    previous = token
    prob_list.append(next_probability)

    
# For the final term    
next_probability = get_bigram_probability(previous,'*start_end*')
print(previous,'*start_end*',next_probability)
prob_list.append(next_probability)    

print(prob_list)    
```

# Test Cases

#prob_list


Variable declaration


round(prob_list[1],2)==0.05

# Success Message

Congrats! You have successfully found out bigram probabilities of the sentence tokens

## 2.2 Language model using n-gram

In the previous topic, we constructed an N-gram model(bigram model in our case) for words, how will we now obtain a complete language model on the basis of this ?

Simple, we will calculate the `joint probability` by multiplying the respective n-gram probabilites.

Applying that to our sentence(from the previous chapter) we see that the following calculation: 


$$ \begin{align} P(\text{its lake  is  so  clear that the})\end{align}$$

$$\begin{align} = P(\text{its}).P(\text{lake} | \text{its}).P(\text{is}| \text{its lake}).P(\text{so}| \text{its lake is}).P(\text{clear}|\text{its lake is so}).P(\text{that}|\text{its lake is so clear}).P(\text{the}|\text{its lake is so clear that})\end{align}..... $$

gets reduced to

$$ \begin{align} P(\text{its lake  is  so  clear that the})\end{align}$$
$$\begin{align} = P(\text{its}|P\text{<s>}).P(\text{lake} | \text{its}).P(\text{is}| \text{lake}).P(\text{so}| \text{is}).P(\text{clear}|\text{so}).P(\text{that}|\text{clear}).P(\text{the}|\text{that})\end{align}.... $$


***
**Deep Dive(Optional)**

*Mathematical representation of how joint probability is calculated:*


Joint-probability of n-word sentence is:

$$ \begin{align} P(w_1^n) &= P(w_1).P(w_2 | w_1).P(w_3 | w_1^2).P(w_4 | w_1^3) \dots\dots P(w_n | w_1^{n-1})\\
 &= \prod_{k=1}^{n}P(w_k|w_1^{k-1}) \end{align} $$

Using Markov's assumption, we know

$$ P(w_n |w^{n−1}_1 ) \approx P(w_n |w_{n−1} ) $$


That will help obtain the following approximation:

\begin{align} P(w_1^n) = \prod_{n=1}^{n}P(w_k|w_{k-1}) \end{align}


***

Let us now try to understand how language model work better using a bigram example:

<br>

<center>Given below is the count of random 8 words(out of 100 distinct words) from a food delivery app(Also known as the unigram table)</center>


|i|want|to|eat|italian|food|lunch|breakfast
|-----|-----|-----|-----|-----|-----|-----|-----|
|2532|928|2419|746|158|1012|342|277|

<br><br>

<center>Following is the bigram table for the same words</center> 

<br>

|-|i|want|to|eat|italian|food|lunch|buy
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|i|6|818|0|9|0|0|0|2|
|want|2|0|608|1|6|7|6|1|
|to|2|0|4|685|2|0|6|211|
|eat|0|0|2|0|15|3|42|0|
|italian|1|0|0|0|0|82|1|0|
|food|14|0|15|0|1|5|0|0|
|lunch|2|0|0|0|0|0|0|1|
|buy|1|0|0|0|0|14|0|0|

<br>
    
<center>Here row word is the first word and column word is the second word.</center>


<center>For example "<i>i want</i>" bigram appears 818 times in the corpus, "<i>eat italian</i>" bigram appears 15 times in the corpus.</center> 


You can make a lot of interesting observations when comparing unigram table with the bigram.

For e.g. Out of the 928 times the word `want` appears, 818 times it appears after the word `I`. 


<br><br>
<center>After calculating the bigram probability($P(w_n |w_{n−1} )$) we get the following table</center>
<br>

|-|i|want|to|eat|italian|food|lunch|buy
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|i|0.002|0.32|0|0.003|0|0|0|0.0007|
|want|0.002|0|0.65|0.001|0.006|0.007|0.006|0.001|
|to|0.0008|0|0.001|0.28|0.0008|0|0.002|0.08|
|eat|0|0|0.002|0|0.02|0.004|0.05|0|
|italian|0.006|0|0|0|0|0.51|0.006|0|
|food|0.01|0|0.01|0|0.0009|0.01|0|0|
|lunch|0.005|0|0|0|0|0|0|0.002|
|breakfast|0.004|0|0|0|0|0.05|0|0|

<br>

<center>To get the probabilities we divide each count value of our bigram table with the unigram count of the first word of the bigram.<\center>

<br><br>
For eg: To get the bigram probability of "<i>eat italian</i>", we divide the bigram count of "<i>eat italian</i>" which is 15 by the total no. of times(unigram count), eat(first word of bigram) appears in the corpus which is 746.

Therefore, $$ \begin{align} P(\text{eat italian}) = \frac{15}{746}= 0.02\end{align}$$

Using the bigram probability table, we can now easily compute the probability of sentences like `I want italian lunch` by simply multiplying the appropriate bigram probabilities together, as follows:

$$\begin{align}
P(\text{<s> i want italian lunch </s>})
&= P(\text{i|<s>}).P(\text{want|i}).P(\text{italian|want}).P(\text{lunch|italian}).P(\text{</s>|lunch}) \\
&= .22 \times .32 \times .006 \times 0.006 \times 0.7 \\
&= 0.0000017
\end{align}
$$

**Note:** $P(\text{i|<s>})$ and $P(\text{</s>|lunch})$ were not in the above bigram probability table but can be easily calculated from the corpus in a similar way.


You can refresh about n-gram model by going through this video on [N-gram by Machine Learning TV](https://www.youtube.com/watch?v=GiyMGBuu45w)

## 2.3 Evaluating LMs: Perplexity

We just successfully constructed one language model using n-grams.

We now need an evaluation method to check how good it is(Especially because we have taken the Markov assumption) from the "actual" probability of the sentences?

Therefore we need a measure to evaluate language models.Following are the two popular ways:

**Extrinsic evaluation:** This is the best and most intuitive way to evaluate a model . It involves testing different models in how much they help the application 

For e.g. We want to evaluate the language model for a spell checker. Thus, for spell checker, we can compare the performance of two language models by running the spell checker twice, once with each language model, and seeing which gives the more accurate correction. 

Unfortunately, running big NLP systems end-to-end is an expensive form of evaluation.

**Intrinsic evaluation**: 

It would be convenient to have a method that can be used to quickly evaluate potential improvements in a language model. 

An intrinsic evaluation method is one that measures the quality of a model independent of any application.

Just like most of the statistical models in data science field, the probabilities of an n-gram model come from the corpus it is trained on, known as the training set. We can then measure the performance of the n-gram model by its performance on the unseen data also known as the test set.

Whichever model assigns a higher probability to the sentences present in the test set is a better model. 


*Q:* So should we just compare the raw probabilities of different models to decide which model is intrinsically better?

*A:* NO

The reason for it is that not all probability distributions are created equal.

For eg: 

There's a lot more uncertainty about the outcome of the word `surprise` in a 1000 word article as compared to a novel.(Novel has a bigger corpus than the Article) 


Another reason is if two distributions have the same number of outcomes, how likely those outcomes are also affects your uncertainty.

For eg:
Given two 500 word essays written one on 'Global Warming' and 'Formula F1 Race', you are a lot less uncertain about the word  'Polar Bears' on the first essay than you are in the second essay.


In practice we don’t use raw probability as our metric for evaluating language models, but a variant called **perplexity**. 

Perplexity gives measures of complexity in a way that accounts for the above two reasons.

The perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words. 

For a test set $W = w_1 w_2 \dots w_N ,$: 

$$\begin{align} PP(W) &= {(P(w_1 w_2 \dots w_N))}^{-\frac{1}{N}}\\\end{align}$$
<br>
$$\begin{align}&= \sqrt[N]{\frac{1}{P(w_1 w_2 \dots w_N)}} \\\end{align}$$
<br>
$$\begin{align}&= \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1 w_2 \dots w_{i-1})}}\\\end{align}$$
<br>

$$\begin{align}&\text{Replacing the perplexity with a n-gram model, say a bigram language model, we get :}\\\end{align}$$
$$\begin{align}PP(W) &= \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_{i-1})}}\\\end{align}$$


**Note:** 

1. The inverse in the formula means the higher the conditional probability of the word sequence, the lower the perplexity. Therefore, minimizing perplexity is equivalent to maximizing the test set probability of the language model.

2. The term 1/N where N is the number of words, helps normalize for the length of the probability by the number of words. This way the longer the sentence the less probable it will be.     

3. Based on multiple experiments, it's observed that of all the n-gram models, trigram(n=3) models perform the best in predicting the 'real' world probabilities  

You can have a better understanding about Evaluating language models by going through the video on [Evaluation and Perplexity by André Ribeiro de Miranda](https://www.youtube.com/watch?v=BAN3NB_SNHY)


**Evaluation problem**

Consider the following bigram table from the previous topic:

|-|i|want|to|eat|italian|food|lunch|buy
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|i|0.002|0.32|0|0.003|0|0|0|0.0007|
|want|0.002|0|0.65|0.001|0.006|0.007|0.006|0.001|
|to|0.0008|0|0.001|0.28|0.0008|0|0.002|0.08|
|eat|0|0|0.002|0|0.02|0.004|0.05|0|
|italian|0.006|0|0|0|0|0.51|0.006|0|
|food|0.01|0|0.01|0|0.0009|0.01|0|0|
|lunch|0.005|0|0|0|0|0|0|0.002|
|breakfast|0.004|0|0|0|0|0.05|0|0|



In the table majority of the values are zero. A matrix selected from a random set of 10 words would be even more sparse. 
 

The model we have assumed so far suffers from two drastic problems:

**1. Sparsity**

For any n-gram that has occurred a sufficient no. of times, we might have a good estimate of its probability. But because any test corpus is limited, some perfectly acceptable word sequences are bound to be missing from it. 

Since there are a combinatorial no. of possible strings, many rare(but not impossible) combinations never occur in training resulting in system incorrectly assigning zero probability to many parameters

For eg: If the bank data training set has the following sentences(among many others):

"he was denied the loan"

"he was denied the loan offer"

"loan was refused to him"

But suppose our test set had a phrase like

"loan was denied to him"

That sentence makes perfect sense but the $ P(\text {to|denied}) $ will be 0 resulting in the overall probability of the test sentence to be equal to 0.


**2. Limited vocabulary**

We assume our model knows all the words in the vocabulary which is rarely the case.

Consider the following sentences of news data training set:

"denied the rumours"

"denied the report"

"denied the allegations"

"denied the news"

But suppose our test set had a phrase like

"denied the speculations"

Even though the test phrase makes perfect sense, since "speculations" word was not in the training set, $ P(\text {speculations|the})$ will be 0 resulting in the overall probability of the test sentence to be equal to 0.

We could choose a vocabulary (word list) that is fixed in advance but in doing so we are limiting our model immensely.


How do we solve the dual problem of limited words and limited sentence combinations that are in train set but appear in a test set in an unseen context?

To keep a language model from assigning zero probability to these unseen events, we’ll have to take a bit of probability from some more frequent events and give it to the events we’ve never seen.

This modification is called smoothing. 

Let's look at smoothing in detail in the next topic.

## Task

- `prob_list_1` contains the bigram probabilities of `this is a sunny day`.

- Multiply all the values of `'prob_list_1'` to find the bigram model probability of the sentence and store the result in a variable called `total_prob_1`

***
Following is a sample code calculation of perplexity

**Input:**

```python

prob_list=[0.1, 0.023 ,0.09]


perplexity=1

# Calculating N
N=len(prob_list)-2


# Calculating the perplexity
for val in prob_list:
    perplexity = perplexity * (1/val)

perplexity = pow(perplexity, 1/float(N)) 

print("Perplexity= :",perplexity)
```
**Output:**

```python
Perplexity= : 69.5048046856916
```
***

- Calculate the perplexity of the values of `'prob_list_1'`(similar to the above code) and store the result in a variable called `'perplexity_1'`. 


- `prob_list_2` contains the bigram probabilities of `this place is beautiful`.

- Multiply all the values of `'prob_list_2'` to find the bigram model probability of the sentence and store the result in a variable called `total_prob_2`

- Calculate the perplexity of the values of `'prob_list_2'`(similar to the above code) and store the result in a variable called `'perplexity_2'`


**Things to ponder upon:**

- Which sentence has a lower perplexity?

- Between perplexity and total probability, which metric gives a better intuitive understanding of more probable sentence?



In [2]:
prob_list=[0.1, 0.023 ,0.09]


perplexity=1

# Calculating N
N=len(prob_list)-1


# Calculating the perplexity
for val in prob_list:
    perplexity = perplexity * (1/val)

perplexity = pow(perplexity, 1/float(N)) 

print("Perplexity= :",perplexity)


Perplexity= : 69.5048046856916


In [1]:


"""For the sentence: 'this is a sunny day' """ 
prob_list_1=[0.008303975842979365, 0.05030826140567201, 0.08609535184632229, 4.5083630133898384e-05, 0.15384615384615385]



total_prob_1 = 1

# Multiplying all the values of the probability and storing it
for val in prob_list_1:
    total_prob_1 *= val


print("For the sentence- 'this is a sunny day'")
print("Total probability:",total_prob_1)


perplexity_1=1

# Calculating N
N=len(prob_list_1)-1


# Calculating the perplexity
for val in prob_list_1:
    perplexity_1 = perplexity_1 * (1/val)

perplexity_1 = pow(perplexity_1, 1/float(N)) 

print("Perplexity:",perplexity_1)



"""For the sentence: 'this place is beautiful' """
prob_list_2=[0.008303975842979365, 0.0022194821208384712, 0.02185792349726776, 9.953219866626854e-05]

total_prob_2 = 1

# Multiplying all the values of the probability and storing it
for val in prob_list_2:
    total_prob_2 *= val

print("\n\nFor the sentence- 'this place is beautiful'")    
print("Total probability: ",total_prob_2)


perplexity_2=1

# Calculating N
N=len(prob_list_2)-1

# Calculating perplexity
for val in prob_list_2:
    perplexity_2 = perplexity_2 * (1/val)

perplexity_2 = pow(perplexity_2, 1/float(N)) 

print("Perplexity: ",perplexity_2)



For the sentence- 'this is a sunny day'
Total probability: 2.494655687321879e-10
Perplexity: 251.62126814544143


For the sentence- 'this place is beautiful'
Total probability:  4.009684736463708e-11
Perplexity:  2921.6616783932823


# Hints
You can find perplexity of sentence 1 by writing code similar to:
```python
for val in prob_list_1:
    perplexity_1 = perplexity_1 * (1/val)

perplexity_1 = pow(perplexity_1, 1/float(N)) 
```
Similarly, you can find perplexity of the other sentence.



# Test Cases


#total_prob_1
Variable declaration
round(total_prob_1,10)==2e-10


#perplexity_1
Variable declaration
round(perplexity_1,2)==251.62

#total_prob_2
Variable declaration
round(total_prob_2,11)==4e-11

#perplexity_2
Variable declaration
round(perplexity_2,2)==2921.66

# Success Message

Congrats! You have successfully found the perplexity and total probabilities of the given sentences!

# 3. Smoothing

Description: In this chapter, we will learn the different types of Smoothing that can be done

## 3.1 Add-K Smoothing

We just understood that to keep a language model from assigning zero probability to these unseen events, we could take a bit of probability from some more frequent events and give it to the events we’ve never seen by a method called smoothing. 

Following are some of the popular smoothing techniques:

- Laplace Smoothing/Add-K smoothing

- Interpolation

- Backoff

Let's try to understand them one by one.

**Laplace Smoothing**

The simplest way to do smoothing would be to add one to all the bigram(or any n-gram) counts, before we normalize them into probabilities. All the counts that used to be 0 will now have a count of 1, the counts of 1 will be 2, and so on and so forth. 

This kind of smoothing is called Laplace smoothing. 

Let’s start with the application of Laplace smoothing to unigram(single word) probabilities. 

Mathematically If unsmoothed unigram probability of the word $w_i$ is its count $c_i$ normalized by the total number of word tokens $N$:

$$P(w_i) = \frac{c_i}{N}$$

Laplace smoothing merely adds one to each count (Its also called one-smoothing). If there are V words in the vocabulary and each one is incremented, we also need to adjust the denominator to take into account the extra V observations.

$$P_{Laplace}(w_i) = \frac{c_i + 1}{N + V}$$

Let us try to understand Laplace smoothing of bigrams with the previous food delivery app example.

<br>
<center>Following is the original bigram count table</center>

|-|i|want|to|eat|italian|food|lunch|buy
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|i|6|818|0|9|0|0|0|2|
|want|2|0|608|1|6|7|6|1|
|to|2|0|4|685|2|0|6|211|
|eat|0|0|2|0|15|3|42|0|
|italian|1|0|0|0|0|82|1|0|
|food|14|0|15|0|1|5|0|0|
|lunch|2|0|0|0|0|0|0|1|
|buy|1|0|0|0|0|14|0|0|

<br><br>
<center>After Laplace smoothing, the table transforms to</center>

|-|i|want|to|eat|italian|food|lunch|buy
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|i|7|819|1|10|1|1|1|3|
|want|3|1|609|2|7|8|7|2|
|to|3|1|5|686|3|1|7|212|
|eat|1|1|3|1|16|4|43|1|
|italian|2|1|1|1|1|83|2|1|
|food|15|1|16|1|2|6|1|1|
|lunch|3|1|1|1|1|1|1|2|
|buy|2|1|1|1|1|15|1|1|


We know that normal bigram probabilities are computed using the following:
$$ P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C(w_{n-1})} $$

This resulted in the following bigram probability table:

|-|i|want|to|eat|italian|food|lunch|buy
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|i|0.002|0.32|0|0.003|0|0|0|0.0007|
|want|0.002|0|0.65|0.001|0.006|0.007|0.006|0.001|
|to|0.0008|0|0.001|0.28|0.0008|0|0.002|0.08|
|eat|0|0|0.002|0|0.02|0.004|0.05|0|
|italian|0.006|0|0|0|0|0.51|0.006|0|
|food|0.01|0|0.01|0|0.0009|0.01|0|0|
|lunch|0.005|0|0|0|0|0|0|0.002|
|breakfast|0.004|0|0|0|0|0.05|0|0|



For add-one smoothed bigram counts, we just need to normalize the unigram count(denominator) by the number of distinct words V(in this case V=100) in the vocabulary:
$$ P^*_{Laplace}(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n) + 1}{C(w_{n-1}) + V} $$

This will result in the following bigram probability table

|-|i|want|to|eat|italian|food|lunch|buy
|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|i|0.002|0.31|0.0003|0.003|0.0003|0.0003|0.0003|0.001|
|want|0.002|0.0009|0.59|0.001|0.006|0.007|0.006|0.001|
|to|0.001|0.0003|0.001|0.27|0.001|0.0003|0.002|0.08|
|eat|0.001|0.001|0.003|0.001|0.01|0.004|0.05|0.001|
|italian|0.007|0.003|0.003|0.003|0.003|0.32|0.007|0.003|
|food|0.01|0.0008|0.01|0.0008|0.001|0.005|0.0008|0.0008|
|lunch|0.006|0.002|0.002|0.002|0.002|0.002|0.002|0.004|
|breakfast|0.005|0.002|0.002|0.002|0.002|0.03|0.002|0.002|


You can see that 0 probabilities have been converted to some non zero value and at the same time, the value of earlier non zero probabilities has also reduced for overall probability distribution(For e.g. $P(\text{to|want}$) changed from 0.65 to 0.59)

The sharp change in probabilities occur because too much probability mass is moved to all the zeros


Let's calculate the probability of the sentence "i want italian lunch" again 

$$\begin{align}
P(\text{<s> i want italian lunch </s>})
&= P(\text{i|<s>}).P(\text{want|i}).P(\text{italian|want}).P(\text{lunch|italian}).P(\text{</s>|lunch}) \\
&= .22 \times .31 \times .006 \times 0.007 \times 0.7 \\
&= 0.000002
\end{align}$$


Though a practical smoothing algorithm for tasks like text classification, unfortunately, Laplace smoothing doesn't perform well for n-gram models.

**Add-K Smoothing**

One way to move a bit less of the `probability mass` from the seen to the unseen events is instead of adding `1` to each count, we just add a fractional count `k` (.5? .02?). 

This modified add-1(Laplace) smoothing is called add-k smoothing.

Here instead of incrementing count by 1 we increment count by a fractional value, helping us transfer a lesser amount of probability from seen values of corpus to the unseen values of corpus. 

Mathematical formula of finding probabilities after add-k smoothing is 

$$ P^*_{add-k}(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n) + k}{C(w_{n-1}) + kV} $$

There are multiple methods for selecting the optimum k value. For example, by optimizing  it on the testset. 

Though better than Add-1 smoothing, Add-k smoothing still doesn’t work well for language modeling often leading to poor variance

Let's look at some other alternatives

# TASK

- The working code for the first task you completed is given with a new sentence `sunset looks magnificient.`

- Run the code once as it is.

**We get an error of `division by 0` because magnificient is not in our corpus.**

**Let's resolve that using Laplace Smoothing**


- Calculate Vocabulary of the corpus by finding the no. of unique words in the list `'words'` and save the the count in a variable called `'V'`

- Inside the function:
        -Update the calculation  of the term `'bigram_prob'` by adding `1` to `'bigram_freq'` and `V` to `'unigram_freq'`
        
- Multiply all the values of `'prob_list'` to find the bigram model probability of the sentence and store the result in a variable called `total_prob`


In [3]:
import nltk
from nltk.corpus import brown



import nltk
from nltk.corpus import brown

# Corpus
words = brown.words()
words=[w.lower() for w in words]

# Unigram frequency 
uni_freq = nltk.FreqDist(w.lower() for w in words)

# Size of corpus
total_words = len(words)

print('Frequency of tokens of the sample sentence:')

for word in test_sentence_tokens:
    print(word,uni_freq[word])

    
# Creating bigrams

bigram_words = []
previous = 'EMPTY'
sentences = 0
for word in words:
    if previous in ['EMPTY','.','?','!']:
        ## insert word_boundaries at beginning of Brown,
        bigram_words.append('*start_end*')
    else:
        bigram_words.append(word)
    
    previous = word


    
    
bigram_words.append('*start_end*') ## assume one additional *start_end* at the end of Brown

updated_uni_freq  = nltk.FreqDist(w.lower() for w in bigram_words)


print('\nCalculating bigram counts for sentence, including bigrams with sentence boundaries, i.e., *BEGIN* and *END*')


# Bigram corpus
bigrams = nltk.bigrams(w.lower() for w in bigram_words)


# Bigram probabilities
conditional_freq = nltk.ConditionalFreqDist(bigrams)

#Sentence 
test_sentence_tokens=['sunset','looks','magnificient','.']

# Code begins here



V=len(set(words))


# Function to calculate bigram probability
def get_bigram_probability(first,second):
    
    bigram_freq = conditional_freq[first][second]
    unigram_freq = updated_uni_freq[first]

    bigram_prob = (bigram_freq + 1)/(unigram_freq + V)
    
    return bigram_prob

# Calculating the bigram probability

prob_list=[]
previous = '*start_end*'
for token in test_sentence_tokens:
    next_probability = get_bigram_probability(previous,token)
    print(previous,token,(float('%.3g' % next_probability)))
    previous = token
    prob_list.append(next_probability)

    
# For the final term    
next_probability = get_bigram_probability(previous,'*start_end*')
print(previous,'*start_end*',next_probability)
prob_list.append(next_probability)    

print(prob_list)    



# Calculating the total probability

total_prob = 1
for val in prob_list:
    total_prob *= val

print("\nTotal probability:",total_prob)

Frequency of tokens of the sample sentence:
this 5145
is 10109
a 23195
sunny 13
day 687
. 49346

Calculating bigram counts for sentence, including bigrams with sentence boundaries, i.e., *BEGIN* and *END*
*start_end* sunset 9.48e-06
sunset looks 2.01e-05
looks magnificient 2e-05
magnificient . 2.01e-05
. *start_end* 0.49764524359375156
[9.48307744829352e-06, 2.0068634730779264e-05, 2.004329351399022e-05, 2.007427481682224e-05, 0.49764524359375156]

Total probability: 3.8106225670516194e-20


# Hints

Inside the function, `bigram_prob` has to be updated in the following way:

```python
    bigram_prob = (bigram_freq + 1)/(unigram_freq + V)
```

# Test Cases

#prob_list
Variable declaration
round(prob_list[4],2)==0.5


#total_prob
Variable declaration
round(total_prob,20)==4e-20


# Success Message

Congrats! You have successfully applied Laplace Smoothing!

## 3.2 Other methods of smoothing

We saw the inefficiency of Add k smoothing.

This is because different n-grams have different problems. 


The unigram estimate will never have the problem of its numerator or denominator having value equal to 0. 
However, the unigram ignores the context (previous n words), and hence discards valuable information. 

In contrast, the n-gram models where n>=2 estimate do make use of context but has the sparsity problem. As n increases, the power of n-gram model increases but the smoothing problem too gets worse.

Instead of relying on a single model, what if we tried to solve this problem using the strength of different models?

**Interpolation(Jelinek-Mercer smoothing)**

The idea in linear interpolation is to use all the available models in a linear combination.

For eg: For estimating the trigram probability $P(w_n |w_{n−2}w_{n−1})$ we will mix(interpolate) together the unigram, bigram, and trigram probabilities, each weighted by a weight λ :

\begin{align} \hat{P}(w_n | w_{n−2}w_{n−1}) &= \lambda_1 P(w_n | w_{n−2} w_{n−1}) \\
&+ \lambda_2 P(w_n | w_{n−1}) \\
&+ \lambda_3 P(w_n) \end{align}

such that the λ's sum to 1:
$$\sum_{i}^{} \lambda_i = 1$$


How do we calculate $\lambda$ values set? 

There are multiple ways to do that:

*1. Calculation using counts*

If we have a high count of trigrams then we give them relatively higher weight otherwise more weight is put on the unigram and bigram models.


*2. Calculation using held out corpus*

A held-out corpus is an additional training corpus that we use to set hyperparameters like the $\lambda$ values, by choosing the $\lambda$ values that maximize the likelihood of the held-out corpus. 

In this method we fix the n-gram model and then search for the $\lambda$ values that when plugged into the equation will give us the highest probability in the held-out set. So if we have particularly accurate counts for  unigram, we assume that the counts  based on this unigram will be more trustworthy, so we can make the λ s for that unigram higher and thus give that unigram more weight in the final interpolation. 



Consider the same corpus example we encountered before:

$<s>$ You are a data scientist $</s>$

$<s>$ Data scientist you are $</s>$

$<s>$ You love statistics $</s>$


Here $<s>$ and $</s>$ denote the start and end of the sentence respectively.

Let's assume $\lambda_1 = \lambda_1= \frac{1}{2} $

If we wanted to calculate the probability of the bigram 'you love', we get:

\begin{align} \hat{P}(\text{you love})&=\hat{P}(w_n | w_{n−1}) = \lambda_1 P(w_n | w_{n−1}) + \lambda_2 P(w_n) = \frac{1}{2}.\frac{1}{3} + \frac{1}{2}.1 =\frac{1}{6} + \frac{1}{2} =\frac{4}{6} \end{align} 


**Backoff(Katz Smoothing)**

This method is another way we can use multiple n-gram models to our advantage

In this method, if the n-gram we are calculating has zero counts, we approximate it by backing off to the (N-1)-gram. 

We continue backing off until we reach a model that has some counts.


So if we are trying to compute trigram probability $P(w_n |w_{n−2} w_{n−1})$ but we have no examples of a particular trigram $w_{n−2} w_{n−1} w_n$ , we "backoff" and estimate its probability by using the bigram probability $P(w_n |w_{n−1})$. 

Similarly, if we don’t have counts to compute for the bigram $P(w_n|w_{n−1})$, we go to the unigram $P(w_n)$[Which will never be 0].


For eg:

We are using 6-grams to calculate the probability of a word in text. You have "<i>this is a very rainy</i>" followed by "night". Let's suppose "<i>night</i>" never ocurred in this context in our corpus "<i>this is a very rainy</i>" so for the 5-grams model "<i>night</i>" has 0 probability which is not good because we know "<i>night</i>" is more probable than something like "<i>peacock</i>".

In other words, $P(\text{night|this is a very rainy})=0$ 

To resolve that, we will use a 5-gram model or a 4-gram model to calculate the probability of the sentence.

We see that if we use 4-gram model i.e "<i>night</i>" in the context "<i>a very rainy</i>", we are able to get a non-zero probability. We will hence use $P(\text{night|a very rainy})$ instead of the 6-gram prob

This method works because sometimes using less context is more benificial(like we saw in our rainy night example), helping to tackle for contexts that the model hasn’t learned much about. 


**Note:** For a backoff model to give the correct probability distribution, we have to `discount` the higher-order n-grams to save some probability mass for the lower order n-grams. Similar to how we changed the denominator with add-one smoothing. If higher-order n-grams aren’t discounted, the total probability assigned to all possible strings by the language model would be greater than 1! 


Studies shows that the two most widely used techniques are Interpolation and Backoff. Both of which perform consistently well across training set sizes for both bigram and trigram models, with Backoff technique performing better on trigram models in large training sets and on bigram models in general.


# Task


- The working code for the first task you completed is given with a new sentence `this is a very sunny day.`

- Run the code once as it is.

**We get the final probability value as 0 because the bigram 'very sunny' is not in our corpus.**

**Let's try to resolve that using Backoff**

Inside the function `get_bigram_probability()` we  need to implement a condition such that if bigram probability is 0, it should return the unigram probability of the `second` term. 

- Inside the function `get_bigram_probability()`:
        
        - Just above the bigram probability calculation, put up an if condition `"if not second in conditional_freq[first]:"` to check if that particular bigram exists
        
        - Inside the if condition calculate the unigram probability `"unigram_prob"` by dividing `"updated_uni_freq[second]"`(Unigram frequencies are stored in the dictionary `"updated_uni_freq"`) by `"len(words)"`. Return the variable `"unigram_prob"`
       
(**Note:** Don't remove the previous code from `get_bigram_probability()`, you just need to add an extra if condition in the beginning of the function )
        

**Things to ponder upon**

- Try to calculate the sentence probability by using Laplace Smoothing instead of Backoff method. Do we get different results?


In [4]:

import nltk
from nltk.corpus import brown



import nltk
from nltk.corpus import brown

#Sentence 
test_sentence_tokens=['this','is','a','very','sunny','day','.']


# Corpus
words = brown.words()
words=[w.lower() for w in words]

# Unigram frequency 
uni_freq = nltk.FreqDist(w.lower() for w in words)

# Size of corpus
total_words = len(words)

print('Frequency of tokens of the sample sentence:')

for word in test_sentence_tokens:
    print(word,uni_freq[word])

    
# Creating bigrams

bigram_words = []
previous = 'EMPTY'
sentences = 0
for word in words:
    if previous in ['EMPTY','.','?','!']:
        ## insert word_boundaries at beginning of Brown,
        bigram_words.append('*start_end*')
    else:
        bigram_words.append(word)
    
    previous = word


    
    
bigram_words.append('*start_end*') ## assume one additional *start_end* at the end of Brown

updated_uni_freq  = nltk.FreqDist(w.lower() for w in bigram_words)


print('\nCalculating bigram counts for sentence, including bigrams with sentence boundaries, i.e., *BEGIN* and *END*')


# Bigram corpus
bigrams = nltk.bigrams(w.lower() for w in bigram_words)


# Bigram probabilities
conditional_freq = nltk.ConditionalFreqDist(bigrams)


# Code begins here


V=len(set(words))


# Function to calculate bigram probability
def get_bigram_probability(first,second):

    if not second in conditional_freq[first]:
        print('Backing Off to Unigram Probability for',second)
        unigram_prob = updated_uni_freq[second]/len(words)
        return unigram_prob 
    

    bigram_freq = conditional_freq[first][second]
    unigram_freq = updated_uni_freq[first]
    bigram_prob = bigram_freq/unigram_freq
    
    return bigram_prob


# Calculating the bigram probability

prob_list=[]
previous = '*start_end*'
for token in test_sentence_tokens:
    next_probability = get_bigram_probability(previous,token)
    print(previous,token,(float('%.3g' % next_probability)))
    previous = token
    prob_list.append(next_probability)

    
# For the final term    
next_probability = get_bigram_probability(previous,'*start_end*')
print(previous,'*start_end*',next_probability)
prob_list.append(next_probability)    

print(prob_list)    



# Calculating the total probability

total_prob = 1
for val in prob_list:
    total_prob *= val

print("\nTotal probability:",total_prob)

Frequency of tokens of the sample sentence:
this 5145
is 10109
a 23195
very 796
sunny 13
day 687
. 49346

Calculating bigram counts for sentence, including bigrams with sentence boundaries, i.e., *BEGIN* and *END*
*start_end* this 0.0083
this is 0.0503
is a 0.0861
a very 0.00613
Backing Off to Unigram Probability for sunny
very sunny 1.12e-05
sunny day 0.154
day . 0.163
. *start_end* 1.0
[0.008303975842979365, 0.05030826140567201, 0.08609535184632229, 0.00613137369821018, 1.1195392320994288e-05, 0.15384615384615385, 0.16251830161054173, 1.0]

Total probability: 6.172926606098926e-14


# Hints

You can write the function `get_bigram_probability()` in the following way

```python

def get_bigram_probability(first,second):

    if not second in cfd[first]:
        print('Backing Off to Unigram Probability for',second)
        unigram_prob = updated_uni_freq[second]/len(words)
        return unigram_prob 
    
    else:
        bigram_freq = conditional_freq[first][second]
        unigram_freq = updated_uni_freq[first]
        bigram_prob = bigram_freq/unigram_freq
    
    return bigram_prob

```


# Test Cases

#prob_list
Variable declaration
round(prob_list[0],3)==0.008
round(prob_list[1],3)==0.05


#total_prob
Variable declaration
round(total_prob,14)==6e-14

# Success Message

Congrats! You have successfully implemented backoff smoothing

# END OF NOTEBOOK

**********************************************************************************************************************************

# 5. Spell Check Project

You often would have observed when you google anything on web and make a spelling mistake, google corrects the spelling mistake for you and perform the search on the corrected word afterwards. The full details of an industrial-strength spell corrector are quite complex, though we can still make a toy spelling corrector that achieves 80 or 90% accuracy at a processing speed of at least 10 words per second in about half a page of code.

<img src="lm17.png" width="700">

**Aim**

Here our aim is to construct a function *correction()*, such that *correction(speling)* would output *spelling*. 

We will achieve this task in few steps described below.

Your first step is to understand how to make a list of candidates words which may be the correct word for any misspelled word.

From now on we refer to the word w which we have to get correct spelling for as query word. We solve the above problem in three steps.

First we collect all the words which can be formed from query word using a simple edit. A simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter). You have been provided with a function edits1 which outputs list of all possible words formed with only one simple edit. You have to implement using this given function a function which output lists of all the strings (whether words or not) that can be made with one simple edit. Note that most of the strings made using one simple edit would not be a word.

One important thing to note here is that our spell checker should return the word itself if it is in vocabulary. That is if that word is in vocabulary so that is the correct word, similarly if you get a word with one simple edit which is in vocabulary then you don't need to look further for words with 2 simple edits and look at their probability. This is an assumption we have made here that any word with one simple edit is always better substitute for a mis-spelled word than a number with 2 simple edits and similar is the case in words with 2 and 3 edits. In the end if we can find no meaningful word using 1, 2 or 3 edits then we simply return the query word itself. 

In [5]:
import re
from collections import Counter

import numpy as np

In [6]:
# Inside functions (utility)

def edits1(word):
    #All edits that are one edit away from `word`.
    
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def known(words): 
    return list(w for w in words if w in WORDS)


In [7]:
# PROBLEM CODE

# Function to get list of all possible correct words
# from the given word

def edits(word):
    # word : Contain the query word
    # return two lists, words_1edit and words_2edit
    # which are list of all words with one edit 
    # and 2 edits respectively.
    
    # Initialization
    words_1edit = []
    words_2edit = []
    
    # Write code here
    #####################
    
    # In line 1 
    
    words_1edit = None
    
    for word in None:
        for word2 in None:
            # Append word2 into list words_2edit below
            None
    
    #####################
    # End code here
    
    return [words_1edit, words_2edit]


# A function to find probability of a list of given words
# Given below N is the total number of words in corpus
def P(word, N=sum(WORDS.values())):
    
    # Initialize prob_list which stores probability of 
    # corresponding word in the word list
    prob_list = []
    
    # Write code here
    #####################
    
    # 1. In first line iterate over "word" list
    # 2. In 2nd line find probability of each 
    # word by taking ratio of count of that word
    # and total words in corpus (N)
    # Hints: WORDS['apple'] give count of word "apple" in corpus 
    # 3. In 3rd line append probability of each word into 
    # list prob_list which contain probability of all words.
    
    for w in None:
        p_w = None
        None
        
    #####################
    # End code here
        
    return prob_list



def correction(word):
    
    # Write code here
    #####################
    
    # Follow given steps below.
    # 1. First get lists of all possible words you can
    #  get by making one simple edit and two simple edits
    #  in the query word and store it in words_1edit and 
    #  words_2edit respectively. Use the 'edits()' function we 
    #  implemented earlier.
    # 2. From lists found in above step you have to slect only
    #  correct words, use 'known()' function which takes any 
    #  list of words and return only known words from vocabulary.
    #  From list 'words_1edit' and 'words_1edit' select only 
    #  known words using 'known()' function and save it in lists
    #  'known_words_1edit' and 'known_words_2edit' respectively.
    #  Observe that 'known_word' variable is a list which will 
    #  contain the query word if it is known in vocabulary and
    #  would be empty otherwise
    # 3. Using if-else condition you have to implement three possible
    #  conditions
    #  (a) return query word itself if it is in vocabulary,
    #      so check if length of 'known_word' list is 0, 
    #      if not zero it means word is in vocabulary and so
    #      return it. If it is zero then move to next condition.
    #  (b) if list 'known_words_1edit' is non-empty then
    #      you don't need to look further return word with maximum
    #      probability in this list
    #      in 1st line get probability of all the words in this list
    #      in 2nd line get the index with maximum value of probability
    #      in 3rd line return word with maximum probability
    #  (c) perform same operation with 'known_words_2edit' and
    #      'known_words_3edit'.
    #  (d) return query word if none of these last four cases
    #      give a known word
    
    words_1edit, words_2edit = None
    
    known_word = known([word])
    known_words_1edit = None
    known_words_2edit = None
    
    if len(known_word) != None:
        return None
    
    elif len(None) != 0:
        probability_list = None
        max_index = None
        return None
    
    elif len(None) != 0:
        probability_list = None
        max_index = None
        return None
    
    elif len(None) != 0:
        probability_list = None
        max_index = None
        return None
    
    else:
        return word
    
    #####################
    # End code here

In [8]:
# SOLUTION CODE

def edits(word):
    # All edits that are two edits away from `word`.
    
    words_1edit = []
    words_2edit = []
    words_3edit = []
    
    words_1edit = edits1(word)
    
    for word in words_1edit:
        for word2 in edits1(word):
            words_2edit.append(word2)
            
    for word in words_2edit:
        for word2 in edits1(word):
            words_3edit.append(word2)

    return [words_1edit, words_2edit, words_3edit]

def P(word, N=sum(WORDS.values())):
    prob_list = []
    for w in word:
        p_w = WORDS[w] / N
        prob_list.append(p_w)

    return prob_list



def correction(word):
    words_1edit, words_2edit, words_3edit = edits(word)
    
    known_word = known([word])
    known_words_1edit = known(words_1edit)
    known_words_2edit = known(words_2edit)
    known_words_3edit = known(words_2edit)
                       
        
    if len(known_word) != 0:
        return word
    
    elif len(known_words_1edit) != 0:
        probability_list = P(known_words_1edit)
        max_index = np.argmax(probability_list)
        return known_words_1edit[max_index]
    
    elif len(known_words_2edit) != 0:
        probability_list = P(known_words_2edit)
        max_index = np.argmax(probability_list)
        return known_words_2edit[max_index]
    
    elif len(known_words_3edit) != 0:
        probability_list = P(known_words_3edit)
        max_index = np.argmax(probability_list)
        return known_words_3edit[max_index]
    
    else:
        return word

In [9]:
correction('aple')

'able'

In [10]:
# INITIAL CODE OF SPELL CHECK
# LINK:- https://github.com/norvig/pytudes/blob/master/py/spell.py

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [11]:
# TESTS

def unit_tests():
    assert correction('speling') == 'spelling'              # insert
    assert correction('korrectud') == 'corrected'           # replace 2
    assert correction('bycycle') == 'bicycle'               # replace
    assert correction('inconvient') == 'inconvenient'       # insert 2
    assert correction('arrainged') == 'arranged'            # delete
    assert correction('peotry') =='poetry'                  # transpose
    assert correction('peotryy') =='poetry'                 # transpose + delete
    assert correction('word') == 'word'                     # known
    assert correction('quintessential') == 'quintessential' # unknown
    assert words('This is a TEST.') == ['this', 'is', 'a', 'test']
    assert Counter(words('This is a test. 123; A TEST this is.')) == (
           Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))
    assert len(WORDS) == 32198
    assert sum(WORDS.values()) == 1115585
    assert WORDS.most_common(10) == [
     ('the', 79809),
     ('of', 40024),
     ('and', 38312),
     ('to', 28765),
     ('in', 22023),
     ('a', 21124),
     ('that', 12512),
     ('he', 12401),
     ('was', 11410),
     ('it', 10681)]
    assert WORDS['the'] == 79809
    assert P('quintessential') == 0
    assert 0.07 < P('the') < 0.08
    return 'unit_tests pass'

def spelltest(tests, verbose=False):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time
    start = time.clock()
    good, unknown = 0, 0
    n = len(tests)
    for right, wrong in tests:
        w = correction(wrong)
        good += (w == right)
        if w != right:
            unknown += (right not in WORDS)
            if verbose:
                print('correction({}) => {} ({}); expected {} ({})'
                      .format(wrong, w, WORDS[w], right, WORDS[right]))
    dt = time.clock() - start
    print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '
          .format(good / n, n, unknown / n, n / dt))
    
def Testset(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [(right, wrong)
            for (right, wrongs) in (line.split(':') for line in lines)
            for wrong in wrongs.split()]

In [12]:
## This file assumes Python 3
## To work with Python 2, you would need to adjust
## at least: the print statements (remove parentheses)
## and the instances of division (convert
## arguments of / to floats), and possibly other things
## -- I have not tested this.

import nltk
from nltk.corpus import brown

# test_sentence_tokens = ['a','fact','about','the','unicorn','is','the','same','as','an','alternative','fact','about','the','unicorn','.']

test_sentence_tokens=['this','is','a','very','sunny','day','.']

words = brown.words()
fdist1 = nltk.FreqDist(w.lower() for w in words)

total_words = len(words)

print('Frequency of tokens in sample sententence in Brown according to NLTK:')

for word in test_sentence_tokens:
    print(word,fdist1[word])

# input('Pausing: Hit Return when Ready.')

print('Given that there are',total_words,'in the Brown Corpus, the unigram probability of these words')
print('is as follows (rounded to 3 significant digits):')

for word in test_sentence_tokens:
    unigram_probability = fdist1[word]/total_words
    print(word,float('%.3g' % unigram_probability))
    ## print(word,round((fdist1[word]/total_words),3))

# input('Pausing: Hit Return when Ready.')

## ADD convert single count items to OOV
## make simple assumption about sentence endings,
## and the position of START and END (sentence boundaries)

words2 = []
previous = 'EMPTY'
sentences = 0
for word in words:
    if previous in ['EMPTY','.','?','!']:
        ## insert word_boundaries at beginning of Brown,
        ## and after end-of-sentence markers (overgenerate due to abbreviations, etc.)
        words2.append('*start_end*')
    if fdist1[word]==1:
        ## words occurring only once are treated as Out of Vocabulary Words
        words2.append('*oov*')
    else:
        words2.append(word)
    previous = word
words2.append('*start_end*') ## assume one additional *start_end* at the end of Brown

fdist2 = nltk.FreqDist(w.lower() for w in words2)
## get Unigram counts for all words occuring more than once
## and also a count for OOV words

print('There are',fdist2['*oov*'],'instances of OOVs')

print('Unigram probabilities including OOV probabilities.')

def get_unigram_probability(word):
    if word in fdist1:
        unigram_probability = fdist2[word]/total_words
    else:
        unigram_probability = fdist2['*oov*']/total_words
    return(unigram_probability)

for word in test_sentence_tokens:
    unigram_probability = get_unigram_probability(word)
    print(word,float('%.3g' % unigram_probability))

# input('Pausing: Hit Return when Ready.')
## make new version that models Out of Vocabulary (OOV) words

print('Calculating bigram counts for sentence, including bigrams with sentence boundaries, i.e., *BEGIN* and *END*')
print('Assuming some idealizations: all periods, questions and exclamation marks end sentences;')

bigrams = nltk.bigrams(w.lower() for w in words2)
## get bigrams for words2 (words plus OOV)

cfd = nltk.ConditionalFreqDist(bigrams)

# for token1 in cfd:
#     if not '*oov*' in cfd[token1]:
#         cfd[token1]['*oov*']=1
#         ## fudge so there can be no 
#         ## 0 bigram

def multiply_list(inlist):
    out = 1
    for number in inlist:
        out *= number
    return(out)

def get_bigram_probability(first,second):
    if not second in cfd[first]:
        print('Backing Off to Unigram Probability for',second)
        unigram_probability = get_unigram_probability(second)
        return(unigram_probability)
    else:
        bigram_frequency = cfd[first][second]
    unigram_frequency = fdist2[first]
    bigram_probability = bigram_frequency/unigram_frequency
    return(bigram_probability)

def calculate_bigram_freq_of_sentence_token_list(tokens):
    prob_list = []
    ## assume that 'START' precedes the first token
    previous = '*start_end*'
    for token in tokens:
        if not token  in fdist2:
            token = '*oov*'
        next_probability = get_bigram_probability(previous,token)
        print(previous,token,(float('%.3g' % next_probability)))
        prob_list.append(next_probability)
        previous = token
    ## assume that 'END' follows the last token
    next_probability = get_bigram_probability(previous,'*start_end*')
    print(previous,'*start_end*',next_probability)
    prob_list.append(next_probability)
    probability = multiply_list(prob_list)
    print('Total Probability',float('%.3g' % probability))
    return(probability)



result = calculate_bigram_freq_of_sentence_token_list(test_sentence_tokens)


Frequency of tokens in sample sententence in Brown according to NLTK:
this 5145
is 10109
a 23195
very 796
sunny 13
day 687
. 49346
Given that there are 1161192 in the Brown Corpus, the unigram probability of these words
is as follows (rounded to 3 significant digits):
this 0.00443
is 0.00871
a 0.02
very 0.000686
sunny 1.12e-05
day 0.000592
. 0.0425
There are 15673 instances of OOVs
Unigram probabilities including OOV probabilities.
this 0.00443
is 0.00871
a 0.02
very 0.000686
sunny 1.12e-05
day 0.000592
. 0.0425
Calculating bigram counts for sentence, including bigrams with sentence boundaries, i.e., *BEGIN* and *END*
Assuming some idealizations: all periods, questions and exclamation marks end sentences;
*start_end* this 0.0196
this is 0.0842
is a 0.0858
a very 0.00604
Backing Off to Unigram Probability for sunny
very sunny 1.12e-05
sunny day 0.154
day . 0.162
. *start_end* 1.0
Total Probability 2.38e-13


In [13]:
# # Add perplexity
# import collections, nltk
# # we first tokenize the text corpus
# corpus ="""
# Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
# that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
# into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
# The group's influence on comedy has been compared to The Beatles' influence on music."""

# tokens = nltk.word_tokenize(corpus)
# #here you construct the unigram language model 
# def unigram(tokens):    
#     model = collections.defaultdict(lambda: 0.01)
#     for f in tokens:
#         try:
#             model[f] += 1
#         except KeyError:
#             model [f] = 1
#             continue
#     for word in model:
#         model[word] = model[word]/float(sum(model.values()))
#     return model



    


#computes perplexity of the unigram model on a testset  
# def perplexity(testset):
#     testset = testset.split()
#     perplexity = 1
#     N = 0
#     previous='*start_end*'
#     for word in testset:
#         N += 1
#         next_probability = get_bigram_probability(previous,word)
#         previous = word
#         print(word, next_probability)
#         perplexity = perplexity * (1/next_probability)
    
#     perplexity = pow(perplexity, 1/float(N)) 
#     return perplexity

# 4. Neural Language Models



Recently, the use of neural networks in the development of language models has become very popular, to the point that it may now be the preferred approach. The use of neural networks in language modeling is often called Neural Language Modeling, or NLM for short. Neural network approaches are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation. A key reason for the leaps in improved performance may be the method’s ability to generalize.

Specifically, a word embedding is adopted that uses a real-valued vector to represent each word in a project vector space. This learned representation of words based on their usage allows words with a similar meaning to have a similar representation. This generalization is something that the representation used in classical statistical language models can not easily achieve. 

The neural network approach to language modeling can be described using the three following model properties:

* Associate each word in the vocabulary with a distributed word feature vector. That is each word is represented using a feature vector.
* Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence.
* This feature vector of words which we talked abobve and parameters of the probability function are learnt simultaneously.

Summary:

* Language models offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words.
* n-grams are Markov models that estimate words from a fixed window of previous words. n-gram probabilities can be estimated by counting in a corpus and normalizing
* n-gram language models are evaluated extrinsically in some task, or intrinsically using perplexity.
* The perplexity of a test set according to a language model is the geometric mean of the inverse test set probability computed by the model.
* Smoothing algorithms provide a more sophisticated way to estimate the probability of n-grams. Commonly used smoothing algorithms for n-grams rely on lower-order n-gram counts through backoff or interpolation.
* In neural models probability is given by a neural network model.
* We get feature vector representation of words instead of sparse representations which are later used in the model to get probability.


In [14]:

import nltk.lm
# f_in = open("science.txt", 'r');
# ln = f_in.read()    

# words = nltk.word_tokenize(ln)
my_bigrams = nltk.bigrams(words)
my_trigrams = nltk.trigrams(words)

s=""
tText = Text(words)
tText1 = Text(my_bigrams)
tText2 = Text(my_trigrams)
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)


tt=NgramModel(1, tText, estimator)
tt1=NgramModel(2, tText1, estimator)
tt2=NgramModel(3, tText2, estimator)


print (tt.perplexity(tText))
print (tt1.perplexity(tText1))
print (tt2.perplexity(tText2))

ModuleNotFoundError: No module named 'nltk.lm'