Language Model and Application for Spelling Error Correction

Objective: Develop a simple English syntax error correction program.

Exercise 1:

a) Build a language model based on n-grams using the Laplace smoothing method for the following models:

1-gram
2-gram
3-gram

b) Calculate the probability of a sentence and compute the Perplexity of a sentence based on 1-gram, 2-gram, and 3-gram models.

c) Analyze the results (Provide your own examples of spelling errors and calculate the probability of two similar sentences, where one has the correct word order and the other has an incorrect word order).

In [16]:
%pip install requests tqdm pandas zipfile36 patool



In [17]:
import os,glob
import codecs
import sys
import re
import requests
import zipfile

In [18]:
textword_URL = "https://raw.githubusercontent.com/cudnah124/Natural-Language-Processing/main/lab3/tedtalk.txt"
textword_PATH = "tedtalk.txt"

response = requests.get(textword_URL)

with open(textword_PATH, "wb") as f:
    f.write(response.content)

with open('./tedtalk.txt', 'r', encoding='utf-8') as f:
    texttalk = set([line.strip() for line in f])

In [19]:
import re

sentences = []
for line in texttalk:
    line = line.lower()
    line = re.sub(r'[^a-z\s]', '', line)
    tokens = ['<s>'] + line.split() + ['</s>']
    sentences.append(tokens)

print(sentences[0])

['<s>', 'i', 'want', 'you', 'now', 'to', 'imagine', 'a', 'wearable', 'robot', 'that', 'gives', 'you', 'superhuman', 'abilities', 'or', 'another', 'one', 'that', 'takes', 'wheelchair', 'users', 'up', 'standing', 'and', 'walking', 'again', 'we', 'at', 'berkeley', 'bionics', 'call', 'these', 'robots', 'exoskeletons', 'these', 'are', 'nothing', 'else', 'than', 'something', 'that', 'you', 'put', 'on', 'in', 'the', 'morning', 'and', 'it', 'will', 'give', 'you', 'extra', 'strength', 'and', 'it', 'will', 'further', 'enhance', 'your', 'speed', 'and', 'it', 'will', 'help', 'you', 'for', 'instance', 'to', 'manage', 'your', 'balance', 'it', 'is', 'actually', 'the', 'true', 'integration', 'of', 'the', 'man', 'and', 'the', 'machine', 'but', 'not', 'only', 'that', 'it', 'will', 'integrate', 'and', 'network', 'you', 'to', 'the', 'universe', 'and', 'other', 'devices', 'out', 'there', 'this', 'is', 'just', 'not', 'some', 'blue', 'sky', 'thinking', 'to', 'show', 'you', 'now', 'what', 'we', 'are', 'workin

In [20]:
from collections import Counter

unigrams = Counter()
bigrams = Counter()
trigrams = Counter()

for sent in sentences:
    unigrams.update(sent)
    bigrams.update(zip(sent[:-1], sent[1:]))
    trigrams.update(zip(sent[:-2], sent[1:-1], sent[2:]))

V = len(unigrams)
N = sum(unigrams.values())

In [21]:
def P_unigram(w):
    return (unigrams[w] + 1) / (N + V)
def P_bigram(w_prev, w):
    return (bigrams[(w_prev, w)] + 1) / (unigrams[w_prev] + V)
def P_trigram(w1, w2, w3):
    return (trigrams[(w1, w2, w3)] + 1) / (bigrams[(w1, w2)] + V)

In [22]:
import math

def sentence_prob_unigram(sent):
    prob = 1
    for w in sent:
        prob *= P_unigram(w)
    return prob
def sentence_prob_bigram(sent):
    prob = 1
    for i in range(1, len(sent)):
        prob *= P_bigram(sent[i-1], sent[i])
    return prob
def sentence_prob_trigram(sent):
    prob = 1
    for i in range(2, len(sent)):
        prob *= P_trigram(sent[i-2], sent[i-1], sent[i])
    return prob

In [23]:
def perplexity_unigram(sent):
    log_prob = 0
    for w in sent:
        log_prob += math.log(P_unigram(w))
    return math.exp(-log_prob / len(sent))
def perplexity_bigram(sent):
    log_prob = 0
    for i in range(1, len(sent)):
        log_prob += math.log(P_bigram(sent[i-1], sent[i]))
    return math.exp(-log_prob / (len(sent)-1))
def perplexity_trigram(sent):
    log_prob = 0
    for i in range(2, len(sent)):
        log_prob += math.log(P_trigram(sent[i-2], sent[i-1], sent[i]))
    return math.exp(-log_prob / (len(sent)-2))

In [26]:
print(f"P_unigram: ", P_unigram("ideas"))
print(f"P_bigram: ", P_bigram("the", "world"))
print(f"P_trigram: ", P_trigram("can", "change", "the"))

P_unigram:  0.0002572277456701963
P_bigram:  0.020233866975541533
P_trigram:  0.0014453928104167964


In [28]:
test_sentence = "<s> ideas can change the world </s>".split()
wrong_order = "<s> ideas world the change can </s>".split()
spelling_error = "<s> ideas can chnage the world </s>".split()

tests = {
    "Correct sentence": test_sentence,
    "Wrong word order": wrong_order,
    "Spelling error": spelling_error
}

for name, sent in tests.items():
    print("\n", name)
    print("Unigram PP :", perplexity_unigram(sent))
    print("Bigram  PP :", perplexity_bigram(sent))
    print("Trigram PP:", perplexity_trigram(sent))


 Correct sentence
Unigram PP : 638.2662319912381
Bigram  PP : 1076.2463412324455
Trigram PP: 7149.133722973036

 Wrong word order
Unigram PP : 638.2662319912381
Bigram  PP : 10596.054946895363
Trigram PP: 80025.32849153919

 Spelling error
Unigram PP : 2164.5537383520136
Bigram  PP : 8669.044863897736
Trigram PP: 52558.14505235626
