<a href="https://colab.research.google.com/github/afaundez/CS74040-2021-fall/blob/homework-1/homework-1/part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PART I:

(10 points) Do exercise 3.4 from Chapter 3 in the textbook: [https://web.stanford.edu/~jurafsky/slp3/3.pdf](https://web.stanford.edu/~jurafsky/slp3/3.pdf)

We are given the following corpus, modified from the one in the chapter:

    <s> I am Sam </s>
    <s> Sam I am </s>
    <s> I am Sam </s>
    <s> I do not like green eggs and Sam </s>

Using a bigram language model with add-one smoothing, what is P(Sam | am)? Include \<s> and \</s> in your counts just like any other token.

In [2]:
sentences = ['<s> I am Sam </s>', '<s> Sam I am </s>', '<s> I am Sam </s>', '<s> I do not like green eggs and Sam </s>']
sentences = [ sentence.split(' ') for sentence in sentences ]

unigrams_count_by_unigram = {}
for sentence in sentences:
    for word in sentence:
        if word not in unigrams_count_by_unigram:
            unigrams_count_by_unigram[word] = 0
        unigrams_count_by_unigram[word] += 1

vocabulary = set(unigrams_count_by_unigram.keys())

In [3]:
from itertools import tee

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

bigrams_count_by_bigram = {}
for sentence in sentences:
    for bigram in pairwise(sentence):
        if bigram not in bigrams_count_by_bigram:
            bigrams_count_by_bigram[bigram] = 0
        bigrams_count_by_bigram[bigram] += 1

In [4]:
from collections import defaultdict

MLE_probabilities_by_token_given_condition = {
    (token, condition): (condition_token_count + 1) / (unigrams_count_by_unigram[condition] + len(vocabulary))
    for (condition, token), condition_token_count in bigrams_count_by_bigram.items()
}

MLE_probabilities_by_token_given_condition = defaultdict(lambda: (1 / len(vocabulary)), MLE_probabilities_by_token_given_condition)

MLE_probabilities_by_token_given_condition[('Sam', 'am')]

0.21428571428571427

# PART II:

In this assignment, you will train several language models and will evaluate them on a test corpus. You can discuss in groups, but the homework is to be completed and submitted individually. Two files are provided with this assignment:

1. train.txt
2. test.txt

Each file is a collection of texts, one sentence per line. train.txt contains 10,000 sentences from the NewsCrawl corpus. You will use this corpus to train the language models. The test corpus test.txt is from the same domain and will be used to evaluate the language models that you trained.

## 1.1 PRE-PROCESSING

Prior to training, please complete the following pre-processing steps:


### 1. Pad each sentence in the training and test corpora with start and end symbols (you can use \<s> and \</s>, respectively).

In [5]:
train_sentences = open('train.txt').read().strip().split('\n')
print('TRAIN', train_sentences[1])
assert len(train_sentences) == 100000, f'train must have 10000 sentences, got {len(train_sentences)} instead'

TRAIN Man charged over drugs seizure


In [6]:
test_sentences = open('test.txt').read().strip().split('\n')
print('TEST', test_sentences[2])
assert len(test_sentences) == 100, f'test must have 100 sentences, got {len(test_sentences)} instead'

TEST The road was pitted with tank treads .


In [7]:
def pad_sentence(sentence, start_pad = '<s>', stop_pad = '</s>'):
    return ' '.join([start_pad] + sentence.split() + [stop_pad])

In [8]:
import re

padded_sentence_pattern = re.compile("^<s> .* </s>$")

padded_train_sentences = [ pad_sentence(sentence) for sentence in train_sentences ]
print('TRAIN', train_sentences[1], '->', padded_train_sentences[1])
assert padded_sentence_pattern.match(padded_train_sentences[1]), f'padded_train_sentences[0] must start with <s> and end with </s>, got {padded_train_sentences[0]} instead'

padded_test_sentences = [pad_sentence(sentence) for sentence in test_sentences]
print('TEST', test_sentences[1], '->', padded_test_sentences[2])
assert padded_sentence_pattern.match(padded_test_sentences[2]), f'padded_test_sentences[0] must start with <s> and end with </s>, got {padded_test_sentences[0]} instead'

TRAIN Man charged over drugs seizure -> <s> Man charged over drugs seizure </s>
TEST If you have owned the property for more than three years , you can apply for " taper relief , " by which you can reduce any taxable gain by 5% for each year of ownership , up to a maximum 40% . -> <s> The road was pitted with tank treads . </s>


### 2. Lowercase all words in the training and test corpora. Note that the data already has been tokenized (i.e. the punctuation has been split off words).

In [9]:
lowercased_padded_train_sentences = [ sentence.lower() for sentence in padded_train_sentences ]
print('TRAIN', padded_train_sentences[1], '->', lowercased_padded_train_sentences[1])
assert lowercased_padded_train_sentences[1] == '<s> man charged over drugs seizure </s>', f'lowercased_padded_train_sentences[1] must be "<s> man charged over drugs seizure </s>", got {lowercased_padded_train_sentences[1]} instead'
lowercased_padded_train_vocabulary = set(word for sentence in lowercased_padded_train_sentences for word in sentence.split())

lowercased_padded_test_sentences = [ sentence.lower() for sentence in padded_test_sentences ]
print('TEST', padded_test_sentences[2], '->', lowercased_padded_test_sentences[2])
assert lowercased_padded_test_sentences[2] == '<s> the road was pitted with tank treads . </s>', f'lowercased_padded_test_sentences[2] must be "<s> the road was pitted with tank treads . </s>", got {lowercased_padded_test[1]} instead'
lowercased_padded_test_vocabulary = set(word for sentence in lowercased_padded_test_sentences for word in sentence.split())


TRAIN <s> Man charged over drugs seizure </s> -> <s> man charged over drugs seizure </s>
TEST <s> The road was pitted with tank treads . </s> -> <s> the road was pitted with tank treads . </s>


### 3. Replace all words occurring in the training data once with the token \<unk>. Everyword in the test data not seen in training should be treated as \<unk>.

In [10]:
train_tokens_count_by_token = {}
for sentence in lowercased_padded_train_sentences:
    for word in sentence.split():
        if word not in train_tokens_count_by_token:
            train_tokens_count_by_token[word] = 0
        train_tokens_count_by_token[word] += 1

print(len(train_tokens_count_by_token))
assert train_tokens_count_by_token['<s>'] == 100000, f'start token must be 10000, instead got {train_tokens_count_by_token["<s>"]}'

once_ocurring_words = set(word for word in train_tokens_count_by_token if train_tokens_count_by_token[word] == 1)

assert 'chasnoff' in once_ocurring_words, 'chasnoff must be in once_ocurring_words'

def replace_word(sentence, words, token = '<unk>'):
    return ' '.join([
        token if word in words else word
        for word in sentence.split()
    ])

assert 'chasnoff' in lowercased_padded_train_sentences[2], f'chasnoff must be in {lowercased_padded_train_sentences[2]}'
replaced_lowercased_padded_train_sentences = [ replace_word(sentence, once_ocurring_words) for sentence in lowercased_padded_train_sentences ]
assert 'chasnoff' not in replaced_lowercased_padded_train_sentences[2], f'chasnoff must not be in {replaced_lowercased_padded_train_sentences[2]}'
assert '<unk>' in replaced_lowercased_padded_train_sentences[2], f'<unk> must be in {replaced_lowercased_padded_train_sentences[2]}'
replaced_lowercased_padded_train_vocabulary = set(word for sentence in replaced_lowercased_padded_train_sentences for word in sentence.split())

replaced_lowercased_padded_train_vocabulary = set(word for sentence in replaced_lowercased_padded_train_sentences for word in sentence.split())
assert 'leuthard' not in replaced_lowercased_padded_train_vocabulary, 'leuthard is not in train vocabulary'
assert 'septa' not in replaced_lowercased_padded_train_vocabulary, 'septa is not in train vocabulary (replaced with <unk>)'
replaced_lowercased_padded_train_vocabulary = set(word for sentence in replaced_lowercased_padded_train_sentences for word in sentence.split())

lowercased_padded_test_vocabulary = set(word for sentence in lowercased_padded_test_sentences for word in sentence.split())
assert 'leuthard' in lowercased_padded_test_vocabulary, 'leuthard is in test vocabulary'
assert 'septa' in lowercased_padded_test_vocabulary, 'septa is in test vocabulary'

test_words_not_in_replaced_lowercased_padded_train_vocabulary = lowercased_padded_test_vocabulary.difference(replaced_lowercased_padded_train_vocabulary)
assert 'leuthard' in test_words_not_in_replaced_lowercased_padded_train_vocabulary, 'leuthard is in test but not in train vocabulary'
assert 'septa' in test_words_not_in_replaced_lowercased_padded_train_vocabulary, 'leuthard is in test but not in train vocabulary'

replaced_lowercased_padded_test_sentences = [ replace_word(sentence, test_words_not_in_replaced_lowercased_padded_train_vocabulary) for sentence in lowercased_padded_test_sentences ]
assert 'leuthard' not in replaced_lowercased_padded_train_vocabulary, 'leuthard is not in train vocabulary'
assert 'septa' not in replaced_lowercased_padded_test_sentences[86], f'septa is not in "{replaced_lowercased_padded_test_sentences[86]}" (replaced with <unk>)'
replaced_lowercased_padded_test_vocabulary = set(word for sentence in replaced_lowercased_padded_test_sentences for word in sentence.split())

83045


In [169]:
from itertools import tee

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def sentenceToWords(sentence, startToken='<s>', stopToken='</s>', unknownToken='<unk>', unknownWords=[]):
    words = sentence.strip().lower().split(' ')
    words = [
        word if word not in unknownWords else unknownToken
        for word in words
    ]
    return [startToken, *words, stopToken]

def textToSentences(text, startToken='<s>', stopToken='</s>', unknownToken='<unk>', unknownWords=[]):
    raw_sentences = text.strip().split('\n')
    unigrams = {}
    bigrams = {}
    sentences = []
    for sentence in raw_sentences:
        token_words = sentenceToWords(sentence, startToken, stopToken, unknownToken, unknownWords)
        for (condition, word) in pairwise(token_words):
            if condition not in bigrams:
                bigrams[condition] = {}
            if word not in bigrams[condition]:
                bigrams[condition][word] = 0
            bigrams[condition][word] += 1
        for token_type in token_words:
            if token_type not in unigrams:
                unigrams[token_type] = 0
            unigrams[token_type] += 1
        sentences.append(token_words)
    return sentences, unigrams, bigrams


In [170]:
train_sentences, train_unigrams, train_bigrams = textToSentences(open('train.txt').read())
print(len(train_sentences))
print(len(train_unigrams))
print(sum(train_unigrams.values()))
print(sum(len(condition_word.values()) for condition_word in train_bigrams.values()))
print(sum(sum(condition_word.values()) for condition_word in train_bigrams.values()))

100000
83045
2568210
809973
2468210


In [171]:
test_sentences, test_unigrams, test_bigrams = textToSentences(open('test.txt').read())
print(len(test_sentences))
print(len(test_unigrams))
print(sum(test_unigrams.values()))
print(sum(len(condition_word.values()) for condition_word in test_bigrams.values()))
print(sum(sum(condition_word.values()) for condition_word in test_bigrams.values()))

100
1249
2869
2421
2769


In [173]:
test_only_unigrams = { token_type: token_words for token_type, token_words in test_unigrams.items() if token_type not in train_unigrams }
print(len(test_only_unigrams.keys()))
print(sum(test_only_unigrams.values()))

test_only_bigrams = { condition: { word: count for word, count in word_count.items() if word in train_bigrams[condition]} for condition, word_count in test_bigrams.items() if condition in train_bigrams }
print(sum(len(token_count.values()) for token_count in test_only_bigrams.values()))
print(sum(sum(token_count.values()) for token_count in test_only_bigrams.values()))

45
46
1699
2044


In [174]:
train_once_unigrams = { word: count for word, count in train_unigrams.items() if count == 1 }
print(len(train_once_unigrams.keys()))
print(sum(train_once_unigrams.values()))

train_sentences_with_replacing, train_unigrams_with_replacing, train_bigrams_with_replacing = textToSentences(open('train.txt').read(), unknownWords=train_once_unigrams)
print(len(train_sentences_with_replacing))
print(len(train_unigrams_with_replacing))
print(sum(train_unigrams_with_replacing.values()))
print(sum(len(condition_word.values()) for condition_word in train_bigrams_with_replacing.values()))
print(sum(sum(condition_word.values()) for condition_word in train_bigrams_with_replacing.values()))

41307
41307
100000
41739
2568210
742294
2468210


In [175]:
test_only_unigrams = { word: count for word, count in test_unigrams.items() if word not in train_unigrams }
print(len(test_only_unigrams.keys()))
print(sum(test_only_unigrams.values()))

test_sentences_with_replacing, test_unigrams_with_replacing, test_bigrams_with_replacing = textToSentences(open('test.txt').read(), unknownWords=(set(test_only_unigrams.keys()).union(set(train_once_unigrams.keys()))))
print(len(test_sentences_with_replacing))
print(len(test_unigrams_with_replacing))
print(sum(test_unigrams_with_replacing.values()))
print(sum(len(condition_word.values()) for condition_word in test_bigrams_with_replacing.values()))
print(sum(sum(condition_word.values()) for condition_word in test_bigrams_with_replacing.values()))

45
46
100
1175
2869
2365
2769


## 1.2 TRAINING THE MODELS

Please use train.txt to train the following language models:

### 1. A unigram maximum likelihood model.

In [186]:
import math

class UnigramModel:
    def __init__(self, unigrams):
        self.unigrams = unigrams
    
    def unigramMLE(self, unigram, log=False):
        denominator = sum(self.unigrams.values())
        if unigram in self.unigrams:
            numerator = self.unigrams[unigram]
            result = numerator / denominator
        else:
            numerator = 0
            result = 0
        equation = f'count({unigram}) / count() = {numerator} / {denominator}'
        return result, equation

    def sentenceMLE(self, sentence, log=False):
        words = sentenceToWords(sentence)
        equation = {}
        for unigram in words:
            probability, probability_equation = self.unigramMLE(unigram, log)
            key = f'P({unigram})'
            equation[key] = probability
            print(f'{key} = {probability_equation} = {probability}')

        result = math.prod(equation.values())
        print(f'P({sentence}) = ', ' * '.join(equation.keys()), ' = ', ' * '.join([ str(v) for v in equation.values()]), ' = ', result, '\n')
        return result

unigramModel = UnigramModel(train_unigrams_with_replacing)
unigramModel.sentenceMLE('man charged over drugs')
unigramModel.sentenceMLE('man charged over drugs lalalllala')

P(<s>) = count(<s>) / count() = 100000 / 2568210 = 0.03893762581720342
P(man) = count(man) / count() = 942 / 2568210 = 0.00036679243519805626
P(charged) = count(charged) / count() = 245 / 2568210 = 9.539718325214838e-05
P(over) = count(over) / count() = 2876 / 2568210 = 0.0011198461185027704
P(drugs) = count(drugs) / count() = 275 / 2568210 = 0.00010707847099730941
P(</s>) = count(</s>) / count() = 100000 / 2568210 = 0.03893762581720342
P(man charged over drugs) =  P(<s>) * P(man) * P(charged) * P(over) * P(drugs) * P(</s>)  =  0.03893762581720342 * 0.00036679243519805626 * 9.539718325214838e-05 * 0.0011198461185027704 * 0.00010707847099730941 * 0.03893762581720342  =  6.361438993281194e-18 

P(<s>) = count(<s>) / count() = 100000 / 2568210 = 0.03893762581720342
P(man) = count(man) / count() = 942 / 2568210 = 0.00036679243519805626
P(charged) = count(charged) / count() = 245 / 2568210 = 9.539718325214838e-05
P(over) = count(over) / count() = 2876 / 2568210 = 0.0011198461185027704
P(dru

0.0

### 2. A bigram maximum likelihood model.

In [14]:
# import math

# assert sum([ count_by_tokens['<s>'] for count_by_tokens in MLE_probabilities_by_token_given_condition.values() if '<s>' in count_by_tokens ]) == 0
# assert train_unigrams_count_by_unigram['401'] == 15, 'there are 15 instances of token 401 in train'
# assert MLE_probabilities_by_unigram['401'] == train_unigrams_count_by_unigram['401'] / train_unigrams_count
# assert set(condition for condition, count_by_tokens in train_bigrams_count_by_bigram.items() if '401' in count_by_tokens) == set(['your', 'corp.', 'their', ',', 'of', 'his', ';', 'a', 'a', 'most', 'in', 'of', 'the', 'his', '<s>'])
# assert math.isclose(sum([ count_by_tokens['401'] * MLE_probabilities_by_unigram[condition] for condition, count_by_tokens in MLE_probabilities_by_token_given_condition.items() if '401' in count_by_tokens ]), MLE_probabilities_by_unigram['401'])

In [195]:
import math

class BigramModel:
    def __init__(self, unigrams, bigrams):
        self.unigrams = unigrams
        self.bigrams = bigrams
    
    def bigramMLE(self, word, condition, log=False):
        if condition in self.bigrams and word in self.bigrams[condition]:
            numerator = self.bigrams[condition][word]
            denominator = self.unigrams[condition]
            result = numerator / denominator
        else:
            numerator = 0
            denominator = 0
            result = 0
        equation = f'count({condition}, {word}) / count({condition}) = {numerator} / {denominator}'
        return result, equation

    def sentenceMLE(self, sentence, log=False):
        words = sentenceToWords(sentence)
        bigrams = list(pairwise(words))

        equation = {}
        for condition, word in bigrams:
            probability, probability_equation = self.bigramMLE(word, condition)
            key = f'P({word} | {condition})'
            equation[key] = probability
            print(f'{key} = {probability}')

        result = math.prod(equation.values())
        print(f'P({sentence}) = ', ' * '.join(equation.keys()), ' = ', ' * '.join([ str(v) for v in equation.values()]), ' = ', result)
        print()
        return result

bigramModel = BigramModel(train_unigrams_with_replacing, train_bigrams_with_replacing)
bigramModel.sentenceMLE('Man charged over drugs seizure')
bigramModel.sentenceMLE('Man charged over drugs seizure lalalala')

P(man | <s>) = 0.00075
P(charged | man) = 0.0074309978768577496
P(over | charged) = 0.02857142857142857
P(drugs | over) = 0.0006954102920723226
P(seizure | drugs) = 0.0036363636363636364
P(</s> | seizure) = 0.07692307692307693
P(Man charged over drugs seizure) =  P(man | <s>) * P(charged | man) * P(over | charged) * P(drugs | over) * P(seizure | drugs) * P(</s> | seizure)  =  0.00075 * 0.0074309978768577496 * 0.02857142857142857 * 0.0006954102920723226 * 0.0036363636363636364 * 0.07692307692307693  =  3.097457984376298e-14

P(man | <s>) = 0.00075
P(charged | man) = 0.0074309978768577496
P(over | charged) = 0.02857142857142857
P(drugs | over) = 0.0006954102920723226
P(seizure | drugs) = 0.0036363636363636364
P(lalalala | seizure) = 0
P(</s> | lalalala) = 0
P(Man charged over drugs seizure lalalala) =  P(man | <s>) * P(charged | man) * P(over | charged) * P(drugs | over) * P(seizure | drugs) * P(lalalala | seizure) * P(</s> | lalalala)  =  0.00075 * 0.0074309978768577496 * 0.028571428571

0.0

### 3. A bigram model with Add-One smoothing.

In [204]:
import math

class SmoothBigramModel:
    def __init__(self, unigrams, bigrams):
        self.unigrams = unigrams
        self.bigrams = bigrams
    
    def bigramMLE(self, word, condition, log=False):
        if condition in self.bigrams and word in self.bigrams[condition]:
            numerator = self.bigrams[condition][word]
        else:
            numerator = 0
        if word in self.unigrams:
            denominator = self.unigrams[word]
        else:
            denominator = 0
        vocab_size = len(self.bigrams.keys())
        equation = f'(count({condition}, {word}) + 1) / (count({condition}) + |V|) = ({numerator} + 1) / ({denominator} + {vocab_size})'

        result = (numerator + 1) / (denominator + vocab_size)
        return result, equation

    def sentenceMLE(self, sentence, log=False):
        words = sentenceToWords(sentence)
        bigrams = list(pairwise(words))

        equation = {}
        for condition, word in bigrams:
            probability, probability_equation = self.bigramMLE(word, condition)
            key = f'P({word} | {condition})'
            equation[key] = probability
            print(f'{key} = {probability_equation} = {probability}')

        result = math.prod(equation.values())
        print(f'P({sentence}) = ', ' * '.join(equation.keys()), ' = ', ' * '.join([ str(v) for v in equation.values()]), ' = ', result)
        print()
        return result

smoothBigramModel = SmoothBigramModel(train_unigrams_with_replacing, train_bigrams_with_replacing)
smoothBigramModel.sentenceMLE('Man charged over drugs seizure')
smoothBigramModel.sentenceMLE('Man charged over drugs seizure lalalala')


P(man | <s>) = (count(<s>, man) + 1) / (count(<s>) + |V|) = (75 + 1) / (942 + 41738) = 0.001780693533270853
P(charged | man) = (count(man, charged) + 1) / (count(man) + |V|) = (7 + 1) / (245 + 41738) = 0.00019055331920062884
P(over | charged) = (count(charged, over) + 1) / (count(charged) + |V|) = (7 + 1) / (2876 + 41738) = 0.00017931590980409736
P(drugs | over) = (count(over, drugs) + 1) / (count(over) + |V|) = (2 + 1) / (275 + 41738) = 7.140646942613001e-05
P(seizure | drugs) = (count(drugs, seizure) + 1) / (count(drugs) + |V|) = (1 + 1) / (13 + 41738) = 4.7903044238461356e-05
P(</s> | seizure) = (count(seizure, </s>) + 1) / (count(seizure) + |V|) = (1 + 1) / (100000 + 41738) = 1.4110541985917679e-05
P(Man charged over drugs seizure) =  P(man | <s>) * P(charged | man) * P(over | charged) * P(drugs | over) * P(seizure | drugs) * P(</s> | seizure)  =  0.001780693533270853 * 0.00019055331920062884 * 0.00017931590980409736 * 7.140646942613001e-05 * 4.7903044238461356e-05 * 1.411054198591

3.5180925719142713e-29

### 4. A bigram model with discounting and Katz backoff. Please use a discount constant of 0.5 (see lecture on smoothing).

In [231]:
class KatzBigramModel:
    def __init__(self, unigrams, bigrams):
        self.unigrams = unigrams
        self.bigrams = bigrams

        self.bigrams_star = {}
        self.leftovers = {}
        for condition, word_count in self.bigrams.items():
            self.leftovers[condition] = 1
            if condition not in self.bigrams_star:
                self.bigrams_star[condition] = {}
            for word, count in word_count.items():
                if word not in self.bigrams_star[condition]:
                    self.bigrams_star[condition][word] = 0
                self.bigrams_star[condition][word] = count - 0.5
                self.leftovers[condition] += 0.5
    
    def unigramMLE(self, unigram, log=False):
        denominator = sum(self.unigrams.values())
        if unigram in self.unigrams:
            numerator = self.unigrams[unigram]
            result = numerator / denominator
        else:
            numerator = 0
            result = 0
        equation = f'count({unigram}) / count() = {numerator} / {denominator}'
        return result, equation

    def bigramMLE(self, word, condition):
        if condition in self.bigrams_star:
            A_condition = set(self.bigrams_star[condition].keys())
        else:
            A_condition = set()

        B_condition = set(self.unigrams.keys()) - A_condition
        if word in A_condition:
            numerator = self.bigrams_star[condition][word]
            denominator = self.unigrams[condition]
            result = numerator / denominator
            return result, f'count*({condition}, {word}) / count({condition}) = {numerator} / {denominator}'
        else:
            alpha_numerator = sum(self.bigrams_star[condition].values())
            alpha_denominator = self.unigrams[condition]
            alpha_condition = 1 - alpha_numerator / alpha_denominator
            print(f'alpha({condition}) = 1 - Sigma(count*({condition}, word) / count({condition})) = 1 - {alpha_numerator} / {alpha_denominator} = {alpha_condition}')
            numerator, _ = self.unigramMLE(word)
            denominator = sum(self.unigramMLE(unigram)[0] for unigram in self.unigrams if unigram in B_condition)
            result = alpha_condition * numerator / denominator
            return result, f'alpha({condition}) * P({word}) / Sigma_w_in_B(P(w)) = {alpha_condition} * {numerator} / {denominator}'


    def sentenceMLE(self, sentence, log=False):
        words = sentenceToWords(sentence)
        bigrams = list(pairwise(words))

        equation = {}
        for condition, word in bigrams:
            probability, probability_equation = self.bigramMLE(word, condition)
            key = f'P({word} | {condition})'
            equation[key] = probability
            print(f'{key} = {probability_equation} = {probability}')

        result = math.prod(equation.values())
        print(f'P({sentence}) = ', ' * '.join(equation.keys()), ' = ', ' * '.join([ str(v) for v in equation.values()]), ' = ', result)
        print()
        return result

katzBigramModel = KatzBigramModel(train_unigrams_with_replacing, train_bigrams_with_replacing)
katzBigramModel.sentenceMLE('Man charged over drugs seizure')
katzBigramModel.sentenceMLE('Man charged over drugs seizure man')

P(man | <s>) = count*(<s>, man) / count(<s>) = 74.5 / 100000 = 0.000745
P(charged | man) = count*(man, charged) / count(man) = 6.5 / 942 = 0.006900212314225053
P(over | charged) = count*(charged, over) / count(charged) = 6.5 / 245 = 0.026530612244897958
P(drugs | over) = count*(over, drugs) / count(over) = 1.5 / 2876 = 0.000521557719054242
P(seizure | drugs) = count*(drugs, seizure) / count(drugs) = 0.5 / 275 = 0.0018181818181818182
P(</s> | seizure) = count*(seizure, </s>) / count(seizure) = 0.5 / 13 = 0.038461538461538464
P(Man charged over drugs seizure) =  P(man | <s>) * P(charged | man) * P(over | charged) * P(drugs | over) * P(seizure | drugs) * P(</s> | seizure)  =  0.000745 * 0.006900212314225053 * 0.026530612244897958 * 0.000521557719054242 * 0.0018181818181818182 * 0.038461538461538464  =  4.974304177587982e-15

P(man | <s>) = count*(<s>, man) / count(<s>) = 74.5 / 100000 = 0.000745
P(charged | man) = count*(man, charged) / count(man) = 6.5 / 942 = 0.006900212314225053
P(over

1.2072568204594095e-19

## 1.3 QUESTIONS

Please answer the questions below:

### 1. (5 points) How many word types (unique words) are there in the training corpus? Please include the end-of-sentence padding symbol \</s> and the unknown token \<unk>. Do not include the start of sentence padding symbol \<s>.

In [19]:
train_unigrams = set(train_unigrams_count_by_unigram.keys()) - {'<s>'}
print('train unique words count', len(train_unigrams))

test_unigrams = set(test_unigrams_count_by_unigram.keys()) - {'<s>'}
print('test unique words count', len(test_unigrams))

train unique words count 41738
test unique words count 1174


### 2. (5 points) How many word tokens are there in the training corpus? Do not include the start of sentence padding symbol \<s>.

In [20]:
train_unigrams_count = sum([ value for word, value in train_unigrams_count_by_unigram.items() if word not in {'<s>'} ])
print('train unigrams words count', train_unigrams_count)

test_unigrams_count = sum([ value for word, value in test_unigrams_count_by_unigram.items() if word not in {'<s>'} ])
print('test unigrams words count', test_unigrams_count)

train unigrams words count 2468210
test unigrams words count 2769


### 3. (10 points) What percentage of word tokens and word types in the test corpus did not occur in training (before you mapped the unknown words to \<unk> in training and test data)? Please include the padding symbol \</s> in your calculations. Do not include the start of sentence padding symbol \<s>.

In [21]:
print(len(lowercased_padded_test_vocabulary - {'<s>'}), len(replaced_lowercased_padded_test_vocabulary - {'<s>'}))

1248 1174


### 4. (15 points) Now replace singletons in the training data with \<unk> symbol and map words (in the test corpus) not observed in training to \<unk>. What percentage of bigrams (bigram types and bigram tokens) in the test corpus did not occur in training (treat \<unk> as a regular token that has been observed). Please include the padding symbol \</s> in your calculations. Do not include the start of sentence padding symbol \<s>.

In [50]:
excluded_unigrams =  {'<s>'}

train_bigram_type = {}
for sentence in replaced_lowercased_padded_train_sentences:
    for bigram in pairwise(sentence.split()):
        if bigram[0] in excluded_unigrams:
            continue
        if bigram not in train_bigram_type:
            train_bigram_type[bigram] = 0
        train_bigram_type[bigram] += 1

test_bigram_types = {}
for sentence in replaced_lowercased_padded_test_sentences:
    for bigram in pairwise(sentence.split()):
        if bigram[0] in excluded_unigrams:
            continue
        if bigram not in test_bigram_types:
            test_bigram_types[bigram] = 0
        test_bigram_types[bigram] += 1

test_bigram_types_count = len(test_bigram_types)
print('bigrams types in test', test_bigram_types_count)

only_test_bigram_types = test_bigram_types.keys() - train_bigram_type.keys()
only_test_bigram_types_count = len(only_test_bigram_types)
print('bigrams types in test but not in train', only_test_bigram_types_count, f'({only_test_bigram_types_count / test_bigram_types_count * 100}%)')


test_bigram_words_count = sum(test_bigram_types.values())
print('bigrams words in test', sum(test_bigram_types.values()))

only_test_bigram_types_words_count = sum([ count for bigram, count in test_bigram_types.items() if bigram in only_test_bigram_types ])
print('bigrams words in test but not in train', only_test_bigram_types_words_count, f'({only_test_bigram_types_words_count / test_bigram_words_count*100}%)')



bigrams types in test 2300
bigrams types in test but not in train 595 (25.869565217391305%)
bigrams words in test 2669
bigrams words in test but not in train 597 (22.367928062944923%)


### 5. (15 points) Compute the log probability of the following sentence under the three models (ignore capitalization and pad each sentence as described above). Please list all of the parameters required to compute the probabilities and show the complete calculation. Which of the parameters have zero values under each model? Use log base 2 in your calculations. Map words not observed in the training corpus to the \<unk> token.

- I look forward to hearing your reply.

In [74]:
unigram_model('I look forward to hearing your reply .', MLE_probabilities_by_unigram)

P(<s>) = 0.03893762581720342
P(i) = 0.0028576323587245593
P(look) = 0.000238687646259457
P(forward) = 0.00018456434637354423
P(to) = 0.02065563174351007
P(hearing) = 8.137963795795515e-05
P(your) = 0.00047387090619536564
P(reply) = 5.061891356236445e-06
P(.) = 0.03422383683577278
P(</s>) = 0.03893762581720342
P(I look forward to hearing your reply .) =  P(<s>) * P(i) * P(look) * P(forward) * P(to) * P(hearing) * P(your) * P(reply) * P(.) * P(</s>)  =  0.03893762581720342 * 0.0028576323587245593 * 0.000238687646259457 * 0.00018456434637354423 * 0.02065563174351007 * 8.137963795795515e-05 * 0.00047387090619536564 * 5.061891356236445e-06 * 0.03422383683577278 * 0.03893762581720342  =  2.633776012975432e-29



2.633776012975432e-29

In [81]:
bigram_model('I look forward to hearing your reply .', MLE_probabilities_by_token_given_condition)

P(i | <s>) = 0.02006
P(look | i) = 0.0020438751873552256
P(forward | look) = 0.05546492659053834
P(to | forward) = 0.2109704641350211
P(hearing | to) = 0.00011310511235107827
P(your | hearing) = 0
P(reply | your) = 0
P(. | reply) = 0
P(</s> | .) = 0.9430450315152342
P(I look forward to hearing your reply .) =  P(i | <s>) * P(look | i) * P(forward | look) * P(to | forward) * P(hearing | to) * P(your | hearing) * P(reply | your) * P(. | reply) * P(</s> | .)  =  0.02006 * 0.0020438751873552256 * 0.05546492659053834 * 0.2109704641350211 * 0.00011310511235107827 * 0 * 0 * 0 * 0.9430450315152342  =  0.0



0.0

### 6. (20 points) Compute the perplexity of the sentence above under each of the models.

In [76]:
bigram_model('I look forward to hearing your reply .', MLE_laplace_probabilities_by_token_given_condition)

P(i | <s>) = 0.014159828981437713
P(look | i) = 0.0003260116549166633
P(forward | look) = 0.0008264072534945221
P(to | forward) = 0.002392627863454386
P(hearing | to) = 7.384978952809984e-05
P(your | hearing) = 2.395840820335897e-05
P(reply | your) = 2.395840820335897e-05
P(. | reply) = 2.395840820335897e-05
P(</s> | .) = 0.6394128038385287
P(I look forward to hearing your reply .) =  P(i | <s>) * P(look | i) * P(forward | look) * P(to | forward) * P(hearing | to) * P(your | hearing) * P(reply | your) * P(. | reply) * P(</s> | .)  =  0.014159828981437713 * 0.0003260116549166633 * 0.0008264072534945221 * 0.002392627863454386 * 7.384978952809984e-05 * 2.395840820335897e-05 * 2.395840820335897e-05 * 2.395840820335897e-05 * 0.6394128038385287  =  5.9274088158499185e-30



5.9274088158499185e-30

In [77]:
bigram_katz_model('I look forward to hearing your reply .', backoff_probabilities, train_discounted_counts_by_bigram, MLE_probabilities_by_unigram, train_unigrams_count_by_unigram)

P(i | <s>) = 0.002857618070562231
P(look | i) = 0.00023867138466748843
P(forward | look) = 0.00018441380449229155
P(to | forward) = 0.02063384310242656
P(hearing | to) = 8.137887092018207e-05
P(your | hearing) = 0.00047273724374026445
P(reply | your) = 5.059811696680759e-06
P(. | reply) = 0.03290753541900613
P(</s> | .) = 0.038937404313875
P(I look forward to hearing your reply .) =  P(i | <s>) * P(look | i) * P(forward | look) * P(to | forward) * P(hearing | to) * P(your | hearing) * P(reply | your) * P(. | reply) * P(</s> | .)  =  0.002857618070562231 * 0.00023867138466748843 * 0.00018441380449229155 * 0.02063384310242656 * 8.137887092018207e-05 * 0.00047273724374026445 * 5.059811696680759e-06 * 0.03290753541900613 * 0.038937404313875  =  6.473009959346732e-28



6.473009959346732e-28

### 7. (20 points) Compute the perplexity of the entire test corpus under each of the models.Discuss the differences in the results you obtained.