<a href="https://colab.research.google.com/github/afaundez/CS74040-2021-fall/blob/homework-1/homework-1/part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PART I:

(10 points) Do exercise 3.4 from Chapter 3 in the textbook: [https://web.stanford.edu/~jurafsky/slp3/3.pdf](https://web.stanford.edu/~jurafsky/slp3/3.pdf)

We are given the following corpus, modified from the one in the chapter:

    <s> I am Sam </s>
    <s> Sam I am </s>
    <s> I am Sam </s>
    <s> I do not like green eggs and Sam </s>

Using a bigram language model with add-one smoothing, what is P(Sam | am)? Include \<s> and \</s> in your counts just like any other token.

In [292]:
p1_corpora = filenameToCorpora('p1.txt', startToken='<s>', stopToken='</s>')
p1_unigrams, p1_bigrams = processGrams(p1_corpora)
p1Model = SmoothBigramModel(p1_unigrams, p1_bigrams)
result, equation = p1Model.bigramMLE('sam', 'am', log=False, verbose=False)
print(equation, ' = ', result)

(count(am, sam) + 1) / (count(am) + |V|) = (2 + 1) / (3 + 11)  =  0.21428571428571427


In [266]:
sentences = ['<s> I am Sam </s>', '<s> Sam I am </s>', '<s> I am Sam </s>', '<s> I do not like green eggs and Sam </s>']
sentences = [ sentence.split(' ') for sentence in sentences ]

unigrams_count_by_unigram = {}
for sentence in sentences:
    for word in sentence:
        if word not in unigrams_count_by_unigram:
            unigrams_count_by_unigram[word] = 0
        unigrams_count_by_unigram[word] += 1

vocabulary = set(unigrams_count_by_unigram.keys())

In [267]:
from itertools import tee

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

bigrams_count_by_bigram = {}
for sentence in sentences:
    for bigram in pairwise(sentence):
        if bigram not in bigrams_count_by_bigram:
            bigrams_count_by_bigram[bigram] = 0
        bigrams_count_by_bigram[bigram] += 1

In [268]:
from collections import defaultdict

MLE_probabilities_by_token_given_condition = {
    (token, condition): (condition_token_count + 1) / (unigrams_count_by_unigram[condition] + len(vocabulary))
    for (condition, token), condition_token_count in bigrams_count_by_bigram.items()
}

MLE_probabilities_by_token_given_condition = defaultdict(lambda: (1 / len(vocabulary)), MLE_probabilities_by_token_given_condition)

MLE_probabilities_by_token_given_condition[('Sam', 'am')]

0.21428571428571427

# PART II:

In this assignment, you will train several language models and will evaluate them on a test corpus. You can discuss in groups, but the homework is to be completed and submitted individually. Two files are provided with this assignment:

1. train.txt
2. test.txt

Each file is a collection of texts, one sentence per line. train.txt contains 10,000 sentences from the NewsCrawl corpus. You will use this corpus to train the language models. The test corpus test.txt is from the same domain and will be used to evaluate the language models that you trained.

## 1.1 PRE-PROCESSING

Prior to training, please complete the following pre-processing steps:


### 1. Pad each sentence in the training and test corpora with start and end symbols (you can use \<s> and \</s>, respectively).

In [366]:
def loadSentences(filename):
    with open(filename, 'r') as f:
        corpora = f.read().strip()
        return [ sentence.strip().split(' ') for sentence in corpora.split('\n') ]

In [371]:
train_corpora = loadSentences('train.txt')
print('TRAIN', train_corpora[1])
assert len(train_corpora) == 100000, f'train must have 10000 sentences, got {len(train_corpora)} instead'

test_corpora = loadSentences('test.txt')
print('TEST', test_corpora[2])
assert len(test_corpora) == 100, f'test must have 100 sentences, got {len(test_corpora)} instead'

TRAIN ['Man', 'charged', 'over', 'drugs', 'seizure']
TEST ['The', 'road', 'was', 'pitted', 'with', 'tank', 'treads', '.']


In [372]:
def padSentence(sentence, startToken='<s>', stopToken='</s>'):
    return [startToken] + sentence + [stopToken]

In [373]:
train_padded_corpora = [ padSentence(sentence) for sentence in train_corpora ]
print('TRAIN', train_corpora[1], '->', train_padded_corpora[1])

test_padded_corpora = [padSentence(sentence) for sentence in test_corpora]
print('TEST', test_corpora[1], '->', test_padded_corpora[2])

TRAIN ['Man', 'charged', 'over', 'drugs', 'seizure'] -> ['<s>', 'Man', 'charged', 'over', 'drugs', 'seizure', '</s>']
TEST ['If', 'you', 'have', 'owned', 'the', 'property', 'for', 'more', 'than', 'three', 'years', ',', 'you', 'can', 'apply', 'for', '"', 'taper', 'relief', ',', '"', 'by', 'which', 'you', 'can', 'reduce', 'any', 'taxable', 'gain', 'by', '5%', 'for', 'each', 'year', 'of', 'ownership', ',', 'up', 'to', 'a', 'maximum', '40%', '.'] -> ['<s>', 'The', 'road', 'was', 'pitted', 'with', 'tank', 'treads', '.', '</s>']


### 2. Lowercase all words in the training and test corpora. Note that the data already has been tokenized (i.e. the punctuation has been split off words).

In [374]:
def lowerSentence(sentence):
    return [ word.lower() for word in sentence ]


In [376]:
train_lowercased_padded_corpora = [ lowerSentence(sentence) for sentence in train_padded_corpora ]
print('TRAIN', train_padded_corpora[1], '->', train_lowercased_padded_corpora[1])
assert ' '.join(train_lowercased_padded_corpora[1]) == '<s> man charged over drugs seizure </s>', f'train_lowercased_padded_corpora[1] must be "<s> man charged over drugs seizure </s>", got {train_lowercased_padded_corpora[1]} instead'
train_lowercased_padded_vocabulary = set(word for sentence in train_lowercased_padded_corpora for word in sentence)

test_lowercased_padded_corpora = [ lowerSentence(sentence) for sentence in test_padded_corpora ]
print('TEST', test_padded_corpora[2], '->', test_lowercased_padded_corpora[2])
assert ' '.join(test_lowercased_padded_corpora[2]) == '<s> the road was pitted with tank treads . </s>', f'test_lowercased_padded_corpora[2] must be "<s> the road was pitted with tank treads . </s>", got {lowercased_padded_test[1]} instead'
lowercased_padded_test_vocabulary = set(word for sentence in test_lowercased_padded_corpora for word in sentence)

TRAIN ['<s>', 'Man', 'charged', 'over', 'drugs', 'seizure', '</s>'] -> ['<s>', 'man', 'charged', 'over', 'drugs', 'seizure', '</s>']
TEST ['<s>', 'The', 'road', 'was', 'pitted', 'with', 'tank', 'treads', '.', '</s>'] -> ['<s>', 'the', 'road', 'was', 'pitted', 'with', 'tank', 'treads', '.', '</s>']


### 3. Replace all words occurring in the training data once with the token \<unk>. Everyword in the test data not seen in training should be treated as \<unk>.

In [377]:
from itertools import tee

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def sentenceToWords(sentence, startToken='<s>', stopToken='</s>', unknownToken='<unk>', unknownWords=set(), knownWords=set()):
    words = sentence.strip().lower().split(' ')
    words = [
        word if (len(knownWords) > 0 and word in knownWords) or (word not in unknownWords) else unknownToken
        for word in words
    ]
    print(words)
    return [startToken, *words, stopToken]

def replaceWordsInSentence(sentence, wordsForReplacing=[], replacingToken='<unk>'):
    return [
        word if word not in wordsForReplacing else replacingToken
        for word in sentence
    ]

def filenameToCorpora(filename, startToken='<s>', stopToken='</s>', unknownToken='<unk>', unknownWords=[]):
    return [
        replaceWordsInSentence(
            lowerSentence(
                padSentence(sentence, startToken, stopToken)),
            wordsForReplacing=unknownWords, replacingToken=unknownToken)
        for sentence in loadSentences(filename)
    ]

def processGrams(corpora):
    unigrams = {}
    bigrams = {}
    for sentence in corpora:
        for (condition, word) in pairwise(sentence):
            if condition not in bigrams:
                bigrams[condition] = {}
            if word not in bigrams[condition]:
                bigrams[condition][word] = 0
            bigrams[condition][word] += 1
        for token_type in sentence:
            if token_type not in unigrams:
                unigrams[token_type] = 0
            unigrams[token_type] += 1
    return unigrams, bigrams


In [378]:
train_corpora = filenameToCorpora('train.txt', startToken='<s>', stopToken='</s>')
train_unigrams, train_bigrams = processGrams(train_corpora)

In [342]:
print('total sentences', len(train_corpora))

print('total words', sum(len([word for word in sentence if word not in {'<s>', '</s>'}]) for sentence in train_corpora))
print('total words with start/stop', sum(count for count  in train_unigrams.values()))

print('total unigrams types', len([ count for unigram, count in train_unigrams.items() if unigram not in {'<s>', '</s>'} ]))
print('total unigrams types with start/stop', len(train_unigrams))

print('total unigrams words', sum(count for unigram, count in train_unigrams.items() if unigram not in {'<s>', '</s>'}))
print('total unigrams words with start/stop', sum(train_unigrams.values()))

print('total bigrams types with start/stop', len(train_bigrams))
print('total bigrams words with start/stop', sum(len(condition_word.values()) for condition_word in train_bigrams.values()))

total sentences 100000
total words 2368210
total words with start/stop 2568210
total unigrams types 83043
total unigrams types with start/stop 83045
total unigrams words 2368210
total unigrams words with start/stop 2568210
total bigrams types with start/stop 83044
total bigrams words with start/stop 809973


In [379]:
test_corpora = filenameToCorpora('test.txt', startToken='<s>', stopToken='</s>')
test_unigrams, test_bigrams = processGrams(test_corpora)

In [380]:
print('total sentences', len(test_corpora))

print('total words', sum(len([word for word in sentence if word not in {'<s>', '</s>'}]) for sentence in test_corpora))
print('total words with start/stop', sum(count for count  in test_unigrams.values()))

print('total unigrams types', len([ count for unigram, count in test_unigrams.items() if unigram not in {'<s>', '</s>'} ]))
print('total unigrams types with start/stop', len(test_unigrams))

print('total unigrams words', sum(count for unigram, count in test_unigrams.items() if unigram not in {'<s>', '</s>'}))
print('total unigrams words with start/stop', sum(test_unigrams.values()))

print('total bigrams types with start/stop', len(test_bigrams))
print('total bigrams words with start/stop', sum(len(condition_word.values()) for condition_word in test_bigrams.values()))

total sentences 100
total words 2669
total words with start/stop 2869
total unigrams types 1247
total unigrams types with start/stop 1249
total unigrams words 2669
total unigrams words with start/stop 2869
total bigrams types with start/stop 1248
total bigrams words with start/stop 2421


In [381]:
test_only_unigrams = { unigram: count for unigram, count in test_unigrams.items() if unigram not in train_unigrams }

test_only_bigrams = {}
for condition, word_count in test_bigrams.items():
    if condition not in train_bigrams:
        test_only_bigrams[condition] = word_count
    else:
        for word, count in word_count.items():
            if word not in train_bigrams[condition]:
                if condition not in test_only_bigrams:
                    test_only_bigrams[condition] = {}
                test_only_bigrams[condition][word] = count

In [382]:
print('only test unigrams types', len(test_only_unigrams.keys()))
print('only test unigrams words', sum(test_only_unigrams.values()))

print('only test bigram types', sum(len(token_count.values()) for token_count in test_only_bigrams.values()))
print('only test bigram words', sum(sum(token_count.values()) for token_count in test_only_bigrams.values()))

only test unigrams types 45
only test unigrams words 46
only test bigram types 722
only test bigram words 725


In [383]:
train_once_unigrams = { word: count for word, count in train_unigrams.items() if count == 1 }

train_corpora_with_replacing = filenameToCorpora('train.txt', startToken='<s>', stopToken='</s>', unknownToken='<unk>', unknownWords=set(train_once_unigrams.keys()))
train_unigrams_with_replacing, train_bigrams_with_replacing = processGrams(train_corpora_with_replacing)

In [384]:
print('total sentences', len(train_corpora_with_replacing))

print('total words', sum(len([word for word in sentence if word not in {'<s>', '</s>'}]) for sentence in train_corpora_with_replacing))
print('total words with start/stop', sum(count for count  in train_unigrams_with_replacing.values()))

print('total unigrams types', len([ count for unigram, count in train_unigrams_with_replacing.items() if unigram not in {'<s>', '</s>'} ]))
print('total unigrams types with start/stop', len(train_unigrams_with_replacing))

print('total unigrams words', sum(count for unigram, count in train_unigrams_with_replacing.items() if unigram not in {'<s>', '</s>'}))
print('total unigrams words with start/stop', sum(train_unigrams_with_replacing.values()))

print('total bigrams types with start/stop', len(train_bigrams_with_replacing))
print('total bigrams words with start/stop', sum(len(condition_word.values()) for condition_word in train_bigrams_with_replacing.values()))


total sentences 100000
total words 2368210
total words with start/stop 2568210
total unigrams types 41737
total unigrams types with start/stop 41739
total unigrams words 2368210
total unigrams words with start/stop 2568210
total bigrams types with start/stop 41738
total bigrams words with start/stop 742294


In [385]:
train_once_test_only_unigrams = set(train_once_unigrams.keys()).union(set(test_only_unigrams.keys()))

test_corpora_with_replacing = filenameToCorpora('test.txt', startToken='<s>', stopToken='</s>', unknownToken='<unk>', unknownWords=train_once_test_only_unigrams)
test_unigrams_with_replacing, test_bigrams_with_replacing = processGrams(test_corpora_with_replacing)
print(len(test_corpora_with_replacing))
print(len(test_unigrams_with_replacing))
print(sum(test_unigrams_with_replacing.values()))
print(sum(len(condition_word.values()) for condition_word in test_bigrams_with_replacing.values()))
print(sum(sum(condition_word.values()) for condition_word in test_bigrams_with_replacing.values()))

100
1175
2869
2365
2769


## 1.2 TRAINING THE MODELS

Please use train.txt to train the following language models:

### 1. A unigram maximum likelihood model.

In [411]:
import math

class UnigramModel:
    def __init__(self, unigrams):
        self.unigrams = unigrams
    
    def unigramMLE(self, unigram, log=False, verbose=False):
        denominator = sum(self.unigrams.values())
        if unigram in self.unigrams:
            numerator = self.unigrams[unigram]
            result = numerator / denominator
        else:
            numerator = 0
            result = 0
        if log:
            result = math.log(result) if result > 0 else -math.inf
            leftSide = '\log_{2} (\\frac{count(\\texttt{'+ unigram + '})}{count()})'
            rightSide = '\log_{2} (\\frac{' + str(numerator) +  '}{' + str(denominator) + '})'
            key = '\log_{2} (P(\\texttt{'+ unigram + '}))'
        else:
            leftSide = '\\frac{count(\\texttt{'+ unigram + '})}{count()}'
            rightSide = '\\frac{' + str(numerator) +  '}{' + str(denominator) + '}'
            key = 'P(\\texttt{'+ unigram + '})'
        fullEquation = f'{leftSide} = {rightSide}'
        if verbose:
            print(key +  ' &= ' + fullEquation +  ' = ' + str(result) + '\\\\')
        return result, key, fullEquation

    def sentenceMLE(self, sentence, log=False, verbose=False):
        words = sentenceToWords(sentence, knownWords=set(self.unigrams.keys()))
        equation = {}
        for unigram in words:
            probability, probability_key, probability_equation = self.unigramMLE(unigram, log=log, verbose=verbose)
            equation[probability_key] = probability
        if log:
            result = sum(equation.values())
            key = '\log_{2}(P(\\texttt{'+ sentence + '}))'
            operatorSymbol = ' + '
        else:
            result = math.prod(equation.values())
            key = 'P(\\texttt{'+ sentence + '})'
            operatorSymbol = ' \\times '
        leftSide = operatorSymbol.join(equation.keys())
        rightSide = operatorSymbol.join([ str(v) for v in equation.values()])
        fullEquation = key + ' &= ' + leftSide + '\\\\ &= ' + rightSide
        if verbose:
            print(f"{fullEquation} \\\\ &= {result}\n")
        return result, fullEquation

unigramModel = UnigramModel(train_unigrams_with_replacing)
output = unigramModel.sentenceMLE('man charged over drugs', verbose=True)
output = unigramModel.sentenceMLE('man charged over drugs', verbose=True, log=True)
output = unigramModel.sentenceMLE('man charged over drugs lalalllala', verbose=True)
output = unigramModel.sentenceMLE('man charged over drugs lalalllala', verbose=True, log=True)

['man', 'charged', 'over', 'drugs']
P(\texttt{<s>}) &= \frac{count(\texttt{<s>})}{count()} = \frac{100000}{2568210} = 0.03893762581720342\\
P(\texttt{man}) &= \frac{count(\texttt{man})}{count()} = \frac{942}{2568210} = 0.00036679243519805626\\
P(\texttt{charged}) &= \frac{count(\texttt{charged})}{count()} = \frac{245}{2568210} = 9.539718325214838e-05\\
P(\texttt{over}) &= \frac{count(\texttt{over})}{count()} = \frac{2876}{2568210} = 0.0011198461185027704\\
P(\texttt{drugs}) &= \frac{count(\texttt{drugs})}{count()} = \frac{275}{2568210} = 0.00010707847099730941\\
P(\texttt{</s>}) &= \frac{count(\texttt{</s>})}{count()} = \frac{100000}{2568210} = 0.03893762581720342\\
P(\texttt{man charged over drugs}) &= P(\texttt{<s>}) \times P(\texttt{man}) \times P(\texttt{charged}) \times P(\texttt{over}) \times P(\texttt{drugs}) \times P(\texttt{</s>})\\ &= 0.03893762581720342 \times 0.00036679243519805626 \times 9.539718325214838e-05 \times 0.0011198461185027704 \times 0.00010707847099730941 \time

### 2. A bigram maximum likelihood model.

In [282]:
import math

class BigramModel(UnigramModel):
    def __init__(self, unigrams, bigrams):
        super().__init__(unigrams)
        self.bigrams = bigrams
    
    def bigramMLE(self, word, condition, log=False, verbose=False):
        if condition in self.bigrams and word in self.bigrams[condition]:
            bigramCount = self.bigrams[condition][word]
            unigramCount = self.unigrams[condition]
            probability = bigramCount / unigramCount
        else:
            bigramCount = 0
            unigramCount = 0
            probability = 0
        equationText = f'count({condition}, {word}) / count({condition}) = {bigramCount} / {unigramCount}'
        if verbose:
            print(f'P({word} | {condition}) = ', equationText, ' = ', probability)
        return probability, equationText

    def sentenceMLE(self, sentence, log=False, verbose=False):
        words = sentenceToWords(sentence)
        bigrams = list(pairwise(words))

        equation = {}
        for condition, word in bigrams:
            probability, probability_equation = self.bigramMLE(word, condition, log=log, verbose=verbose)
            key = f'P({word} | {condition})'
            equation[key] = probability
            print(f'{key} = {probability_equation} = {probability}')

        result = math.prod(equation.values())
        print(f'P({sentence}) = ', ' * '.join(equation.keys()), ' = ', ' * '.join([ str(v) for v in equation.values()]), ' = ', result)
        print()
        return result

bigramModel = BigramModel(train_unigrams_with_replacing, train_bigrams_with_replacing)
bigramModel.sentenceMLE('Man charged over drugs seizure')
bigramModel.sentenceMLE('Man charged over drugs seizure lalalala')

['man', 'charged', 'over', 'drugs', 'seizure']
['man', 'charged', 'over', 'drugs', 'seizure']
P(man | <s>) = count(<s>, man) / count(<s>) = 75 / 100000 = 0.00075
P(charged | man) = count(man, charged) / count(man) = 7 / 942 = 0.0074309978768577496
P(over | charged) = count(charged, over) / count(charged) = 7 / 245 = 0.02857142857142857
P(drugs | over) = count(over, drugs) / count(over) = 2 / 2876 = 0.0006954102920723226
P(seizure | drugs) = count(drugs, seizure) / count(drugs) = 1 / 275 = 0.0036363636363636364
P(</s> | seizure) = count(seizure, </s>) / count(seizure) = 1 / 13 = 0.07692307692307693
P(Man charged over drugs seizure) =  P(man | <s>) * P(charged | man) * P(over | charged) * P(drugs | over) * P(seizure | drugs) * P(</s> | seizure)  =  0.00075 * 0.0074309978768577496 * 0.02857142857142857 * 0.0006954102920723226 * 0.0036363636363636364 * 0.07692307692307693  =  3.097457984376298e-14

['man', 'charged', 'over', 'drugs', 'seizure', 'lalalala']
['man', 'charged', 'over', 'drugs

0.0

### 3. A bigram model with Add-One smoothing.

In [283]:
import math

class SmoothBigramModel(BigramModel):
    def bigramMLE(self, word, condition, log=False, verbose=False):
        if condition in self.bigrams and word in self.bigrams[condition]:
            numerator = self.bigrams[condition][word]
        else:
            numerator = 0
        if condition in self.unigrams:
            denominator = self.unigrams[condition]
        else:
            denominator = 0
        vocab_size = len(self.unigrams.keys())
        equation = f'(count({condition}, {word}) + 1) / (count({condition}) + |V|) = ({numerator} + 1) / ({denominator} + {vocab_size})'

        result = (numerator + 1) / (denominator + vocab_size)
        return result, equation

smoothBigramModel = SmoothBigramModel(train_unigrams_with_replacing, train_bigrams_with_replacing)
smoothBigramModel.sentenceMLE('Man charged over drugs seizure')
smoothBigramModel.sentenceMLE('Man charged over drugs seizure lalalala')


['man', 'charged', 'over', 'drugs', 'seizure']
['man', 'charged', 'over', 'drugs', 'seizure']
P(man | <s>) = (count(<s>, man) + 1) / (count(<s>) + |V|) = (75 + 1) / (100000 + 41739) = 0.0005361968124510544
P(charged | man) = (count(man, charged) + 1) / (count(man) + |V|) = (7 + 1) / (942 + 41739) = 0.00018743703287176964
P(over | charged) = (count(charged, over) + 1) / (count(charged) + |V|) = (7 + 1) / (245 + 41739) = 0.00019054878048780488
P(drugs | over) = (count(over, drugs) + 1) / (count(over) + |V|) = (2 + 1) / (2876 + 41739) = 6.724195898240502e-05
P(seizure | drugs) = (count(drugs, seizure) + 1) / (count(drugs) + |V|) = (1 + 1) / (275 + 41739) = 4.760317989241681e-05
P(</s> | seizure) = (count(seizure, </s>) + 1) / (count(seizure) + |V|) = (1 + 1) / (13 + 41739) = 4.790189691511784e-05
P(Man charged over drugs seizure) =  P(man | <s>) * P(charged | man) * P(over | charged) * P(drugs | over) * P(seizure | drugs) * P(</s> | seizure)  =  0.0005361968124510544 * 0.00018743703287176

3.517570419932478e-29

### 4. A bigram model with discounting and Katz backoff. Please use a discount constant of 0.5 (see lecture on smoothing).

In [284]:
class KatzBigramModel(BigramModel):
    def __init__(self, unigrams, bigrams):
        BigramModel.__init__(self, unigrams, bigrams)

        self.bigrams_star = {}
        # self.leftovers = {}
        for condition, word_count in self.bigrams.items():
            # self.leftovers[condition] = 1
            if condition not in self.bigrams_star:
                self.bigrams_star[condition] = {}
            for word, count in word_count.items():
                if word not in self.bigrams_star[condition]:
                    self.bigrams_star[condition][word] = 0
                self.bigrams_star[condition][word] = count - 0.5
                # self.leftovers[condition] += 0.5

    def bigramMLE(self, word, condition, log=False, verbose=False):
        if condition in self.bigrams_star:
            A_condition = set(self.bigrams_star[condition].keys())
        else:
            A_condition = set()

        B_condition = set(self.unigrams.keys()) - A_condition
        if word in A_condition:
            numerator = self.bigrams_star[condition][word]
            denominator = self.unigrams[condition]
            result = numerator / denominator
            return result, f'count*({condition}, {word}) / count({condition}) = {numerator} / {denominator}'
        else:
            alpha_numerator = sum(self.bigrams_star[condition].values())
            alpha_denominator = self.unigrams[condition]
            alpha_condition = 1 - alpha_numerator / alpha_denominator
            print(f'alpha({condition}) = 1 - Sigma(count*({condition}, word) / count({condition})) = 1 - {alpha_numerator} / {alpha_denominator} = {alpha_condition}')
            numerator, _ = self.unigramMLE(word)
            denominator = sum(self.unigramMLE(unigram, log=log, verbose=verbose)[0] for unigram in self.unigrams if unigram in B_condition)
            result = alpha_condition * numerator / denominator
            return result, f'alpha({condition}) * P({word}) / Sigma_w_in_B(P(w)) = {alpha_condition} * {numerator} / {denominator}'

katzBigramModel = KatzBigramModel(train_unigrams_with_replacing, train_bigrams_with_replacing)
katzBigramModel.sentenceMLE('Man charged over drugs seizure')
katzBigramModel.sentenceMLE('Man charged over drugs seizure man')

['man', 'charged', 'over', 'drugs', 'seizure']
['man', 'charged', 'over', 'drugs', 'seizure']
P(man | <s>) = count*(<s>, man) / count(<s>) = 74.5 / 100000 = 0.000745
P(charged | man) = count*(man, charged) / count(man) = 6.5 / 942 = 0.006900212314225053
P(over | charged) = count*(charged, over) / count(charged) = 6.5 / 245 = 0.026530612244897958
P(drugs | over) = count*(over, drugs) / count(over) = 1.5 / 2876 = 0.000521557719054242
P(seizure | drugs) = count*(drugs, seizure) / count(drugs) = 0.5 / 275 = 0.0018181818181818182
P(</s> | seizure) = count*(seizure, </s>) / count(seizure) = 0.5 / 13 = 0.038461538461538464
P(Man charged over drugs seizure) =  P(man | <s>) * P(charged | man) * P(over | charged) * P(drugs | over) * P(seizure | drugs) * P(</s> | seizure)  =  0.000745 * 0.006900212314225053 * 0.026530612244897958 * 0.000521557719054242 * 0.0018181818181818182 * 0.038461538461538464  =  4.974304177587982e-15

['man', 'charged', 'over', 'drugs', 'seizure', 'man']
['man', 'charged',

ValueError: too many values to unpack (expected 2)

## 1.3 QUESTIONS

Please answer the questions below:

### 1. (5 points) How many word types (unique words) are there in the training corpus? Please include the end-of-sentence padding symbol \</s> and the unknown token \<unk>. Do not include the start of sentence padding symbol \<s>.

In [345]:
excluded_unigrams =  {'<s>'}
train_unigrams_with_replacing_without_start = set(train_unigrams_with_replacing.keys()) - excluded_unigrams
print('word types in training', len(train_unigrams_with_replacing_without_start))

word types in training 41738


### 2. (5 points) How many word tokens are there in the training corpus? Do not include the start of sentence padding symbol \<s>.

In [346]:
excluded_unigrams =  {'<s>'}

train_unigrams_with_replacing_without_start = sum([ value for word, value in train_unigrams_with_replacing.items() if word not in excluded_unigrams ])
print('word tokens in training', train_unigrams_with_replacing_without_start)

# test_unigrams_with_replacing_without_start = sum([ value for word, value in test_unigrams_with_replacing.items() if word not in {'<s>'} ])
# print('test unigrams words count', test_unigrams_with_replacing_without_start)

word tokens in training 2468210


### 3. (10 points) What percentage of word tokens and word types in the test corpus did not occur in training (before you mapped the unknown words to \<unk> in training and test data)? Please include the padding symbol \</s> in your calculations. Do not include the start of sentence padding symbol \<s>.

In [359]:
excluded_unigrams =  {'<s>'}
test_only_unigrams_without_start = { word: count for word, count in test_only_unigrams.items() if word not in excluded_unigrams }
test_unigrams_without_start = { word: count for word, count in test_unigrams.items() if word not in excluded_unigrams }
print('unigrams types only in test', len(test_only_unigrams_without_start))
print('unigrams types', len(test_unigrams_without_start))
print('percentage unigrams types only in test', len(test_only_unigrams_without_start) / len(test_unigrams_without_start) * 100)


print('unigrams words only in test', sum(test_only_unigrams_without_start.values()))
print('unigrams words', sum(test_unigrams_without_start.values()))
print('percentage unigrams words only in test', sum(test_only_unigrams_without_start.values()) / sum(test_unigrams_without_start.values()) * 100)

unigrams types only in test 45
unigrams types 1248
percentage unigrams types only in test 3.6057692307692304
unigrams words only in test 46
unigrams words 2769
percentage unigrams words only in test 1.6612495485734922


### 4. (15 points) Now replace singletons in the training data with \<unk> symbol and map words (in the test corpus) not observed in training to \<unk>. What percentage of bigrams (bigram types and bigram tokens) in the test corpus did not occur in training (treat \<unk> as a regular token that has been observed). Please include the padding symbol \</s> in your calculations. Do not include the start of sentence padding symbol \<s>.

In [365]:
excluded_unigrams =  {'<s>'}

test_only_bigrams_types_without_start_count = sum(len(word_count.keys()) for condition, word_count in test_only_bigrams.items() if condition not in excluded_unigrams)
print('bigrams types only in test', test_only_bigrams_types_without_start_count)
test_bigrams_types_without_start_count = sum(len(word_count.keys()) for condition, word_count in test_bigrams.items() if condition not in excluded_unigrams)
print('bigrams types in test', test_bigrams_types_without_start_count)
print('percentage bigrams types only in test', test_only_bigrams_types_without_start_count / test_bigrams_types_without_start_count * 100)

test_only_bigrams_words_without_start_count = sum(sum(word_count.values()) for condition, word_count in test_only_bigrams.items() if condition not in excluded_unigrams)
print('bigrams types only in test', test_only_bigrams_words_without_start_count)
test_bigrams_words_without_start_count = sum(sum(word_count.values()) for condition, word_count in test_bigrams.items() if condition not in excluded_unigrams)
print('bigrams types in test', test_bigrams_words_without_start_count)
print('percentage bigrams types only in test', test_only_bigrams_words_without_start_count / test_bigrams_words_without_start_count * 100)

bigrams types only in test 716
bigrams types in test 2353
percentage bigrams types only in test 30.429239269018275
bigrams types only in test 719
bigrams types in test 2669
percentage bigrams types only in test 26.938928437617083


### 5. (15 points) Compute the log probability of the following sentence under the three models (ignore capitalization and pad each sentence as described above). Please list all of the parameters required to compute the probabilities and show the complete calculation. Which of the parameters have zero values under each model? Use log base 2 in your calculations. Map words not observed in the training corpus to the \<unk> token.

- I look forward to hearing your reply.

In [231]:
unigramModel.sentenceMLE('I look forward to hearing your reply .')

P(<s>) = count(<s>) / count() = 100000 / 2568210 = 0.03893762581720342
P(i) = count(i) / count() = 7339 / 2568210 = 0.0028576323587245593
P(look) = count(look) / count() = 613 / 2568210 = 0.000238687646259457
P(forward) = count(forward) / count() = 474 / 2568210 = 0.00018456434637354423
P(to) = count(to) / count() = 53048 / 2568210 = 0.02065563174351007
P(hearing) = count(hearing) / count() = 209 / 2568210 = 8.137963795795515e-05
P(your) = count(your) / count() = 1217 / 2568210 = 0.00047387090619536564
P(reply) = count(reply) / count() = 13 / 2568210 = 5.061891356236445e-06
P(.) = count(.) / count() = 87894 / 2568210 = 0.03422383683577278
P(</s>) = count(</s>) / count() = 100000 / 2568210 = 0.03893762581720342
P(I look forward to hearing your reply .) =  P(<s>) * P(i) * P(look) * P(forward) * P(to) * P(hearing) * P(your) * P(reply) * P(.) * P(</s>)  =  0.03893762581720342 * 0.0028576323587245593 * 0.000238687646259457 * 0.00018456434637354423 * 0.02065563174351007 * 8.137963795795515e-

2.633776012975432e-29

In [232]:
bigramModel.sentenceMLE('I look forward to hearing your reply .')

P(i | <s>) = count(<s>, i) / count(<s>) = 2006 / 100000 = 0.02006
P(look | i) = count(i, look) / count(i) = 15 / 7339 = 0.0020438751873552256
P(forward | look) = count(look, forward) / count(look) = 34 / 613 = 0.05546492659053834
P(to | forward) = count(forward, to) / count(forward) = 100 / 474 = 0.2109704641350211
P(hearing | to) = count(to, hearing) / count(to) = 6 / 53048 = 0.00011310511235107827
P(your | hearing) = count(hearing, your) / count(hearing) = 0 / 0 = 0
P(reply | your) = count(your, reply) / count(your) = 0 / 0 = 0
P(. | reply) = count(reply, .) / count(reply) = 0 / 0 = 0
P(</s> | .) = count(., </s>) / count(.) = 82888 / 87894 = 0.9430450315152342
P(I look forward to hearing your reply .) =  P(i | <s>) * P(look | i) * P(forward | look) * P(to | forward) * P(hearing | to) * P(your | hearing) * P(reply | your) * P(. | reply) * P(</s> | .)  =  0.02006 * 0.0020438751873552256 * 0.05546492659053834 * 0.2109704641350211 * 0.00011310511235107827 * 0 * 0 * 0 * 0.9430450315152342

0.0

In [234]:
smoothBigramModel.sentenceMLE('I look forward to hearing your reply .')

P(i | <s>) = (count(<s>, i) + 1) / (count(<s>) + |V|) = (2006 + 1) / (100000 + 41739) = 0.014159828981437713
P(look | i) = (count(i, look) + 1) / (count(i) + |V|) = (15 + 1) / (7339 + 41739) = 0.0003260116549166633
P(forward | look) = (count(look, forward) + 1) / (count(look) + |V|) = (34 + 1) / (613 + 41739) = 0.0008264072534945221
P(to | forward) = (count(forward, to) + 1) / (count(forward) + |V|) = (100 + 1) / (474 + 41739) = 0.002392627863454386
P(hearing | to) = (count(to, hearing) + 1) / (count(to) + |V|) = (6 + 1) / (53048 + 41739) = 7.384978952809984e-05
P(your | hearing) = (count(hearing, your) + 1) / (count(hearing) + |V|) = (0 + 1) / (209 + 41739) = 2.3839038809955182e-05
P(reply | your) = (count(your, reply) + 1) / (count(your) + |V|) = (0 + 1) / (1217 + 41739) = 2.3279634975323588e-05
P(. | reply) = (count(reply, .) + 1) / (count(reply) + |V|) = (0 + 1) / (13 + 41739) = 2.395094845755892e-05
P(</s> | .) = (count(., </s>) + 1) / (count(.) + |V|) = (82888 + 1) / (87894 + 417

5.728997390142119e-30

In [235]:
katzBigramModel.sentenceMLE('I look forward to hearing your reply .')

P(i | <s>) = count*(<s>, i) / count(<s>) = 2005.5 / 100000 = 0.020055
P(look | i) = count*(i, look) / count(i) = 14.5 / 7339 = 0.001975746014443385
P(forward | look) = count*(look, forward) / count(look) = 33.5 / 613 = 0.05464926590538336
P(to | forward) = count*(forward, to) / count(forward) = 99.5 / 474 = 0.20991561181434598
P(hearing | to) = count*(to, hearing) / count(to) = 5.5 / 53048 = 0.00010367968632182174
alpha(hearing) = 1 - Sigma(count*(hearing, word) / count(hearing)) = 1 - 170.0 / 209 = 0.1866028708133971
P(your | hearing) = alpha(hearing) * P(your) / Sigma_w_in_B(P(w)) = 0.1866028708133971 * 0.00047387090619536564 / 0.6730450391519687 = 0.00013138150695296243
alpha(your) = 1 - Sigma(count*(your, word) / count(your)) = 1 - 874.5 / 1217 = 0.2814297452752671
P(reply | your) = alpha(your) * P(reply) / Sigma_w_in_B(P(w)) = 0.2814297452752671 * 5.061891356236445e-06 / 0.8819099684217718 = 1.6153199827710787e-06
alpha(reply) = 1 - Sigma(count*(reply, word) / count(reply)) = 1 - 

8.095318855724841e-23

### 6. (20 points) Compute the perplexity of the sentence above under each of the models.

### 7. (20 points) Compute the perplexity of the entire test corpus under each of the models.Discuss the differences in the results you obtained.