# Language Modeling using Ngram

In this Exercise, we are going to create a bigram language model and its variation. We will build one model for each of the following type and calculate their perplexity:
- Unigram Model
- Bigram Model
- Bigram Model with Laplace smoothing
- Bigram Model with Interpolation
- Bigram Model with Kneser-ney Interpolation

We will also use NLTK which is a natural language processing library for python to make our lives easier.



In [2]:
# #download corpus
!wget --no-check-certificate https://github.com/ekapolc/nlp_2019/raw/master/HW4/BEST2010.zip
!unzip BEST2010.zip

--2025-01-13 18:19:57--  https://github.com/ekapolc/nlp_2019/raw/master/HW4/BEST2010.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ekapolc/nlp_2019/master/HW4/BEST2010.zip [following]
--2025-01-13 18:19:58--  https://raw.githubusercontent.com/ekapolc/nlp_2019/master/HW4/BEST2010.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7423530 (7.1M) [application/zip]
Saving to: ‘BEST2010.zip.1’


2025-01-13 18:20:00 (5.73 MB/s) - ‘BEST2010.zip.1’ saved [7423530/7423530]

Archive:  BEST2010.zip
replace BEST2010/article.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [1]:
!wget https://www.dropbox.com/s/jajdlqnp5h0ywvo/tokenized_wiki_sample.csv

--2025-01-13 10:37:12--  https://www.dropbox.com/s/jajdlqnp5h0ywvo/tokenized_wiki_sample.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.81.18
Connecting to www.dropbox.com (www.dropbox.com)|162.125.81.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.dropbox.com/scl/fi/88uzig0mno1b57d6bhwht/tokenized_wiki_sample.csv?rlkey=oya9jw1rljj31jc49fvoaty01 [following]
--2025-01-13 10:37:13--  https://www.dropbox.com/scl/fi/88uzig0mno1b57d6bhwht/tokenized_wiki_sample.csv?rlkey=oya9jw1rljj31jc49fvoaty01
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucfa8ed67cd4dab79d6b55512f03.dl.dropboxusercontent.com/cd/0/inline/CiFsPEfWYAZcapQLlX7ENV0xdzUaVa0SLER8sCPEIRGcCX5eQ0b3NXEVUgyNYcBr7thzJDJ1s7jpQdi_65QcIfIg7VZdKxuspYUcP-B0TyNRa6VY5P8-XbXF2Cdo7EcCz40JEpB2DjV5HVa4JGoyJMBD/file# [following]
--2025-01-13 10:37:13--  https://ucfa8ed67cd4dab79d6b55512f03.dl.dropboxusercontent.com/

In [1]:
#First we import necessary library such as math, nltk, bigram, and collections.
import math
import nltk
import io
import random
from random import shuffle
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
random.seed(999)

BEST2010 is a free Thai NLP dataset by NECTEC usually used as a standard benchmark for various NLP tasks including language modeling. It is separated into 4 domains including article, encyclopedia, news, and novel. The data is already  tokenized using '|' as a separator.

For example,

ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น|

In [2]:
total_word_count = 0
best2010 = []
with open('BEST2010/news.txt','r',encoding='utf-8') as f:
  for i,line in enumerate(f):
    line=line.strip()[:-1] #remove the trailing |
    total_word_count += len(line.split("|"))
    best2010.append(line)

In [3]:
#For simplicity, we assumes that each line is a sentence.
print (f'Total sentences in BEST2010 news dataset :\t{len(best2010)}')
print (f'Total word counts in BEST2010 news dataset :\t{total_word_count}')

Total sentences in BEST2010 news dataset :	30969
Total word counts in BEST2010 news dataset :	1660190


We separate the input into 2 sets, train and test data with 70:30 ratio

In [4]:
sentences = best2010
# The data is separated to train and test set with 70:30 ratio.
train = sentences[:int(len(sentences)*0.7)]
test = sentences[int(len(sentences)*0.7):]

#Training data
train_word_count =0
for line in train:
    for word in line.split('|'):
        train_word_count+=1
print ('Total sentences in BEST2010 news training dataset :\t'+ str(len(train)))
print ('Total word counts in BEST2010 news training dataset :\t'+ str(train_word_count))

Total sentences in BEST2010 news training dataset :	21678
Total word counts in BEST2010 news training dataset :	1042797


Here we load the data from Wikipedia which is also already tokenized. It will be used for answering questions in MyCourseville.

In [5]:
import pandas as pd
wiki_data = pd.read_csv("tokenized_wiki_sample.csv")

## Data Preprocessing

Before training any language models, the first step we always do is process the data into the format suited for the LM.

For this exercise, we will use NLTK to help process our data.

In [141]:
from nltk.lm.preprocessing import pad_both_ends, flatten
from nltk.lm.vocabulary import Vocabulary
from nltk import ngrams

We begin by "tokenizing" our training set. Note that the data is already tokenized so we can just split it.

In [142]:
tokenized_train = [t.split("|") for t in train] # "tokenize" each sentence

Next we create a vocabulary with the ```Vocabulary``` class from NLTK. It accepts a list of tokens so we flatten our sentences into one long sentence first.







In [174]:
flat_tokens = list(flatten(tokenized_train)) #join all sentences into one long sentence
vocab = Vocabulary(flat_tokens, unk_cutoff=3) #Words with frequency **below** 3 (not exactly 3) will not be considered in our vocab and will be converted to <UNK>.

In [177]:
vocab.update(["<s>", "</s>"])
vocab.update(["<s>", "</s>"])

In [179]:
print('<UNK>' in vocab, '<s>' in vocab, '</s>' in vocab)

True True True


Then we replace low frequency words and pad each sentence with \<s\> in the front and \</s\> in the back of each sentence.

Now *each* sentence is going to look something like this:
\["\<s\>", "hello", "my", "name", "is", "\<UNK\>", "\</s\>" \]

In [9]:
tokenized_train = [[token if token in vocab else "<UNK>" for token in sentence] for sentence in tokenized_train]
padded_tokenized_train = [list(pad_both_ends(sentence, n=2)) for sentence in tokenized_train]

Finally, we do the same for the test set and the wiki dataset.

In [10]:
tokenized_test = [t.split("|") for t in test]
tokenized_test = [[token if token in vocab else "<UNK>" for token in sentence] for sentence in tokenized_test]
padded_tokenized_test = [list(pad_both_ends(sentence, n=2)) for sentence in tokenized_test]

tokenized_wiki_test = [t.split("|") for t in wiki_data['tokenized'].tolist()]
tokenized_wiki_test = [[token if token in vocab else "<UNK>" for token in sentence] for sentence in tokenized_wiki_test]
padded_tokenized_wiki_test = [list(pad_both_ends(sentence, n=2)) for sentence in tokenized_wiki_test]

# Unigram

In this section, we will demonstrate how to build a unigram language model <br>
**Important note:** <br>
**\<s\>** = sentence start symbol <br>
**\</s\>** = sentence end symbol

# VERY IMPORTANT:
- In this notebook, we will *not* default the unknown token probability to ```1/len(vocab)``` but instead will treat it as a normal word and let the model learn its probability so that we can compare our results to NLTK.
- **Also make sure that the code in this notebook can be executed without any problem. If we find that you used NLTK to answer questions in MyCourseVille and did not finish the assignment, you will receive a grade of 0 for this assignment.**

In [11]:
class UnigramModel():
  def __init__(self, data, vocab):
    self.unigram_count = defaultdict(lambda: 0.0)
    self.word_count = 0
    self.vocab = vocab
    for sentence in data:
        for w in sentence: #[(word1, ), (word2, ), (word3, )...]
          w = w[0]
          if w in self.vocab:
            self.unigram_count[w] +=1.0
          else:
            self.unigram_count["<UNK>"] += 1.0
          self.word_count+=1

  def __getitem__(self, w):
    w = w[0]  #[(word1, ), (word2, ), (word3, )...]
    if w in self.vocab:
      return self.unigram_count[w]/(self.word_count)
    else:
      return self.unigram_count["<UNK>"]/(self.word_count)

In [12]:
train_unigrams = [list(ngrams(sent, n=1)) for sent in padded_tokenized_train] #creating the unigrams by setting n=1
model = UnigramModel(train_unigrams, vocab)

In [13]:
train_unigrams

[[('<s>',),
  ('สงสัย',),
  ('ติด',),
  ('หวัด',),
  ('นก',),
  (' ',),
  ('อีก',),
  ('คน',),
  ('ยัง',),
  ('น่า',),
  ('ห่วง',),
  ('</s>',)],
 [('<s>',),
  ('ตาม',),
  ('ที่',),
  ('<UNK>',),
  (' ',),
  ('กับ',),
  (' ',),
  ('<UNK>',),
  (' ',),
  ('และ',),
  (' ',),
  ('<UNK>',),
  (' ',),
  ('ป่วย',),
  ('สงสัย',),
  ('ติด',),
  ('เชื้อ',),
  ('ไข้',),
  ('ขณะ',),
  ('นี้',),
  ('ยัง',),
  ('ไม่',),
  ('ดี',),
  ('ขึ้น',),
  ('</s>',)],
 [('<s>',),
  ('หลัง',),
  ('เข้า',),
  ('เยี่ยม',),
  ('ดู',),
  ('อาการ',),
  ('ผู้',),
  ('ป่วย',),
  ('แล้ว',),
  (' ',),
  ('น.พ.จรัล',),
  ('ประชุม',),
  ('ร่วม',),
  ('กับ',),
  ('เจ้าหน้าที่',),
  ('ทุก',),
  ('ฝ่าย',),
  (' ',),
  ('เพื่อ',),
  ('สรุป',),
  ('ผล',),
  ('การ',),
  ('ดำเนิน',),
  ('การ',),
  (' ',),
  ('รวม',),
  ('ทั้ง',),
  ('สอบสวน',),
  ('โรค',),
  ('ก่อน',),
  ('ที่',),
  ('ผู้',),
  ('ป่วย',),
  ('จะ',),
  ('ถูก',),
  ('ส่ง',),
  ('มา',),
  ('รักษา',),
  ('ตัว',),
  (' ',),
  ('จาก',),
  ('นั้น',),
  ('ร่วม',),
  ('

In [14]:
def getLnValue(x):
      return math.log(x)

In [15]:
#problability of 'นายก'
print(getLnValue(model['นายก']))

#for example, problability of 'นายกรัฐมนตรี' which is an unknown word is equal to
print(getLnValue(model['นายกรัฐมนตรี']))

#problability of 'นายก' 'ได้' 'ให้' 'สัมภาษณ์' 'กับ' 'สื่อ'
prob = getLnValue(model['นายก'])+getLnValue(model['ได้'])+ getLnValue(model['ให้'])+getLnValue(model['สัมภาษณ์'])+getLnValue(model['กับ'])+getLnValue(model['สื่อ'])+getLnValue(model['</s>'])
print ('Problability of a sentence', math.exp(prob))

-2.8280048315526183
-2.8280048315526183
Problability of a sentence 1.5741554829931142e-13


# Perplexity

In order to compare language model we need to calculate perplexity. In this task you should write a perplexity calculation code for the unigram model. The result perplexity should be around 420.67 and
345.12 on train and test data.

## TODO #1 Calculate perplexity

In [16]:
def getLnValue(x):
    return math.log(x)

def calculate_sentence_ln_prob(sentence, model):
    prob = 0
    for word in sentence:
        prob += getLnValue(model[word])
    return prob

def perplexity(test,model):
    sum_ln_prob = 0
    sum_len = 0
    for sent in test:
        sum_ln_prob += calculate_sentence_ln_prob(sent,model)
        sum_len += len(sent)

    pp = math.exp(-sum_ln_prob/sum_len)
   
    return pp

In [17]:
test_unigrams = [list(ngrams(sent, n=1)) for sent in padded_tokenized_test]

In [18]:
print(perplexity(train_unigrams,model))
print(perplexity(test_unigrams,model))

420.6667499545085
345.11942873411994


## Q1 MCV
Calculate the perplexity of the model on the wiki test set and answer in MyCourseVille

In [19]:
wiki_test_unigrams = [list(ngrams(sent, n=1)) for sent in padded_tokenized_wiki_test]

In [20]:
print(perplexity([list(flatten(wiki_test_unigrams))], model))

419.56889624155025


# Bigram

Next, you will create a better language model than a unigram (which is not much to compare with). But first, it is very tedious to count every pair of words that occur in our corpus by ourselves. Lucky for us, nltk provides us a simple library which will simplify the process.

In [21]:
#example of nltk usage for bigram
sentence = 'I always search google for an answer .'
padded_sentence = list(pad_both_ends(sentence.split(), n=2))

print('This is how nltk generate bigram.')
for w1,w2 in bigrams(padded_sentence):
    print(w1,w2)
print('\n<s> and </s> are used as a start and end of sentence symbol. respectively.')

This is how nltk generate bigram.
<s> I
I always
always search
search google
google for
for an
an answer
answer .
. </s>

<s> and </s> are used as a start and end of sentence symbol. respectively.


Now, you should be able to implement a bigram model by yourself. Also, you must create a new perplexity calculation for bigram. The result perplexity should be around 56.46 and 85.38 on train and test data.

## TODO #3 Write Bigram Model

In [245]:
class BigramModel():
  def __init__(self, data, vocab):
    self.unigram_count = defaultdict(lambda: 0.0)
    self.bigram_count = defaultdict(lambda: 0.0)
    self.vocab = vocab
    for sentence in data:
        for w in sentence: #[(word1, ), (word2, ), (word3, )...]
          w = w[0]
          if w in self.vocab:
            # print(w)
            self.unigram_count[w] += 1.0
          else:
            self.unigram_count["<UNK>"] += 1.0
        self.unigram_count["</s>"] += 1.0
    
    for sentence in data:
        for w1,w2 in sentence:
          # print(w1,w2)
          # if w1 == "<UNK>" or w2 == "<UNK>":
          #   self.bigram_count["<UNK>"] += 1.0
          if w1 in self.vocab and w2 in self.vocab:
            # if w1 == 'แถลง' or w2 == 'ยอด':
            #   print(w1,w2)
            self.bigram_count[(w1, w2)] += 1.0
          else:
            self.bigram_count["<UNK>"] += 1.0

  def __getitem__(self, bigram):
    w1, w2 = bigram
    # if w1 == 'แถลง' and w2 == 'ยอด':
    #   print(w1,w2)
    # if w1 == "<UNK>" or w2 == "<UNK>":
    #   return self.bigram_count["<UNK>"]/self.unigram_count["<UNK>"]
    if w1 in self.vocab and w2 in self.vocab:
      if w1 == 'แถลง' and w2 == 'ยอด':
        print("AS")
        
      if self.bigram_count[(w1,w2)] == 0:
        return 1.0/self.unigram_count[w1]
      
      return self.bigram_count[(w1,w2)]/self.unigram_count[w1]
      # if self.bigram_count[(w1,w2)] == 0:
        # return 1/self.unigram_count[w1]
    # if w1 == "<UNK>" or w2 == "<UNK>":
    # if w1 in self.vocab and w2 in self.vocab: 
    #   prob = self.bigram_count[(w1,w2)]/self.unigram_count[w1]
    #   return prob
    else:
      return self.bigram_count["<UNK>"]/self.unigram_count["<UNK>"]

## TODO #4 Write Perplexity for Bigram Model

Sum perplexity score at a sentence level, instead of word level

In [243]:
def getLnValueBigram(x):
    return math.log(x)

def calculate_sentence_ln_prob(sentence, model):
    prob = 0
    for bigram in sentence:
        # if(model[bigram] == 0):
        #     print(bigram, " is 0")
        #     prob += getLnValueBigram(1/model.unigram_count[bigram[0]])
        # else:
            prob += getLnValueBigram(model[bigram])
    return prob

def perplexity(bigram_data,model):
    sum_ln_prob = 0
    sum_len = 0
    for sent in bigram_data:
        
        sum_ln_prob += calculate_sentence_ln_prob(sent, model)
        sum_len += len(sent) 

    print(sum_ln_prob, sum_len)
    pp = math.exp(-sum_ln_prob/sum_len)
   
    return pp

In [235]:
train_bigrams = [list(ngrams(sent, n=2)) for sent in padded_tokenized_train]
test_bigrams = [list(ngrams(sent, n=2)) for sent in padded_tokenized_test]

In [51]:
train_bigrams

[[('<s>', 'สงสัย'),
  ('สงสัย', 'ติด'),
  ('ติด', 'หวัด'),
  ('หวัด', 'นก'),
  ('นก', ' '),
  (' ', 'อีก'),
  ('อีก', 'คน'),
  ('คน', 'ยัง'),
  ('ยัง', 'น่า'),
  ('น่า', 'ห่วง'),
  ('ห่วง', '</s>')],
 [('<s>', 'ตาม'),
  ('ตาม', 'ที่'),
  ('ที่', '<UNK>'),
  ('<UNK>', ' '),
  (' ', 'กับ'),
  ('กับ', ' '),
  (' ', '<UNK>'),
  ('<UNK>', ' '),
  (' ', 'และ'),
  ('และ', ' '),
  (' ', '<UNK>'),
  ('<UNK>', ' '),
  (' ', 'ป่วย'),
  ('ป่วย', 'สงสัย'),
  ('สงสัย', 'ติด'),
  ('ติด', 'เชื้อ'),
  ('เชื้อ', 'ไข้'),
  ('ไข้', 'ขณะ'),
  ('ขณะ', 'นี้'),
  ('นี้', 'ยัง'),
  ('ยัง', 'ไม่'),
  ('ไม่', 'ดี'),
  ('ดี', 'ขึ้น'),
  ('ขึ้น', '</s>')],
 [('<s>', 'หลัง'),
  ('หลัง', 'เข้า'),
  ('เข้า', 'เยี่ยม'),
  ('เยี่ยม', 'ดู'),
  ('ดู', 'อาการ'),
  ('อาการ', 'ผู้'),
  ('ผู้', 'ป่วย'),
  ('ป่วย', 'แล้ว'),
  ('แล้ว', ' '),
  (' ', 'น.พ.จรัล'),
  ('น.พ.จรัล', 'ประชุม'),
  ('ประชุม', 'ร่วม'),
  ('ร่วม', 'กับ'),
  ('กับ', 'เจ้าหน้าที่'),
  ('เจ้าหน้าที่', 'ทุก'),
  ('ทุก', 'ฝ่าย'),
  ('ฝ่าย', ' '),
  (' ', 'เพื

In [246]:
bigram_model_scratch = BigramModel(train_bigrams, vocab)

In [247]:
print("<s>" in bigram_model_scratch.vocab, "</s>" in bigram_model_scratch.vocab, "<UNK>" in bigram_model_scratch.vocab)

True True True


In [248]:
print(bigram_model_scratch.unigram_count['<s>'], bigram_model_scratch.unigram_count['</s>'], bigram_model_scratch.unigram_count['<UNK>'])
print(bigram_model_scratch.bigram_count[('<UNK>')])
print(bigram_model_scratch.bigram_count[('<s>', 'สงสัย')])
# bigram_model_scratch.bigram_count[('<UNK>', ' ')]
print(bigram_model_scratch.unigram_count['แถลง'], bigram_model_scratch.unigram_count['ยอด'], bigram_model_scratch.bigram_count[('แถลง', 'ยอด')])
print(bigram_model_scratch[('จ.นครพนม', 'ว่า')])


21678.0 21678.0 20869.0
41273.0
4.0
390.0 149.0 0.0
0.16666666666666666


In [219]:
vocab.update(["<s>", "</s>"])

In [220]:
print('<UNK>' in vocab, '<s>' in vocab, '</s>' in vocab)

True True True


In [249]:
print(perplexity([list(flatten(train_bigrams))], bigram_model_scratch))
print(perplexity([list(flatten(test_bigrams))[:17]], bigram_model_scratch))
print(perplexity([list(flatten(test_bigrams))], bigram_model_scratch))

-4129330.95654965 1064475
48.38637922888255
-44.08324796875132 17
13.37158901653334
AS
-2392513.6153604826 626684
45.50104435915717


## Q2 MCV

In [74]:
wiki_test_bigrams = [list(ngrams(sent, n=2)) for sent in padded_tokenized_wiki_test]

In [239]:
print(perplexity([list(flatten(wiki_test_bigrams))],bigram_model_scratch))

-980052.4740734707 215930
93.57384063684898


# Smoothing

Usually any ngram models have a sparsity problem, which means it does not have every possible ngram of words in the dataset. Smoothing techniques can alleviate this problem. In this section, you will implement three basic smoothing methods laplace smoothing, interpolation for bigram, and Knesey-Ney smoothing.

## TODO #5 write Bigram with Laplace smoothing (Add-One Smoothing)

The result perplexity on training and testing should be:

    370.23, 361.25 for Laplace smoothing

In [None]:
class BigramWithLaplaceSmoothing():

  def __init__(self, data, vocab):
    pass

  def __getitem__(self, bigram):
    pass

model = BigramWithLaplaceSmoothing(train_bigrams, vocab)
print(perplexity([list(flatten(train_bigrams))],model))
print(perplexity([list(flatten(test_bigrams))], model))

## Q3 MCV

In [None]:
print(perplexity([list(flatten(wiki_test_bigrams))],model))

## TODO #6 Write Bigram with Interpolation
Set the lambda value as 0.7 for bigram, 0.25 for unigram, and 0.05 for unknown word.

The result perplexity on training and testing should be:

    70.07, 102.67 for Interpolation

In [None]:
class BigramWithInterpolation():

  def __init__(self, data, vocab, l = 0.7):
    self.unigram_count = defaultdict(lambda: 0.0)
    self.bigram_count = defaultdict(lambda: 0.0)
    self.total_word_count = 0
    self.vocab = vocab
    self.l = l #l for lambda
    for sentence in data:
        for w1, w2 in sentence:
            self.bigram_count[(w1,w2)] += 1.0
            self.unigram_count[w1] += 1.0
            self.total_word_count += 1

        #account of the last word of each sentence
        self.unigram_count[w2] += 1.0
        self.total_word_count += 1

  def __getitem__(self, bigram):
      w1, w2 = bigram
      unigram_prob = self.unigram_count[w2]/self.total_word_count
      bigram_prob = self.bigram_count[(w1,w2)]/self.unigram_count[w1]

      return 0.7*bigram_prob + 0.25*unigram_prob + 0.05*(1/len(self.vocab))

model = BigramWithInterpolation(train_bigrams, vocab)
print(perplexity([list(flatten(train_bigrams))],model))
print(perplexity([list(flatten(test_bigrams))], model))

## Q4 MCV

In [None]:
print(perplexity([list(flatten(wiki_test_bigrams))],model))

## Language modeling on multiple domains

Sometimes, we do not have enough data to create a language model for a new domain. In that case, we can improvised by combining several models to improve result on the new domain.

In this exercise you will try to merge two language models from news and article domains to create a language model for the encyclopedia domain.

In [None]:
# create encyclopeida data (test data)
encyclo_data=[]
with open('BEST2010/encyclopedia.txt','r',encoding='utf-8') as f:
    for i,line in enumerate(f):
        encyclo_data.append(line.strip()[:-1])

(news) First, you should try to calculate perplexity of your bigram with interpolation on encyclopedia data. The  perplexity should be around 236.329

In [None]:
encyclopedia_bigrams = ...

In [None]:
# 1) news only on "encyclopedia"
print(perplexity([list(flatten(encyclopedia_bigrams))], model))

## TODO #7 - Langauge Modelling on Multiple Domains
Combine news and article datasets to create another bigram model and evaluate it on the encyclopedia data.



(article) For your information, a bigram model with interpolation using article data to test on encyclopedia data has a perplexity of 218.55

In [None]:
# 2) article only on "encyclopedia"
best2010_article=[]
with open('BEST2010/article.txt','r',encoding='utf-8') as f:
    for i,line in enumerate(f):
        best2010_article.append(line.strip()[:-1])

combined_total_word_count = 0
for line in best2010_article:
    combined_total_word_count += len(line.split('|'))

article_bigrams = ...
article_vocab = ...

model = BigramWithInterpolation(article_bigrams, article_vocab)

In [None]:
print('Perplexity of the bigram model using article data with interpolation smoothing on encyclopedia test data',perplexity([list(flatten(encyclopedia_bigrams))], model))

In [None]:
# 3) train on news + article, test on "encyclopedia"
best2010_article_and_news = best2010_article.copy()
with open('BEST2010/news.txt','r',encoding='utf-8') as f:
    for i,line in enumerate(f):
        best2010_article_and_news.append(line.strip()[:-1])

combined_bigrams = ...
combined_vocab = ...

combined_model = BigramWithInterpolation(combined_bigrams, combined_vocab)
print('Perplexity of the combined Bigram model with interpolation smoothing on encyclopedia test data',perplexity([list(flatten(encyclopedia_bigrams))], combined_model))

## Q5 MCV

Did you get a better or worse result when using combined data?

## TODO #8 - Kneser-ney on "News"

<!-- Reimplement equation 4.33 in SLP textbook (https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf) -->

Implement Bigram Knerser-ney LM. The result perplexity should be around 65.81, 92.88 on train and test data. Be careful not to mix up vocab from the above section!


In [None]:
class BigramKneserNey():

  def __init__(self, data, vocab):
    pass


  def __getitem__(self, x):
    pass

model = BigramKneserNey(train_bigrams, vocab)
print(perplexity([list(flatten(train_bigrams))],model))
print(perplexity([list(flatten(train_bigrams))[:1000]],model))
print(perplexity([list(flatten(test_bigrams))[:1000]], model))
print(perplexity([list(flatten(test_bigrams))], model))

## Q6 MCV

In [None]:
print(perplexity([list(flatten(wiki_test_bigrams))],model))