# Language Modeling using Ngram

In this Exercise, you are going to use NLTK which is a natural language processing library for python to create a bigram language model and its variation. You will build one model for each of the following type and calculate their perplexity:
- Unigram Model
- Bigram Model
- Bigram Model with Laplace smoothing
- Bigram Model with Interpolation
- Bigram Model with Kneser-ney Interpolation
- Neural LM (optional)



In [None]:
#download corpus
!wget --no-check-certificate https://github.com/ekapolc/nlp_2019/raw/master/HW4/BEST2010.zip
!unzip BEST2010.zip

--2021-02-09 13:35:08--  https://github.com/ekapolc/nlp_2019/raw/master/HW4/BEST2010.zip
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ekapolc/nlp_2019/master/HW4/BEST2010.zip [following]
--2021-02-09 13:35:09--  https://raw.githubusercontent.com/ekapolc/nlp_2019/master/HW4/BEST2010.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7423530 (7.1M) [application/zip]
Saving to: ‘BEST2010.zip’


2021-02-09 13:35:10 (73.6 MB/s) - ‘BEST2010.zip’ saved [7423530/7423530]

Archive:  BEST2010.zip
   creating: BEST2010/
  inflating: BEST2010/article.txt    
  inflating: BEST2010/encyclopedia.txt  
  inflati

In [None]:
#First we import necessary library such as math, nltk, bigram, and collections.
import math
import nltk
import io
import random
from random import shuffle
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
random.seed(999)

BEST2010 is a free Thai NLP dataset by NECTEC usually use as a standard benchmark for various NLP tasks includeing language modeling. BEST2010 is separated into 4 domain article, encyclopedia, news and novel. The data is already  tokenized using '|' as a separator.

For example,

ตาม|ที่|นางประนอม ทองจันทร์| |กับ| |ด.ช.กิตติพงษ์ แหลมผักแว่น| |และ| |ด.ญ.กาญจนา กรองแก้ว| |ป่วย|สงสัย|ติด|เชื้อ|ไข้|ขณะ|นี้|ยัง|ไม่|ดี|ขึ้น|

In [None]:
# We choose news domain as our dataset
best2010=[]
fp= io.open('BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    best2010.append(line.strip()[:-1])
fp.close()
all_vocabulary =set()
total_word_count =0
for line in best2010:
    for word in line.split('|'):        
        all_vocabulary.add(word)
        total_word_count+=1

In [None]:
#For simplicity, we assumes that each line is a sentence.
print ('Total sentences in BEST2010 news dataset :\t'+ str(len(best2010)))
print ('Total word counts in BEST2010 news dataset :\t'+ str(total_word_count))
print ('Total vocabulary in BEST2010 news dataset :\t'+ str(len(all_vocabulary)))

Total sentences in BEST2010 news dataset :	30969
Total word counts in BEST2010 news dataset :	1660190
Total vocabulary in BEST2010 news dataset :	35488


We separate out input into 2 sets, train and test data with 70:30 ratio

In [None]:
sentences = best2010
# The data is separated to train and test set with 70:30 ratio.
train = sentences[:int(len(sentences)*0.7)]
test = sentences[int(len(sentences)*0.7):]

#Training data
train_vocabulary =set()
train_word_count =0
for line in train:
    for word in line.split('|'):        
        train_vocabulary.add(word)
        train_word_count+=1
print ('Total sentences in BEST2010 news training dataset :\t'+ str(len(train)))
print ('Total word counts in BEST2010 news training dataset :\t'+ str(train_word_count))
print ('Total vocabuary in BEST2010 news training dataset :\t'+ str(len(train_vocabulary)))
# We will use 1/vocab_size as a default value for unknown word
unk_value = math.pow(len(train_vocabulary),-1)

Total sentences in BEST2010 news training dataset :	21678
Total word counts in BEST2010 news training dataset :	1042797
Total vocabuary in BEST2010 news training dataset :	26240


In [None]:
def resultEvaluator(TAG, gt, pp) :
    if gt==pp :
        print("Check on {}: {} ".format(TAG, pp==gt))
    else :
        print("Check on {} which ground truth = {}, calculated perplexity = {}".format(TAG, gt, pp))

## Utilities

In [None]:
# https://www.reddit.com/r/Python/comments/27crqg/making_defaultdict_create_defaults_that_are_a/
class key_dependent_dict(defaultdict):
    def __init__(self,f_of_x):
        super().__init__(None) # base class doesn't get a factory
        self.f_of_x = f_of_x # save f(x)
    def __missing__(self, key): # called when a default needed
        ret = self.f_of_x(key) # calculate default value
        self[key] = ret # and install it in the dict
        return ret

In [None]:
def getCountingTable(data) :
    bi = defaultdict(lambda: defaultdict(lambda: 0.0))
    uni = defaultdict(lambda: 0.0)
    for sentence in data:
        for (w1, w2) in bigrams(sentence.split("|"), pad_right=True, pad_left=True,left_pad_symbol='<s>', right_pad_symbol='</s>' ) :
            bi[w1][w2] += 1
            uni[w1] += 1
    uni['</s>'] = int(uni['<s>'])
    return bi, uni

# Unigram

In this section, we will demonstrate how to build a unigram language model <br>
**Important note:** <br>
**\<s\>** = sentence start symbol <br>
**\</s\>** = sentence end symbol 

In [None]:
def getUnigramModel(data):
    model = defaultdict(lambda: 0)
    word_count =0
    for sentence in data:
        sentence +=  u'|</s>' #for unigram model we can always ignore <s>, since p(w0=<s>)=1
        for w1 in sentence.split('|'):
            model[w1] +=1.0
            word_count+=1
    for w1 in model:
        model[w1] = model[w1]/(word_count)
    return model

In [None]:
unimodel = getUnigramModel(train)

In [None]:
def getLnValue(x):
    if x >0.0:
        return math.log(x)
    else:
        return math.log(unk_value)

In [None]:
#problability of 'นายก'
print(getLnValue(unimodel[u'นายก']))
#for example, problability of 'นายกรัฐมนตรี' which is an unknown word is equal to
print(getLnValue(unimodel[u'นายกรัฐมนตรี']))
#problability of 'นายก' 'ได้' 'ให้' 'สัมภาษณ์' 'กับ' 'สื่อ'
prob = getLnValue(unimodel[u'นายก'])+getLnValue(unimodel[u'ได้'])+ getLnValue(unimodel[u'ให้'])+getLnValue(unimodel[u'สัมภาษณ์'])+getLnValue(unimodel[u'กับ'])+getLnValue(unimodel[u'สื่อ'])+getLnValue(unimodel['</s>'])
print ('Problability of a sentence', math.exp(prob))

-6.551526663995246
-10.175040243058024
Problability of a sentence 5.617210748667918e-18


# Perplexity

In order to compare language model we need to calculate perplexity. In this task you should write a perplexity calculation code for the unigram model. The result perplexity should be around 556.39 and
476.07 on train and test data.

# TODO #0
**Fill your name and ID here** <br>
**Name**: Thammakorn Kobkuachaiyapong
<br>
**ID**: 6030272021

## TODO #1 **Calculate perplexity**

In [None]:
def uni_calculate_sentence_ln_prob(sentence, model):
    sum = 0
    words = sentence.split("|")
    for word in words :
        sum += getLnValue(model[word])
    return sum, len(words)

def uni_perplexity(data,model):
    sum = 0
    wc = 0
    for sentence in data :
        sentence += "|</s>"
        s, c = uni_calculate_sentence_ln_prob(sentence, model)
        sum += s
        wc += c
    return math.exp(-1*sum/wc)

In [None]:
# train = 556.3925994212195, test = 476.0687892303532
print(uni_perplexity(train,unimodel))
print(uni_perplexity(test,unimodel))

556.3925994212195
476.0687892303532


## TODO #2 **Please explain why this model give us such a high perplexity.**

**Your answer**:  This model concern only how many times the word occur in some sentences but doesn't attend many important feature such as meaning, context, structure ,etc. Which resulted in high perplex when guessing next word.

# Bigram

Next, you will create a better language model than a unigram (which is not much to compare with). But first, it is very tedious to count every pair of words that occur in our corpus by ourselves. In this case, nltk provide us a simple library which will do it for us.

In [None]:
#example of nltk usage for bigram
sentence = 'I always search google for an answer .'

print('This is how nltk generate bigram.')
for w1,w2 in bigrams(sentence.split(), pad_right=True, pad_left=True,left_pad_symbol='<s>', right_pad_symbol='</s>'):
    print (w1,w2)
print('None is used as a start and end of sentence symbol.')

This is how nltk generate bigram.
<s> I
I always
always search
search google
google for
for an
an answer
answer .
. </s>
None is used as a start and end of sentence symbol.


Now, you should be able to implement a bigram model by yourself. Also, you must create a new perplexity calculation for bigram. The result perplexity should be around 58.78 and 146.26 on train and test data.

## TODO #3 **Create a Bigram Model**

In [None]:
def getBigramModel(data):
    ###FILL YOUR CODE HERE###
    bi, uni = getCountingTable(data)

    model = defaultdict(lambda: defaultdict(lambda: 0.0))
    for w1 in bi :
        for w2 in bi[w1] :
            model[w1][w2] = bi[w1][w2]/uni[w1]
    return model

bimodel = getBigramModel(train)

## TODO #4 **Calculate Perplexity for Bigram Model**
** Please look through here, imho it's weird. Since I calculated word count from len(sentence.split()) which contain every word except start and stop token then it should +2 not +1, shouldn't it?


In [None]:
def bi_calculate_sentence_ln_prob(sentence, model):
    sum = 0
    words = sentence.split("|")
    for (w1, w2) in bigrams(words, pad_right=True, pad_left=True,left_pad_symbol='<s>', right_pad_symbol='</s>') :
        sum += getLnValue(model[w1][w2])
    # return sum, len(words)+1
    return sum, len(words)+2

def bi_perplexity(data,model):
    sum = 0
    wc = 0
    for sentence in data :
        s, c = bi_calculate_sentence_ln_prob(sentence, model)
        sum += s
        wc += c
    return math.exp(-1*sum/wc)

In [None]:
# train  58.78942889767147
# test 146.26539331038614
print(bi_perplexity(train, bimodel))
print(bi_perplexity(test, bimodel))

58.78942889767147
146.26539331038614


# Smoothing

Usually any ngram models have a sparsity problem, which means it does not have every possible ngram of words in the dataset. Smoothing techniques can alleviate this problem. In this section, you will implement two basic smoothing methods laplace smoothing and interpolation for bigram.

## TODO #5 **Bigram with Laplace smoothing (Add-One Smoothing)**

In [None]:
#Laplace Smoothing
def getBigramWithLaplaceSmoothing(data):
    #Fill code here
    bi, uni = getCountingTable(data)

    nvocab = len(uni)
    model = key_dependent_dict(lambda w1: key_dependent_dict(lambda w2: 1/(uni[w2]+nvocab)))
    for w1 in bi :
        for w2 in bi[w1] :
            model[w1][w2] = (bi[w1][w2]+1)/(uni[w1]+nvocab)
    return model

bilaplacemodel = getBigramWithLaplaceSmoothing(train)
# train 974.8134581679766
# test 1098.1622194979489
print(bi_perplexity(train,bilaplacemodel))
print(bi_perplexity(test,bilaplacemodel))

974.8428254812111
1146.9954981655972


## TODO #6 **Bigram with Interpolation**
lambda value is 0.7 for bigram, 0.25 for unigram, and 0.05 for unknown word

In [None]:
#interpolation
def getBigramWithInterpolation(data):
    #Fill code here
    h2 = 0.70
    h1 = 0.25
    h0 = 0.05
    
    bi, uni = getCountingTable(data)
    nvocab = len(uni)

    model = key_dependent_dict(lambda w1: key_dependent_dict(lambda w2: (h1*uni[w2]) + (h0/nvocab)))
    for w1 in bi :
        for w2 in bi[w1] :
            bi[w1][w2] = bi[w1][w2]/uni[w1]
    
    wc = 0
    for w in uni:
        wc += uni[w]
    
    for w in uni :
        uni[w] /= wc

    for w1 in bi :
        for w2 in bi[w1] :
            model[w1][w2] = (h2*bi[w1][w2]) + (h1*uni[w2]) + (h0/nvocab) 
    return model
    
biipmodel = getBigramWithInterpolation(train)

# train 73.38409869825665
# test 172.67485908813356
print (bi_perplexity(train,biipmodel))        
print (bi_perplexity(test,biipmodel))

73.69859278127929
151.4903747109964


the result perplexity on training and testing should be 

    974.81, 1098.16 for Laplace smoothing
    73.38, 172.67 for Interpolation

# Language modeling on multiple domains

Sometimes, we do not have enough data to create a language model for a new domain. In that case, we can improvised by combining several models to improve result on the new domain.

In this exercise you will try to merge two language models from news and article domains to create a language model for the encyclopedia domain.

In [None]:
# create encyclopedia data
encyclo_data=[]
fp= io.open('BEST2010/encyclopedia.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    encyclo_data.append(line.strip()[:-1])
fp.close()

In [None]:
# create news and article data
combined_data=[]
fp= io.open('BEST2010/news.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    combined_data.append(line.strip()[:-1])
fp.close()
fp= io.open('BEST2010/article.txt','r',encoding='utf-8')
for i,line in enumerate(fp):
    combined_data.append(line.strip()[:-1])
fp.close()

First, you should try to calculate perplexity of your bigram with interpolation on encyclopedia data. The result perplexity should be around 656.63
For your information, a bigram model with interpolation using ariticle data to test on encyclopedia data has a perplexity of 459.51

In [None]:
#print perplexity of bigram with interpolation on article data     
# 727.3502637212223   
print(bi_perplexity(encyclo_data,biipmodel))

592.1317500708113


## TODO #7 
Write a model that produce 400.0 or less perplexity on encyclopedia data without using data from the encyclopedia as training data. (Hint : Try to combine a model with news data and a model with article data together.)

In [None]:
# Fill code here
combined_model = getBigramWithInterpolation(combined_data)
# 428.85251789073953 (on combined data)
print('Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data',bi_perplexity(encyclo_data, combined_model))

Perplexity of combine Bigram model with interpolation smoothing on encyclopedia test data 405.8501128620644


## TODO #8 
## Kneser-ney on "News"

<!-- Reimplement equation 4.33 in SLP textbook (https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf) -->

Implement Bigram Knerser-ney LM. The result perplexity should be around 54.84 and 124.28 on train and test data. 


In [None]:
# Fill codehere
def getKNmodel(data) :
    print("Counting table")
    bi, uni = getCountingTable(data)
    
    model = defaultdict(lambda: defaultdict(lambda:0))
    d = 0.75
    cnz = 0

    print("Calculate continuation dict")
    conc = defaultdict(lambda: 0)
    for w1 in bi :
        cnz += len(bi[w1])
        for w2 in uni:
            if w2 in bi[w1] :
                conc[w2] += 1

    print(conc)
    print("Model create")
    for w1 in bi:
        for w2 in bi[w1]:
            model[w1][w2] = (max(bi[w1][w2]-d,0)/uni[w1])+(d*(len(bi[w1])/uni[w1])*(conc[w2]/cnz))
    
    return model
knmodel = getKNmodel(train)
# 55.069882021166585
# 140.76248767095316
print(bi_perplexity(train,knmodel)) 
print(bi_perplexity(test,knmodel))

Counting table
Calculate continuation dict
defaultdict(<function getKNmodel.<locals>.<lambda> at 0x7f1625af9620>, {'สงสัย': 49, 'ติด': 206, 'หวัด': 48, 'นก': 43, 'อีก': 695, 'คน': 723, 'ยัง': 573, 'น่า': 157, 'ห่วง': 32, 'ตาม': 570, 'ที่': 2301, 'กับ': 711, 'และ': 2340, 'ป่วย': 52, 'เชื้อ': 95, 'ไข้': 77, 'ขณะ': 271, 'นี้': 363, 'ไม่': 1240, 'ดี': 230, 'ขึ้น': 502, 'หลัง': 384, 'เข้า': 745, 'ดู': 152, 'อาการ': 105, 'ผู้': 774, 'แล้ว': 820, 'น.พ.จรัล': 2, 'ประชุม': 70, 'ร่วม': 306, 'เจ้าหน้าที่': 226, 'ทุก': 479, 'ฝ่าย': 160, 'เพื่อ': 838, 'สรุป': 72, 'ผล': 204, 'การ': 847, 'ดำเนิน': 129, 'รวม': 321, 'ทั้ง': 465, 'สอบสวน': 57, 'โรค': 128, 'ก่อน': 396, 'จะ': 1095, 'ถูก': 376, 'ส่ง': 297, 'มา': 1181, 'รักษา': 131, 'ตัว': 440, 'จาก': 869, 'นั้น': 635, 'กัน': 571, 'แถลง': 103, 'ข่าว': 148, 'โดย': 767, 'กล่าว': 593, 'ว่า': 1012, '3': 161, 'ราย': 270, 'ทรง': 103, 'ใน': 1932, 'ของ': 1691, 'ปอด': 37, 'หาย': 76, 'เป็น': 1410, 'ปกติ': 88, 'คาด': 108, 'กลับ': 305, 'บ้าน': 163, 'ได้': 1669, 'แต่': 

## TODO #9
## Neural LM (optional)
do it on news corpus that we splitted into train and test sets at the beginning of this exercise. 

In [None]:
#there are many ways to do this. e.g.:
%tensorflow_version 2.x
import numpy as np
import numpy.random
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Embedding, LSTM
from tensorflow.keras.optimizers import Adam

#https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/

# Evaluation

In [None]:
resultEvaluator("Unigram-train", 556.3925994212195, uni_perplexity(train,unimodel))
resultEvaluator("Unigram-test", 476.0687892303532, uni_perplexity(test,unimodel))
resultEvaluator("Bigram-train", 58.78942889767147, bi_perplexity(train,bimodel))
resultEvaluator("Bigram-test", 146.26539331038614, bi_perplexity(test,bimodel))
resultEvaluator("Bigram with laplace-train", 974.8134581679766, bi_perplexity(train,bilaplacemodel))
resultEvaluator("Bigram with laplace-test", 1098.1622194979489, bi_perplexity(test,bilaplacemodel))
resultEvaluator("Bigram with interpolation-train", 73.38409869825665, bi_perplexity(train,biipmodel))
resultEvaluator("Bigram with interpolation-test", 172.67485908813356, bi_perplexity(test,biipmodel))
resultEvaluator("Bigram with interpolation-combined", 428.85251789073953, bi_perplexity(encyclo_data,combined_model))
resultEvaluator("Bigram with Kneser-ney-train", 55.069882021166585, bi_perplexity(train,knmodel))
resultEvaluator("Bigram with Kneser-ney-test", 140.76248767095316, bi_perplexity(test,knmodel))

Check on Unigram-train: True 
Check on Unigram-test: True 
Check on Bigram-train: True 
Check on Bigram-test: True 
Check on Bigram with laplace-train which ground truth = 974.8134581679766, calculated perplexity = 974.8428254812111
Check on Bigram with laplace-test which ground truth = 1098.1622194979489, calculated perplexity = 1146.9954981655972
Check on Bigram with interpolation-train which ground truth = 73.38409869825665, calculated perplexity = 73.69859278127929
Check on Bigram with interpolation-test which ground truth = 172.67485908813356, calculated perplexity = 151.4903747109964
Check on Bigram with interpolation-combined which ground truth = 428.85251789073953, calculated perplexity = 405.8501128620644
Check on Bigram with Kneser-ney-train which ground truth = 55.069882021166585, calculated perplexity = 71.14054002208687
Check on Bigram with Kneser-ney-test which ground truth = 140.76248767095316, calculated perplexity = 155.09274968738495
