<a href="https://colab.research.google.com/github/NULabTMN/hw2-aidasharif1365/blob/master/LanguageModeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Your task is to train *character-level* language models. 
You will train unigram, bigram, and trigram character-level models on a collection of books from Project Gutenberg. You will then use these trained English language models to distinguish English documents from Brazilian Portuguese documents in the test set.

In [1]:
import pandas as pd
import httpimport
import collections
import math
import nltk
import time

with httpimport.remote_repo(['lm_helper'], 'https://raw.githubusercontent.com/jasoriya/CS6120-PS2-support/master/utils/'):
  from lm_helper import get_train_data, get_test_data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package mac_morpho to /root/nltk_data...
[nltk_data]   Unzipping corpora/mac_morpho.zip.


This code loads the training and test data. Each dataset is a list of books. Each book contains a list of sentences, and each sentence contains a list of words. For building a character language model, you should join the words of a sentence together with a space character.

In [2]:
# get the train and test data
train = get_train_data()
test, test_files = get_test_data()

## 1.1
Collect statistics on the unigram, bigram, and trigram character counts.

If your machine takes a long time to perform this computation, you may save these counts to files in your github repository and load them on request. This is not necessary, however.

In [3]:
badcharaceters=['_','-','(',')','*','?','"','!',':',';','[',']','`',"'",',','1','2','3','4','5','6','7','8','9','0','.'
,'&','@', '/', '$', '+', '%', '~']

In [4]:
#making a single corpus out of training documents, cleaning adding start and end to each sentence
corpus_char=['*']
for i in range(len(train)):
  for sen in train[i]:
    corpus_char.append('*')
    for word in sen:
      if word not in badcharaceters:
        corpus_char.append(' ')
        for ch in word:
          if ch not in badcharaceters:
            corpus_char.append(ch.lower())
    corpus_char.append('**')

In [5]:
def show_ngram_stat():
  total_unigram=0
  total_bigram=0
  total_trigram=0
  unique_unigram=collections.defaultdict(int)
  unique_bigram=collections.defaultdict(int)
  unique_trigram=collections.defaultdict(int)

  #counting bigrams
  for unigram in corpus_char:
    total_unigram+=1
    unique_unigram[unigram] += 1

  #counting bigrams
  for bigram in nltk.bigrams(corpus_char):
    total_bigram+=1
    unique_bigram[bigram] += 1

  #counting trigrams
  for trigram in nltk.trigrams(corpus_char):
    total_trigram+=1
    unique_trigram[trigram] += 1

  print('Number of total unigrams in the corpus are: ',total_unigram)
  print('Number of unique unigrams in the corpus are: ',len(unique_unigram))
  print('********************************************************************')

  print('Number of total bigrams in the corpus are: ',total_bigram)
  print('Number of unique bigrams in the corpus are: ',len(unique_bigram))
  print('********************************************************************')

  print('Number of total trigrams in the corpus are: ',total_trigram)
  print('Number of unique trigrams in the corpus are: ',len(unique_trigram))
  print('********************************************************************')

show_ngram_stat()

Number of total unigrams in the corpus are:  11328339
Number of unique unigrams in the corpus are:  38
********************************************************************
Number of total bigrams in the corpus are:  11328338
Number of unique bigrams in the corpus are:  670
********************************************************************
Number of total trigrams in the corpus are:  11328337
Number of unique trigrams in the corpus are:  7511
********************************************************************


## 1.2
Calculate the perplexity for each document in the test set using the linear interpolation smoothing method. For determining λs for linear interpolation of the trigram, bigram, and unigram models, you can divide the training data into a new training set (80%) and a held-out set (20%).
Then choose ~10 random pairs of $(\lambda_3, \lambda_2)$ such that $1 > \lambda_3 > \lambda_2 > 0$ and $\sum_{i=1}^3 \lambda_i = 1$ to test on held-out data.

Some documents in the test set are in Brazilian Portuguese. Identify them as follows: 
  - Sort by perplexity and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names (from `test_files`) and perplexities of the documents above the threshold

    ```
        file name, score
        file name, score
        . . .
        file name, score
    ```

  - Copy this list of filenames and manually annotate them as being correctly or incorrectly labeled as Portuguese.




In [6]:
#dividing the training corpus into 80% training and 20% validation
training_set=corpus_char[0:int(0.8*len(corpus_char))]
validation_set=corpus_char[int(0.8*len(corpus_char)):]

In [7]:
#cleaning the test set, adding start and end at each sentence
cleaned_test=[]
for t in test:
  corpus_char=[]
  for sen in t:
    corpus_char.append('*')
    for word in sen:
      if word not in badcharaceters:
        corpus_char.append(' ')
        for ch in word:
          if ch not in badcharaceters:
            corpus_char.append(ch.lower())
    corpus_char.append('**')
  cleaned_test.append(corpus_char)

In [25]:
#making unigram, bigram, trigrams counts & probabilities
def make_ngrams(alpha):
  train_char_unigram=[]
  train_char_bigram=[]
  train_char_trigram=[]

  unigram_c = collections.defaultdict(int)
  bigram_c = collections.defaultdict(int)
  trigram_c = collections.defaultdict(int)
  unigram_p=collections.defaultdict(int)
  bigram_p=collections.defaultdict(int)
  trigram_p=collections.defaultdict(int)

  unigrams_len=0

  #counting unigrams
  for unigram in training_set:
    unigrams_len+=1
    unigram_c[unigram] += 1

  #counting bigrams
  for bigram in nltk.bigrams(training_set):
    bigram_c[bigram] += 1

  #counting trigrams
  for trigram in nltk.trigrams(training_set):
    trigram_c[trigram] += 1

  #finding probabilities in unigram, bigram, and trigrams
  for key in unigram_c:
    unigram_p[key]=(unigram_c[key])/(unigrams_len)

  for key in bigram_c:
    bigram_p[key]=(bigram_c[key])/(unigram_c[key[0]])

  for key in trigram_c:
    trigram_p[key]=(trigram_c[key]+alpha)/(bigram_c[(key[0],key[1])]+alpha*len(unigram_c))
    trigram_p['UNK']=0.1/(len(bigram_c) + 0.1* len(unigram_c))

  return unigram_p,bigram_p,trigram_p

In [28]:
#finding the best lambda on validation set
def best_lambda(lambda_sets,unigram_p,bigram_p,trigram_p):
  for l in lambda_set:
    sum_log=0
    interpolated_p=collections.defaultdict(int)
    for ch in range(2,len(validation_set)):
      interpolated_p[(validation_set[ch-2],validation_set[ch-1],validation_set[ch])]=l[0]*trigram_p[(validation_set[ch-2],validation_set[ch-1],validation_set[ch])]+l[1]*bigram_p[(validation_set[ch-1],validation_set[ch])]+l[2]*unigram_p[validation_set[ch]]
      if interpolated_p[(validation_set[ch-2],validation_set[ch-1],validation_set[ch])]==0:
        interpolated_p[(validation_set[ch-2],validation_set[ch-1],validation_set[ch])]=trigram_p['UNK']
      sum_log+=math.log2(interpolated_p[(validation_set[ch-2],validation_set[ch-1],validation_set[ch])])
    perplexity=2**((-sum_log)/len(validation_set))
    print("Perpelexity for lambda set ["+' '.join([str(elem) for elem in l]) +"] is ",round(perplexity,2))


In [29]:
#an array with different set of lambdas
lambda_set=[[0.1,0.1,0.8],[0.5,0.3,0.2],[0.6,0.3,0.1],[0.7,0.2,0.1],[0.8,0.1,0.1],[0.4,0.3,0.3],[0.05,0.05,0.9],
            [0.4,0.5,0.1],[0.15,0.7,0.15],[0.5,0.4,0.1],[0.3,0.3,0.4],[0.2,0.6,0.2],[0.1,0.2,0.7],[0.9,0.05,0.05]]

In [30]:
alpha=0
unigram_p,bigram_p,trigram_p=make_ngrams(alpha)
best_lambda(lambda_set,unigram_p,bigram_p,trigram_p)

Perpelexity for lambda set [0.1 0.1 0.8] is  12.95
Perpelexity for lambda set [0.5 0.3 0.2] is  8.25
Perpelexity for lambda set [0.6 0.3 0.1] is  7.84
Perpelexity for lambda set [0.7 0.2 0.1] is  7.67
Perpelexity for lambda set [0.8 0.1 0.1] is  7.55
Perpelexity for lambda set [0.4 0.3 0.3] is  8.76
Perpelexity for lambda set [0.05 0.05 0.9] is  14.7
Perpelexity for lambda set [0.4 0.5 0.1] is  8.33
Perpelexity for lambda set [0.15 0.7 0.15] is  9.45
Perpelexity for lambda set [0.5 0.4 0.1] is  8.06
Perpelexity for lambda set [0.3 0.3 0.4] is  9.43
Perpelexity for lambda set [0.2 0.6 0.2] is  9.31
Perpelexity for lambda set [0.1 0.2 0.7] is  12.14
Perpelexity for lambda set [0.9 0.05 0.05] is  7.41


In [43]:
#Finding all perplexities in a test set, using interpolation,  in order to find a threshold
def perplexities_interpol(unigram_p,bigram_p,trigram_p,bestLambda):
  results=[]
  count=0
  for test in cleaned_test:
    sum_log=0
    interpolated_p=collections.defaultdict(int)
    for ch in range(2,len(test)):
      interpolated_p[(test[ch-2],test[ch-1],test[ch])]=l[0]*trigram_p[(test[ch-2],test[ch-1],test[ch])]+l[1]*bigram_p[(test[ch-1],test[ch])]+l[2]*unigram_p[test[ch]]
      if interpolated_p[(test[ch-2],test[ch-1],test[ch])]==0:
        #this unknow is for special characters in other languages, otherwise if it was only english the unigrams would count for it
        interpolated_p[(test[ch-2],test[ch-1],test[ch])]=trigram_p['UNK']
      sum_log+=math.log2(interpolated_p[(test[ch-2],test[ch-1],test[ch])])
      
    perplexity=2**(-sum_log/len(test))
    results.append((count,perplexity))
    count+=1
  results=sorted(results, key=lambda x: -x[1])
  for i in range(len(results)):
    print(results[i][1])
  return results

In [44]:
#alpha is zero in interpolation
alpha=0
#best lambda from previous step
l=[0.9,0.05,0.05]
unigram_p,bigram_p,trigram_p=make_ngrams(alpha)
results=perplexities_interpol(unigram_p,bigram_p,trigram_p,l)

24.373833198577003
24.07363751344276
23.937818294311814
23.641773031058005
23.618843537381306
23.561812459417595
23.534998878911164
23.41340417621807
23.30565670915398
23.19703522856929
23.05482666992589
22.99330567422551
22.988721183515295
22.950882854170835
22.936954562451586
22.902657068784276
22.836680775663005
22.80204935678454
22.720144645469087
22.409046558607695
9.060471872595729
9.053264626936926
8.926573872086584
8.909810665152946
8.817352567873632
8.81092583487065
8.805629115605779
8.802926910451152
8.755994853396691
8.724397796208107
8.708490109440259
8.707292212037013
8.690714991428734
8.679817031714396
8.678462521139002
8.65743088878441
8.648183622444265
8.646631377385248
8.638702390275489
8.632179257396814
8.629233561273992
8.618694914054243
8.612728463550125
8.570994744840435
8.561037024479669
8.545300796885074
8.52423863464934
8.514187266759041
8.511032718099397
8.493307380024746
8.490442214897612
8.489084118586023
8.477493196070855
8.471061088576977
8.462364984004296


In [35]:
#the sudden jump in perplexities are from 22 to 9, so threshold of 15 is good to distinguish between the two languages
#i get the same preplexities in all tests
print('The cutoff threshold is 15, everything more than 15 is Brazilian')
for i in range(220):
  if round(results[i][1],2)>15:
    print('Perplexity='+ str(round(results[i][1],2))+' file is predicted Brazillian, file name is: '+str(test_files[results[i][0]]))
  else:
    print('Perplexity= '+ str(round(results[i][1],2))+' file is is predicted English, file name is: '+str(test_files[results[i][0]]))

Threshold is 15, everything more than 15 is Brazilian
Perplexity=24.37 file is predicted Brazillian, file name is: ag94ju07.txt
Perplexity=24.07 file is predicted Brazillian, file name is: ag94ma03.txt
Perplexity=23.94 file is predicted Brazillian, file name is: br94ag01.txt
Perplexity=23.64 file is predicted Brazillian, file name is: br94ma01.txt
Perplexity=23.62 file is predicted Brazillian, file name is: ag94se06.txt
Perplexity=23.56 file is predicted Brazillian, file name is: ag94ab12.txt
Perplexity=23.53 file is predicted Brazillian, file name is: ag94de06.txt
Perplexity=23.41 file is predicted Brazillian, file name is: br94fe1.txt
Perplexity=23.31 file is predicted Brazillian, file name is: ag94jl12.txt
Perplexity=23.2 file is predicted Brazillian, file name is: br94ja04.txt
Perplexity=23.05 file is predicted Brazillian, file name is: ag94ou04.txt
Perplexity=22.99 file is predicted Brazillian, file name is: ag94ag02.txt
Perplexity=22.99 file is predicted Brazillian, file name is:

**Using threshold of 15 and checking the predictions and file names, the function detected all non-English texts correctly. **

## 1.3
Build a trigram language model with add-λ smoothing (use λ = 0.1).

Sort the test documents by perplexity and perform a check for Brazilian Portuguese documents as above:

  - Observe the perplexity scores and set a cut-off threshold. All the documents above this threshold score should be categorized as Brazilian Portuguese. 
  - Print the file names and perplexities of the documents above the threshold

  ```
      file name, score
      file name, score
      . . .
      file name, score
  ```

  - Copy this list of filenames and manually annotate them for correctness.

In [38]:
#Finding all perplexities in a test set, using smoothing,  in order to find a threshold
def perplexities_smoothing(trigram_p):
  results=[]
  count=0
  for test in cleaned_test:
    sum_log=0
    for ch in range(2,len(test)):
      if trigram_p[(test[ch-2],test[ch-1],test[ch])]==0:
        #for unseen features we use unknown probability
        trigram_p[(test[ch-2],test[ch-1],test[ch])]=trigram_p['UNK']
      sum_log+=math.log2(trigram_p[(test[ch-2],test[ch-1],test[ch])])
      
    perplexity=2**(-sum_log/len(test))
    results.append((count,perplexity))
    count+=1
  results=sorted(results, key=lambda x: -x[1])
  for i in range(len(results)):
    print(results[i][1])
  return results

In [39]:
alpha=0.1
unigram_p,bigram_p,trigram_p=make_ngrams(alpha)
results=perplexities_smoothing(trigram_p)

33.36871923492745
33.30708283163418
33.139279534317275
32.59083857233003
32.561993280766025
32.223620942226106
32.07471209622475
32.014452912611276
31.922403920111087
31.85827435697232
31.512127893762177
31.434546269564958
31.4145589044988
31.406022015435322
31.26165014422557
31.111236736103248
31.00782705527124
30.990644839931125
30.92936189835053
30.209970718664177
9.366080735093076
9.276031069102164
9.265151721518373
9.203225058535544
9.180042259600484
9.170814428271527
9.146796143323309
9.08179520691154
9.035510991123125
9.00988052932453
9.007187174482937
8.996340502289513
8.96331040804744
8.958086432192255
8.931111493888784
8.915257771388267
8.9120064408254
8.904675145695997
8.904404813327568
8.895174645655782
8.879916267731064
8.850582515194
8.831708006153976
8.81852475785547
8.776145154135468
8.725162224740926
8.701318851550674
8.685024849684673
8.684177335591896
8.673770898708062
8.671771457185065
8.66725456418907
8.664800211134128
8.658086361327193
8.647099464179414
8.64572678

In [40]:
#the sudden jump in perplexities are from 31 to 9, so threshold of 20 is good to distinguish between the two languages
#i get the same preplexities in all tests
print('The cutoff threshold is 20, everything more than 20 is Brazilian')
for i in range(220):
  if round(results[i][1],2)>20:
    print('Perplexity='+ str(round(results[i][1],2))+' file is Brazillian, file name is: '+str(test_files[results[i][0]]))
  else:
    print('Perplexity= '+ str(round(results[i][1],2))+' file is English, file name is: '+str(test_files[results[i][0]]))

Threshold is 20, everything more than 20 is Brazilian
Perplexity=33.37 file is Brazillian, file name is: ag94ju07.txt
Perplexity=33.31 file is Brazillian, file name is: br94ag01.txt
Perplexity=33.14 file is Brazillian, file name is: ag94ma03.txt
Perplexity=32.59 file is Brazillian, file name is: br94fe1.txt
Perplexity=32.56 file is Brazillian, file name is: br94ma01.txt
Perplexity=32.22 file is Brazillian, file name is: ag94se06.txt
Perplexity=32.07 file is Brazillian, file name is: ag94ab12.txt
Perplexity=32.01 file is Brazillian, file name is: ag94de06.txt
Perplexity=31.92 file is Brazillian, file name is: ag94jl12.txt
Perplexity=31.86 file is Brazillian, file name is: br94ja04.txt
Perplexity=31.51 file is Brazillian, file name is: ag94ag02.txt
Perplexity=31.43 file is Brazillian, file name is: br94jl01.txt
Perplexity=31.41 file is Brazillian, file name is: br94ju01.txt
Perplexity=31.41 file is Brazillian, file name is: ag94mr1.txt
Perplexity=31.26 file is Brazillian, file name is: a

**Using the cutoff threshold of 20 and checking the predictions and file names, the function detected all non-English texts correctly. **

## 1.4
Based on your observation from above questions, compare linear interpolation and add-λ smoothing by listing out their pros and cons.

**Add-λ smoothing:**

**Pros:**
  It distributes some probability to unseen trigrams. 
  It is a robust method which does not need us to worry about unseen trigrams.
  It accounts for diversity of characters/word combinations in unseen texts.

**Cons:**
  It only relies on trigrams.For smaller texts it could result in weak performance.
  It oversmooths the common trigrams. That's why interpolation performs a bit better in terms of perplexity (around 0.3 more perplexity compared to interpolation)
  





**Linear interpolation:**

**Pros:**
   Takes advantage of lower and higher ngrams (unigram, bigrams, and trigrams in this hw). By using interpolation we can exactly find best weights for each ngram.

  **Cons:** 
  It assigns zero probability to unseen characters. Therefore, some modification is needed to prevent division by zero in probability. In smoothing this problem is solved by adding λ in nominator and λV in the denominator.
