<a href="https://colab.research.google.com/github/chandanareddy-enugala/NLP-SLU/blob/main/Nlp_Bantu_04_V02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Important Notes from given Problem Statement:**
- It is essential that the probabilities for all 48 characters with a given history sum to 1.0!. In other words, you must produce a conditional probability distribution at each step — returning probabilities that sum to
more than 1.0 is an easy way to “cheat” and I'll be checking carefully that this isn't the case.
- Evaluate functions that return the probability of a character given the history of characters preceding it (with trivial implementations).

- Your program to be a call to the “evaluate” function which displays the cross entropies of your four language models on the test sets.
- Evaluation: Your goal is to produce language models with the smallest cross-entropy (in units of bits per
character)
- H = -1/N sum i=1toN (log2(P(ci\c1...ci-1))); where c1,..., cN are the characters of the test corpus.

**Approach of the Code**

======================================================================= 1 Loading   
1) Load train dataset  

======================================================================= 2 Cleaning  
2) Prepare the useful characters: ASCII_Chars  
3) Clean dataset:  
  - Remove punctuation characters  
  - List of all the characters without losing the order of the characters  

======================================================================= 3 Training     
4) Find the ngrams & ngram_nextCharacters of these characters  
5) Find ngrams count & ngram_nextCharacters count and store them  

======================================================================= 4 Probability  
6) Find the probability of ngram & nextChar  

======================================================================= 5 Evaluating  
8) For each given history & nextChar,   
  - get the probability of history(ngram) and nextChar  
  - apply -log2 on probability  
  - summation of all these gives result as log2loss.  

### **Defining Language Model**

In [2]:
import re
import os
import sys
import numpy as np
import time
import math
from datetime import datetime

In [3]:
class DictionaryFunctions:
  def __init__(self) -> None:
    pass
  def addVal_to_dictKey(self, D, k, val=1): # Inputs: Dictionary, Key, Value; Ouput: Dictionary (Updated)
    if k in D:
      if type(val)==int:
        D[k] += val
      elif (type(val)==str):
        D[k].append(val)
    else:
      if type(val)==int:
        D[k] = val
      elif (type(val)==str):
        D[k] = [val]
    return D

    def find_probs_of_dict(self, D):
      total = 0
      for key in D.keys():
        total += D[key]
      for key in D.keys():
        D[key] /= total
      return D 

  def get_dict_from_list(self, L):
    D = {}
    for char in L:
      if char in D:
        D[char] += 1
      else:
        D[char] = 1
    return D

  def max_of_dict(self, D):  
    maxProb = 0
    maxKey = ''
    for key, probVal in D.items():
      if probVal>maxProb:
        maxProb = probVal
        maxKey = key
    return (maxKey, maxProb)

In [4]:
class Language_Model:
  def __init__(self, n, lang="") -> None:
    self.n = n
    self.lang = lang
    self.DF = DictionaryFunctions()
    self.ASCII_Chars = ' !"\'(),-.0123456789:;?abcdefghijklmnopqrstuvwxyz'
    self.sentence_EndChars = '.!?'
    self.ngrams = {}
    self.ngrams_nextChars = {}
    self._ngrams_List = []
    self._ngrams_nextChars_List = []
    self._ngrams_possChars = {}
    self.startPad = ['<START>']
    self.endPad = ['<END>']
    self._ngrams_notAvailable = []
    self._ngrams_nextChars_notAvailable = []
  
  def _get_ngrams(self, text):
    n = self.n
    chars_tokens = (n)*self.startPad + text + (n)*self.endPad
    ngrams_nextChars = [(tuple(chars_tokens[i:i+n]),chars_tokens[i+n]) for i in range(len(chars_tokens)-n)]
    return ngrams_nextChars

  def clean_data(self, dataset: str) -> list:  
    ASCII_Chars = self.ASCII_Chars
    all_characters_in_data = re.findall(r"[%s]"%ASCII_Chars, dataset)
    # data_clean = "".join(all_characters_in_data)
    return all_characters_in_data
    
  def get_Sentences_from_Text(self, text):
    if type(text)==list:
      all_characters_in_data = text
    elif type(text) == str:
      all_characters_in_data = list(text)
    else:
      return "Please pass input either 'List_of_Characters' or 'Text_String'."
    endChars = self.sentence_EndChars # endChars=".!?" ; where '.' -> (period), '!' -> (exclamation mark) and '?' -> (question mark)
    sentences = []
    sentence = []
    for c in all_characters_in_data:
      if c not in endChars:
        sentence.append(c)
      else:
        sentence.append(c)
        sentences.append(sentence)
        sentence = []
    return sentences
  
  def store_ngrams(self, sentences):
    for sent in sentences:
      ngrams_nextChars = self._get_ngrams(sent)
      for ngram_nextChar in ngrams_nextChars:
        self._ngrams_nextChars_List.append(ngram_nextChar)
        self._ngrams_List.append(ngram_nextChar[0])
        self._ngrams_possChars = self.DF.addVal_to_dictKey(self._ngrams_possChars, ngram_nextChar[0], str(ngram_nextChar[1]))

    self.ngrams_nextChars = self.DF.get_dict_from_list(self._ngrams_nextChars_List)
    self.ngrams = self.DF.get_dict_from_list(self._ngrams_List)

    return "Success: ngrams & ngrams_nextChars are stored successfully"
  
  def fit(self, data):
    all_characters_in_data = self.clean_data(data)
    sentences = self.get_Sentences_from_Text(all_characters_in_data)
    statusMessage = self.store_ngrams(sentences)
    print("==== TRAINING IS COMPLETED ====")
    print(statusMessage)
  
  def get_prob(self, ngram, char):
    if ngram in self.ngrams:
      A = self.ngrams[ngram]
    else:
      self._ngrams_notAvailable.append(ngram)
      A = 0
    ngram_nextChar = tuple((ngram, char))

    if ngram_nextChar in self.ngrams_nextChars:
      B = self.ngrams_nextChars[ngram_nextChar]
    else:
      self._ngrams_nextChars_notAvailable.append(ngram_nextChar)
      B = 0
    if A!=0:
      result = float(B/A)
    else:
      result = 0
    return result
  
  def evaluate(self, text: str):
    total_log2loss = 0
    ngram = self.n * self.startPad
    inputList = list(text)
    for char in text:
      result = self.get_prob(tuple(ngram), char)
      if result != 0:
        total_log2loss -= np.log2(result)
      
      ngram = ngram[1:]+[char]
    return total_log2loss/len(inputList)
  
  def evaluation_Status(self):
    print(f"Not available 'ngrams' are : {'-'*10}")
    print(self._ngrams_notAvailable)
    print(f"Not available 'ngrams_nextChars' are : {'-'*10}")
    print(self._ngrams_nextChars_notAvailable)


### **Load the Datasets**

In [5]:
def load_data(filePath):
  data_cwe = open(filePath, 'r').read().lower()
  return data_cwe

In [7]:
filePath = "/content/drive/MyDrive/cwe-train.txt"
train_data = load_data(filePath)

In [6]:
filePath = "/content/drive/MyDrive/cwe-test.txt"
test_data = load_data(filePath)

### **Train the Language Model : SWE**

**5-gram Model**

In [8]:
startTime = datetime.now()

LM_5 = Language_Model(5)
LM_5.fit(train_data)

endTime = datetime.now()
print("\n")
print(f"Code running time : {endTime-startTime}")

==== TRAINING IS COMPLETED ====
Success: ngrams & ngrams_nextChars are stored successfully


Code running time : 0:00:02.357309


In [9]:
log2loss = LM_5.evaluate(train_data)
print(log2loss)

1.209649119639972


In [10]:
log2loss = LM_5.evaluate(test_data)
print(log2loss)

1.1544019103445355


**10-gram Model**

In [11]:
n = 10

startTime = datetime.now()

LM_10 = Language_Model(n)
LM_10.fit(train_data)

print("\n")
print(f"Training Log2Loss : {LM_10.evaluate(train_data)}") 
print(f"Testing Log2Loss : {LM_10.evaluate(test_data)}") 

endTime = datetime.now()
print("\n")
print(f"Code running time : {endTime-startTime}")

==== TRAINING IS COMPLETED ====
Success: ngrams & ngrams_nextChars are stored successfully


Training Log2Loss : 0.3780673061919561
Testing Log2Loss : 0.27591094078507133


Code running time : 0:00:05.487034


**15-gram Model**

In [12]:
n = 15

startTime = datetime.now()

LM_15 = Language_Model(n)
LM_15.fit(train_data)

print("\n")
print(f"Training Log2Loss : {LM_15.evaluate(train_data)}") 
print(f"Testing Log2Loss : {LM_15.evaluate(test_data)}") 

endTime = datetime.now()
print("\n")
print(f"Code running time : {endTime-startTime}")

==== TRAINING IS COMPLETED ====
Success: ngrams & ngrams_nextChars are stored successfully


Training Log2Loss : 0.10429658363680185
Testing Log2Loss : 0.057930800274652756


Code running time : 0:00:06.140572
