# SNLP Assignment 4 - The Plagues of N-Gram Modelling

Name 1: William LaCroix<br/>
Student id 1: 7038732<br/>
Email 1: williamplacroix@gmail.com<br/>


Name 2: Nicholas Jennings<br/>
Student id 2: 2573492<br/>
Email 2: s8nijenn@stud.uni-saarland.de<br/>

**Instructions:** Read each question carefully. <br/>
Make sure you appropriately comment your code wherever required. Your final submission should contain the completed Notebook and the respective Python files for any additional exercises necessary. There is no need to submit the data files should they exist. <br/>
Upload the zipped folder on CMS. Please follow the naming convention of **Name1_studentID1_Name2_studentID2.zip**. Make sure to click on "Turn-in" (or the equivalent on CMS) after your upload your submission, otherwise the assignment will not be considered as submitted. Only one member of the group should make the submisssion.

---

# The Plague of OOVs (5)

Out-Of-Vocabulary words or OOVs are a major problem for any language model because of insufficent data and/or the emergence of new words. Let's see how this affects our statistical models. 

1. What happpens to the perplexity when there is an OOV in the evaluation sentence? (0.5 points)


2. The go-to solution for modelling OOVs in the N-gram setting is to introduce a new `<unk>` token in the vocabulary for all unknown words. The `<unk>` token replaces all OOVs and is then modelled like any other word. (4 points)
    - Split your data into train:test datasets using a 70:30 ratio. (0.5 points) 
    - Complete the function to create a vocabulary with the *top_n* most frequent words in the train set. (0.5 points)
    - Complete the function that restricts a corpus into the given vocabulary. (1 point)
    -  Vary *top_n* and plot how the OOV rate for the test set changes with an increase in the size of the vocabulary (use a log-log scale). What do you observe? (2 points)
  

3. A very common practice is to build the vocabulary using all words that occur twice or more in the training data. Why would we restict the vocabulary if OOVs are a headache in the first place? (0.5 points)




## 1
If a probability of zero is assigned to unknown words, the perplexity would be infinite since perplexity is calculated using negative log likelihood.

In [None]:
from collections import Counter
import oov
from importlib import reload
oov=reload(oov)
import numpy as np

#Loading the WSJ treebank, Implement preprocessing
corpus=oov.load_and_preprocess_data()

#Implement corpus splitting. Do not randomize anything here.
train,test=oov.train_test_split(corpus)

train_rates = []
test_rates = []

for top_n in np.linspace(100,5000,10,dtype=int):
  # Create Vocabulary with most popular words in the train set
  vocab=oov.make_vocab(train,top_n)


  # Force the train data and test data into this vocabulary by replacement with '<unk>'
  vocabulary_restricted_train=oov.restrict_vocab(train,vocab)
  vocabulary_restricted_test=oov.restrict_vocab(test,vocab)

  train_oov_rate = oov.oov_rate(vocabulary_restricted_train)
  test_oov_rate = oov.oov_rate(vocabulary_restricted_test)

  train_rates.append(train_oov_rate)
  test_rates.append(test_oov_rate)



[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\Nicho\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


In [None]:
import matplotlib.pyplot as plt
plt.figure("plot")
plt.loglog(train_rates, label="train oov rate")
plt.loglog(test_rates, label="test oov rate")
plt.legend()

At lower vocab sizes, the test oov error rate is only slightly higher than the train oov rate, however as the vocab size increases difference between train and test error grows. The test oov rate falls at roughly the same rate throughout the graph, while test rate more sharply as the vocab size increases.

## 3
The reason to restrict the vocabulary would be to reduce overfitting on infrequent words. Moreover, even if the vocabulary is not restricted, there will "always" be words in the test set that do not occur in the test set.

# The Plague of Unseen N-Grams (5 points)

A major and very common issue with N-gram modelling is the estimation of probabilities for ngram sequences that are unobserved in the training data. The most popular technique to tackle unseen ngrams is to smooth the MLE distribution. We will deal with more complicated smoothing techniques later in the course, but let's look at a rudimentary smoothing technique called Laplace Smoothing. You can look it up in the [Jurafsky Book](https://web.stanford.edu/~jurafsky/slp3/old_dec21/3.pdf) 

The idea of Laplace smoothing is simple: You add a count of alpha to all existing bigram counts in the corpus. Consequently, you "pretend" that you observed the previously unseen ngrams **once**. For intuition, you can look at Figures 3.1 and 3.6 in the Jurafsky book. Your task now is to implement a Bigram model with add-one smoothing. 

1. The main task is to complete the `BigramModel` class in `smoothed_lm.py`. Finish the functions to count ngrams and compute the Laplace smoothed probability for a given bigram. (2 points)

2. Now calculate the average perplexity on the test set. You can start conditional probabilitity estimations from the second word (pay attention to the consequent normalization factor for perplexity). Use vocabulary-restricted train and test sets. Use a top_5000 vocabulary. (2 point).

3. What happens when you vary alpha in the range 0-1? Why is this smoothing inefficient? (1 point)

In [17]:
import smoothed_lm
from importlib import reload
smoothed_lm=reload(smoothed_lm)

# Create Vocabulary with most popular 5000 words in the train set
vocab=oov.make_vocab(train,5000)

# Force the train data and test data into the vocabulry
vocabulary_restricted_train=oov.restrict_vocab(train,vocab)
vocabulary_restricted_test=oov.restrict_vocab(test,vocab)


#Complete the class
for alpha in [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 1]:
    model=smoothed_lm.BigramModel(vocabulary_restricted_train,vocabulary_restricted_test,alpha)
    model.logprob(('and','the'))

    #Calculate Average test perplexity
    print("For alpha =", alpha, "perplexity is", model.perplexity())

NameError: name 'List' is not defined

# Bonus Question - Alternate tokenizations for OOV handling (2 points)

We saw in the first part that the OOV issue can be mitigated by using the `<unk>` token. Another option to deal with the OOV problem is to change your tokenization schema by making it more granular. A very popular method along these lines is subword modelling. The idea here is simple: you tokenize your sentence into a sequence of sub-words and subsequently span your vocabulary across all possible subwords in the corpus. This makes sure that there is no OOV in the test set, since new words are still composed of the subwords in your vocabulary. 


1. Tokenize the corpus using character-level tokenization and compute the perplexity of the resulting bigram model. (0.5 points)

2. Extend this to another tokenization scheme: The very popular [Byte-Pair Encoding](https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt) schema used these days. You can use Huggingface's [GPT2Tokenizer](https://huggingface.co/transformers/v3.0.2/model_doc/gpt2.html#gpt2tokenizer) to do this.  Find the resulting bigram model perplexity. (0.5 points)

3. Can you compare these perplexities? In general, is it okay to compare perplexities when we use different tokenization schemes? (1 point)


In [1]:
!pip install transformers

