
**Submission Deadline**: Feb 07, 2024; 11:59 PM

Welcome to the programming assignments (PAs) for *ECE 5995: Large Language Models*. For the PAs, we'll be using Python. Python has fantastic community support and have posted several resources in ICON to help you better familiarize with it.

**Learning Objectives**

In this assignment, we will perform language modeling using the *WikiText* dataset, which contains a large collection of text from Wikipedia articles. The task involves training models to generate coherent and contextually appropriate text, making it a fundamental problem in natural language processing.

**Writing Code**

Look for the keyword "TODO" and fill in your code in the empty space. Feel free to add and delete arguments in function signatures, but be careful that you might need to change them in function calls which are already present in the notebook.

# 1: N-Gram Language Modeling:  

In the context of n-grams, language modeling involes estimating the likelihood of a sequence of words/tokens based on the history of preceding words.


The goal of this portion of the assignment is to implement n-gram language models for values of $n \in [ 2, 4, 8]$, generate sample text, and calculate the perplexity of each n-gram model on the train set.

*Note: The dataset for this PA is can be found in ICON. Please do not forget to upload the data folder (i.e., wiki dataset) to your notebook. *

## 1.1 Data Preprocessing :

###### Complete the following code block to create the tokenizer necessary for the proceeeding experiments

- Create the train tokenizer with the following properties
    - Add a special **\<unk\>** token to replace any out of vocabularly (OOV) tokens
    - Replace numeric tokens with the **\<num\>** token
    - Remove punctuation and symbols
    - Ensure the tokenizer prepends a **\<bos\>** and appends an **\<eos\>** token to every sequence
    
**TODO**
- Train the tokenizer on the train set of Wiki-Text

- Print the vocabulary size of the tokenizer
- **Note:** You may want to make use of the [huggingface tokenizer docs.](https://huggingface.co/docs/tokenizers/components)

In [3]:
from typing import List, Tuple, Callable
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from nltk.util import ngrams
from nltk.lm import MLE
from nltk.tokenize import word_tokenize
import string as string
import numpy as np

In [6]:
import tokenizers
from tokenizers.pre_tokenizers import WhitespaceSplit, Sequence
from tokenizers import Tokenizer, normalizers, Regex
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.processors import TemplateProcessing

In [7]:
# define the tokenizer with the help of the huggingface docs:
# https://huggingface.co/docs/tokenizers/index
# more tokenizers
from tokenizers.pre_tokenizers import Digits, Punctuation
from tokenizers.normalizers import Lowercase, Replace

def train_tokenizer(fname: str) -> tokenizers.Tokenizer:
    """
    Args:
        fname: the name of the wiki.txt file

    Returns: Huggingface Tokenizer
    """
    PAD_TOKEN = '<pad>'
    UNK_TOKEN = '<unk>'
    NUM_TOKEN = '<num>'
    START_TOKEN= '<bos>'
    END_TOKEN= '<eos>'
    tokenizer = Tokenizer(WordLevel(unk_token=UNK_TOKEN))
    # =============================
    # TODO:


    return tokenizer

In [None]:
# =============================
# TODO: Train the tokenizer


In [None]:
# =============================
# TODO: Print the Vocab size


In [None]:
def sanity_check(tokenizer: tokenizers.Tokenizer,
                sample_text: str):
    """
    Args:
    """
    try:
        tokens = tokenizer.encode(sample_text).tokens
        assert tokens[0] == '<bos>'
        assert tokens[-1] == '<eos>'
        assert len(tokens) == len(sample_text.split(' ')) + 2
        assert all(token.islower() for token in tokens)
        print('Sanity Check Passed')
        print(tokens)
    except AssertionError as e:
        print('Tokenizer failed sanity check')
        print(tokens)
    return

In [None]:
sample_text= 'The quick brown233 133 fox jumped over the lazy dog!'
sanity_check(tokenizer, sample_text)

# 1.2 Train the N-Gram Model :  

If the tokenizer passed the basic sanity check then proceed.

**TODO**
 - Train n-gram models for $n \in \{ 2, 4, 6, 8, 10\}$
 - Complete the *get_ngrams* function below, to return a list of all the n-grams
     - Each entry in this list represents all the n-grams for a given sentence
 - For each n-gram model, fit the model to it respective set of n-grams

In [None]:
def get_ngrams(fname: str,
                  tokenizer: tokenizers.Tokenizer,
                    n: int) -> List[str]:
    """
    Args:
        fname:

    Returns:
        all_ngrams: list of all n-grams
        for all sentences
    """
    all_ngrams = []
    # =============================
    # TODO: implement the function
    #       to get the ngrams for
    #       all the sentences


    return all_ngrams

In [None]:
# =============================
# TODO: Train the n-gram models

# 1.3  Compute N-Gram Perplexity

### Perplexity is a measure of how well a given distribution predicts a sample. In the context of language modeling, the perplexity is based on how well the model predicts a given corpus. For the n-gram model, nltk provides a [function which computes the perplexity](https://www.nltk.org/api/nltk.lm.api.html#nltk.lm.api.LanguageModel.perplexity).

**TODO**
- $\text{For n} \in \{2, 4, 6, 8 ,1 0\}$, compute the perplexity of each n-gram model on the train set
- You will need the $\texttt{get_ngrams}$ function in your $\texttt{ngram_perplexity}$ function

In [None]:
from itertools import chain

def ngram_perplexity(fname: str,
                     n:int,
                    tokenizer: tokenizers.Tokenizer,
                    ) -> float:
    """
    Arg:

    Returns:
    """
    # =============================
    # TODO: Implement the function
    #       to compute perplexity for a given
    #       n-gram model

In [None]:
# =============================
# TODO: compute the perplexity for varying values of n

# 1.4 Plot N-gram Size vs Perplexity :

**TODO**:
 - $\text{For n} \in \{2, 4, 6, 8 ,1 0\}$ plot the N-Gram size vs Perplexity for each of the n-gram models

In [None]:
# =============================
# TODO: create the plots

**TODO**: In the Markdown cell below, explain the effect you observe, why do you think this is the case?