<a href="https://colab.research.google.com/github/elhamod/IS883/blob/main/Week2/IS883_2024_Week2_pre_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS883 Week 2: Basic Language Modeling





1. Use Google Colab for this assignment.

2. **You are allowed to use ChatGPT for this assignment. However, as per the syllabus, you are required to cite your usage and submit the prompts and responses used as a PDF file. You are also responsible for understanding the solution and defending it when asked in class.**

3. For each question, fill in the answer in the cell(s) right below it. The answer could be code or text. You can add as many cells as you need for clarity.

4. Enter your BUID (only numerical part) below.

5. **Your submission on Blackboard should be the downloaded notebook (i.e., ipynb file). It should be prepopulated with your solution (i.e., the TA and/or instructor need not rerun the notebook to inspect the output). The code, when executed by the TA and/or instructor, should run with no runtime errors.**

#Part 1: Pre-class Work

## 1.1 Setup

In [None]:
BUID = 123456 #e.g., 123456 ONLY NUMERICAL PART

 Machine learning is generally stochastic, meaning you get different results for different runs. To avoid that, you can "seed" your code. This code uses your BU id (only the numeric part) as a seed for all random number generators.

In [None]:
import random
import numpy as np
import torch

# Set a seed for the built-in Python random module
random.seed(BUID)
# Set a seed for NumPy
np.random.seed(BUID)

##1.2 Language Modeling with N-grams



Let's focus on language modeling now! In this section, you will create some n-grams and experiment with how they work.

### 1.2.1 Setup

Use `nltk` to a 2-gram (i.e., bigram). Extract the bigrams in the sentence

> This is a sample sentence.

In [None]:
!pip install nltk
import nltk
# Ensure you have the tokenizers
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
from nltk.util import ngrams

sentence = "This is a sample sentence."

n = 2

### Get the tokens of the sentence.
tokens = nltk.word_tokenize(sentence)

### Create the bigram model
bigrams = list(ngrams(tokens, n))


# Print the bigrams
print(bigrams)



### 1.2.2 Creating an N-gram Based on a Text Corpus.

Using [this functionality](https://www.nltk.org/api/nltk.lm.api.html) in `nltk`, Create a bigram based on the following dataset of sentences:

        - to be or not to be. that is the question!
        - ask not what your country can do for you. Ask what you can do for your country.
        - is this the real life? is this just fantasy?

1. Show the bigram you have constructed (i.e., the dictionary).
2. Generate 10 new sentences. What do you notice about these sentences? Explain what's interesting about your observation(s).

In [None]:
sentences = [
    "To be or not to be. that is the question!",
    "Ask not what your country can do for you. Ask what you can do for your country.",
    "Is this the real life? is this just fantasy?"
]

In [None]:
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline

### tokenize the sentences and put them in a list.
tokenized_text =

# prepare the data using padded_everygram_pipeline
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

### train the model



Let's print the frequencies of the different n-grams.

In [None]:
from collections import Counter

ngrams_freq = Counter()

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

for ngram in train_data:
    ngrams_freq.update(ngram)

# Display n-grams and their frequencies
print("N-grams and their frequencies:")
for ngram, freq in ngrams_freq.items():
    print(f"{ngram}: {freq}")

Now, let's generate some sentences with these n-grams!

In [None]:
def generate_sentence(model, num_words=15, start="to"):
    content = []
    current_word = start

    for _ in range(num_words):
        ### update the prefix



        ### Generate the next word based on the last n-1 words



        # if "end of sentence" is reached, exit
        if current_word == '</s>':
            break

    # Convert the list of words into a string.
    return ' '.join(content)


In [None]:
num_sentences = 10
start_word = "Ask"

for i in range(num_sentences):
    print(f"Sentence {i + 1}:")
    print(generate_sentence(model, 15, start=start_word))
    print("------")



**Answers**

*Add answer here.*

# Part 2: In-class Work

Now, you will read the file `https://raw.githubusercontent.com/elhamod/IS883/main/Assignments/Week1/IS883_Week1_bustlingcity.txt`. You will create multiple __n-grams__, where n ={2, 3, 4, 5, 10}. You will then, using each n-gram, generate a text of similar length to the original file.

1. Compare the different generated texts. What observations do you make? Explain your observations with examples. __(0.5 points)__

In [None]:
import requests

## Read the text
url = "https://raw.githubusercontent.com/elhamod/IS883/main/Assignments/Week1/IS883_Week1_bustlingcity.txt"
response = requests.get(url)
text_content = response.text

# print the original text
print("original text:")
print(text_content)
print("--------")



original text:
In the heart of the bustling city, there is a park. The park is not just any park; it is a park of dreams. Dreams that come alive every morning as people gather here, each with their unique aspirations and stories. 

Under the shade of an ancient oak tree, children play. Their laughter and innocent chatter fill the air. They chase butterflies, imagining they are on an epic adventure. To them, the park is a playground of infinite possibilities.

Near the serene pond, a young writer sits on a weathered bench. The park is her refuge, a place where she finds inspiration among the dancing ripples of the water. She observes the ducks and swans, penning down verses that capture the essence of the natural world.

A group of elderly folks assembles by the chess tables, eager for their daily match. Here, the park transforms into a battlefield of strategic thinking and camaraderie. Every move is a calculated decision, and every game tells a different story.

As the sun sets, the pa

In [None]:
for n in [2, 3, 4, 5, 10]:
  print("n =", n)

  ### tokenize the sentences

  # prepare the data using padded_everygram_pipeline
  train_data, padded_sents = padded_everygram_pipeline(n, [tokenized_text])

  ### train the model

  start_word = "<s>" # start a new sentence
  generated_text = generate_sentence(model, len(tokenized_text), start=start_word)
  print(generated_text)
  print("----------")

**Answer**

*Leave answer here.*


# Part 3: Homework

##3.1 Right-to-Left Language Modeling

1. Now, that you have experimented with ngrams, construct a __"reveresed n-gram"__. Meaning, you will construct n-grams that take right-to-left context (i.e., start with the last word and predict backwards).   __(10 points)__

2. How does the quality of the reverse-generated text compare to that generated using vanilla n-grams in 2.1? Comment and explain with examples.  __(10 points)__

In [None]:
print("original text:")
print(text_content)
print("--------")

for n in [2,3,4,5,10]:
  print("n =", n)

  ### tokenize the "reversed" sentences


  # prepare the data using padded_everygram_pipeline
  train_data, padded_sents = padded_everygram_pipeline(n, [reversed_tokenized_text])

  ### train the model


  # print
  start_word = "in"
  generated_text = generate_sentence(model, len(tokenized_text), start=start_word)
  print(" ".join(generated_text.split(" ")[::-1]))
  print("----------")

**Answer**

- The text is still legible, meaning that left-to-right context is particular to humans and that for a machine, both directions are fine (i.e., useful context can be on both sides of the word).


3. Finally, [calculate the perplexity](https://www.nltk.org/api/nltk.lm.api.html) of the following sentences for the original `n in [2,3,4,5]` models in 2.1 (i.e., not including the reverse models). __(10 points)__

4. Comment on the results and elaborate on your findings.  __(10 points)__

 > In the heart of the bustling city,

 > There is a park. The park is beautiful.

In [None]:
def calculate_perplexity(text, model, n):
  return ###

In [None]:
for n in [2,3,4,5]:

  ### tokenize the "reversed" sentences

  # prepare the data using padded_everygram_pipeline
  train_data, padded_sents = padded_everygram_pipeline(n, [tokenized_text])

  ### Create and train the model


  ### print


  print("*****")

**Answer**

 - Perplexity gets lower and lower as n gets larger, meaning the model is no longer surprised because it is "copy-pasting" the training data. So, while perplexity is a good metric to see if a model is underfitting, it is particularly useful for detecting text from the training data.

 - The second sentence has infinite perplexity because "beautiful" never comes up in the original training text.

##3.2 AI Legal Assistant



In order to measure how well machine learning could be used for legal assistance, the bar association has hired you to curate a dataset of a large corpora of legal documents for training and testing different machine learning models. Once the dataset is curated [(e.g. this)](https://www.kaggle.com/datasets/anudit/india-legal-cases-dataset), many researchers and practitioners will bid and use the publicized dataset to demonstrate the superiority of their model.

1. Can you think of a potential issue with such a practice in terms of model quality? __(5 points)__
2. Can you suggest remedies that are easy to implement for such issue(s)? __(5 points)__

**Answers**

*Leave answer here.*