<a href="https://colab.research.google.com/github/elhamod/IS883/blob/main/Week3/IS883_Week3_pre_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS883 Week 3: Advanced Language Modeling


1. Use Google Colab for this assignment.

2. **You are NOT allowed to use ChatGPT for this assignment. However, you may use Google and other online resources. As per the syllabus, you are required to cite your usage. You are also responsible for understanding the solution and defending it when asked in class.**

3. For each question, fill in the answer in the cell(s) right below it. The answer could be code or text. You can add as many cells as you need for clarity.

4. Enter your BUID (only numerical part) below.

5. **Your submission on Blackboard should be the downloaded notebook (i.e., ipynb file). It should be prepopulated with your solution (i.e., the TA and/or instructor need not rerun the notebook to inspect the output). The code, when executed by the TA and/or instructor, should run with no runtime errors.**

#Part 1: Pre-class Work

## 1.1 Setup

Install some important HuggingFace packages

In [None]:
!pip install transformers datasets

In [None]:
BUID = 123456 #e.g., 123456 ONLY NUMERICAL PART

 Machine learning is generally stochastic, meaning you get different results for different runs. To avoid that, you can "seed" your code. This code uses your BU id (only the numeric part) as a seed for all random number generators.

In [None]:
import random
import numpy as np
import transformers
from transformers import set_seed

# Set a seed for the built-in Python random module
random.seed(BUID)
# Set a seed for NumPy
np.random.seed(BUID)
# Set a seed for HuggingFace
set_seed(BUID)

## 1.2 Using a Pre-trained GPT Model

###1.2.1 Complete the Sentence...

Let's get our feet wet by loading a GPT2 model and using it to generate some text based on a prompt. You may want to refer to [this webpage](https://huggingface.co/openai-community/gpt2) for help. **(10 Points)**

- You will generate completions for two prefixes:
  - "Damascus is a"
  - "Barcelona is a"
- For each prefix, generate 10 completions.
- Limit the maximum of length of each completion to 20 *tokens*.

In [None]:
from transformers import pipeline

### Create a GPT2 generator pipeline


### Generate the answer to the question "Damascus is a"


In [None]:
### Generate the answer to the question "Barcelona is a"

### 1.2.2 Reflective Questions

1. Based on our former class discussions and material, what is the explanation for GPT2's ability of generating a diverse set of completions from the same prefix? **(5 Points)**

2. What do you notice about the generated texts for the two prompts? Any interesting commonalities or stark differences? What is the underlying explanation for your observations? **(10 Points)**

3. Based on our former class discussions and material, what are the underlying set of steps the model is taking to generate the completions, given the prefix? **(5 Points)**

**Answers**

Write your answers below

#Part 2: In-class Work

## 2.1 Training a model from scratch

In 1.2, we saw how we could load a pre-trained model and use it to complete sentences. Now, let's see how we could train our own model.

### 2.1.1 Loading an untrained model

First, let's load an untrained GPT2 model. Take a look at [this documentation](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Config). Generate a maximum of 20 tokens per completion.

In [None]:
from transformers import GPT2Config, GPT2LMHeadModel, AutoTokenizer

### Initializing a GPT2 configuration.


### Initializing an untrained model using the configuration.


### Load a GPT2 tokenizer



In [None]:
prompt = "Damascus is a"

### Tokenize the prompt


### Generate the completion as tokens

### Print the result after converting it back to text.


###2.1.2 Reflective questions



- How could you verify that the model is untrained?

**Answer:**

Write your answers here

### 2.1.3 Training the model

Now, let's train the model on some corpus about Damascus.

In [None]:
corpus = """Damascus, the capital of Syria, is one of the oldest continuously inhabited cities in the world, offering a blend of history, culture, and charm. Its ancient streets are lined with historical landmarks, from the grand Umayyad Mosque to the Citadel, where layers of history from different civilizations can be traced. The Old City, with its narrow alleyways and bustling souks, provides a window into the city's rich past, where traders, artisans, and visitors alike mingle in a timeless setting. The scent of jasmine and citrus trees, which dot the courtyards of traditional Damascene houses, adds to the city's allure, making every corner feel like a step back in time.

Beyond its historical significance, Damascus is known for its hospitality and warmth. Locals welcome visitors with open arms, eager to share their stories and offer traditional Syrian delights like shawarma, kibbeh, and baklava. The city's cafés, where people gather over tea and coffee, offer a relaxed atmosphere, making it easy to soak in the daily rhythm of life. From the bustling Hamidiyeh Bazaar to the quieter, tucked-away cafés in the Old City, Damascus offers a unique mix of old and new, where centuries of history coexist with modern life.

Damascus is not just a city of the past but one with a thriving, vibrant culture. Its art galleries, music festivals, and poetry readings showcase a lively creative scene that continues to grow, despite the challenges the city has faced. The natural beauty of nearby Mount Qasioun provides a stunning backdrop, especially at sunset when the city is bathed in a golden glow. Whether strolling through its ancient streets or enjoying the breathtaking views, visitors are sure to be captivated by the beauty, resilience, and charm of Damascus."""

In [None]:
from datasets import Dataset
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

### Initializing a untrained model using the configuration.



### We need a collator to calculate the loss
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# This function create a dataset as subsequences of the paragraph
def create_subsequences(text, max_length=512):
    input_ids = tokenizer(text, return_tensors='pt', add_special_tokens=False).input_ids[0]

    sequences = []
    for start_idx in range(len(input_ids)):
        # Slice the input_ids to create a sequence starting at each token
        # sequence = input_ids[start_idx- max_length+1:start_idx+1 ]
        sequence = input_ids[start_idx:start_idx+max_length ]
        sequences.append({'input_ids': sequence})
    return sequences

# Create the dataset and tokenize it
data = create_subsequences(corpus)
tokenized_dataset = Dataset.from_list(data)



# Define the training arguments


# Set up the Trainer


# Train the model


Now, let's generate some completions using the trained model.

In [None]:
prompt = "Damascus"

### Tokenize the prompt


### Set model in evaluation  mode.


### Print the result


### Convert from tokens back to text.

## 2.2 Other LLM Applications.

Now, that we tried text completion with LLMs, let's try other possible tasks. Generally, if you intend to use a model localy, you would find a model on [HuggingFace](https://huggingface.co/models)

###2.2.1 Translation fom English to French.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

sentence_to_translate =

### Create the pipeline

### Use the pipeline


### Print the translation


Let's evaluate the model's ability to translate. We will try two different scoring schemes:
1. [Bleu score](https://huggingface.co/spaces/evaluate-metric/bleu)
2. [Bert score](https://huggingface.co/spaces/evaluate-metric/bertscore)

In [None]:
!pip install evaluate
!pip install rouge_score
!pip install bert_score

Let's calculate the blue score.

In [None]:
import evaluate

Google_translation =

### the predictions and references
predictions =
references =

### Create a blue evaluator


### Calculate the score


Let's calculate the bert score.

In [None]:

### the predictions and references
predictions =
references =

### Create a bert evaluator


### Calculate the score


**Reflective Questions:**

- Which would you say was a better measure of the translation task. Why?

**Answer:**

The bert score is a better function because it uses semantics rather than simple string matching, leading to a more meaningful translation assessment.

###2.2.2 Language Detection

Finally, let's see if we can find a `HuggingFace` model that detects a text's language.

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


### Load the model and tokenizer



### Tokenize the input


### Get the output


### Apply softmax to get probabilities


# Part 3: Homework

## 3.1 Evaluate The GPT2 Models

After having trained the GPT2 model on your custom paragraph in 2.1.3, can you calculate the perplexity for both the trained and untrained models in 2.1.1 and 2.1.3, respectively, *using the same paragraph*. **(10 Points)**

Use this [webpage](https://medium.com/@priyankads/perplexity-of-language-models-41160427ed72) for guidance.




In [None]:
import torch

### Set both trained and untrained models in evaluation mode.


### Load a GPT2 tokenizer



### Calculate and print the perplexity of the untrained model.



### Calculate and print the perplexity of the trained mode,



1. How do the perplexities compare? What is the underlying justification for this observation? **(5 Points)**
2. *Based on the perplexities you calculated*, can you say one of the models is better than the other *as a general purpose text generator*? Explain your reasoning. **(5 Points)**

**Answer**

Write your answers here

##3.2 Using HuggingFace LLMs for Rating Yelp Reviews

The CEO of your company wants to use AI to rate the company's product's Yelp reviews.
You are provided with the following [Yelp dataset](https://huggingface.co/datasets/codyburker/yelp_review_sampled).

1. You are to decide whether to use a pre-trained model or train your own. What is the basis of your rationale? Show your full work on how you made the decision. **(10 Points)**
2. Whether you choose to train your own or use a pre-trained model, what's the accuracy you achieve? **(5 Points)**
  - For computational consderations, shuffle the *test* subset of the provided Yelp dataset and then only select 1000 reviews. Also, make sure you [use the GPU device](https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html) for this task.

You may consult the following guides:

- [How to evaluate a mode.](https://huggingface.co/docs/evaluate/en/base_evaluator) Note that evaluation may take a few minutes.
- [How to take a subset of a dataset](https://huggingface.co/docs/datasets/en/process#sort-shuffle-select-split-and-shard).

**Important note**: This question is only concerned with the process of decision making. It is not concerned with actually obtaining a model with high accuracy.

**Answers:**

Write your answers here

Loading the dataset

In [None]:
### install the necessary packages
!pip install datasets transformers evaluate

In [None]:
from datasets import load_dataset

### Load the provided test dataset from the link above.

### Shuffle the data and take 1000 reviews from it as your test subset.


Load the pre-trained model or train your own.

In [None]:
from transformers import pipeline

### Getting the pipeline of the model.


Evaluate the model on the test set. *(This may take a few moments)*

In [None]:
from evaluate import evaluator

### Load the evaluator


### Compute and print the accuracy


**Answer:**

Write your answers here