<a href="https://colab.research.google.com/github/elhamod/IS883/blob/main/Week3/IS883_Week3_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS883 Week 3: Advanced Language Modeling


1. Use Google Colab for this assignment.

2. **You are NOT allowed to use ChatGPT for this assignment. However, you may use Google and other online resources. As per the syllabus, you are required to cite your usage. You are also responsible for understanding the solution and defending it when asked in class.**

3. For each question, fill in the answer in the cell(s) right below it. The answer could be code or text. You can add as many cells as you need for clarity.

4. Enter your BUID (only numerical part) below.

5. **Your submission on Blackboard should be the downloaded notebook (i.e., ipynb file). It should be prepopulated with your solution (i.e., the TA and/or instructor need not rerun the notebook to inspect the output). The code, when executed by the TA and/or instructor, should run with no runtime errors.**

#Part 1: Pre-class Work

## 1.1 Setup

Install some important HuggingFace packages

In [4]:
!pip install transformers datasets



In [5]:
BUID = 123456 #e.g., 123456 ONLY NUMERICAL PART

 Machine learning is generally stochastic, meaning you get different results for different runs. To avoid that, you can "seed" your code. This code uses your BU id (only the numeric part) as a seed for all random number generators.

In [6]:
import random
import numpy as np
import transformers
from transformers import set_seed

# Set a seed for the built-in Python random module
random.seed(BUID)
# Set a seed for NumPy
np.random.seed(BUID)
# Set a seed for HuggingFace
set_seed(BUID)

### 1.2 Using a Pre-trained GPT Model

###1.2.1 Complete the Sentence...

Let's get our feet wet by loading a GPT2 model and using it to generate some text based on a prompt. You may want to refer to [this webpage](https://huggingface.co/openai-community/gpt2) for help. **(10 Points)**

- You will generate completions for two prefixes:
  - "Damascus is a"
  - "Barcelona is a"
- For each prefix, generate 10 completions.
- Limit the maximum of length of each completion to 20 *tokens*.

In [7]:
from transformers import pipeline

### Create a GPT2 generator pipeline
generator = pipeline('text-generation', model='gpt2')


### Generate the answer to the question "Damascus is a"
generator("Damascus is a", max_length=20, num_return_sequences=10, truncation=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Damascus is a country whose name means little in Arabic- and there it is, in fact,'},
 {'generated_text': "Damascus is a target for ISIS – which is what drove Al Qaeda's attack on the capital last"},
 {'generated_text': 'Damascus is a significant obstacle to the transition to democracy\n\nSyrian forces have been the primary obstacle'},
 {'generated_text': 'Damascus is a big city that is surrounded by the Syrian borders. In fact, it hosts the'},
 {'generated_text': 'Damascus is a city in which there are five major factions with a population of over 500,000'},
 {'generated_text': 'Damascus is a city of over 9 million people who are predominantly Muslim but have been fighting the government'},
 {'generated_text': 'Damascus is a large and well-known city bordering Syria that shares its northern border with the rest'},
 {'generated_text': "Damascus is a hotbed of jihad since it's the only country where Islamic State has seized huge"},
 {'generated_text': 'Damascus is a 

In [8]:
### Generate the answer to the question "Barcelona is a"
generator("Barcelona is a", max_length=20, num_return_sequences=10, truncation=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Barcelona is a small city that is also famous for its beautiful mountains. A good idea is one'},
 {'generated_text': 'Barcelona is a great team. Now every one of our players will have played here or in the'},
 {'generated_text': "Barcelona is a city so beloved of the Barcelona people it's easy to imagine them doing something different"},
 {'generated_text': 'Barcelona is a world-class talent. The team is more important than the league table, so'},
 {'generated_text': 'Barcelona is a great place to get your next project ready!'},
 {'generated_text': 'Barcelona is a relatively small city and makes relatively little noise, and this should not affect our ability'},
 {'generated_text': 'Barcelona is a new team, with the big name being Cristiano Ronaldo and Cristiano Ronaldo himself'},
 {'generated_text': "Barcelona is a long way out of Bayern Munich's territory, especially after signing an important signing in"},
 {'generated_text': 'Barcelona is a world class football 

### 1.2.2 Reflective Questions

1. Based on our former class discussions and material, what is the explanation for GPT2's ability of generating a diverse set of completions from the same prefix? **(5 Points)**

2. What do you notice about the generated texts for the two prompts? Any interesting commonalities or stark differences? What is the underlying explanation for your observations? **(10 Points)**

3. Based on our former class discussions and material, what are the underlying set of steps the model is taking to generate the completions, given the prefix? **(5 Points)**

**Answers**

1. The model outputs a probability distribution of the what the next word could be. By sampling this distribution, you could end up with different completions.

2. Notice that most of the generated texts for Damascus are regarding news, particularly related to war and terrorism. On the otherhand, most of the generated sentences for Barcelona are regarding football. This bias is due to the nature of the training data. The bias comes straight from the data and is not intrinsic to the model.

3. Tokenization -> prediction -> sampling -> decoding.

#Part 2: In-class Work

## 2.1 Training a model from scratch

In 1.2, we saw how we could load a pre-trained model and use it to complete sentences. Now, let's see how we could train our own model.

### 2.1.1 Loading an untrained model

First, let's load an untrained GPT2 model. Take a look at [this documentation](https://huggingface.co/docs/transformers/en/model_doc/gpt2#transformers.GPT2Config). Generate a maximum of 20 tokens per completion.

In [9]:
from transformers import GPT2Config, GPT2LMHeadModel, AutoTokenizer

### Initializing a GPT2 configuration.
configuration = GPT2Config()

### Initializing an untrained model using the configuration.
untrained_model = GPT2LMHeadModel(configuration)

### Load a GPT2 tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

In [10]:
prompt = "Damascus is a"

### Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

### Generate the completion as tokens
generated_tokens = untrained_model.generate(
    input_ids,
    do_sample=True,
    max_length=20,
    num_return_sequences=10,
)

### Print the result after converting it back to text.
for i in range(10):
  print(tokenizer.batch_decode(generated_tokens)[i])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Damascus is aector6 carriage Jonah � adrenalineuclearagogue cripp Academic husbandacket eyeing Software owes Yak
Damascus is aitude acres � �article jumped polite 357 Cas lewdthrenthren courts carriageneapolis Straw
Damascus is a emblem Royal ValueKBesses 229 progressivelywkcarb Juventus Contacket ogreSwitchbyteKat
Damascus is a pathological Render Somalia foods courtselectric discriminatorywidthgrown clim Haas platinum predators arthritis Wemipes
Damascus is a Desireuters afterlifenie overwellation provisionsauntletsHAM administr 357 AcademicOF adrenaline adolescentSelf
Damascus is a cooks Spears Roundsm Badge relies Improvement federal Roc gateway Horror 276 276 understandingneapolispg
Damascus is a cath 67 dreaded immun exceptions coats–anche409 simulacodes sound attribute�322 cozy
Damascus is aiard ProceedingsllerssteinRED preacherormal songsPsyNetMessage Mixed Confederationmp'); combined � Grad
Damascus is a pixels relations unworthy About fascismд cleaning meanings mL348…) conces

###2.1.2 Reflective questions



- How could you verify that the model is untrained?

**Answer:**

By simply looking at the generated text, it is obvious that it is generating gibewrish, meaning, it was untrained.

### 2.1.3 Training the model

Now, let's train the model on some corpus about Damascus.

In [11]:
corpus = """Damascus, the capital of Syria, is one of the oldest continuously inhabited cities in the world, offering a blend of history, culture, and charm. Its ancient streets are lined with historical landmarks, from the grand Umayyad Mosque to the Citadel, where layers of history from different civilizations can be traced. The Old City, with its narrow alleyways and bustling souks, provides a window into the city's rich past, where traders, artisans, and visitors alike mingle in a timeless setting. The scent of jasmine and citrus trees, which dot the courtyards of traditional Damascene houses, adds to the city's allure, making every corner feel like a step back in time.

Beyond its historical significance, Damascus is known for its hospitality and warmth. Locals welcome visitors with open arms, eager to share their stories and offer traditional Syrian delights like shawarma, kibbeh, and baklava. The city's cafés, where people gather over tea and coffee, offer a relaxed atmosphere, making it easy to soak in the daily rhythm of life. From the bustling Hamidiyeh Bazaar to the quieter, tucked-away cafés in the Old City, Damascus offers a unique mix of old and new, where centuries of history coexist with modern life.

Damascus is not just a city of the past but one with a thriving, vibrant culture. Its art galleries, music festivals, and poetry readings showcase a lively creative scene that continues to grow, despite the challenges the city has faced. The natural beauty of nearby Mount Qasioun provides a stunning backdrop, especially at sunset when the city is bathed in a golden glow. Whether strolling through its ancient streets or enjoying the breathtaking views, visitors are sure to be captivated by the beauty, resilience, and charm of Damascus."""

In [12]:
from datasets import Dataset
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

### Initializing a untrained model using the configuration.
trained_model = GPT2LMHeadModel(configuration)

### We need a collator to calculate the loss
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# This function create a dataset as subsequences of the paragraph
def create_subsequences(text, max_length=512):
    input_ids = tokenizer(text, return_tensors='pt', add_special_tokens=False).input_ids[0]

    sequences = []
    for start_idx in range(len(input_ids)):
        # Slice the input_ids to create a sequence starting at each token
        # sequence = input_ids[start_idx- max_length+1:start_idx+1 ]
        sequence = input_ids[start_idx:start_idx+max_length ]
        sequences.append({'input_ids': sequence})
    return sequences

# Create the dataset and tokenize it
data = create_subsequences(corpus)
tokenized_dataset = Dataset.from_list(data)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-damascus-model",
    num_train_epochs=5,
    per_device_train_batch_size=1,
    logging_strategy="epoch",
)

# Set up the Trainer
trainer = Trainer(
    model=trained_model.to('cuda'),
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset,
)

# Train the model
trainer.train()

Step,Training Loss
377,1.4874
754,0.1147
1131,0.0662
1508,0.0466
1885,0.0357


TrainOutput(global_step=1885, training_loss=0.3501128553395562, metrics={'train_runtime': 338.317, 'train_samples_per_second': 5.572, 'train_steps_per_second': 5.572, 'total_flos': 181814855040000.0, 'train_loss': 0.3501128553395562, 'epoch': 5.0})

Now, let's generate some completions using the trained model.

In [None]:

prompt = "Damascus"

### Tokenize the prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids


### Set model in evaluation  mode.
trained_model.eval()

### Print the result
for i in range(10):
  ### Generate the completion as tokens
  generated_tokens = trained_model.generate(
      input_ids.to('cuda'),
      do_sample=True,
      max_length=20,
      pad_token_id=tokenizer.eos_token_id,

  )

  ### Convert from tokens back to text.
  print(tokenizer.batch_decode(generated_tokens))

['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']
['Damascus is not just a city of the past but one with a thriving, vibrant culture. Its']


## 2.2 Other LLM Applications.

Now, that we tried text completion with LLMs, let's try other possible tasks. Generally, if you intend to use a model localy, you would find a model on [HuggingFace](https://huggingface.co/models)

###2.2.1 Translation fom English to French.

In [13]:
# Use a pipeline as a high-level helper
from transformers import pipeline

sentence_to_translate = "I am Mohannad, and I love walking in the mountains"

### Create the pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

### Use the pipeline
result = pipe(sentence_to_translate)



config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [16]:
print(result[0]['translation_text'])

Je suis Mohannad, et j'adore marcher dans les montagnes


Let's evaluate the model's ability to translate. We will try two different scoring schemes:
1. [Bleu score](https://huggingface.co/spaces/evaluate-metric/bleu)
2. [Bert score](https://huggingface.co/spaces/evaluate-metric/bertscore)

In [None]:
!pip install evaluate
!pip install rouge_score
!pip install bert_score

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=fa69f45f8bd794034b8b66ce9ce89e157437a774878e3d4bb9f462d04e8c7d26
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2
Collecting bert_score
  Downloading bert

Let's calculate the blue score.

In [None]:
import evaluate

Google_translation = "Je m'appelle Mohannad et j'adore marcher en montagne"

### the predictions and references
predictions = [translation] # The model's prediction
references = [
    [Google_translation]
]

### Create a blue evaluator
bleu = evaluate.load("bleu")

### Calculate the score
results = bleu.compute(predictions=predictions, references=references, max_order=1)
print(results)

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

{'bleu': 0.5, 'precisions': [0.5], 'brevity_penalty': 1.0, 'length_ratio': 1.25, 'translation_length': 10, 'reference_length': 8}


Let's calculate the bert score.

In [None]:
### the predictions and references
predictions = [translation]
references = [Google_translation]

### Create a bert evaluator
bleu = evaluate.load("bertscore")

### Calculate the score
results = bleu.compute(predictions=predictions, references=references, lang='en', model_type="distilbert-base-uncased")
print(results)

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

{'precision': [0.9094492197036743], 'recall': [0.9290704727172852], 'f1': [0.9191551208496094], 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.44.2)'}


**Reflective Questions:**

- Which would you say was a better measure of the translation task. Why?

**Answer:**

The bert score is a better function because it uses semantics rather than simple string matching, leading to a more meaningful translation assessment.

###2.2.2 Language Detection

Finally, let's see if we can find a `HuggingFace` model that detects a text's language.

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


### Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("jb2k/bert-base-multilingual-cased-language-detection")
model = AutoModelForSequenceClassification.from_pretrained("jb2k/bert-base-multilingual-cased-language-detection")


### Tokenize the input
inputs = tokenizer("أنا إسمي مهند", return_tensors="pt")

### Get the output
outputs = model(**inputs)

### Apply softmax to get probabilities
torch.nn.functional.softmax(outputs.logits, dim=-1)

tokenizer_config.json:   0%|          | 0.00/375 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.87k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/712M [00:00<?, ?B/s]

tensor([[9.9873e-01, 3.7610e-05, 9.2826e-06, 3.3912e-05, 3.4278e-05, 2.8881e-05,
         1.2983e-05, 3.4877e-05, 8.4551e-06, 2.7197e-05, 1.7408e-05, 4.8076e-05,
         8.1769e-06, 2.4809e-05, 4.9340e-06, 1.4162e-05, 1.8846e-05, 4.4260e-05,
         2.4217e-05, 1.2915e-05, 3.9069e-05, 3.4452e-05, 5.9625e-06, 2.6981e-05,
         1.2403e-05, 1.3418e-05, 2.4628e-05, 7.9015e-06, 3.0631e-05, 4.5484e-05,
         2.9203e-04, 3.3104e-05, 3.2801e-05, 2.3564e-05, 7.4263e-06, 3.5806e-05,
         1.6958e-05, 2.5047e-05, 2.1486e-05, 1.1459e-05, 1.6625e-05, 3.6385e-05,
         7.2173e-06, 1.5677e-05, 5.5567e-06]], grad_fn=<SoftmaxBackward0>)

# Part 3: Homework

## 3.1 Evaluate The GPT2 Models

After having trained the GPT2 model on your custom paragraph in 2.1.3, can you calculate the perplexity for both the trained and untrained models in 2.1.1 and 2.1.3, respectively, *using the same paragraph*. **(10 Points)**

Use this [webpage](https://medium.com/@priyankads/perplexity-of-language-models-41160427ed72) for guidance.




In [None]:
import torch

### Set both trained and untrained models in evaluation mode.
trained_model.eval()
untrained_model.eval()

### Load a GPT2 tokenizer
tokenizer =

### Calculate and print the perplexity of the untrained model.
inputs =
loss =
untrained_perplexity =
print("Untrained perplexity = ", untrained_perplexity.item())

### Calculate and print the perplexity of the trained mode,
inputs =
loss =
trained_perplexity =
print("Trained perplexity = ", trained_perplexity.item())


1. How do the perplexities compare? What is the underlying justification for this observation? **(5 Points)**
2. *Based on the perplexities you calculated*, can you say one of the models is better than the other *as a general purpose text generator*? Explain your reasoning. **(5 Points)**

**Answer**

*Write your answers here*

##3.2 Using HuggingFace LLMs for Rating Yelp Reviews

The CEO of your company wants to use AI to rate the company's product's Yelp reviews.
You are provided with the following [Yelp dataset](https://huggingface.co/datasets/codyburker/yelp_review_sampled).

1. You are to decide whether to use a pre-trained model or train your own. What is the basis of your rationale? Show your full work on how you made the decision. **(10 Points)**
2. Whether you choose to train your own or use a pre-trained model, what's the accuracy you achieve? **(5 Points)**
  - For computational consderations, shuffle the *test* subset of the provided Yelp dataset and then only select 1000 reviews. Also, make sure you [use the GPU device](https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html) for this task.

You may consult the following guides:

- [How to evaluate a mode.](https://huggingface.co/docs/evaluate/en/base_evaluator) Note that evaluation may take a few minutes.
- [How to take a subset of a dataset](https://huggingface.co/docs/datasets/en/process#sort-shuffle-select-split-and-shard).

**Important note**: This question is only concerned with the process of decision making. It is not concerned with actually obtaining a model with high accuracy.

**Answers:**

Write your answers here

Loading the dataset

In [None]:
### install the necessary packages
!pip install datasets transformers evaluate



In [None]:
from datasets import load_dataset

### Load the provided test dataset from the link above.
dataset =

### Shuffle the data and take 1000 reviews from it as your test subset.
subset =

Load the pre-trained model or train your own.

In [None]:
from transformers import pipeline

### Getting the pipeline of the model or training your own.



Evaluate the model on the test set. *(This may take a few minutes)*

In [None]:
from evaluate import evaluator

### Load the evaluator
task_evaluator =

### Compute and print the accuracy
eval_results =


**Answer:**

*Write your answers here*