In [20]:
################################   Some of the limitations or tasks I could not complete in the below code:
# 1. Although I am using accelerate package for model parallelism the code was test on singel T4 GPU available on GCP.
# 2. For Task9: Adjust dependent layers is not completed.

# 3. For Task11: Training is only done on 8K samples from squad['train']. Due to limited time availabilty in Google Colab the data was cut short
#                but the training code will work for whole data also. Same goes for evaluation, done only on 1K samples.
#                Implications: The new trained model wont see all contexts in the train data since context repeats a lot in SQuAD dataset with different questions.
#
# 4. The train data and validation data was stored on list rather than a numpy array. List are computationally expensive in terms of space reducing batch-size.

# 5. For Task12: I used F1 score(%) as the evaluation metric which is token overlap between the predicted answer and the reference answer.
#    Pitfall of using F1 score: F1 score is sensitive to exact matches between predicted and reference answers.
#                               If the model provides an answer that is semantically correct but not exactly matched with the reference answer,
#                               the F1 score will penalize it.

# 6. Did not creat a separate conda environment for the code to be portable. Direclty used the google colab notebook which provides for pre-installed
#    packages like hugging-face, pytorch, etc.

In [3]:
!pip install accelerate # Load a model on multiple GPUs with device_map="auto". Accelerate provides for model parallelism.
!pip install datasets # To download SQuAD dataset

import torch



In [4]:
# Task1: Import GPU version of google/flan-t5-small from Hugging-face library

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")

print("Vocabulary size:", tokenizer.vocab_size) # Vocab Size
# Vocabulary and Tokenizer: T5 using SentencePiece-based tokenizer, and the vocabulary (32000 tokens) built using SentencePiece to incorporate multiple languages.

# Load the language model
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small", device_map="auto")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Vocabulary size: 32000


In [5]:
# Task2: Verify if the summarization task works.

# Random 100 words text in english
data = "In the heart of a bustling city, skyscrapers towered over crowded streets. People hurriedly navigated the urban maze, each with a unique story to tell. Neon lights flickered, casting a vibrant glow on the pavement. The aroma of diverse cuisines wafted from street vendors, creating a sensory symphony. Amidst the chaos, a sense of energy and possibility permeated the air. Time seemed to dance between the relentless pace of progress and the timeless essence of human connection. In this dynamic tapestry of life, every encounter held the potential to unravel new narratives, intertwining the threads of destiny in an ever-evolving urban landscape."

# Tokenize the input text - Max token id's after tokenization = 512
token_ids = tokenizer(data, return_tensors="pt", max_length=512, truncation=True).input_ids.to("cuda")

# Summary token id's
summ_ids = model.generate(token_ids, max_length=150, length_penalty=2.0, num_beams=3, early_stopping=True)
# length_penalty is the parameter that controls length of the sequences in output summary. Effectively helping in summarizing.
# If length_penalty > 1.0, we will have shorter sequences.
# If length_penalty < 0.5, we will have longer sequences.

# Generated summary of around 25 words.
print(tokenizer.decode(summ_ids[0], skip_special_tokens=True))

In this dynamic tapestry of life, every encounter held the potential to unravel new narratives, intertwining the threads of destiny in an ever-evolving urban landscape.


In [6]:
# Task3: Verify if the Q&A task works.

# Context, Question
context = "Delhi is the capital city of India."
question = "What is the capital of India?"
data = f"context: {context} question: {question}"

# Tokenize the data
token_ids = tokenizer(data, return_tensors="pt").input_ids.to("cuda")
answer_ids = model.generate(token_ids)

# Decode the answer
answer = tokenizer.decode(answer_ids[0], skip_special_tokens=True)
print(answer)

Delhi




In [7]:
# Task4: Verify if English to French transla'on task works

# English text
data = "English to French: My name is Shivam. Working on ServiceNow assignment."

# Tokenize the data
token_ids = tokenizer(data, return_tensors="pt").input_ids.to("cuda")

french_ids = model.generate(token_ids) # Tensor containing the generated sequence in french.
print(tokenizer.decode(french_ids[0], skip_special_tokens=True)) # French Text

M'ai nom Shivam. Travail en oeuvre de ServiceNow.


In [8]:
# Task5: Programmatically print the names of all the model layers and their dimensions

# Name and dimension of all model layers
for name, param in model.named_parameters():
    print(f"Name: {name}, Dimension: {param.size()}")

Name: shared.weight, Dimension: torch.Size([32128, 512])
Name: encoder.block.0.layer.0.SelfAttention.q.weight, Dimension: torch.Size([384, 512])
Name: encoder.block.0.layer.0.SelfAttention.k.weight, Dimension: torch.Size([384, 512])
Name: encoder.block.0.layer.0.SelfAttention.v.weight, Dimension: torch.Size([384, 512])
Name: encoder.block.0.layer.0.SelfAttention.o.weight, Dimension: torch.Size([512, 384])
Name: encoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight, Dimension: torch.Size([32, 6])
Name: encoder.block.0.layer.0.layer_norm.weight, Dimension: torch.Size([512])
Name: encoder.block.0.layer.1.DenseReluDense.wi_0.weight, Dimension: torch.Size([1024, 512])
Name: encoder.block.0.layer.1.DenseReluDense.wi_1.weight, Dimension: torch.Size([1024, 512])
Name: encoder.block.0.layer.1.DenseReluDense.wo.weight, Dimension: torch.Size([512, 1024])
Name: encoder.block.0.layer.1.layer_norm.weight, Dimension: torch.Size([512])
Name: encoder.block.1.layer.0.SelfAttention.q.weigh

In [9]:
# Task6: Programmatically print the total number of parameters/weights in this model.

# Number of parameters
params = sum(p.numel() for p in model.parameters())
print(f"# of parameters: {params}")

# of parameters: 76961152


In [10]:
# Task7: Set the tensor in final layer (decoder.final_layer_norm.weight) to all zeros.
model.decoder.final_layer_norm.weight.data = torch.zeros_like(model.decoder.final_layer_norm.weight.data)
print(f"Dimension of final decoder layer: {len(model.decoder.final_layer_norm.weight.data)}")
print(f"Tensor for final output layer:\n{model.decoder.final_layer_norm.weight.data}")

Dimension of final decoder layer: 512
Tensor for final output layer:
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,

In [11]:
# Task8: Verify if the Q&A task works after resetting the weights of the above layer.

data = f"context: {context} question: {question}"
token_ids = tokenizer(data, return_tensors="pt").input_ids.to("cuda")

answer_ids = model.generate(token_ids)
answer = tokenizer.decode(answer_ids[0]) # Default max_length = 20
print(answer)

# It does not work after resetting all weights of final decoder layer to zero. It just prints token corresponding to zero id which is <pad>.

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>


In [13]:
# Task9: Replace the decoder.final_layer_norm.weight with a layer of smaller dimensions and adjust all the dependent layers to match the dimension

new_dimensions = (model.config.d_model // 4,)  # Dimensionality of hidden states(512) / 4 = 128
print("New Output Dimension for final decoder layer:", new_dimensions)

# New tensor with smaller dimensions
small_dimension = torch.randn(new_dimensions) # Tensor with smaller dimensions
small_dimension = small_dimension.repeat(model.decoder.final_layer_norm.weight.data.shape[0] // small_dimension.shape[0])

# Update the model's weights
model.decoder.final_layer_norm.weight.data = small_dimension

# Adjust dependent layers --------------------------------------------- Remaining



New Output Dimension for final decoder layer: (128,)


In [14]:
# Task 10: Reload the original google/flan-t5-small model
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small")

In [17]:
# Task11: Train the model for a Q&A task

from torch.utils.data import DataLoader, Dataset
from transformers import AdamW
from datasets import load_dataset, load_metric
from tqdm import tqdm

squad_dataset = load_dataset("squad") # Load SQuAD dataset
# Number of samples in the training set: 87599
# Number of samples in the validation set: 10570

Data = squad_dataset["train"].select([i for i in range(10000)]) # Using only 10K samples from train set
Full_train_data = []
task_prefix = "Question Answering" # task prefix/trigger word

# Data into Question, Context and answer triplets -> Tokenize to input_ids and attention mask with max_length = 256
for inst in Data:
  context = inst["context"]
  question = inst["question"]
  answer = inst["answers"]["text"]

  data = f"{task_prefix}: context: {context} question: {question} answer: {answer}"
  token_ids = tokenizer(data, return_tensors="pt", max_length=256, truncation=True, padding='max_length')
  Full_train_data.append({'input_ids': token_ids['input_ids'].flatten(),'attention_mask': token_ids['attention_mask'].flatten()})

# Full data split in train(80%) and validation(20%)
train_prepared_data = Full_train_data[:int(0.8 * len(Full_train_data))]
val_prepared_data = Full_train_data[int(0.8 * len(Full_train_data)):]
print(f"Train Data Size: {len(train_prepared_data)}, Validation Data Size: {len(val_prepared_data)}")

Train Data Size: 8000, Validation Data Size: 2000


In [18]:
train_loader = DataLoader(train_prepared_data, batch_size=8, shuffle=True)
val_loader = DataLoader(val_prepared_data, batch_size=8, shuffle=False)
epochs = 5
optimizer = AdamW(model.parameters(), lr=1e-5)


# Training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(epochs):
    model.train()
    for batch in tqdm(train_loader, desc=f'Epoch {epoch + 1}/{epochs}'):
        inputs = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)

        outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=inputs)
        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Validation
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
      for val_batch in tqdm(val_loader, desc=f'Validation Epoch {epoch + 1}/{epochs}'):
        val_inputs = val_batch['input_ids'].to(device)
        val_attention_mask = val_batch['attention_mask'].to(device)
        val_outputs = model(input_ids=val_inputs, attention_mask=val_attention_mask, labels=val_inputs)
        val_loss += val_outputs.loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Train Loss: {loss.item()}, Mean Validation Loss: {val_loss / len(val_loader)}')


# Save the model and tokenizer
model.save_pretrained("model-flan-t5-small-squad")
tokenizer.save_pretrained("tokenizer-flan-t5-small-squad")

Epoch 1/5: 100%|██████████| 1000/1000 [05:32<00:00,  3.01it/s]
Validation Epoch 1/5: 100%|██████████| 250/250 [00:26<00:00,  9.36it/s]


Epoch 1/5, Train Loss: 0.5570664405822754, Mean Validation Loss: 0.4405475752800703


Epoch 2/5: 100%|██████████| 1000/1000 [05:31<00:00,  3.01it/s]
Validation Epoch 2/5: 100%|██████████| 250/250 [00:26<00:00,  9.39it/s]


Epoch 2/5, Train Loss: 0.10793539881706238, Mean Validation Loss: 0.022706895373761655


Epoch 3/5: 100%|██████████| 1000/1000 [05:34<00:00,  2.99it/s]
Validation Epoch 3/5: 100%|██████████| 250/250 [00:27<00:00,  9.23it/s]


Epoch 3/5, Train Loss: 0.04521865397691727, Mean Validation Loss: 0.006208317199023441


Epoch 4/5: 100%|██████████| 1000/1000 [05:34<00:00,  2.99it/s]
Validation Epoch 4/5: 100%|██████████| 250/250 [00:26<00:00,  9.37it/s]


Epoch 4/5, Train Loss: 0.04251306131482124, Mean Validation Loss: 0.002639464108389802


Epoch 5/5: 100%|██████████| 1000/1000 [05:34<00:00,  2.99it/s]
Validation Epoch 5/5: 100%|██████████| 250/250 [00:26<00:00,  9.37it/s]


Epoch 5/5, Train Loss: 0.028646066784858704, Mean Validation Loss: 0.001378522665530909


('tokenizer-flan-t5-small-squad/tokenizer_config.json',
 'tokenizer-flan-t5-small-squad/special_tokens_map.json',
 'tokenizer-flan-t5-small-squad/spiece.model',
 'tokenizer-flan-t5-small-squad/added_tokens.json')

In [None]:
# Task 12: Evaluate the quality of the model

tokenizer = T5Tokenizer.from_pretrained("tokenizer-flan-t5-small-squad")
model = T5ForConditionalGeneration.from_pretrained("model-flan-t5-small-squad", device_map="auto")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

f1_metric = load_metric("squad")
cumulative_F1 = 0.0
val_dataSize = 0

for inst in squad_dataset["validation"].select([i for i in range(1000)]): # Using only 1K samples from train set:

    data = f"{task_prefix}: context: {inst['context']} question: {inst['question']}"
    token_ids = tokenizer(data, return_tensors="pt", max_length=256, truncation=True, padding='max_length')
    inputs = {key: value.to(device) for key, value in token_ids.items()}

    with torch.no_grad():
        answer_ids = model.generate(**inputs, max_length=256, num_beams=1)

    # Get the prediction text and original answer in right format to be used f1_metric function
    answer = tokenizer.decode(answer_ids[0], skip_special_tokens=True)
    predictions = [{'prediction_text': answer, 'id': inst['id']}]
    references = [{'answers': {'answer_start': inst['answers']['answer_start'], 'text': inst['answers']['text']}, 'id': inst['id']}]

    f1 = f1_metric.compute(predictions=predictions, references=references)
    cumulative_F1 += f1['f1']
    val_dataSize += 1

print(f"Mean F1 Score in %: {cumulative_F1 / val_dataSize}")



Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
