# **IMPLEMENTING SEQ2SEQ MODEL FOR QUESTIONING ANSWERING USING BART MODEL**

**Team Name :** E47

**Team Members:**

1.   Amit Das
2.   Vibhor Joshi
3.   Shilpa Tichkule
4.   Sanika Nandurkar
5.   Medhavi Nasare

**Project Guide:** Amit Pandey



# **Step 1:** Install Required Libraries
This command installs:


*   **datasets:** A library for accessing and processing datasets.
*   **transformers:** A library for state-of-the-art natural language processing models.
*   **torch**: The core PyTorch library for tensor computations.
*   **os:** A standard library for interacting with the operating system.
*   **BartForConditionalGeneration:** A model class from the transformers library specifically for BART, useful for tasks like text generation.
*   **BartTokenizer:** A tokenizer class to convert text into tokens suitable for BART.
*   **Trainer:** A utility class to simplify training of models.
*   **TrainingArguments:** A class to define parameters for training.



In [None]:
!pip install datasets
import torch
import os
from transformers import BartForConditionalGeneration, BartTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

# **Step 2:** Loading the Squad(Stanford question answering dataset) Dataset and the tokenizer suitable for BART model.


In [None]:
squad_dataset = load_dataset("squad")
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]



# **Step 3:** Defining a function to preprocess the data
This step involves:
*   **Combine Questions and Contexts:** This line combines each question with its corresponding context into a single string formatted as "question: <question> context: <context>".
*   **Tokenize Inputs:** The combined inputs are tokenized using the BART tokenizer.

      **1. max_length=512:** Sets the maximum length of the tokenized inputs.

      **2. padding='max_length':** Pads all sequences to the maximum length.

      **3. truncation=True:** Truncates sequences that exceed the maximum length.

*   **Process Answers:**The loop processes each answer in the dataset.
*   **Pad/Truncate Labels:**This line ensures that all labels are either padded or truncated to a consistent length of 128 tokens.
*   **Add Labels to Model Inputs:** The processed labels are added under the key "labels" in the model_inputs dictionary.




In [None]:
def preprocess_function(data):
    inputs = ["question: " + q + " context: " + c for q, c in zip(data["question"], data["context"])]

    # Tokenize inputs (questions + context) for the batch
    model_inputs = tokenizer(inputs, max_length=512, padding='max_length', truncation=True)

    # Process answers
    labels = []
    for answer in data["answers"]:
        answer_text = answer["text"][0] if len(answer["text"]) > 0 else ""  # Handle empty answers
        # Tokenize the answer
        tokenized_answer = tokenizer(answer_text, max_length=128, padding='max_length', truncation=True)
        labels.append(tokenized_answer["input_ids"])

    # We will pad the labels to the same length (128 tokens in this case)
    labels = [label + [tokenizer.pad_token_id] * (128 - len(label)) if len(label) < 128 else label[:128] for label in labels]

    model_inputs["labels"] = labels

    return model_inputs


# **Step 4:** Tokenizing and mapping the Dataset
This step involves:
*   **Tokenizing the Dataset:** This applies the preprocess_function to the squad_dataset, containing questions, contexts, and answers.
*   **Set Dataset Format for PyTorch:** This configures the format of the tokenized_datasets to be compatible with PyTorch.


In [None]:
tokenized_datasets = squad_dataset.map(preprocess_function, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

# Printing the tokenized dataset for verifying the format after preprocessing.

In [None]:
# (just to verify)
print(tokenized_datasets["train"][0])
print(tokenized_datasets["validation"][0])

{'input_ids': tensor([    0, 40018,    35,   598,  2661,   222,     5,  9880,  2708,  2346,
         2082,    11,   504,  4432,    11,   226,  2126, 10067,  1470,   116,
         5377,    35, 29474, 28108,     6,     5,   334,    34,    10,  4019,
         2048,     4,   497,  1517,     5,  4326,  6919,    18,  1637, 31346,
           16,    10,  9030,  9577,     9,     5,  9880,  2708,     4, 29261,
           11,   760,     9,     5,  4326,  6919,     8,  2114,    24,     6,
           16,    10,  7621,  9577,     9,  4845,    19,  3701,    62, 33161,
           19,     5,  7875,    22, 39043,  1459,  1614,  1464, 13292,  4977,
          845,  4130,     7,     5,  4326,  6919,    16,     5, 26429,  2426,
            9,     5, 25095,  6924,     4, 29261,   639,     5, 32394,  2426,
           16,     5,  7461, 26187,     6,    10, 19035,   317,     9,  9621,
            8, 12456,     4,    85,    16,    10, 24633,     9,     5, 11491,
        26187,    23,   226,  2126, 10067,     6, 

# **Step 5:** Defining Function for training the model
The train_model function is designed to train the model using specified training and evaluation datasets. It configures training parameters and initializes the training process.

**Parameters:**
* model: The model to be trained (e.g., BART or any other transformer model).
* train_dataset: The dataset used for training the model.
* eval_dataset: The dataset used for evaluating the model during training.
* epochs: The number of training epochs (default is 1).
* batch_size: The size of the training batches (default is 4).

In [None]:
def train_model(model, train_dataset, eval_dataset, epochs=1, batch_size=4):
    training_args = TrainingArguments(
        output_dir="./outputResult",
        eval_strategy="epoch",
        learning_rate=3e-5,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=2,
        num_train_epochs=epochs,
        logging_dir="./logs",
        logging_steps=10,
        save_steps=500,
        save_total_limit=2,
        report_to="wandb",
        fp16=True,  # This activates mixed precision (uses half precision for less GPU mem)
        gradient_accumulation_steps=1, #To simulate small batches
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset
    )

    # Start training
    trainer.train()


# **Step 6:** Utilizing Ensemble learning and training the model
The loop iterates to train/fine-tune the model multiple times and stores the model's instances in the 'models' array.

In [None]:
models =[]
num_models = 2

for i in range(num_models):
  model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")
  train_model(model, tokenized_datasets["train"], tokenized_datasets["validation"])
  models.append(model)


model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss
1,0.0208,0.020493


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams

Epoch,Training Loss,Validation Loss
1,0.0207,0.020242


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams

# **Step 7:** Defining Functions to Generate answers using contexts for the questions asked.

**'generate_predictions'** takes the model,question and context as the parameters and returns the answer it predicts

**'ensemble_predict'** calls the 'generate_predictions' iteratively and stores the answers in an array, then a final answer is returned with max presicion.



In [None]:
def generate_predictions(model, question, context):
    inputs = tokenizer(question + " " + context, return_tensors='pt', max_length=512, truncation=True).to(device)
    with torch.no_grad():
        outputs = model.generate(**inputs)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer.strip()

In [None]:
 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
def ensemble_predict(question, context):
    predictions = []

    for model in models:
        model.to(device)
        pred = generate_predictions(model, question, context)
        predictions.append(pred)

    final_predictions = max(set(predictions), key=predictions.count)
    return final_predictions

In [None]:
context = "Delhi is the capital of India. Nagpur is the capital of maharashtra.My name is Amit Das. We are building Seq2seq Model using BART Model."
questions = ["What is the capital of Maharashtra?", "What is my name?", "What are we building?"]

for question in questions:
  ensemble_ans = ensemble_predict(question, context)
  print(f"Ensemble Prediction : {ensemble_ans}")





Ensemble Prediction : Nagpur
Ensemble Prediction : Amit Das
Ensemble Prediction : Seq2seq Model


#**Step 8:** Saving the model
This step involves saving the trained model for future uses

In [None]:
def save_models(models, output_dir):
    os.makedirs(output_dir, exist_ok=True)

    for idx, model in enumerate(models):
        model_path = f"{output_dir}/model_{idx}.pt"
        torch.save(model.state_dict(), model_path)
        print(f"Model {idx} saved to {model_path}")


In [None]:
output_dir = './saved_models'

save_models(models, output_dir)

Model 0 saved to ./saved_models/model_0.pt
Model 1 saved to ./saved_models/model_1.pt
