### **ArXiv Assistant Project - Model Evaluation and Batch Inference**

This notebook provides a comprehensive evaluation of the fine-tuned arXiv assistant model, focusing on batch inference and performance metrics. It demonstrates the process of loading a PEFT (Parameter-Efficient Fine-Tuning) model, merging it with its base model, and performing batch inference using the VLLM library for optimized performance.

The notebook is designed to evaluate the model's performance on a test dataset, calculate various metrics such as BLEU, ROUGE, and SacreBLEU scores, and generate detailed output for qualitative and quantitative analysis. It serves as a crucial step in assessing the effectiveness of the fine-tuning process and the model's capability in handling arXiv-related queries.

The Notebook includes the following key components:

1. **Setup and Configuration**:
   - Imports necessary libraries including transformers, peft, vllm, and evaluation metrics.
   - Sets up the environment for GPU usage and loads API tokens for Hugging Face.

2. **Model Loading and Merging**:
   - Loads the PEFT model and its corresponding base model.
   - Merges the PEFT model with the base model for inference.

3. **Tokenizer Configuration**:
   - Loads and configures the tokenizer to match the model's requirements.

4. **Dataset Preparation**:
   - Loads the test dataset from Hugging Face.
   - Prepares prompts for inference using a specified template.

5. **Batch Inference**:
   - Utilizes VLLM for efficient batch inference on the test dataset.

6. **Evaluation Metrics**:
   - Calculates BLEU, ROUGE, and SacreBLEU scores to evaluate model performance.

7. **Result Analysis and Storage**:
   - Generates detailed output comparing model predictions with reference texts.
   - Stores results and evaluation metrics in JSONL format for further analysis.

8. **Model Publishing** (Optional):
   - Includes steps to push the merged model to the Hugging Face Hub.

This notebook serves as a crucial tool for assessing the performance of the fine-tuned arXiv assistant model, providing both quantitative metrics and qualitative outputs for comprehensive evaluation.

Author: Amr Sherif  
Created Date: 2024-06-13  
Updated Date: 2024-06-29  
Version: 2.0

In [1]:
from baseline.helpers import set_css

get_ipython().events.register('pre_run_cell', set_css)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

In [3]:
%cd "DRIVE_PATH"

In [4]:
%ls

In [6]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets
!pip install -q trl
# !pip install -q wandb

In [7]:
from google.colab import userdata
from huggingface_hub import notebook_login, login
from datasets import load_dataset

# os.environ['hf'] = userdata.get('hf')
hfToken = userdata.get('hf')
# notebook_login()
login(hfToken, add_to_git_credential=True, new_session=False)

#### Load the Peft Model, Tokenizer, Config, and Merge the Model

In [8]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM
import torch

config = PeftConfig.from_pretrained("amrachraf/arxiv-assistant-mistral7b")

In [9]:
config

In [10]:
!nvidia-smi

In [11]:
base_model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path,
                                                  # device_map=device,
                                                  device_map={"":0},
                                                  trust_remote_code=True,
                                                  # torch_dtype=getattr(torch, "float16"),
                                                  token=hfToken)
model = PeftModel.from_pretrained(base_model, "amrachraf/arxiv-assistant-mistral7b")

In [12]:
model = model.merge_and_unload()

In [13]:
from transformers import AutoTokenizer, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,
                                          use_fast=True,
                                          token=hfToken,
                                          trust_remote_code=True,
                                          # add_eos_token=True
                                          )

In [18]:
tokenizer.pad_token_id

In [19]:
model.config

In [16]:
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.unk_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

In [17]:
model.config.pad_token_id = tokenizer.pad_token_id

#### Push the Merged Model to the Hub

In [None]:
new_model_name = "amrachraf/arxiv-assistant-merged_peft_model"

from huggingface_hub import create_repo
create_repo(new_model_name, exist_ok=True, private=False)

In [23]:
model.push_to_hub(new_model_name)
tokenizer.push_to_hub(new_model_name)

#### Load the dataset and prepare the data for Inference

In [None]:
data = load_dataset("amrachraf/arXiv-full-text-synthetic-instruct-tune", split="train")

# small dataset for testing
test = False
if test:
    data = data.select(range(21))

# Explore the data
df = data.to_pandas()
df.head(10)

In [None]:
system_instruction = """You are an arXiv assistant, your name is Marvin. You provide detailed, comprehensive and helpful responses to any request,
specially requests related to scientific papers published on arXiv, structure your responses and reply in a clear scientific manner.
Ensure to greet the user at the start of the first message of the conversation only. And ensure to ask the user if your response was clear and sufficient and if he needs any other help.
As an arXiv assistant, Your task is to generate an appropriate response based on the conversation and context given.
The tone of your answer should be warm, kind and friendly."""

# <<SYS>> {system_instruction} <</SYS>>

def generate_prompt(example, system_instruction):
    return f"""<s>[INST] {example['instruction']}
{example['input']} [/INST]
{example['output']} </s>""".strip()

def generate_test_prompt(example, system_instruction):
    return f"""<s>[INST] {example['instruction']}
{example['input']} [/INST]""".strip()

# def generate_test_prompt(example, system_instruction):
#     return f"""<s>[INST]<<SYS>> {system_instruction} <</SYS>>
#     {example['instruction']}
#     {example['input']} [/INST]""".strip()

data = data.train_test_split(test_size=0.05, seed=3, shuffle=True)
train_data = data["train"]
test_data = data["test"]

train_data.shape, test_data.shape

prompt_column_train = [generate_prompt(example, system_instruction) for example in train_data]
train_data = train_data.add_column("prompt", prompt_column_train)
# train_data = train_data.add_column("text", prompt_column_train)
# train_data = train_data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

prompt_column_test = [generate_test_prompt(example, system_instruction) for example in test_data]
test_data = test_data.add_column("prompt", prompt_column_test)
# test_data = test_data.add_column("text", prompt_column_test)
# test_data = test_data.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

test_data, train_data

In [None]:
print(test_data[1]['prompt'])

#### Inference - Merged Model

In [None]:
def extract_response(decoded_output):
    response = decoded_output.split("[INST]")[-1]  # Get the last segment
    response = response.split("[/INST]")[1]  # Get the part after the last instruction token
    response = response.strip()  # Remove any leading/trailing whitespace
    response = response.replace("</s>", "").strip()
    return response

In [None]:
device = "cuda:0"

encodeds = tokenizer(test_data[0]['prompt'], return_tensors="pt",
                     add_special_tokens=False)

model_inputs = encodeds.to(device)

GenerationConfig(
    max_new_tokens=1000,
    do_sample=True
    # return_dict_in_generate=True

)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.pad_token_id)

decoded = tokenizer.batch_decode(generated_ids)

print(extract_response(decoded[0]))

In [None]:
print(decoded[0])

In [None]:
print(test_data[0]['output'])

#### Batch Inference - Vllm

In [None]:
!pip install vllm

In [None]:
model_dir = "./models/arxiv-merged_peft_model/"

In [None]:
# Save the merged model and tokenizer locally
model.save_pretrained(model_dir, safe_serialization=True)
tokenizer.save_pretrained(model_dir)

In [None]:
from vllm import LLM, SamplingParams

llm = LLM(model=model_dir)

In [None]:
llm.get_tokenizer()

In [None]:
sampling_params = SamplingParams(
                  max_tokens=1000,
                  min_tokens=1,
                  temperature=0.9,
                  top_k=50,
                  top_p=0.95,
                  # dtype="auto"
                  skip_special_tokens=True
                  )

In [None]:
test_data['output'][9]

In [None]:
outputs = llm.generate(test_data['prompt'], sampling_params) # Batch inference

In [None]:
for i, output in enumerate(outputs):
    generated_text = output.outputs[0].text
    print(f"Generated text: {generated_text}\n" + f"\nReference text: {test_data['output'][i]}\n")

In [None]:
generated_output = []
for output in outputs:
    generated_text = output.outputs[0].text
    generated_output.append(generated_text)

In [None]:
generated_output[9]

#### Evaluation

In [None]:
!pip install evaluate
!pip install rouge_score
!pip install sacrebleu

In [None]:
import evaluate

predictions = generated_output
references = test_data['output']

bleu = evaluate.load("bleu")
bleu_results = bleu.compute(predictions=predictions, references=references)
print(bleu_results)

In [None]:
rouge = evaluate.load("rouge")
rouge_results = rouge.compute(predictions=predictions, references=references)
print(rouge_results)

In [None]:
sacrebleu = evaluate.load("sacrebleu")
sacrebleu_results = sacrebleu.compute(predictions=predictions, references=references)
print(sacrebleu_results)

In [None]:
test_results_jsonl = []

for i, output in enumerate(outputs):
  generated_text = output.outputs[0].text
  reference_text = test_data['output'][i]
  prompt = test_data['prompt'][i]

  test_results_jsonl.append({"prompt": prompt, "generated_text": generated_text, "reference_text": reference_text})

In [None]:
import json

with open("test-data-predictions-ref.jsonl", "w") as f:
    for item in test_results_jsonl:
        f.write(json.dumps(item) + "\n")

In [None]:
results_file = "test-data-predictions-ref.jsonl"

with open(results_file, "r") as f:
    data = [json.loads(line) for line in f]

In [None]:
data[0]

In [None]:
eval_results_jsonl = []

eval_results_jsonl.append(bleu_results)
eval_results_jsonl.append(rouge_results)
eval_results_jsonl.append(sacrebleu_results)

In [None]:
with open("eval_results.jsonl", "w") as f:
    for item in eval_results_jsonl:
        f.write(json.dumps(item) + "\n")

In [None]:
results_file = "eval_results.jsonl"

with open(results_file, "r") as f:
    evaluation_res = [json.loads(line) for line in f]

In [None]:
evaluation_res