## Imports, etc.

In [4]:
!pip install transformers datasets accelerate wandb peft

Collecting datasets
  Using cached datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.w

In [5]:
from google.colab import drive
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
import torch

# Mount Google Drive
drive.mount('/content/drive')

# Load the dataset
data_path = '/content/drive/My Drive/CS394/medquad.csv'
df = pd.read_csv(data_path)
df = df.dropna(subset=['question', 'answer']).reset_index(drop=True)

# Convert pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['question', 'answer']])

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Base "meta-llama/Llama-3.2-1B"

### Q&A

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")

# Define the input question
question = "What is glaucoma?"

# Tokenize the input question
inputs = tokenizer(question, return_tensors="pt")

# Generate a response
outputs = model.generate(**inputs, max_length=100, num_return_sequences=1)

# Decode and print the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What is glaucoma? What are the symptoms of glaucoma? How is glaucoma diagnosed? What are the treatments for glaucoma?
Glaucoma is a group of eye diseases that damage the optic nerve and cause gradual loss of vision. Glaucoma can affect both eyes and is the leading cause of blindness in the United States. The eye pressure is too high, causing damage to the optic nerve.
The optic nerve carries the signals from the eye to the brain, and


In [14]:
# Define the input question
question = "What is chatGPT"

# Tokenize the input question
inputs = tokenizer(question, return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=512,
    num_return_sequences=1,
    # temperature=0.7  # Adjust for more creative or deterministic responses
)


# Decode and print the response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


What is chatGPT?
ChatGPT is a large language model (LLM) that is trained on a massive amount of text data. It can generate human-like text based on the input it receives. ChatGPT is a powerful tool that can be used to generate text, provide advice, and perform various tasks.
ChatGPT is a large language model (LLM) that is trained on a massive amount of text data. It can generate human-like text based on the input it receives. ChatGPT is a powerful tool that can be used to generate text, provide advice, and perform various tasks.
ChatGPT is a large language model (LLM) that is trained on a massive amount of text data. It can generate human-like text based on the input it receives. ChatGPT is a powerful tool that can be used to generate text, provide advice, and perform various tasks.
ChatGPT is a large language model (LLM) that is trained on a massive amount of text data. It can generate human-like text based on the input it receives. ChatGPT is a powerful tool that can be used to gener

### Metrics

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding
import pandas as pd
from datasets import Dataset
from tqdm import tqdm
import math

# Clear GPU cache
torch.cuda.empty_cache()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load dataset
data_path = '/content/drive/My Drive/CS394/medquad.csv'
df = pd.read_csv(data_path)

# Drop rows with missing or empty 'question' or 'answer'
df = df.dropna(subset=['question', 'answer']).reset_index(drop=True)
df = df[(df['question'].str.strip() != '') & (df['answer'].str.strip() != '')]

# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['question', 'answer']])

# Perform 80/20 train-test split
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
test_dataset = split_dataset['test']

# Load the baseline tokenizer and model
model_name = "meta-llama/Llama-3.2-1B"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add a padding token if not already present
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# Load the baseline model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Use FP16 to save memory
).to(device)

# Resize token embeddings if new tokens were added
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

# Set the model to evaluation mode
model.eval()

# Prepare the evaluation data collator
eval_data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding='longest',
    return_tensors='pt',
)

# Preprocess the test dataset for evaluation
def preprocess_function_eval(examples):
    inputs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]
    tokenized_inputs = tokenizer(
        inputs,
        padding=False,
        truncation=True,   # Enable truncation
        max_length=512,    # Set max_length to 512 (adjust as needed)
    )
    return tokenized_inputs

tokenized_test = test_dataset.map(
    preprocess_function_eval,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=1,  # Set num_proc=1 to reduce CPU memory usage
)

# Create the evaluation DataLoader
eval_dataloader = DataLoader(
    tokenized_test,
    batch_size=1,  # Reduce batch size to 1
    collate_fn=eval_data_collator,
)

# Compute perplexity and next-token accuracy
correct_predictions = 0
total_predictions = 0
total_loss = 0
total_tokens = 0
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=-100, reduction='sum')

for batch in tqdm(eval_dataloader, desc="Evaluating"):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)

    # Create labels for next-token prediction
    labels = input_ids.clone()
    labels[input_ids == tokenizer.pad_token_id] = -100  # Ignore padding tokens

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits  # Shape: [batch_size, seq_length, vocab_size]

    # Shift logits and labels to align for next-token prediction
    shift_logits = logits[:, :-1, :].contiguous()  # [batch_size, seq_length - 1, vocab_size]
    shift_labels = labels[:, 1:].contiguous()      # [batch_size, seq_length - 1]

    # Compute loss
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    total_loss += loss.item()

    # Count total tokens (excluding padding and ignored tokens)
    total_tokens += (shift_labels != -100).sum().item()

    # Predicted tokens
    predicted_tokens = torch.argmax(shift_logits, dim=-1)  # [batch_size, seq_length - 1]

    # Mask to ignore padding tokens
    mask = shift_labels != -100  # [batch_size, seq_length - 1]

    # Correct predictions
    correct = (predicted_tokens == shift_labels) & mask

    correct_predictions += correct.sum().item()
    total_predictions += mask.sum().item()

# Calculate perplexity
average_loss = total_loss / total_tokens
perplexity = math.exp(average_loss)
print(f"Perplexity: {perplexity:.4f}")

# Next-token accuracy
next_token_accuracy = correct_predictions / total_predictions
print(f"Next-token Accuracy: {next_token_accuracy:.4f}")


Map:   0%|          | 0/3282 [00:00<?, ? examples/s]

Evaluating: 100%|██████████| 3282/3282 [01:19<00:00, 41.32it/s]

Perplexity: 6.1104
Next-token Accuracy: 0.5907





## Unfreeze Last Layer

#### Fine Tune

In [9]:
# Install necessary libraries
!pip install transformers datasets evaluate torch

# Import libraries
import torch
import pandas as pd
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
)
import os

# Set environment variable for debugging CUDA (optional)
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# Check for GPU availability
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

# Load tokenizer and model
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Add a padding token if not already present
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# Load model onto GPU with BF16
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Use BF16
    use_cache=False
).to(device)

model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

# Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# Unfreeze the last transformer block
for param in model.model.layers[-1].parameters():
    param.requires_grad = True

# Unfreeze lm_head
for param in model.lm_head.parameters():
    param.requires_grad = True

# Verify trainable parameters
trainable_params = [name for name, param in model.named_parameters() if param.requires_grad]
print(f"Trainable parameters: {trainable_params}")

# Load dataset
data_path = '/content/drive/My Drive/CS394/medquad.csv'
df = pd.read_csv(data_path)

# Drop rows with missing or empty 'question' or 'answer'
df = df.dropna(subset=['question', 'answer']).reset_index(drop=True)
df = df[(df['question'].str.strip() != '') & (df['answer'].str.strip() != '')]

# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['question', 'answer']])

# Perform 80/20 train-test split
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset, test_dataset = split_dataset['train'], split_dataset['test']

print(f"Training dataset size: {len(train_dataset)} rows")
print(f"Test dataset size: {len(test_dataset)} rows")

# Preprocess function to tokenize and split sequences
def preprocess_function(examples):
    # Concatenate question and answer
    inputs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]

    # Tokenize without truncation
    tokenized_inputs = tokenizer(
        inputs,
        padding=False,
        truncation=False,
    )

    # Initialize lists for inputs and labels
    input_ids_list = []
    attention_mask_list = []
    labels_list = []

    for input_ids in tokenized_inputs["input_ids"]:
        # Split input_ids into chunks of max_seq_length
        for i in range(0, len(input_ids), max_seq_length):
            chunk = input_ids[i:i + max_seq_length]
            input_ids_list.append(chunk)
            attention_mask_list.append([1] * len(chunk))
            labels_list.append(chunk.copy())

    return {
        "input_ids": input_ids_list,
        "attention_mask": attention_mask_list,
        "labels": labels_list,
    }

# Get the model's maximum sequence length
max_seq_length = model.config.max_position_embeddings
print(f"Model max sequence length: {max_seq_length}")

# Apply the preprocess function to the datasets
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)
tokenized_test = test_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

# Data collator with dynamic padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    fp16=False,  # Disable FP16
    bf16=True,   # Enable BF16
    gradient_checkpointing=True,
    optim="adamw_torch",
    logging_steps=50,
    save_steps=0,
    evaluation_strategy="no",
    save_strategy="no",
    report_to="none",
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./fine_tuned_one_layer_llama")


Using GPU: NVIDIA A100-SXM4-40GB
Trainable parameters: ['model.embed_tokens.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.15.post_attention_layernorm.weight']
Training dataset size: 13125 rows
Test dataset size: 3282 rows
Model max sequence length: 131072


Map (num_proc=4):   0%|          | 0/13125 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3282 [00:00<?, ? examples/s]

  trainer = Trainer(


Step,Training Loss
50,1.4263
100,1.3459
150,1.2986
200,1.2713
250,1.1899
300,1.2523
350,1.219
400,1.2693
450,1.1664
500,1.1803


### Q&A

In [11]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load tokenizer and model
model_path = "./fine_tuned_one_layer_llama"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# Prepare input prompt
question = "What are the symptoms of diabetes?"
prompt = f"Question: {question}\nAnswer:"

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {key: value.to(device) for key, value in inputs.items()}

# Generate answer
with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_length=inputs['input_ids'].shape[1] + 100,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )

# Decode and display answer
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
answer = generated_text[len(prompt):].strip()

print("Question:", question)
print("Answer:", answer)


Question: What are the symptoms of diabetes?
Answer: The symptoms of diabetes are different for each person. In the early stages of the disease, the person may not have any symptoms. As the disease gets worse, people may have increased thirst, frequent urination, and fatigue. Often, people with diabetes have problems with their eyes, feet, and kidneys. In the end, people with diabetes may have heart disease or stroke. Many people with diabetes have problems with their nerves, which can lead to amputations of the limbs.  In the early


### Metrics

In [10]:
import torch
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding, AutoTokenizer, AutoModelForCausalLM
import numpy as np
from tqdm import tqdm
import math
import pandas as pd
from datasets import Dataset

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load dataset
data_path = '/content/drive/My Drive/CS394/medquad.csv'
df = pd.read_csv(data_path)

# Drop rows with missing or empty 'question' or 'answer'
df = df.dropna(subset=['question', 'answer']).reset_index(drop=True)
df = df[(df['question'].str.strip() != '') & (df['answer'].str.strip() != '')]

# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df[['question', 'answer']])

# Perform 80/20 train-test split
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset, test_dataset = split_dataset['train'], split_dataset['test']

# Load the fine-tuned model
model = AutoModelForCausalLM.from_pretrained(
    "./fine_tuned_one_layer_llama",
    torch_dtype=torch.bfloat16,
).to(device)
model.eval()

# Ensure the tokenizer is loaded from the same directory to include any special tokens
tokenizer = AutoTokenizer.from_pretrained("./fine_tuned_one_layer_llama")

# Prepare the evaluation data collator
eval_data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding='longest',
    return_tensors='pt',
)

# Preprocess the test dataset for evaluation
def preprocess_function_eval(examples):
    inputs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]
    tokenized_inputs = tokenizer(
        inputs,
        padding=False,
        truncation=False,
    )
    return tokenized_inputs

tokenized_test = test_dataset.map(
    preprocess_function_eval,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

# Create the evaluation DataLoader
eval_dataloader = DataLoader(
    tokenized_test,
    batch_size=8,  # Adjust based on your GPU memory
    collate_fn=eval_data_collator,
)

# Compute perplexity and next-token accuracy
correct_predictions = 0
total_predictions = 0
total_loss = 0
total_tokens = 0
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=-100, reduction='sum')

for batch in tqdm(eval_dataloader, desc="Evaluating"):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)

    # Create labels for next-token prediction
    labels = input_ids.clone()
    labels[input_ids == tokenizer.pad_token_id] = -100  # Ignore padding tokens

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits  # Shape: [batch_size, seq_length, vocab_size]

    # Shift logits and labels to align for next-token prediction
    shift_logits = logits[:, :-1, :].contiguous()  # [batch_size, seq_length - 1, vocab_size]
    shift_labels = labels[:, 1:].contiguous()      # [batch_size, seq_length - 1]
    shift_attention_mask = attention_mask[:, 1:].contiguous()  # [batch_size, seq_length - 1]

    # Compute loss
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    total_loss += loss.item()

    # Count total tokens (excluding padding and ignored tokens)
    total_tokens += (shift_labels != -100).sum().item()

    # Predicted tokens
    predicted_tokens = torch.argmax(shift_logits, dim=-1)  # [batch_size, seq_length - 1]

    # Mask to ignore padding tokens
    mask = shift_labels != -100  # [batch_size, seq_length - 1]

    # Correct predictions
    correct = (predicted_tokens == shift_labels) & mask

    correct_predictions += correct.sum().item()
    total_predictions += mask.sum().item()

# Calculate perplexity
average_loss = total_loss / total_tokens
perplexity = math.exp(average_loss)
print(f"Perplexity: {perplexity:.4f}")

# Next-token accuracy
next_token_accuracy = correct_predictions / total_predictions
print(f"Next-token Accuracy: {next_token_accuracy:.4f}")


Map (num_proc=4):   0%|          | 0/3282 [00:00<?, ? examples/s]

Evaluating: 100%|██████████| 411/411 [01:12<00:00,  5.70it/s]

Perplexity: 2.9599
Next-token Accuracy: 0.7329





# PeFT

## Fine Tune & Metrics


In [1]:
# Install necessary libraries
!pip install transformers datasets evaluate torch peft

import torch
import pandas as pd
import os
import math
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    DataCollatorWithPadding,
)
from peft import LoraConfig, get_peft_model, PeftModel, TaskType
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Base model name
model_name = "meta-llama/Llama-3.2-1B"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# Load model onto GPU with BF16
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)

base_model.resize_token_embeddings(len(tokenizer))
base_model.config.pad_token_id = tokenizer.pad_token_id

# Set up LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

# Wrap the model with PEFT (LoRA)
model = get_peft_model(base_model, lora_config)
print("Model wrapped with PEFT.")

# Verify trainable parameters
trainable_params = [name for name, param in model.named_parameters() if param.requires_grad]
print(f"Trainable parameters: {trainable_params}")

# Load dataset from Drive
data_path = '/content/drive/My Drive/CS394/medquad.csv'
df = pd.read_csv(data_path)

# Drop rows with missing or empty question/answer
df = df.dropna(subset=['question', 'answer']).reset_index(drop=True)
df = df[(df['question'].str.strip() != '') & (df['answer'].str.strip() != '')]

# Convert DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df[['question', 'answer']])

# Perform 80/20 train-test split
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset, test_dataset = split_dataset['train'], split_dataset['test']

print(f"Training dataset size: {len(train_dataset)} rows")
print(f"Test dataset size: {len(test_dataset)} rows")

# Preprocessing function
def preprocess_function(examples):
    # Concatenate question and answer
    inputs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]
    # Tokenize
    tokenized_inputs = tokenizer(
        inputs,
        padding=False,
        truncation=False,
    )

    input_ids_list = []
    attention_mask_list = []
    labels_list = []

    max_seq_length = model.config.max_position_embeddings
    print(f"Model max sequence length: {max_seq_length}")

    for input_ids in tokenized_inputs["input_ids"]:
        # Split input_ids into chunks of max_seq_length
        for i in range(0, len(input_ids), max_seq_length):
            chunk = input_ids[i : i + max_seq_length]
            input_ids_list.append(chunk)
            attention_mask_list.append([1] * len(chunk))
            labels_list.append(chunk.copy())

    return {
        "input_ids": input_ids_list,
        "attention_mask": attention_mask_list,
        "labels": labels_list,
    }

# Tokenize train and test sets
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

tokenized_test = test_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

# Data collator with dynamic padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/CS394/fine_tuned_peft_llama_results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    fp16=False,
    bf16=True,
    gradient_checkpointing=False,
    optim="adamw_torch",
    logging_steps=50,
    save_steps=0,
    evaluation_strategy="no",
    save_strategy="no",
    report_to="none",
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Directory to save the final model and tokenizer
final_model_dir = "/content/drive/My Drive/CS394/fine_tuned_peft_llama"
os.makedirs(final_model_dir, exist_ok=True)

# Save model and tokenizer
trainer.save_model(final_model_dir)       # Saves the underlying model weights
model.save_pretrained(final_model_dir)    # Saves the LoRA adapter config & weights
tokenizer.save_pretrained(final_model_dir)

print("Model and tokenizer saved successfully to Google Drive.")

#########################################
# After saving the model, let's benchmark
#########################################

# Load the tokenizer and base model again for evaluation
tokenizer = AutoTokenizer.from_pretrained(final_model_dir)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)
base_model.resize_token_embeddings(len(tokenizer))
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load the PEFT model from the saved directory
model = PeftModel.from_pretrained(base_model, final_model_dir).to(device)
model.eval()

# Prepare data collator for evaluation
eval_data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding='longest',
    return_tensors='pt',
)

# Preprocess function for evaluation
def preprocess_function_eval(examples):
    inputs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]
    tokenized_inputs = tokenizer(inputs, padding=False, truncation=False)
    return tokenized_inputs

tokenized_test_for_eval = test_dataset.map(
    preprocess_function_eval,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

eval_dataloader = DataLoader(
    tokenized_test_for_eval,
    batch_size=1,  # Use smaller batch size if memory is an issue
    collate_fn=eval_data_collator,
)

# Compute perplexity and next-token accuracy
correct_predictions = 0
total_predictions = 0
total_loss = 0
total_tokens = 0
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=-100, reduction='sum')

for batch in tqdm(eval_dataloader, desc="Evaluating"):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)

    labels = input_ids.clone()
    labels[input_ids == tokenizer.pad_token_id] = -100  # Ignore padding tokens

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits  # [batch_size, seq_length, vocab_size]

    # Shift logits and labels for next-token prediction
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = labels[:, 1:].contiguous()

    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    total_loss += loss.item()

    # Count total tokens (excluding padding)
    valid_tokens_mask = (shift_labels != -100)
    total_tokens += valid_tokens_mask.sum().item()

    # Predicted tokens
    predicted_tokens = torch.argmax(shift_logits, dim=-1)

    # Correct predictions (only where label != -100)
    correct = (predicted_tokens == shift_labels) & valid_tokens_mask
    correct_predictions += correct.sum().item()
    total_predictions += valid_tokens_mask.sum().item()

# Calculate perplexity
average_loss = total_loss / total_tokens
perplexity = math.exp(average_loss)
print(f"Perplexity: {perplexity:.4f}")

# Next-token accuracy
next_token_accuracy = correct_predictions / total_predictions
print(f"Next-token Accuracy: {next_token_accuracy:.4f}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Using device: cuda


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Model wrapped with PEFT.
Trainable parameters: ['base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.1.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.1.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.1.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.1.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.2.self_attn.q_proj.lora_A.default.weight', 'base_model.model.model.layers.2.self_attn.q_proj.lora_B.default.weight', 'base_model.model.model.layers.2.self_attn.v_proj.lora_A.default.weight', 'base_model.model.model.layers.2.self_attn.v_proj.lora_B.default.weight', 'base_model.model.model.layers.3.self_attn.q_proj.lora_A.default

Map (num_proc=4):   0%|          | 0/13125 [00:00<?, ? examples/s]

Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072


Map (num_proc=4):   0%|          | 0/3282 [00:00<?, ? examples/s]

Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072


  trainer = Trainer(


Step,Training Loss
50,1.6956
100,1.5493
150,1.4085
200,1.3703
250,1.2786
300,1.3251
350,1.2937
400,1.3332
450,1.2475
500,1.2649


Step,Training Loss
50,1.6956
100,1.5493
150,1.4085
200,1.3703
250,1.2786
300,1.3251
350,1.2937
400,1.3332
450,1.2475
500,1.2649




Model and tokenizer saved successfully to Google Drive.


Map (num_proc=4):   0%|          | 0/3282 [00:00<?, ? examples/s]

Evaluating: 100%|██████████| 3282/3282 [01:42<00:00, 32.15it/s]

Perplexity: 3.2522
Next-token Accuracy: 0.7146





## Q&A

In [7]:
# Install necessary libraries
!pip install transformers datasets torch peft

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Directory where the fine-tuned model is saved
final_model_dir = "/content/drive/My Drive/CS394/fine_tuned_peft_llama"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(final_model_dir)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# Load the base model and fine-tuned LoRA (PEFT) model
base_model_name = "meta-llama/Llama-3.2-1B"  # Replace with your base model name
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)

base_model.resize_token_embeddings(len(tokenizer))
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load the fine-tuned PEFT model
model = PeftModel.from_pretrained(base_model, final_model_dir).to(device)
model.eval()

# Function to ask a question
def ask_question(model, tokenizer, question, max_length=50, temperature=0.7, top_p=0.9):
    """
    Generates an answer to the given question using the fine-tuned model.

    Args:
        model: The fine-tuned model.
        tokenizer: The tokenizer corresponding to the model.
        question (str): The question to ask the model.
        max_length (int): The maximum length of the generated answer.
        temperature (float): Sampling temperature (lower values are more deterministic).
        top_p (float): Top-p nucleus sampling value.

    Returns:
        str: The generated answer.
    """
    # Tokenize the question
    input_ids = tokenizer(question, return_tensors="pt").input_ids.to(device)

    # Generate response
    output_ids = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        temperature=temperature,
        top_p=top_p,
        do_sample=True
    )

    # Decode the output
    answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return answer

# Example usage
question = "What are the symptoms of diabetes?"
answer = ask_question(model, tokenizer, question)
print(f"Question: {question}")
print(f"Answer: {answer}")


Using device: cuda


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Question: What are the symptoms of diabetes?
Answer: What are the symptoms of diabetes? How do you know if you have diabetes? The most common signs and symptoms of diabetes are increased thirst and frequent urination. Other signs and symptoms may include blurred vision, a tingling sensation in the hands or


# BOFT

## Fine Tune & Metrics

In [5]:
# Install necessary libraries
!pip install transformers datasets evaluate torch peft
!pip install --upgrade peft

import torch
import pandas as pd
import os
import math
import numpy as np
from torch.utils.data import DataLoader
from tqdm import tqdm
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    DataCollatorWithPadding,
)
from peft import BOFTConfig, get_peft_model, PeftModel, TaskType
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Base model name
model_name = "meta-llama/Llama-3.2-1B"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# Load model onto GPU with BF16
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)

base_model.resize_token_embeddings(len(tokenizer))
base_model.config.pad_token_id = tokenizer.pad_token_id


# Set up BOFT configuration
boft_config = BOFTConfig(
    boft_n_butterfly_factor=8,  # Number of butterfly factors across different layers
    target_modules=["q_proj", "v_proj"],  # Modules to apply the adapter to
    exclude_modules=None,  # Exclude specific modules (optional, None by default)
    boft_dropout=0.1,  # Dropout probability
    fan_in_fan_out=False,  # Set to True if weights are stored as (fan_in, fan_out)
    bias="boft_only",  # Bias type; 'none', 'all', or 'boft_only'
    modules_to_save=[],  # Additional modules to be saved (optional)
    layers_to_transform=[0, 2, 4, 6, 8],  # Transform specific layers (example: even layers)
    layers_pattern="layers",  # Pattern for the nn.ModuleList
    task_type=TaskType.CAUSAL_LM,  # Task type
)

# Wrap the model with BOFT
model = get_peft_model(base_model, boft_config)
print("Model wrapped with BOFT.")

# Verify trainable parameters
trainable_params = [name for name, param in model.named_parameters() if param.requires_grad]
print(f"Trainable parameters: {trainable_params}")

# Load dataset from Drive
data_path = '/content/drive/My Drive/CS394/medquad.csv'
df = pd.read_csv(data_path)

# Drop rows with missing or empty question/answer
df = df.dropna(subset=['question', 'answer']).reset_index(drop=True)
df = df[(df['question'].str.strip() != '') & (df['answer'].str.strip() != '')]

# Convert DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df[['question', 'answer']])

# Perform 80/20 train-test split
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset, test_dataset = split_dataset['train'], split_dataset['test']

print(f"Training dataset size: {len(train_dataset)} rows")
print(f"Test dataset size: {len(test_dataset)} rows")

# Preprocessing function
def preprocess_function(examples):
    inputs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]
    tokenized_inputs = tokenizer(inputs, padding=False, truncation=False)

    input_ids_list = []
    attention_mask_list = []
    labels_list = []

    max_seq_length = model.config.max_position_embeddings
    print(f"Model max sequence length: {max_seq_length}")

    for input_ids in tokenized_inputs["input_ids"]:
        for i in range(0, len(input_ids), max_seq_length):
            chunk = input_ids[i : i + max_seq_length]
            input_ids_list.append(chunk)
            attention_mask_list.append([1] * len(chunk))
            labels_list.append(chunk.copy())

    return {
        "input_ids": input_ids_list,
        "attention_mask": attention_mask_list,
        "labels": labels_list,
    }

# Tokenize train and test sets
tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

tokenized_test = test_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

# Data collator with dynamic padding
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="/content/drive/My Drive/CS394/fine_tuned_boft_llama_results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    fp16=False,
    bf16=True,
    gradient_checkpointing=False,
    optim="adamw_torch",
    logging_steps=50,
    save_steps=0,
    evaluation_strategy="no",
    save_strategy="no",
    report_to="none",
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train the model
trainer.train()

# Directory to save the final model and tokenizer
final_model_dir = "/content/drive/My Drive/CS394/fine_tuned_boft_llama"
os.makedirs(final_model_dir, exist_ok=True)
F
# Save model and tokenizer
trainer.save_model(final_model_dir)
model.save_pretrained(final_model_dir)
tokenizer.save_pretrained(final_model_dir)

print("Model and tokenizer saved successfully to Google Drive.")

#########################################
# Benchmarking after fine-tuning
#########################################

# Reload model and tokenizer for evaluation
tokenizer = AutoTokenizer.from_pretrained(final_model_dir)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)

base_model.resize_token_embeddings(len(tokenizer))
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load the fine-tuned BOFT model
model = PeftModel.from_pretrained(base_model, final_model_dir).to(device)
model.eval()

# Prepare evaluation
eval_data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding='longest',
    return_tensors='pt',
)

def preprocess_function_eval(examples):
    inputs = [q + " " + a for q, a in zip(examples['question'], examples['answer'])]
    tokenized_inputs = tokenizer(inputs, padding=False, truncation=False)
    return tokenized_inputs

tokenized_test_for_eval = test_dataset.map(
    preprocess_function_eval,
    batched=True,
    remove_columns=["question", "answer"],
    num_proc=4,
)

eval_dataloader = DataLoader(
    tokenized_test_for_eval,
    batch_size=1,
    collate_fn=eval_data_collator,
)

# Compute metrics
correct_predictions = 0
total_predictions = 0
total_loss = 0
total_tokens = 0
loss_fct = torch.nn.CrossEntropyLoss(ignore_index=-100, reduction='sum')

for batch in tqdm(eval_dataloader, desc="Evaluating"):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)

    labels = input_ids.clone()
    labels[input_ids == tokenizer.pad_token_id] = -100

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = labels[:, 1:].contiguous()

    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
    total_loss += loss.item()

    valid_tokens_mask = (shift_labels != -100)
    total_tokens += valid_tokens_mask.sum().item()

    predicted_tokens = torch.argmax(shift_logits, dim=-1)
    correct = (predicted_tokens == shift_labels) & valid_tokens_mask
    correct_predictions += correct.sum().item()
    total_predictions += valid_tokens_mask.sum().item()

# Perplexity
average_loss = total_loss / total_tokens
perplexity = math.exp(average_loss)
print(f"Perplexity: {perplexity:.4f}")

# Next-token accuracy
next_token_accuracy = correct_predictions / total_predictions
print(f"Next-token Accuracy: {next_token_accuracy:.4f}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Using device: cuda


Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu121/fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/

Model wrapped with BOFT.
Trainable parameters: ['base_model.model.model.layers.0.self_attn.q_proj.boft_R.default', 'base_model.model.model.layers.0.self_attn.q_proj.boft_s.default', 'base_model.model.model.layers.0.self_attn.v_proj.boft_R.default', 'base_model.model.model.layers.0.self_attn.v_proj.boft_s.default', 'base_model.model.model.layers.2.self_attn.q_proj.boft_R.default', 'base_model.model.model.layers.2.self_attn.q_proj.boft_s.default', 'base_model.model.model.layers.2.self_attn.v_proj.boft_R.default', 'base_model.model.model.layers.2.self_attn.v_proj.boft_s.default', 'base_model.model.model.layers.4.self_attn.q_proj.boft_R.default', 'base_model.model.model.layers.4.self_attn.q_proj.boft_s.default', 'base_model.model.model.layers.4.self_attn.v_proj.boft_R.default', 'base_model.model.model.layers.4.self_attn.v_proj.boft_s.default', 'base_model.model.model.layers.6.self_attn.q_proj.boft_R.default', 'base_model.model.model.layers.6.self_attn.q_proj.boft_s.default', 'base_model.mo

Map (num_proc=4):   0%|          | 0/13125 [00:00<?, ? examples/s]

Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072


Map (num_proc=4):   0%|          | 0/3282 [00:00<?, ? examples/s]

Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072
Model max sequence length: 131072


  trainer = Trainer(


Step,Training Loss
50,1.7586
100,1.7756
150,1.7451
200,1.7276
250,1.749
300,1.7176
350,1.6918
400,1.7576
450,1.6506
500,1.6803




Model and tokenizer saved successfully to Google Drive.


Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading exte

Map (num_proc=4):   0%|          | 0/3282 [00:00<?, ? examples/s]

Evaluating: 100%|██████████| 3282/3282 [04:43<00:00, 11.59it/s]

Perplexity: 4.6205
Next-token Accuracy: 0.6433





## Q&A

In [6]:
# Install necessary libraries
# !pip install transformers datasets torch peft

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Directory where the fine-tuned model is saved
final_model_dir = "/content/drive/My Drive/CS394/fine_tuned_boft_llama"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(final_model_dir)

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# Load the base model and fine-tuned BOFT model
base_model_name = "meta-llama/Llama-3.2-1B"  # Replace with your base model name
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    use_cache=False
).to(device)

base_model.resize_token_embeddings(len(tokenizer))
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load the BOFT fine-tuned model
model = PeftModel.from_pretrained(base_model, final_model_dir).to(device)
model.eval()

# Function to ask a question
def ask_question(model, tokenizer, question, max_length=50, temperature=0.7, top_p=0.9):
    """
    Generates an answer to the given question using the fine-tuned BOFT model.

    Args:
        model: The fine-tuned model.
        tokenizer: The tokenizer corresponding to the model.
        question (str): The question to ask the model.
        max_length (int): The maximum length of the generated answer.
        temperature (float): Sampling temperature (lower values are more deterministic).
        top_p (float): Top-p nucleus sampling value.

    Returns:
        str: The generated answer.
    """
    # Tokenize the question
    input_ids = tokenizer(question, return_tensors="pt").input_ids.to(device)

    # Generate response
    output_ids = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        temperature=temperature,
        top_p=top_p,
        do_sample=True
    )

    # Decode the output
    answer = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return answer

# Example usage
question = "What are the symptoms of diabetes?"
answer = ask_question(model, tokenizer, question)
print(f"Question: {question}")
print(f"Answer: {answer}")


Using device: cuda


Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading extension module fbd_cuda...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module fbd_cuda, skipping build step...
Loading exte

Question: What are the symptoms of diabetes?
Answer: What are the symptoms of diabetes? The most common symptom of diabetes is feeling very tired. You may feel tired for no reason, or you may feel tired when you are active. You may also feel tired when you are not active. You may
