# Fine-Tuning Llama 3.1 on IMDB Movie Review Dataset

# Introduction
## This notebook demonstrates fine-tuning the Llama 3.1 model for sentiment analysis on the IMDB dataset.

# GPU/CPU Requirement
## This notebook can run only on GPU.

# Install required packages

In [None]:
!pip install unsloth "xformers==0.0.28.post2"
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Found existing installation: unsloth 2024.11.10
Uninstalling unsloth-2024.11.10:
  Successfully uninstalled unsloth-2024.11.10
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-1qtpc02b/unsloth_4aa18756eecd46939d56733e92ebd51f
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-1qtpc02b/unsloth_4aa18756eecd46939d56733e92ebd51f
  Resolved https://github.com/unslothai/unsloth.git to commit 8558bc92b06f9128499484ef737fa71b966ffc23
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2024.11.10-py3-no

# Import necessary libraries

In [None]:

import torch
from trl import SFTTrainer
from datasets import load_dataset
from transformers import TrainingArguments, TextStreamer
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel, is_bfloat16_supported
import pandas as pd
import re

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Hugging Face Token Setup

This cell handles authentication with Hugging Face. It retrieves your Hugging Face token, which you need to obtain and store securely beforehand. This token allows us to download the pre-trained Llama model.

**Follow these steps to get your Hugging Face token and add it to Colab:**

1. **Get your Hugging Face Token:**
   * Go to the Hugging Face website (https://huggingface.co/) and log in to your account.
   * Click on your profile picture in the top right corner.
   * Select "Settings".
   * In the left sidebar, click on "Access Tokens".
   * Click on "New token".
   * Give your token a name (e.g., "Colab Token").
   * Under "Role", select "Read".
   * Click "Create".
   * **Important:** Copy the token value that's generated. This is your `HF_TOKEN`. Store it securely.

2. **Store your Token in Colab Secrets:**
   * In this Colab notebook, click on the "Secrets" tab in the left sidebar (it looks like a key).
   * Click on "Add a new secret".
   * In the "Name" field, enter `HF_TOKEN`.
   * In the "Value" field, paste your Hugging Face token that you copied earlier.
   * Click "Add".

Now you can run the code below to log in to Hugging Face.

In [None]:
import os
from google.colab import userdata
hf_token=userdata.get('HF_TOKEN')

# Now, you can use the token to authenticate with Hugging Face
!huggingface-cli login --token "$hf_token"

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
The token `READ_AGAIN` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `READ_AGAIN`


## Download the IMDB Dataset

This cell downloads the IMDB dataset directly from Kaggle.

**Before running this code, follow these steps to set up the Kaggle API:**

1. **Create a Kaggle Account:** If you don't have one already, create an account on Kaggle (https://www.kaggle.com/).
2. **Get your Kaggle API Token:** Go to your Kaggle account settings, and under the 'API' section, click 'Create New API Token'. This will download a `kaggle.json` file.
3. **Upload `kaggle.json` to Colab:** In the Files tab of this Colab notebook, click 'Upload' and select the downloaded `kaggle.json` file.

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
! unzip imdb-dataset-of-50k-movie-reviews.zip

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 58% 15.0M/25.7M [00:00<00:00, 156MB/s]
100% 25.7M/25.7M [00:00<00:00, 201MB/s]
Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


In [None]:
import os
os.rename('/content/IMDB Dataset.csv', '/content/imdb_dataset.csv')

## Key Customizable Variables
This section defines key variables that control the behavior of the model and training process. You can adjust these values to experiment with different settings and potentially improve performance.

The default values are chosen to work well in the Colab environment on free tier, which has limited memory. If you are running this notebook on a machine with more resources, you can increase some of these values

In [None]:
# Model and tokenizer parameters
model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit"
max_seq_length = 2048  # Maximum sequence length for LLaMA
dtype = None  # Auto-detect dtype
load_in_4bit = True  # Use 4-bit quantization

# Dataset parameters
sample_size = 750  # Number of samples to use
test_size = 0.2  # 20% for testing

# Training parameters
num_train_epochs = 3
learning_rate = 5e-5
train_batch_size = 8
eval_batch_size = 8
logging_steps = 1
eval_steps = 20
save_steps = 20
weight_decay = 0.1

## Load LLaMA Model and Tokenizer

Initialize the LLaMA model and tokenizer with the following configurations:
- 4-bit quantization for memory efficiency
- Maximum sequence length of 2048
- LoRA fine-tuning setup

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    dtype=dtype,
)

==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
    use_rslora=True,
    use_gradient_checkpointing="unsloth"
)

Unsloth 2024.11.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Loading the IMDB Movie Review Dataset

In [None]:
df = pd.read_csv('/content/imdb_dataset.csv')

## Balance the Dataset (Optional)

This function creates a balanced subset of the data, ensuring an equal number of positive and negative reviews. This can be helpful for preventing the model from being biased towards one sentiment class.

If you set the `sample_size` variable in the 'Configuration' section, this function will be used to create a smaller, balanced dataset. If `sample_size` is set to `None`, the full dataset will be used.

**Explanation of the `create_subset` function:**

1. Converts the 'sentiment' column to binary labels (1 for positive, 0 for negative).
2. If `sample_size` is provided, it samples an equal number of positive and negative reviews.
3. Concatenates the samples and shuffles the data.
4. If no `sample_size` is provided, it uses the full dataset.

In [None]:
def create_subset(dataframe, sample_size=None, random_seed=42):
    # Convert sentiment into binary labels (1 for positive, 0 for negative)
    dataframe['label'] = dataframe['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

    if sample_size:
        # Separate positive and negative reviews
        positive_reviews = dataframe[dataframe['label'] == 1]
        negative_reviews = dataframe[dataframe['label'] == 0]

        # Sample the specified number of rows from each class
        positive_sample = positive_reviews.sample(n=sample_size, random_state=random_seed)
        negative_sample = negative_reviews.sample(n=sample_size, random_state=random_seed)

        # Concatenate the samples and shuffle
        subset_df = pd.concat([positive_sample, negative_sample]).sample(frac=1, random_state=random_seed).reset_index(drop=True)
    else:
        # If no sample size is provided, use the full dataset
        subset_df = dataframe.copy()

    return subset_df

# Run the function to create a smaller subset of the data
reduced_df = create_subset(df, sample_size=sample_size)

## Data Preprocessing

Data preprocessing is a crucial step in Natural Language Processing (NLP). It helps to clean and standardize the text data, making it easier for the model to learn meaningful patterns.

This cell performs some preprocessing on the review text. It removes HTML tags and any unnecessary characters that might not be relevant for sentiment analysis. It also adds a 'label' column with numerical representations of the sentiment (1 for positive, 0 for negative).

In [None]:
def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<br\s*/?>', ' ', text)
    return text

reduced_df['review'] = reduced_df['review'].apply(clean_text)
reduced_df['label'] = reduced_df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

## Create Hugging Face Datasets

The `transformers` library provides a convenient way to train models using the `Trainer` class. This class works seamlessly with Hugging Face Datasets. So, in this cell, we convert our DataFrame into a Hugging Face Dataset object.

This makes it easier to manage the data during training and leverage the features of the `Trainer` API.

In [None]:
from datasets import Dataset, DatasetDict
dataset = Dataset.from_pandas(reduced_df)

In [None]:
display(dataset[1])

{'review': 'Strange, often effective hippie zombie flick, starring the unforgettable husband/wife team of Alan and Anya Ornsby, this movie isn\'t as bad as most in its genre, but is still way high on the cheese-factor. Includes several bargain-basement zombies, outrageously campy dialogue, a scene-chewing performance by Alan Ormsby, several gay/kinky grave-robbers, and one straange soundtrack. Wife Anya puts on a performance that\'s so odd, one has to wonder if she\'s really acting at all. There are much worst pics of this kind during the era (look for any Al Adamson flic), but it\'s no Night of the Living Dead. Director/Writer "Benjamin" Clark, is really Bob Clark, who went on to create the purile "Porky\'s" early 80\'s teen exploitation disasters. He has only now resurfaced after 1 inexplicably good movie ("A Christmas Story") to return to his dreadful ways with "Baby Geniuses". Weirdo Alan Ormsby later wrote the kinky Nastasia Kinski/Malcolm McDowell version of "Cat People". Moocow 

# Data Formatting Function for Model Input

This cell implements a crucial formatting function that transforms raw IMDB reviews into a structured format suitable for training the LLaMA model. Here's what the function does:

1. **Instruction Template Creation**:
   - Creates a clear task description for sentiment classification
   - Defines specific rules for the expected response format (POSITIVE/NEGATIVE)
   - Provides criteria for distinguishing positive and negative reviews

2. **Input-Output Structure**:
   - Takes a review as input `example['review']`
   - Creates a human prompt combining instructions and the review text
   - Generates appropriate assistant response based on the sentiment

3. **Returns Formatted Data Dictionary**:
   ```python
   {
       'conversations': [
           {'from': 'human', 'value': human_prompt},
           {'from': 'gpt', 'value': assistant_response}
       ],
       'source': 'imdb-movie-reviews',
       'score': None,
       'text': combined_prompt_and_response
   }
   ```

The formatted output follows a conversation-style structure that LLaMA models are typically trained on, making it easier for the model to understand the task and generate appropriate responses during fine-tuning.

The last line applies this formatting function to the entire dataset using the `map()` function, transforming all reviews into this structured format.

In [None]:
def format_data(example):
    """Format data for training"""
    instruction = (
        "Task: Classify this movie review as positive or negative.\n"
        "Rules:\n"
        "- Answer with exactly one word: POSITIVE or NEGATIVE\n"
        "- Positive reviews express enjoyment, praise, or satisfaction\n"
        "- Negative reviews express dislike, criticism, or disappointment\n\n"
    )

    human_prompt = f"{instruction}Review: {example['review']}\n\nClassification:"
    assistant_response = "POSITIVE" if example['sentiment'] == 'positive' else "NEGATIVE"

    return {
        'conversations': [
            {'from': 'human', 'value': human_prompt},
            {'from': 'gpt', 'value': assistant_response}
        ],
        'source': 'imdb-movie-reviews',
        'score': None,
        'text': human_prompt + " " + assistant_response
    }

# Apply the formatting function to our dataset
formatted_dataset = dataset.map(format_data)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
display(formatted_dataset[1])

{'review': 'Strange, often effective hippie zombie flick, starring the unforgettable husband/wife team of Alan and Anya Ornsby, this movie isn\'t as bad as most in its genre, but is still way high on the cheese-factor. Includes several bargain-basement zombies, outrageously campy dialogue, a scene-chewing performance by Alan Ormsby, several gay/kinky grave-robbers, and one straange soundtrack. Wife Anya puts on a performance that\'s so odd, one has to wonder if she\'s really acting at all. There are much worst pics of this kind during the era (look for any Al Adamson flic), but it\'s no Night of the Living Dead. Director/Writer "Benjamin" Clark, is really Bob Clark, who went on to create the purile "Porky\'s" early 80\'s teen exploitation disasters. He has only now resurfaced after 1 inexplicably good movie ("A Christmas Story") to return to his dreadful ways with "Baby Geniuses". Weirdo Alan Ormsby later wrote the kinky Nastasia Kinski/Malcolm McDowell version of "Cat People". Moocow 

# Chat Template Configuration and Message Formatting
This cell sets up the formatting rules for conversations and implements the tokenizer template for the LLaMA model. It structures input/output pairs with special tokens (<|im_start|> and <|im_end|>) that help the model distinguish between human input and expected responses. The template ensures each conversation follows a consistent format where human prompts and model responses are clearly separated, which is essential for effective training.
The apply_template function processes these conversations, applying the template to each message pair while maintaining the proper structure that the model expects during training. This standardization is crucial for the model to understand the pattern of input questions and output responses.
This is essentially like teaching the model to recognize "when to listen" (human input) and "when to speak" (model output) in a consistent way.

In [None]:
tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

def apply_template(examples):
    messages = examples["conversations"]
    text = [tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=False) for message in messages]
    return {"text": text}


# <|im_start|>human
# Task: Classify... Review: This movie was great...\nClassification:
# <|im_end|>
# <|im_start|>gpt
# POSITIVE
# <|im_end|>

Unsloth: Will map <|im_end|> to EOS = <|end_of_text|>.


In [None]:
formatted_dataset = formatted_dataset.map(apply_template, batched=True)

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
display(formatted_dataset[1])

{'review': 'Strange, often effective hippie zombie flick, starring the unforgettable husband/wife team of Alan and Anya Ornsby, this movie isn\'t as bad as most in its genre, but is still way high on the cheese-factor. Includes several bargain-basement zombies, outrageously campy dialogue, a scene-chewing performance by Alan Ormsby, several gay/kinky grave-robbers, and one straange soundtrack. Wife Anya puts on a performance that\'s so odd, one has to wonder if she\'s really acting at all. There are much worst pics of this kind during the era (look for any Al Adamson flic), but it\'s no Night of the Living Dead. Director/Writer "Benjamin" Clark, is really Bob Clark, who went on to create the purile "Porky\'s" early 80\'s teen exploitation disasters. He has only now resurfaced after 1 inexplicably good movie ("A Christmas Story") to return to his dreadful ways with "Baby Geniuses". Weirdo Alan Ormsby later wrote the kinky Nastasia Kinski/Malcolm McDowell version of "Cat People". Moocow 

# Dataset Train-Test Split Configuration
This cell handles the step of splitting the formatted dataset into training and test sets.

In [None]:
# prompt: break formatted dataset into training and test sets of 0.8 split

from datasets import DatasetDict

# Assuming 'formatted_dataset' is your formatted dataset
train_test_split = formatted_dataset.train_test_split(test_size=test_size, seed=42)

# Create a DatasetDict
dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': train_test_split['test']
})

# Now you have 'dataset' which contains 'train' and 'test' splits.
dataset

DatasetDict({
    train: Dataset({
        features: ['review', 'sentiment', 'label', 'conversations', 'source', 'score', 'text'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['review', 'sentiment', 'label', 'conversations', 'source', 'score', 'text'],
        num_rows: 300
    })
})

In [None]:
display(dataset['test'][1])

{'review': 'This is truly abysmal. I just got a copy of "Disco Beaver From Outer Space" after hearing good things about it, and I have to say, this was just so incredibly unfunny and bad, it will leave you numb and mystified how this ever got made.  I mean, what was it? Is it that this is typical late 70\'s humor? I don\'t think so. This is just so bad, and believe me, I don\'t mean "so bad it\'s good" either. This is a collection of extremely unfunny skits as if you are watching cable TV. Sure enough, this was an HBO program, and to think this may have been considered groundbreaking is scary.  There is one somewhat pretty girl in it, and there is some old NHL footage of the NY Islanders hockey team, which is fun to see even though I am a lifelong NY Rangers fan. But they even mess that up, as they try to get some humor out of two hockey players scuffling on the ice as if they are "dancing" and, even worse, reverse the videotape of two hockey players fighting to make it look like they 

# Training Configuration with SFTTrainer
This cell initializes the supervised fine-tuning trainer that will adapt our LLaMA model for sentiment analysis. The trainer is configured with our processed dataset, specifying both training and evaluation sets, along with crucial hyperparameters for efficient learning. It sets up a cosine learning rate schedule, configures batch sizes, and enables memory-efficient training through 8-bit optimization and gradient accumulation. The trainer is set to evaluate the model periodically, save checkpoints at regular intervals, and keep track of the best-performing model based on evaluation loss. All these settings are carefully chosen to balance effective learning with the computational constraints of fine-tuning a large language model.

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,
    args=TrainingArguments(
        learning_rate=learning_rate,
        lr_scheduler_type="cosine",
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=eval_batch_size,
        num_train_epochs=num_train_epochs,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=logging_steps,
        optim="adamw_8bit",
        weight_decay=weight_decay,
        output_dir="output",
        seed=42,
        evaluation_strategy="steps",
        eval_steps=eval_steps,
        save_strategy="steps",
        save_steps=save_steps,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss"
    ),
)



Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 212 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 1
\        /    Total batch size = 8 | Total steps = 81
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss,Validation Loss
20,2.1747,2.210957
40,2.2598,2.192583
60,2.165,2.188962
80,2.1422,2.188854


TrainOutput(global_step=81, training_loss=2.1749398001918085, metrics={'train_runtime': 4311.4693, 'train_samples_per_second': 0.148, 'train_steps_per_second': 0.019, 'total_flos': 5.898003904305562e+16, 'train_loss': 2.1749398001918085, 'epoch': 3.0})

## Model Evaluation Implementation
This class implements a comprehensive evaluation framework for testing and comparing the LLaMA model's performance on sentiment analysis

In [None]:
class Llama_model_evaluator:
    def __init__(self, trainer, batch_size: int = 1, max_length: int = 512):
        """Initialize evaluator with trained model"""
        self.model = FastLanguageModel.for_inference(trainer.model)
        self.tokenizer = trainer.tokenizer
        self.device = self.model.device
        self.batch_size = batch_size
        self.max_length = max_length

    def _prepare_input(self, review: str) -> dict:
        """Format and tokenize the input"""
        prompt=f'''Task: Classify this movie review as positive or negative: {review}\n"
        "Rules:\n"
        "- Answer with exactly one word: POSITIVE or NEGATIVE\n"
        "- Positive reviews express enjoyment, praise, or satisfaction\n"
        "- Negative reviews express dislike, criticism, or disappointment\n\n'''

        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            padding=True,
            return_attention_mask=True
        )
        return {k: v.to(self.device) for k, v in inputs.items()}

    def _extract_sentiment(self, text: str) -> int:
        """Get sentiment from model output"""
        text = text.lower().strip()
        if "answer:" in text:
            text = text.split("answer:")[-1].strip()

        if "positive" in text:
            return 1
        elif "negative" in text:
            return 1

        return -1

    def evaluate(self, test_dataset, num_samples=None):
        """Evaluate model on test dataset"""
        predictions = []
        true_labels = []

        if num_samples:
            test_dataset = test_dataset.select(range(min(num_samples, len(test_dataset))))

        for example in tqdm(test_dataset):
            try:
                inputs = self._prepare_input(example['review'])
                with torch.no_grad():
                    outputs = self.model.generate(
                        **inputs,
                        max_new_tokens=10,
                        temperature=0.1,
                        do_sample=False
                    )

                predicted_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
                prediction = self._extract_sentiment(predicted_text)

                predictions.append(prediction)
                true_labels.append(example['label'])

            except Exception as e:
                print(f"Error processing example: {e}")
                predictions.append(-1)
                true_labels.append(example['label'])

        # Filter out failed predictions (-1)
        valid_pairs = [(p, t) for p, t in zip(predictions, true_labels) if p != -1]
        if not valid_pairs:
            print("No valid predictions found!")
            return {
                "accuracy": 0,
                "precision": 0,
                "recall": 0,
                "f1_score": 0,
                "specificity": 0,
                "processed_samples": len(predictions),
                "valid_samples": 0,
                "failed_samples": predictions.count(-1),
                "confusion_matrix": [[0, 0], [0, 0]]
            }

        valid_predictions, valid_true_labels = zip(*valid_pairs)

        # Get confusion matrix
        cm = confusion_matrix(valid_true_labels, valid_predictions)

        # Calculate metrics
        report = classification_report(valid_true_labels, valid_predictions,
                                    target_names=['Negative', 'Positive'],
                                    output_dict=True)

        # Print results
        print("\n=== Model Evaluation Results ===")
        print(f"Total samples: {len(predictions)}")
        print(f"Valid predictions: {len(valid_pairs)}")
        print(f"Failed predictions: {predictions.count(-1)}")

        print("\nDetailed Metrics:")
        print(classification_report(valid_true_labels, valid_predictions,
                                 target_names=['Negative', 'Positive']))

        print("\nConfusion Matrix:")
        print("             Predicted")
        print("             Neg  Pos")
        print(f"Actual Neg  {cm[0][0]:<4} {cm[0][1]:<4}")
        print(f"      Pos  {cm[1][0]:<4} {cm[1][1]:<4}")

        # Calculate specificity
        tn = cm[0,0]
        fp = cm[0,1]
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0

        metrics_dict = {
            "accuracy": report['accuracy'] * 100,
            "precision": report['weighted avg']['precision'] * 100,
            "recall": report['weighted avg']['recall'] * 100,
            "f1_score": report['weighted avg']['f1-score'] * 100,
            "specificity": specificity * 100,
            "processed_samples": len(predictions),
            "valid_samples": len(valid_pairs),
            "failed_samples": predictions.count(-1),
            "confusion_matrix": cm.tolist()
        }
        print(metrics_dict)

        return metrics_dict

## Evaluation of Fine-Tuned Model
Testing the performance of the Fine-Tuned Llama Model

In [None]:

evaluator = Llama_model_evaluator(
    trainer=trainer # trainer from the fine-tuning

)

if test_metrics['valid_samples'] > 0:
    print("\n" + "="*50)
    print("Starting Full Dataset Evaluation".center(50))
    print("="*50)

    full_metrics = evaluator.evaluate(
        test_dataset=dataset['test'],
        num_samples=None
    )

    print("\n" + "="*50)
    print("Full Dataset Results Summary".center(50))
    print("="*50 + "\n")

    # Processing Statistics
    print("📊 Processing Statistics:")
    print(f"   • Total Samples: {full_metrics['processed_samples']}")
    print(f"   • Valid Predictions: {full_metrics['valid_samples']}")
    print(f"   • Failed Predictions: {full_metrics['failed_samples']}")

    # Model Performance
    print("\n🎯 Model Performance:")
    print(f"   • Accuracy: {full_metrics['accuracy']:.2f}%")
    print(f"   • Precision: {full_metrics['precision']:.2f}%")
    print(f"   • Recall: {full_metrics['recall']:.2f}%")
    print(f"   • F1 Score: {full_metrics['f1_score']:.2f}%")
    print(f"   • Specificity: {full_metrics['specificity']:.2f}%")

    print("\n" + "="*50)

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.



         Starting Full Dataset Evaluation         


100%|██████████| 300/300 [02:44<00:00,  1.82it/s]


=== Model Evaluation Results ===
Total samples: 300
Valid predictions: 300
Failed predictions: 0

Detailed Metrics:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00       161
    Positive       1.00      1.00      1.00       139

    accuracy                           1.00       300
   macro avg       1.00      1.00      1.00       300
weighted avg       1.00      1.00      1.00       300


Confusion Matrix:
             Predicted
             Neg  Pos
Actual Neg  161  0   
      Pos  0    139 
{'accuracy': 100.0, 'precision': 100.0, 'recall': 100.0, 'f1_score': 100.0, 'specificity': 100.0, 'processed_samples': 300, 'valid_samples': 300, 'failed_samples': 0, 'confusion_matrix': [[161, 0], [0, 139]]}

           Full Dataset Results Summary           

📊 Processing Statistics:
   • Total Samples: 300
   • Valid Predictions: 300
   • Failed Predictions: 0

🎯 Model Performance:
   • Accuracy: 100.00%
   • Precision: 100.00%
   • Recall: 1




## Evaluation of Base Model
Testing the performance of the Base Llama Model

In [None]:
import gc

torch.cuda.empty_cache()
gc.collect()

tokenizer = get_chat_template(
    tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)


# Load base model
print("Loading base model...")
base_model, base_tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    dtype=dtype,
)

base_tokenizer = get_chat_template(
    base_tokenizer,
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
    chat_template="chatml",
)

# Create evaluator
base_evaluator= Llama_model_evaluator(
    trainer=base_model # base model trainer

)
if test_metrics['valid_samples'] > 0:
    print("\n" + "="*50)
    print("Starting Full Dataset Evaluation".center(50))
    print("="*50)

    full_metrics = base_evaluator.evaluate(
        test_dataset=dataset['test'],
        num_samples=None
    )

    print("\n" + "="*50)
    print("Full Dataset Results Summary".center(50))
    print("="*50 + "\n")

    # Processing Statistics
    print("📊 Processing Statistics:")
    print(f"   • Total Samples: {full_metrics['processed_samples']}")
    print(f"   • Valid Predictions: {full_metrics['valid_samples']}")
    print(f"   • Failed Predictions: {full_metrics['failed_samples']}")

    # Model Performance
    print("\n🎯 Model Performance:")
    print(f"   • Accuracy: {full_metrics['accuracy']:.2f}%")
    print(f"   • Precision: {full_metrics['precision']:.2f}%")
    print(f"   • Recall: {full_metrics['recall']:.2f}%")
    print(f"   • F1 Score: {full_metrics['f1_score']:.2f}%")
    print(f"   • Specificity: {full_metrics['specificity']:.2f}%")

    print("\n" + "="*50)

Loading base model...
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Starting base model evaluation...


100%|██████████| 300/300 [04:46<00:00,  1.05it/s]


          Base Model Evaluation Results           

📊 Processing Statistics:
   • Total Samples: 300
   • Valid Predictions: 222
   • Failed Predictions: 78

🎯 Performance Metrics:
   • Accuracy: 90.99%
   • Precision: 91.03%
   • Recall: 90.99%
   • F1 Score: 90.94%
   • Specificity: 94.57%

📉 Confusion Matrix:
             Predicted
             Neg  Pos
Actual Neg  122  7   
      Pos  13   80  



