# **Installations**

This section outlines the essential setup steps for working with **Llama-3.1-8B-Instruct** model. It includes installing necessary packages, authenticating with Hugging Face, and downloading model weights.



1. **Install Required Packages**
   - `bitsandbytes` for efficient 8-bit optimizations.
   - `transformers`, `accelerate`, and `peft` for model loading, training acceleration, and parameter-efficient fine-tuning.
   - `python-dotenv` for managing environment variables.
   - `einops`, `scikit-learn`, and `scipy` for tensor operations, machine learning utilities, and scientific computing.
   - `matplotlib` for data visualization.
   - `tabulate` for neatly formatted textual tables.

In [None]:
%pip install -U bitsandbytes
%pip install -U transformers accelerate peft
%pip install python-dotenv
%pip install einops scikit-learn scipy
%pip install matplotlib
%pip install tabulate

2. **Hugging Face Authentication**  
   Authenticate with **Hugging Face** using an access token. This allows access to Llama Model Weights.

In [None]:
# Hugging Face login
hf_token = "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"

# Authenticate with Hugging Face
login(hf_token)

3. **Download the Llama Model Weights**  
   The Hugging Face CLI is used to download the **Llama-3.1-8B-Instruct** model weights into a local directory while excluding unnecessary paths (`original/*`).

In [None]:
# Download Llama model weights
!huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir Llama-3.1-8B-Instruct --exclude "original/*"

# **Preprocessing**

This section defines the **DataProcessor** class, which preprocesses EEG data through normalization, segmentation, and quantization.

1. **Loading EEG Data**  
   `load_subject_data` reads `.dat` EEG files and extracts the data and labels.

    <div style="text-align: left;">
     <img src="Images/data_preprocessed_python_description.png" width="70%">
    </div>

    In addition to the 32 EEG channels, the dataset includes 8 channels that contain the following physiological signals:

    - **EOG (Electrooculography):** Horizontal (hEOG₁, hEOG₂) and Vertical (vEOG₁, vEOG₂).
    - **EMG (Electromyography):** Zygomaticus Major EMG (zEMG₁, zEMG₂) and Trapezius EMG (tEMG₁, tEMG₂).
    - **GSR (Galvanic Skin Response):** Values converted from Twente to Geneva format (ohms).
    - **Respiration Belt, Plethysmograph, and Temperature Sensors**.

    These 8 channels are exclued from the input data.

    The following preprocessing steps were applied to the EEG data:

    1. Downsampled to **128 Hz**.
    2. EOG artifacts removed as described in **[1]**.
    3. Bandpass frequency filter applied (**4.0 - 45.0 Hz**).
    4. Averaged to a **common reference**.
    5. EEG channels reordered to match the **Geneva order**.
    6. Segmented into **60-second trials**, removing the **3-second pre-trial baseline**.
    7. Trials reordered from **presentation order** to **video (Experiment_id) order**.


2. **Z-score Normalization**  
   `zscore_normalize` standardizes EEG signals across trials and channels, ensuring zero mean and unit variance. The transformation is given by:
   
   <div style="text-align: left;">
    <img src="Images/zscore.png" width="50%">
   </div>

   This normalization helps to make EEG signals comparable across different subjects and trials.

3. **Quantization Bins and Labels**  
   `analyze_distribution` flattens the EEG data, computes percentile-based bins (5th-95th percentiles), and assigns labels. Labels can be **binary** (e.g., `000`, `001`, `010`) or **symbolic** (`A`, `B`, `C`).

   <div style="text-align: left;">
    <img src="Images/EEG Data Quantization Bins.png" width="50%">
   </div>


4. **EEG Segmentation**  
   `segment_eeg_data` splits EEG trials into overlapping windows of size `window_size`, with step size determined by `overlap`.
    
    <div style="text-align: left;">
     <img src="Images/EEG Segmentation With Overlapping Windows.png" width="50%">
    </div>

5. **Signal Quantization**  
   `quantize_signal` maps EEG values into quantization bins, producing a compact, space-separated sequence.

    <div style="text-align: left;">
     <img src="Images/EEG Quantization With Binary Encoding.png" width="50%">
    </div>

6. **Full Subject Preprocessing**  
   `preprocess_subject` applies the pipeline:
   - Extracts the first 32 EEG channels.
   - Normalizes EEG data (z-score).
   - Normalizes valence/arousal labels to \([0,1]\).
   - Computes quantization bins.
   - Segments and quantizes each trial.
   - Replicates labels for each segment.

7. **Processing the DEAP Dataset**  
   `preprocess_deap_data` iterates over `.dat` files, applies preprocessing, and saves sequences and labels as NumPy arrays.


In [None]:
# Utility class for EEG preprocessing and quantization
class DataProcessor:

    def __init__(self, preprocessed_output_dir, num_bins, bin_encoding, window_size, overlap):
        
        """
        Handles subject-level EEG loading, normalization, segmentation, and quantization.

        Args:
            preprocessed_output_dir (str): Directory to save preprocessed data.
            num_bins (int): Number of bins for quantization.
            bin_encoding (str): Encoding method ('binary' or 'symbolic').
            window_size (int): Number of samples per segment.
            overlap (float): Fraction of overlap between windows.
        """
        self.preprocessed_output_dir = preprocessed_output_dir
        self.num_bins = num_bins
        self.bin_encoding = bin_encoding
        self.window_size = window_size
        self.overlap = overlap
        self.bins = None
        self.labels = None



    def load_subject_data(self, file_path):

        """
        Load EEG data and corresponding labels from a .dat file.

        Args:
            file_path (str): Path to the .dat file.

        Returns:
            tuple: EEG data and labels as numpy arrays.
        """

        print(f"Loading data from {file_path}...")
        with open(file_path, 'rb') as f:
            subject_data = pickle.load(f, encoding='latin1')
            print("Data loaded successfully.")
            return subject_data['data'], subject_data['labels']



    def zscore_normalize(self, eeg_data):
        
        """
        Perform z-score normalization across channels and time for each subject’s entire data.
        eeg_data shape: (num_trials, num_eeg_channels, time) ->  (40, 32, 8064).

        Args:
            eeg_data (np.ndarray): EEG data to be normalized.

        Returns:
            np.ndarray: Z-score normalized EEG data.
        """

        # shape: (trial, channel, time)
        mean_vals = np.mean(eeg_data, axis=(0,2), keepdims=True)
        std_vals = np.std(eeg_data, axis=(0,2), keepdims=True)
        eeg_data = (eeg_data - mean_vals) / (std_vals + 1e-7)
        return eeg_data    
    


    def analyze_distribution(self, eeg_data):
        
        """
        Analyze EEG amplitude distribution and define quantization bins.

        Args:
            eeg_data (np.ndarray): EEG data of shape (num_trials, 32, time_steps).

        Returns:
            Compute quantization bins, and updates self.bins and self.labels according to them.
        """
        
        flattened_data = eeg_data.flatten()
        # Compute percentiles from 5th to 95th to avoid outliers
        percentiles = np.linspace(5, 95, self.num_bins + 1)
        self.bins = np.percentile(flattened_data, percentiles)

        # Assign labels (binary or symbolic)
        if self.bin_encoding == "binary":
            # e.g. 3-bit if num_bins=8 => '000', '001', '010', ...
            self.labels = [
                format(i, f'0{len(bin(self.num_bins - 1)[2:])}b')
                for i in range(self.num_bins)
            ]
        else:
            # e.g. A, B, C, ...
            self.labels = [chr(65 + i) for i in range(self.num_bins)]

        print(f"Quantization Bins: {self.bins}")
        print(f"Assigned Labels: {self.labels}")


    
    def segment_eeg_data(self, eeg_data):

        """
        Segment EEG data into overlapping windows.

        Args:
            eeg_data (np.ndarray): EEG data of shape (32, 8064).

        Returns:
            np.ndarray: Segmented EEG data of shape (num_segments, 32, window_size).
        """

        step = int(self.window_size * (1 - self.overlap))
        num_windows = (eeg_data.shape[1] - self.window_size) // step + 1
        print(f"Segmenting EEG data into {num_windows} windows...")
        segments = [
            eeg_data[:, i * step:i * step + self.window_size]
            for i in range(num_windows)
        ]
        print("Segmentation complete.")
        return np.stack(segments, axis=0)

    

    def quantize_signal(self, signal):

        """
        Convert an EEG signal into a space-separated quantized representation.

        Args:
            signal (np.ndarray): Single EEG trial of shape (32, window_size).

        Returns:
            str: Space-separated quantized representation.
        """

        if self.bins is None:
            raise ValueError("Bins not initialized. Run analyze_distribution() first.")
        
        # Flatten the 32 channels for that segment
        flat = signal.flatten()
        quantized_indices = np.digitize(flat, self.bins, right=False) - 1
        quantized_indices = np.clip(quantized_indices, 0, len(self.labels) - 1)
        return ' '.join(self.labels[i] for i in quantized_indices)
    
    
    
    def preprocess_subject(self, subject_file):

        """
        Preprocess a single subject's EEG data: z-score, segment, quantize, normalize labels.

        Args:
            subject_file (str): Path to the subject's .dat file.

        Returns:
            tuple: z-scored, segmented, quantized EEG data and normalized labels.
        """

        print(f"Preprocessing data for {subject_file}...")
        eeg_data, labels = self.load_subject_data(subject_file)
    
        # eeg_data => (40, 40, 8064) video/trial x channel x data, 
        # labels => (40, 4) video/trial x label (valence, arousal, dominance, liking) 

        # We only need the first 32 channels, 
        # because the remaining 8 are other physiological data, so:

        # 1) Keep only the first 32 channels and time dimension
        eeg_data = eeg_data[:, :32, :]  

        # 2) Z-score per subject
        eeg_data = self.zscore_normalize(eeg_data)

        # 3) Valence & arousal only => columns 0 & 1, normalizing from [1,9] to [0,1]
        labels = labels[:, :2]  
        labels = (labels - 1) / 8

        # 4) Compute quantization bins based on the entire subject’s EEG distribution
        #    (Now that it’s z-scored).
        self.analyze_distribution(eeg_data)

        all_sequences = []
        all_labels = []

        for trial_idx, trial_data in enumerate(eeg_data):
            segments = self.segment_eeg_data(trial_data)
            # Quantize each segment
            quantized_segments = [self.quantize_signal(seg) for seg in segments]

            all_sequences.extend(quantized_segments)

            # Duplicate this trial's valence/arousal label for each segment
            trial_labels = np.tile(labels[trial_idx], (len(quantized_segments), 1))
            all_labels.append(trial_labels)

        sequences = np.array(all_sequences, dtype=object)
        labels = np.concatenate(all_labels, axis=0)

        # For debugging
        print(f"Preprocessed data dimensions => Sequences: {sequences.shape}, Labels: {labels.shape}")
        return sequences, labels



    def save_hyperparameters(self):
        hyperparams = {
            "num_bins": self.num_bins,
            "bin_encoding": self.bin_encoding,
            "window_size": self.window_size,
            "overlap": self.overlap
        }
        hyperparams_path = os.path.join(self.preprocessed_output_dir, "preprocessing_hyperparameters.json")
        with open(hyperparams_path, 'w') as f:
            json.dump(hyperparams, f, indent=4)
        print(f"Saved preprocessing hyperparameters to {hyperparams_path}")

        

    def preprocess_deap_data(self, data_path):

        """
        Preprocess all subjects' data in the DEAP dataset.

        Args:
            data_path (str): Path to the folder containing .dat files.

        Returns:
            Saves sequences with shape (num_segments, 32, window_size) and labels with shape (num_segments, 2).
        """

        os.makedirs(self.preprocessed_output_dir, exist_ok=True)

        self.save_hyperparameters()

        for subject_file in os.listdir(data_path):
            if subject_file.endswith(".dat"):
                print(f"Processing {subject_file}...")
                subject_path = os.path.join(data_path, subject_file)
                sequences, labels = self.preprocess_subject(subject_path)

                # Overwrite existing files without checking
                np.save(os.path.join(self.preprocessed_output_dir, f"{subject_file}_sequences.npy"), sequences)
                np.save(os.path.join(self.preprocessed_output_dir, f"{subject_file}_labels.npy"), labels)
                print(f"Saved preprocessed data for {subject_file}.")


In [None]:
# Preprocess data

num_bins=8
bin_encoding="binary"
window_size = 1024
overlap = 0.1

processor = DataProcessor(preprocessed_output_dir, num_bins, bin_encoding, window_size, overlap)
processor.preprocess_deap_data(data_path)

# **Imports and Directories**

This section configures GPU settings, ensures reproducibility, and sets up paths for dataset preprocessing and model usage.

In [1]:
import torch

num_gpus = torch.cuda.device_count()  # Get the number of available GPUs
print(f"Number of GPUs: {num_gpus}")

for i in range(num_gpus):
    print(f"GPU {i}: {torch.cuda.get_device_name(i)}")


Number of GPUs: 4
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
GPU 2: NVIDIA RTX A6000
GPU 3: NVIDIA RTX A6000


**GPU Detection & Selection**  
   - Identifies available GPUs and selects a specific one for computations.  
   - Retrieves and displays GPU specifications, including memory, multiprocessors, and compute capability.
   - Sets the selected GPU as the active device. 

In [None]:
import os
import torch

gpu_id = 2
num_gpus = torch.cuda.device_count()

if gpu_id >= num_gpus:
	raise ValueError(f"Invalid GPU ID {gpu_id}. Only {num_gpus} GPUs are available.")

os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)

device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(gpu_id)

print("Using device:", device)

# Print CUDA device information
print("Using GPU:", torch.cuda.get_device_name(gpu_id))
print("Device Count:", torch.cuda.device_count())
print("Current Device ID:", torch.cuda.current_device())
print("CUDA is Available:", torch.cuda.is_available())

# Get device properties
device_props = torch.cuda.get_device_properties(gpu_id)
print("\n GPU Specifications:")
print(f"   - Name: {device_props.name}")
print(f"   - Total Memory: {device_props.total_memory / 1e9:.2f} GB")
print(f"   - Multiprocessors: {device_props.multi_processor_count}")
print(f"   - Compute Capability: {device_props.major}.{device_props.minor}")
print(f"   - Max Threads per Multiprocessor: {device_props.max_threads_per_multi_processor}")

Using device: cuda:3
Using GPU: NVIDIA RTX A6000
Device Count: 4
Current Device ID: 3
CUDA is Available: True

 GPU Specifications:
   - Name: NVIDIA RTX A6000
   - Total Memory: 51.03 GB
   - Multiprocessors: 84
   - Compute Capability: 8.6
   - Max Threads per Multiprocessor: 1536


In [None]:
num_gpus = torch.cuda.device_count()

if gpu_id >= num_gpus:
	raise ValueError(f"Invalid GPU ID {gpu_id}. Only {num_gpus} GPUs are available.")

os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)

device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(gpu_id)

print("Using device:", device)

# Print CUDA device information
print("Using GPU:", torch.cuda.get_device_name(gpu_id))
print("Device Count:", torch.cuda.device_count())
print("Current Device ID:", torch.cuda.current_device())
print("CUDA is Available:", torch.cuda.is_available())

# Get device properties
device_props = torch.cuda.get_device_properties(gpu_id)
print("\n GPU Specifications:")
print(f"   - Name: {device_props.name}")
print(f"   - Total Memory: {device_props.total_memory / 1e9:.2f} GB")
print(f"   - Multiprocessors: {device_props.multi_processor_count}")
print(f"   - Compute Capability: {device_props.major}.{device_props.minor}")
print(f"   - Max Threads per Multiprocessor: {device_props.max_threads_per_multi_processor}")

In [3]:
from torch.utils.data import Dataset, DataLoader, Subset
import numpy as np
import pickle
import json

**Reproducibility**  
A fixed seed is set to maintain consistency in results.

In [4]:
# Set reproducibility seed
seed = 42
torch.manual_seed(seed)

<torch._C.Generator at 0x7f237c1e39b0>

**Directories**  
Modify the paths to match project folder configuration. 

In [5]:
# Paths
project_path = "./"
data_path = "../../data/DEAP_Dataset/data_preprocessed_python"
preprocessed_output_dir = os.path.join(project_path, "DEAP_preprocessed")
model_path = "Llama-3.1-8B-Instruct"

# **Dataset**

This section describes the structure of the **DEAP dataset**, the preprocessing steps applied to EEG data, and how the data is formatted for machine learning. The EEG signals are transformed into quantized text sequences, while valence and arousal labels are discretized into categorical classes for affective computing tasks.


**EEG Data Representation**

The DEAP dataset consists of EEG signals recorded at $128$ Hz across $32$ channels for each trial. Each participant undergoes $40$ trials, resulting in a raw EEG data matrix:

$$
\mathbf{X}_{\text{raw}} \in \mathbb{R}^{n_{\text{trials}} \times 32 \times 8064}
$$

where:
- $n_{\text{trials}}$ is the total number of trials across all participants,
- $32$ represents the EEG channels,
- $8064 = 128 \cdot 60 + 3$ accounts for the number of samples per trial (1-minute recording + 3-second baseline).

Each trial is associated with an affective label vector:

$$
\mathbf{y} \in \mathbb{R}^{n_{\text{trials}} \times 4}
$$

where the four elements represent **valence, arousal, dominance, and liking**, rated on a scale from 1 to 9.



<div style="text-align: left;">
 <img src="Images/experiment_participant_setup.png" width="50%">
</div>

   <div style="text-align: left;">
    <img src="Images/video_examples.png" width="50%">
   </div>

In [6]:
def load_all_preprocessed_subjects(preprocessed_output_dir, max_subjects=None):
    """
    Reads each file ending with "_sequences.npy" in `preprocessed_output_dir`,
    and finds the corresponding "_labels.npy" file.
    
    If `max_subjects` is None, load ALL available subject files.
    Otherwise, load only the first `max_subjects` files (sorted alphabetically).

    Returns:
        all_sequences: (N,) array of quantized EEG text segments
        all_labels: (N, 2) array of valence, arousal
    """
    all_seq_files = sorted(
        f for f in os.listdir(preprocessed_output_dir) if f.endswith("_sequences.npy")
    )

    if max_subjects is not None:
        all_seq_files = all_seq_files[:max_subjects]

    all_sequences = []
    all_labels = []

    for seq_filename in all_seq_files:
        seq_path = os.path.join(preprocessed_output_dir, seq_filename)
        lab_path = seq_path.replace("_sequences.npy", "_labels.npy")
        
        if not os.path.exists(lab_path):
            print(f"Warning: Labels file not found for {seq_filename}")
            continue
        
        subject_sequences = np.load(seq_path, allow_pickle=True)
        subject_labels = np.load(lab_path, allow_pickle=True)

        all_sequences.append(subject_sequences)
        all_labels.append(subject_labels)

    if len(all_sequences) == 0:
        raise ValueError("No preprocessed subject files found in the directory.")

    all_sequences = np.concatenate(all_sequences, axis=0)
    all_labels = np.concatenate(all_labels, axis=0)

    print(f"Total loaded sequences: {all_sequences.shape}")
    print(f"Total loaded labels: {all_labels.shape}")
    return all_sequences, all_labels

**Preprocessed Data Format**

The preprocessed dataset extracts EEG features and represents each trial as a sequence of text-encoded EEG segments. The dataset is stored as:

- **EEG sequences:** $\mathbf{S}$, stored in `_sequences.npy`
- **Corresponding labels:** $\mathbf{L}$, stored in `_labels.npy`

After preprocessing, the dataset is structured as:

$$
\mathbf{S} \in \mathbb{R}^{N}
$$

$$
\mathbf{L} \in [0,1]^{N \times 2}
$$

where:
- $N$ is the number of extracted EEG segments across all trials.
- Each sequence in $\mathbf{S}$ represents a quantized text-encoded EEG segment.
- The label matrix $\mathbf{L}$ contains valence and arousal scores normalized between 0 and 1.

**Label Discretization**

The continuous valence ($v$) and arousal ($a$) scores are discretized into two classes:

$$
C(x) =
\begin{cases}
0, & x \leq 0.56 \\
1, & x > 0.56
\end{cases}
$$

This results in a new label matrix:

$$
\mathbf{L'} \in \{0,1\}^{N \times 2}
$$

where each label is transformed into discrete categories for classification tasks.


In [7]:
class DEAPDataset(Dataset):

    def __init__(self, sequences, labels, debug=False):

        """
        sequences: array/list of text strings (quantized EEG), one per segment
        labels: shape [num_segments, 2] => valence, arousal
        debug: print sample info for debugging
        """
        
        self.sequences = sequences
        self.debug = debug

        # Convert each [val, aro] from [0..1] to discrete {0,1,2}
        discrete = []
        for (v, a) in labels:
            v_class = self._continuous_to_class(v)
            a_class = self._continuous_to_class(a)
            discrete.append([v_class, a_class])
        self.labels = np.array(discrete, dtype=np.int64)


    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):

        """
        Return the raw text segment and label (no tokenization here).
        """

        text_segment = self.sequences[idx]
        label = self.labels[idx] # shape: (2,) => [val_class, aro_class]

        if self.debug and idx < 1:
            print(f"Example sequence: {self.sequences[0]}")
            print(f"Example label: {self.labels[0]}  (valence_class, arousal_class)")
        # Return the raw text and label as a tuple
        return text_segment, label

    def _continuous_to_class(self, value):
        if value <= 5/9:
            return 0
        else:
            return 1

**Dataset Construction**

A PyTorch `Dataset` class organizes the sequences and labels, enabling:
- Efficient access to EEG text segments and labels.
- On-the-fly retrieval of raw EEG sequences and corresponding valence/arousal categories.
- Debugging mode for inspecting sample data.

In [8]:
# Decide how many subject files to load
# e.g. set `max_subjects=2` to load only 2 subject files, or None for all
max_subjects = None  # or None

# Load preprocessed (optionally limited) subject files
sequences, labels = load_all_preprocessed_subjects(
    preprocessed_output_dir,
    max_subjects=max_subjects
)
dataset = DEAPDataset(sequences, labels)

Total loaded sequences: (10240,)
Total loaded labels: (10240, 2)


# **DataLoader**

This section defines the **collate function** used in the `DataLoader`. The function **tokenizes text segments in batches**, ensuring consistent input formatting for the LLaMA model. It also processes labels and moves all tensors to the specified device.


In [9]:
def dynamic_tokenize_collate_fn(tokenizer, max_length, device, debug=False):

    """
    Returns a function that can be used as collate_fn in the PyTorch DataLoader.
    The returned function tokenizes the raw text segments in batch.
    """

    def collate_fn(batch):

        """
        batch: list of (text_segment, label) tuples
        """
        
        # Separate text and labels
        text_segments = [item[0] for item in batch]
        labels = [item[1] for item in batch]  # shape: [val_class, aro_class]

        if debug and len(text_segments) > 0:
            print(f"\n[CollateFn] Example text: {text_segments[0]}")

        # Tokenize in batch
        encoded = tokenizer(
            text_segments,
            truncation=True,
            padding="max_length",
            max_length=max_length,
            return_tensors="pt"
        )

        # Extract tensors
        input_ids = encoded["input_ids"]
        attention_mask = encoded["attention_mask"]
        
        # Convert labels to tensor from numpy arrays
        labels_array = np.array(labels)
        labels_tensor = torch.from_numpy(labels_array).long()

        # Optional: move to GPU here
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        labels_tensor = labels_tensor.to(device)

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels_tensor
        }
    return collate_fn


# **Model**


This section imports from `model.py` the **LlamaEmotionClassifier**, a transformer-based model designed for **emotion classification**. It utilizes a **fine-tuned Llama model** and a **fully connected classification head** to predict **valence** and **arousal** as discrete categories (low, mid, high).



In [10]:
# Import the model class definition from the model.py file
from model import LlamaEmotionClassifier

  from .autonotebook import tqdm as notebook_tqdm


**Model Architecture**

The model consists of the following components:

1. **Llama Backbone**  
   - A **pre-trained Llama model** serves as the feature extractor.

   <div style="text-align: left;">
    <img src="Images/llama_architecture.png" width="50%">
   </div>

   - **LoRA Configuration**  
     - $r = 16$  
     - $\text{lora\_alpha} = 8$  
     - $\text{lora\_dropout} = 0.1$  
     - Applied to the following target modules: $q\_proj,\, k\_proj,\, v\_proj,\, o\_proj,\, gate\_proj,\, up\_proj$  

   <div style="text-align: left;">
    <img src="Images/LoRA.png" width="50%">
   </div>

   - **Partial Freezing**  
     During training, the Llama backbone is **frozen for the first 10 epochs**, then unfrozen thereafter to allow full fine-tuning.


2. **Mean + Max Pooling**  
   - The final hidden states are pooled in two ways—**mean-pooling** and **max-pooling**—to form a combined representation $Z$.

3. **Classification Head**  
   - A single **fully-connected (FC) layer** reduces $Z$ to **4 logits**—2 for **valence** and 2 for **arousal**.

Mathematically, given an input sequence **$X$**, the model computes:

$$
H = \text{Llama}(X)
$$

where **$H \in \mathbb{R}^{B \times L \times d}$**, with:
- $B$: Batch size,
- $L$: Sequence length,
- $d$: Hidden state dimension.

- **Masked Mean-Pooling**  
     $$
     \overline{H} = \frac{\sum_{i=1}^{L} H_i \,\cdot\, M_i}{\sum_{i=1}^{L} M_i},
     $$
     where $M_i$ is the attention mask (1 for valid tokens, 0 for padding).

- **Max-Pooling**  
     $$
     H^{\max} = \max_{1 \le i \le L}(\,H_i \cdot M_i\,).
     $$

- **Concatenation**  
     $$
     Z = \bigl[\;\overline{H}\;\|\;H^{\max}\bigr],
     $$
     making $Z \in \mathbb{R}^{2d}$.

The final classification head projects $Z$ from $2d$ to **4 logits**—2 for **valence** and 2 for **arousal**.

   $$
   y = WZ + b,
   $$
   where $W \in \mathbb{R}^{4 \times 2d}$ and $b \in \mathbb{R}^{4}$.  
   - The first 2 logits correspond to **valence** (binary classes).  
   - The last 2 logits correspond to **arousal** (binary classes).

**Loss Function**

The model is trained using **categorical cross-entropy loss**:

$$
\mathcal{L} = \mathcal{L}_{val} + \mathcal{L}_{aro}
$$

where:

$$
\mathcal{L}_{val} = - \sum_{c=0}^{1} y_c^{val} \log(\hat{y}_c^{val})
$$

$$
\mathcal{L}_{aro} = - \sum_{c=0}^{1} y_c^{aro} \log(\hat{y}_c^{aro})
$$

- **$y_c$** is the true label (one-hot encoded),
- **$\hat{y}_c$** is the predicted probability after softmax over the 2 valence or 2 arousal logits.


In [11]:
# Initialize the Llama emotion classifier
llama_classifier = LlamaEmotionClassifier(model_path=model_path, device=device, train_folder="Trainings").to(device)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading model on device: cuda:3


Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00,  1.47s/it]


LlamaEmotionClassifier initialized.


# **Training**

This section defines the **hyperparameters** for model training, splits the dataset into **train, validation, and test sets**, sets up **data loaders** and trains the `LlamaEmotionClassifier` model. 


In [None]:
# Hyperparameters

hparams = {
    "epochs": 20,
    "batch_size": 8,
    "learning_rate": 1e-3,
    "train_split": 0.67,   
    "val_split": 0.13,
    "max_length": 128
}

# Print Hyperparameters for verification
print("hparams:")
for key, value in hparams.items():
    print(f"{key}: {value}")

hparams:
epochs: 20
batch_size: 32
learning_rate: 0.001
train_split: 0.67
val_split: 0.13
max_length: 128


In [13]:
# Train/val/test split
total_len = len(dataset)
train_len = int(hparams["train_split"] * total_len)
val_len = int(hparams["val_split"] * total_len)
test_len = total_len - (train_len + val_len)

indices = torch.randperm(total_len).tolist()  # Fixed permutation
train_indices = indices[:train_len]
val_indices = indices[train_len:train_len+val_len]
test_indices = indices[train_len+val_len:]

index_dict = {
        "train_indices": train_indices,
        "val_indices": val_indices,
        "test_indices": test_indices
}

# Create dataset subsets
train_dataset = Subset(dataset, train_indices)
val_dataset = Subset(dataset, val_indices)
test_dataset = Subset(dataset, test_indices)

print(f"Dataset splits => train: {len(train_dataset)}, val: {len(val_dataset)}, test: {len(test_dataset)}")

# DataLoaders

collate_fn = dynamic_tokenize_collate_fn(
    tokenizer=llama_classifier.tokenizer,
    max_length=hparams["max_length"],
    device=device,
    debug=False
)

train_loader = DataLoader(train_dataset, batch_size=hparams["batch_size"], shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=hparams["batch_size"], shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_dataset, batch_size=hparams["batch_size"], shuffle=False, collate_fn=collate_fn)

Dataset splits => train: 6860, val: 1331, test: 2049


**Training Procedure**

The model is fine-tuned using **AdamW optimization** with a **cross-entropy loss** objective. The training loop includes:

1. **Batch Processing**  
   - Tokenization using **Llama's tokenizer**.
   - Padded sequences are masked to **ignore padding in pooling**.

2. **Forward Pass**  
   - Llama extracts **hidden states** from input sequences.
   - The **pooled hidden state** is passed through the classification head.

3. **Loss Computation & Backpropagation**  
   - The **cross-entropy loss** is computed for valence and arousal.
   - Gradients are computed and updated using **AdamW**.

4. **Validation & Accuracy Tracking**  
   - **Valence and arousal accuracy** are measured separately.
   - **Overall accuracy** considers both dimensions jointly.

The **overall accuracy** is defined as:

$$
\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i^{val} = y_i^{val} \land \hat{y}_i^{aro} = y_i^{aro})
$$

where:
- **$N$** is the number of samples,
- **$\mathbb{1}$** is an indicator function that returns 1 if both predictions match the true labels.

**Model Deployment**

After training, the model is saved in the experiment folder, along with:
- Model weights and architecture.
- Training curves for **loss** and **accuracy**.
- A saved notebook for reproducibility.

In [14]:
# Train the model
llama_classifier.train_model(train_loader, val_loader, hparams, index_dict)

Starting training with cross-entropy classification...
Experiment folder created: Trainings/20250328_201357
Hyperparameters saved to: Trainings/20250328_201357/hyperparams.json


Epoch 1/20 [TRAIN]:   0%|          | 0/215 [00:00<?, ?it/s]

  return fn(*args, **kwargs)
                                                                                                                              

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 3 has a total capacity of 47.53 GiB of which 200.88 MiB is free. Including non-PyTorch memory, this process has 47.33 GiB memory in use. Of the allocated memory 43.64 GiB is allocated by PyTorch, and 3.37 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# **Testing**

This section **loads a trained model**, **evaluates it on the test set**, and **saves the results**. The trained model is retrieved from a saved experiment folder, and the **test dataset** is reconstructed to measure **valence, arousal, and overall classification accuracy**.


1. Load the Experiment Setup  
- The **experiment folder** containing model checkpoints and configurations is specified.
- The Python path is updated to allow importing the model class definition.

In [None]:
import sys
import os

# Specify experiment folder
experiment_folder = os.path.join(project_path, "Trainings/20250228_101952")  # Change this to the correct experiment name

sys.path.append(experiment_folder)

# Now import the class
from model_definition import LlamaEmotionClassifier

2. Load Dataset Indices & Hyperparameters  
- The **test set indices** are retrieved from `dataset_indices.json`, ensuring consistency with previous training/validation splits.
- Hyperparameters such as **batch size, sequence length**, and other settings are loaded from `hyperparams.json`.

3. Load the Pretrained Model  
- The model class (`LlamaEmotionClassifier`) is imported from the saved **model definition** file.
- A new instance of the classifier is initialized with the same **model path and device settings**.
- Trained **model weights** are loaded from `model_weights.pt`, ensuring the model is in the exact trained state.

In [None]:
# Load dataset indices
index_path = os.path.join(experiment_folder, "dataset_indices.json")
with open(index_path, "r") as f:
    index_dict = json.load(f)

test_indices = index_dict["test_indices"]

# Load hyperparameters
hparams_path = os.path.join(experiment_folder, "hyperparams.json")
with open(hparams_path, "r") as f:
    hparams = json.load(f)

print(f"Hyperparameters loaded from {hparams_path}: {hparams}")

# Load model class definition from saved file
model_definition_path = os.path.join(experiment_folder, "model_definition.py")

print(f"Model class definition loaded from {model_definition_path}")

# Load trained model
model = LlamaEmotionClassifier(model_path=model_path, device=device, train_folder="Trainings")
weights_path = os.path.join(experiment_folder, "model_weights.pt")
model.load_state_dict(torch.load(weights_path), strict=False)
model.to(device)

print(f"Model weights loaded from {weights_path}")

4. Create the Test Dataset & DataLoader  
- The **test subset** is reconstructed using the previously saved test indices.
- A **data collate function** (`dynamic_tokenize_collate_fn`) processes test samples into tokenized tensors.
- The **test DataLoader** batches the dataset for efficient inference.

In [None]:
# Create test dataset
test_dataset = Subset(dataset, test_indices)

# Create test DataLoader

# Build the collate_fn
collate_fn = dynamic_tokenize_collate_fn(
    tokenizer=model.tokenizer,
    max_length=hparams["max_length"],
    device=device,
    debug=False
)

test_loader = DataLoader(test_dataset, batch_size=hparams["batch_size"], shuffle=False, collate_fn=collate_fn)

print(f"Test dataset loaded: {len(test_dataset)} samples")

5. Run Model Testing  
- The trained model is evaluated on the test set using **accuracy metrics**:
  - **Valence Accuracy**: Measures correct classification of valence levels (low, mid, high).
  - **Arousal Accuracy**: Measures correct classification of arousal levels.
  - **Overall Accuracy**: Counts instances where both valence and arousal predictions are correct.

6. Save Test Results  
- Test accuracy metrics are written to `test_results.txt` within the experiment folder.

7. Rename the Experiment Folder with Results  
- The experiment folder is renamed to include **final test accuracies**, making it easy to track performance.

In [None]:
# Run the test and get the results
test_results = model.test_model(test_loader)

# Save the results to test_results.txt inside experiment_folder
results_file_path = os.path.join(experiment_folder, "test_results.txt")
with open(results_file_path, "w") as f:
    f.write(f"Valence Accuracy: {test_results['valence_accuracy']:.4f}\n")
    f.write(f"Arousal Accuracy: {test_results['arousal_accuracy']:.4f}\n")
    f.write(f"Overall Accuracy: {test_results['overall_accuracy']:.4f}\n")

print(f"Test results saved to {results_file_path}")

# Format the new folder name with test accuracy values
new_experiment_folder = f"{experiment_folder}_val={test_results['valence_accuracy']:.4f}_aro={test_results['arousal_accuracy']:.4f}_total={test_results['overall_accuracy']:.4f}"

# Rename the folder
os.rename(experiment_folder, new_experiment_folder)

print(f"Experiment folder renamed to: {new_experiment_folder}")