# **LLM Model Preference Prediction**
## **Competition Notebook**

### **Overview**
This notebook is designed to address the **LLM Model Preference Prediction** challenge. The goal is to predict the preferred LLM response (or a tie) for given prompts, based on their quality.

---

### **Key Objectives**
1. **Feature Engineering**:
   - Extract embeddings using a pretrained RoBERTa model.
   - Engineer additional features such as text lengths to enhance prediction performance.

2. **Model Training**:
   - Use a LightGBM model with a multiclass objective to classify preferences.
   - Train on either a validation split or the full training dataset, depending on the evaluation setup.

3. **Submission Preparation**:
   - Generate predictions for the test set and create a `submission.csv` file in the required format.

---




## **Setup and Environment**
### **Purpose**
This section ensures that the required libraries and dependencies are installed and configured correctly. It also verifies the system's readiness to handle computations, including GPU availability for accelerating operations.

---

### **Key Steps**
1. **Install Dependencies**:
   - Use popular libraries like NumPy, Pandas, LightGBM, Transformers, and Scikit-learn.
2. **Set up Device**:
   - Check for GPU availability to optimize embedding extraction using RoBERTa.
3. **Random Seed Initialization**:
   - Fix random seeds for reproducibility across all operations.

---

### **Why This Matters**
- **Smooth Execution**: Ensures the code runs seamlessly without missing dependencies.
- **Performance Optimization**: Utilizes GPU for faster embedding generation.
- **Reproducibility**: Consistent results every time the notebook is executed.



# Data Preprocessing with RoBERTa Embeddings and Feature Engineering

This section focuses on preprocessing textual data for a machine learning classification task. The preprocessing steps are designed to leverage the power of HuggingFace's **RoBERTa** model for generating text embeddings and include additional feature engineering for better model performance. 

### Key Steps:
1. **Extracting Text Embeddings**: 
   - Utilizes the `roberta-base` model to generate embeddings for textual inputs (`prompt`, `response_a`, `response_b`). 
   - Embeddings are based on the `[CLS]` token from the last hidden layer of the model.
   
2. **Feature Engineering**: 
   - Adds numeric features such as the lengths of the `prompt` and responses (`response_a`, `response_b`) to enrich the feature space.

3. **Preparing Target Labels**:
   - Converts multi-label information from the columns `winner_model_a`, `winner_model_b`, and `winner_tie` into integer-based classes:
     - `0`: Model A is preferred.
     - `1`: Model B is preferred.
     - `2`: Tie.

4. **Saving Processed Data**: 
   - Saves the engineered features and target labels as `.npy` files for efficient storage and model training.

---

### Input Requirements:
- **Train Dataset (`train.csv`)**:
  - Columns:
    - `prompt`: The textual input or query.
    - `response_a`: Response generated by Model A.
    - `response_b`: Response generated by Model B.
    - `winner_model_a`: Binary indicator for Model A being the preferred choice (1 if true, 0 otherwise).
    - `winner_model_b`: Binary indicator for Model B being the preferred choice (1 if true, 0 otherwise).
    - `winner_tie`: Binary indicator for a tie between Model A and Model B (1 if true, 0 otherwise).
    
- **Test Dataset (`test.csv`)**:
  - Columns:
    - `prompt`: The textual input or query.
    - `response_a`: Response generated by Model A.
    - `response_b`: Response generated by Model B.

---

### Outputs:
- **Processed Features**:
  - `x_train.npy`: A NumPy array containing the combined features (embeddings + numeric features) for training data.
  - `x_test.npy`: A NumPy array containing the combined features for test data.
  
- **Target Labels**:
  - `y_train.npy`: A NumPy array containing the target labels for training data.

---

### Highlights:
- **Batch Processing**: Efficiently processes text data in batches to minimize memory usage.
- **GPU Acceleration**: Automatically leverages GPU if available for faster embedding generation.
- **Reproducibility**: Ensures consistent results by setting a fixed random seed.

This preprocessing pipeline is designed to handle large datasets and creates feature-rich inputs for machine learning models, making it ideal for tasks like multi-class classification.


In [8]:
from transformers import AutoTokenizer, AutoModel
import torch
import pandas as pd
import numpy as np
import random

class DataPreprocessor:
    """
    A class for preprocessing text data and extracting features for a machine learning model. 
    It includes methods to tokenize and extract embeddings from text using a pre-trained transformer model 
    (RoBERTa), as well as for feature engineering and preparing target labels.

    Attributes:
        device (torch.device): The device to run computations on ('cuda' if available, otherwise 'cpu').
        tokenizer (AutoTokenizer): A tokenizer for tokenizing input text, initialized from 'roberta-base'.
        embedding_model (AutoModel): A pre-trained transformer model for extracting text embeddings, initialized from 'roberta-base'.
    """

    def __init__(self):
        """
        Initializes the DataPreprocessor class by setting the computation device (CPU/GPU) and loading 
        the tokenizer and transformer model (RoBERTa).
        """
        # Set the computation device to GPU if available, otherwise use CPU
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        # Load RoBERTa tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained('roberta-base')

        # Load RoBERTa model and move it to the computation device
        self.embedding_model = AutoModel.from_pretrained('roberta-base').to(self.device)

    def extract_embeddings(self, texts, batch_size=16):
        """
        Extracts text embeddings using the RoBERTa model. The embeddings are taken from the [CLS] token 
        (first token of the last hidden state).

        Args:
            texts (pd.Series or list): A list or pandas Series of strings (text data).
            batch_size (int): The batch size to use for processing text inputs.

        Returns:
            np.ndarray: A NumPy array of shape (len(texts), embedding_size) containing the embeddings for the input texts.
        """
        all_embeddings = []  # List to store embeddings for all text batches

        # Process texts in batches to handle memory efficiently
        for i in range(0, len(texts), batch_size):
            # Extract a batch of texts
            batch = texts[i:i + batch_size].tolist()

            # Tokenize the batch (convert text to token IDs) and move to the computation device
            inputs = self.tokenizer(batch, padding=True, truncation=True, return_tensors='pt', max_length=512).to(self.device)

            # Use the model to compute embeddings, disabling gradient computation for efficiency
            with torch.no_grad():
                outputs = self.embedding_model(**inputs)

            # Extract embeddings from the [CLS] token (first token) of the last hidden state
            embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()

            # Append the batch embeddings to the list
            all_embeddings.append(embeddings)

        # Concatenate all batch embeddings into a single NumPy array
        return np.concatenate(all_embeddings)

    def feature_engineering(self, df):
        """
        Performs feature engineering on the input DataFrame by extracting text embeddings and other numeric features.

        Args:
            df (pd.DataFrame): A pandas DataFrame containing 'prompt', 'response_a', and 'response_b' columns.

        Returns:
            np.ndarray: A NumPy array of shape (len(df), feature_size) containing engineered features.
        """
        # Add simple numeric features: the lengths of the prompt and responses
        df['prompt_length'] = df['prompt'].str.len()
        df['response_a_length'] = df['response_a'].str.len()
        df['response_b_length'] = df['response_b'].str.len()

        # Extract text embeddings for prompts and responses
        df['prompt_embedding'] = list(self.extract_embeddings(df['prompt']))
        df['response_a_embedding'] = list(self.extract_embeddings(df['response_a']))
        df['response_b_embedding'] = list(self.extract_embeddings(df['response_b']))

        # Combine embeddings into a single feature array
        embeddings = np.hstack([
            np.vstack(df['prompt_embedding']),       # Prompt embeddings
            np.vstack(df['response_a_embedding']),   # Response A embeddings
            np.vstack(df['response_b_embedding'])    # Response B embeddings
        ])

        # Combine embeddings with numeric features
        numeric_features = df[['prompt_length', 'response_a_length', 'response_b_length']].values
        X = np.hstack([embeddings, numeric_features])

        return X

    def prepare_target(self, df):
        """
        Prepares the target variable for classification. Converts the winner columns into a single integer label:
        0 for 'winner_model_a', 1 for 'winner_model_b', and 2 for 'winner_tie'.

        Args:
            df (pd.DataFrame): A pandas DataFrame containing 'winner_model_a', 'winner_model_b', and 'winner_tie' columns.

        Returns:
            np.ndarray: A NumPy array of shape (len(df),) containing integer labels.
        """
        # Initialize target array with zeros
        target = np.zeros(len(df))

        # Set target labels based on winner columns
        target[df['winner_model_a'] == 1] = 0
        target[df['winner_model_b'] == 1] = 1
        target[df['winner_tie'] == 1] = 2

        return target

    def preprocess_and_save(self, train_path, test_path):
        """
        Preprocesses training and test data by extracting features and targets, then saves them as NumPy arrays.

        Args:
            train_path (str): Path to the training CSV file.
            test_path (str): Path to the test CSV file.
        """
        # Load training and test datasets
        train_df = pd.read_csv(train_path)
        test_df = pd.read_csv(test_path)

        # Preprocess and save training data
        print("Processing training data...")
        X_train = self.feature_engineering(train_df)  # Extract features
        np.save('x_train.npy', X_train)              # Save features as a NumPy array
        y_train = self.prepare_target(train_df)      # Prepare target labels
        np.save('y_train.npy', y_train)              # Save target labels as a NumPy array

        # Preprocess and save test data
        print("Processing test data...")
        X_test = self.feature_engineering(test_df)   # Extract features
        np.save('x_test.npy', X_test)                # Save features as a NumPy array

        print("Data preprocessing complete. NumPy arrays saved.")


---

## DatasetLLMPredictor: Training and Evaluating LLM Preference Prediction Model

This script trains and evaluates a **LightGBM** model for the **LLM Model Preference Prediction** task. It is designed to handle multiclass classification, predicting the preference between two responses from different LLMs, or a tie.

### Key Features and Workflow:
1. **LightGBM Multiclass Classifier**:
   - Objective: `multiclass` (for three classes: `winner_model_a`, `winner_model_b`, `winner_tie`).
   - Automatically supports GPU acceleration if available, improving training efficiency.

2. **Reproducibility**:
   - Includes methods to save and reload the trained model for future use, enabling reproducible workflows.

3. **Training and Evaluation**:
   - Utilizes the entire training dataset without a validation split, focusing on learning from all available data.
   - Provides:
     - **Accuracy Score**: Overall correctness of predictions.
     - **Classification Report**: Detailed metrics (precision, recall, F1-score) for each class.

4. **Predictor Class Structure**:
   - **Training**:
     - Fits the model on the full training dataset and evaluates its performance.

5. **Test Dataset Prediction**:
   - The trained model can generate predictions for a separate test dataset, supporting model evaluation on unseen data.



In [2]:
from sklearn.metrics import accuracy_score, classification_report
import numpy as np  # For numerical operations and loading preprocessed data
import pandas as pd  # For creating the submission DataFrame
import lightgbm as lgb

class DatasetLLMPredictor:
    """
    A class to train and evaluate a LightGBM model on the entire dataset 
    for predicting the preference between LLM responses.

    Attributes:
        accuracy (float): The accuracy score of the model on the training dataset.
        detailed_metrics (dict): Detailed classification metrics, including precision, recall, and F1-score.
    """

    def __init__(self):
        """
        Initializes the predictor with attributes to store accuracy and detailed metrics.
        """
        self.accuracy = None  # Placeholder for accuracy score
        self.detailed_metrics = None  # Placeholder for detailed classification metrics

    def train_model(self, X_train, y_train):
        """
        Trains a LightGBM model on the provided training data.

        Args:
            X_train (numpy.ndarray): Feature matrix for training.
            y_train (numpy.ndarray): Labels corresponding to the training features.

        Returns:
            lgb.LGBMClassifier: Trained LightGBM model.
        """
        # Initialize the LightGBM classifier with multiclass support
        model = lgb.LGBMClassifier(
            objective='multiclass',  # Specifies a multiclass classification objective
            num_class=3  # Number of target classes (Model A, Model B, Tie)
        )

        # Train the model on the entire training dataset
        model.fit(X_train, y_train)

        # Make predictions on the training data for evaluation
        train_pred = model.predict(X_train)

        # Calculate accuracy on the training data
        self.accuracy = accuracy_score(y_train, train_pred)

        # Generate a detailed classification report
        self.detailed_metrics = classification_report(y_train, train_pred, output_dict=True)

        # Print results for transparency
        print(f"Training Accuracy: {self.accuracy:.4f}")
        print("Detailed Classification Report:")
        print(classification_report(y_train, train_pred))  # Human-readable report

        return model  # Return the trained model

    def save_model(self, model, filepath):
        """
        Saves the trained LightGBM model to a file.

        Args:
            model (lgb.LGBMClassifier): The trained LightGBM model.
            filepath (str): Path to save the model.
        """
        model.booster_.save_model(filepath)
        print(f"Model saved to {filepath}")

    def load_model(self, filepath):
        """
        Loads a trained LightGBM model from a file.

        Args:
            filepath (str): Path to the saved model file.

        Returns:
            lgb.Booster: The loaded LightGBM model.
        """
        model = lgb.Booster(model_file=filepath)
        print(f"Model loaded from {filepath}")
        return model

    def generate_submission_csv(self, model, X_test, filepath):
        """
        Generates and saves a CSV submission file with class probabilities for the test dataset.
    
        Args:
            model (lgb.Booster): The trained LightGBM model (Booster).
            X_test (numpy.ndarray): Feature matrix for the test data.
            test_ids (numpy.ndarray or pandas.Series): The actual IDs from the test dataset.
            filepath (str): Path to save the submission CSV file.
        """
        # Get the class probabilities for the test dataset using the Booster model's predict method
        test_probs = model.predict(X_test, raw_score=False)  # Using raw_score=False to get probabilities
        test_df = pd.read_csv('test.csv')  # Load the test dataset to get the IDs
        # Create a DataFrame for the submission file
        submission = pd.DataFrame({
        'id': test_df['id'],
        'winner_model_a': test_probs[:, 0],
        'winner_model_b': test_probs[:, 1],
        'winner_tie': test_probs[:, 2]})
    
        # Save the submission DataFrame as a CSV file
        submission.to_csv(filepath, index=False)
        print(f"Submission file saved to {filepath}")




---

## Usage: End-to-End Pipeline for Preprocessing, Training, and Prediction

This section demonstrates the usage of the `DataPreprocessor` and `DatasetLLMPredictor` classes to preprocess the data, train the model, and generate predictions. The steps include:

### Workflow:
1. **Preprocessing**:
   - Initialize the `DataPreprocessor` to preprocess both the training and test datasets.
   - Save the processed features (`X_train`, `X_test`) and labels (`y_train`) as `.npy` files for reproducibility.

2. **Model Training**:
   - Load the preprocessed features and labels.
   - Use `DatasetLLMPredictor` to train a LightGBM model with the training data.
   - Evaluate the model's performance on the training data.

3. **Model Saving and Loading**:
   - Save the trained LightGBM model to a file.
   - Load the saved model for future use or inference.

4. **Inference**:
   - Generate predictions on the test dataset using the trained model.
   - Save the predictions to a `.csv` file for submission.

### Key Features:
- Ensures reproducibility by saving intermediate outputs (`.npy` files and trained model).
- Handles end-to-end machine learning workflow: data preprocessing, training, and prediction.



In [4]:
# Step 1: Initialize the Data Preprocessor
preprocessor = DataPreprocessor()

# Step 2: Preprocess the data and save features and labels
train_csv_path = "train.csv"  # Replace with the path to your train dataset
test_csv_path = "test.csv"    # Replace with the path to your test dataset
preprocessor.preprocess_and_save(train_csv_path, test_csv_path)

# Step 3: Load the preprocessed features and labels
X_train = np.load("x_train.npy")
y_train = np.load("y_train.npy")
X_test = np.load("x_test.npy")

# Step 4: Initialize the Dataset Predictor
predictor = DatasetLLMPredictor()

# Step 5: Train the model on the training data
trained_model = predictor.train_model(X_train, y_train)

# Step 6: Save the trained model for reproducibility
predictor.save_model(trained_model, "trained_lightgbm_model.txt")

# Step 7: Load the trained model (if needed)
loaded_model = predictor.load_model("trained_lightgbm_model.txt")

# Generate and save the submission CSV
predictor.generate_submission_csv(loaded_model, X_test, 'submission.csv')

print("Pipeline executed successfully. Test predictions saved to 'test_predictions.csv'.")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.678153 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 588285
[LightGBM] [Info] Number of data points in the train set: 40000, number of used features: 2307
[LightGBM] [Info] Start training from score -1.052039
[LightGBM] [Info] Start training from score -1.067768
[LightGBM] [Info] Start training from score -1.180908
Training Accuracy: 0.7702
Detailed Classification Report:
              precision    recall  f1-score   support

         0.0       0.74      0.82      0.78     13969
         1.0       0.76      0.80      0.78     13751
         2.0       0.82      0.68      0.74     12280

    accuracy                           0.77     40000
   macro avg       0.78      0.77      0.77     40000
weighted avg       0.77      0.77      0.77     40000

Model saved to trained_lightgbm_model.txt
Model loaded from trained_lightgbm_model.txt
Submission file sav