# Overview

This competition challenges you to predict which responses users will prefer in a head-to-head battle between chatbots powered by large language models (LLMs). You'll be given a dataset of conversations from the Chatbot Arena, where different LLMs generate answers to user prompts. By developing a winning machine learning model, you'll help improve how chatbots interact with humans and ensure they better align with human preferences.

In [1]:
import pandas as pd

### Load Data

In [2]:
train_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')

In [3]:
train_data = train_data.drop(['model_a', "model_b"], axis=1)
train_data.head()

Unnamed: 0,id,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


In [4]:
# hitung banyaknya na
print(train_data.isna().sum())

id                0
prompt            0
response_a        0
response_b        0
winner_model_a    0
winner_model_b    0
winner_tie        0
dtype: int64


In [5]:
# baca convert ke python sehingga [ ] bisa dihapus
import ast
s = train_data['response_a'][0]
obj = ast.literal_eval(s)

print(obj[0])
print(type(obj[0]))


The question of whether it is morally right to aim for a certain percentage of females in managerial positions is a complex ethical issue that involves considerations of fairness, equality, diversity, and discrimination.

Here are some arguments in favor of and against such policies:

**Arguments in favor:**

1. **Correcting Historical Inequities:** Women have historically been underrepresented in leadership roles due to various cultural, institutional, and social barriers. Aiming for a specific percentage can be seen as a corrective measure to address past and ongoing discrimination.

2. **Promoting Diversity:** Diverse leadership teams can enhance decision-making and represent a broader range of perspectives. This can lead to better outcomes for organizations and society as a whole.

3. **Equality of Opportunity:** Setting targets for female representation in management can help ensure that women have equal opportunities to advance in their careers.

4. **Role Modeling:** Increased v

In [6]:
def safe_eval(x):
    if not isinstance(x, str):
        return x  # skip non-string values
    
    x = x.strip()
    # Ensure it looks like a list, e.g. starts with '[' and ends with ']'
    if x.startswith('[') and x.endswith(']'):
        try:
            val = ast.literal_eval(x)
            if isinstance(val, list) and len(val) > 0:
                return val[0]
        except (ValueError, SyntaxError):
            pass
    return x  # fallback: return as-is if not valid

train_data['response_a'] = train_data['response_a'].apply(safe_eval)
train_data['response_b'] = train_data['response_b'].apply(safe_eval)
train_data['prompt'] = train_data['prompt'].apply(safe_eval)



In [7]:
split_size = int(0.9 * len(train_data))
validation_data = train_data[split_size:]
train_data = train_data[:split_size]
print(f"Train data size: {len(train_data)}")
print(f"Validation data size: {len(validation_data)}")

Train data size: 51729
Validation data size: 5748


In [8]:
train_data.tail()

Unnamed: 0,id,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
51724,3865500373,Please do a in depth analysis and share outcom...,Sigmund Freud and Stanislav Grof are two emine...,Sigmund Freud and Stanislav Grof are two figur...,1,0,0
51725,3865610759,You should capitalize the sentence according t...,A Giant Spider Blocks Your Path.,,1,0,0
51726,3865660213,Summarize the different ways automatic transmi...,There are several ways that automatic transmis...,Here is a summary of the main types of automat...,0,0,1
51727,3865766737,Fill in the blank: The cat chased its ____ .\n...,"A. Tail\n\nThe correct answer is ""tail."" The p...",The correct answer is A. Tail,0,0,1
51728,3865815908,Let's play a game of chess. Please only answer...,c5. Fight for the center!,1...e5 White played an aggressive opening move...,0,1,0


In [9]:
validation_data.head()

Unnamed: 0,id,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
51729,3865915001,who created the blue led and why was it so dif...,The blue LED was first invented by Japanese ch...,The blue LED (light-emitting diode) was invent...,1,0,0
51730,3866013973,If you overtake the person in 2nd place in the...,Hello! I'm happy to help you with your questio...,If you overtake the person in 2nd place in the...,0,1,0
51731,3866095506,In which country can you find the largest wate...,The largest waterfall by volume of water flowi...,The largest waterfall by volume of water flowi...,1,0,0
51732,3866185257,I have a bitmask and a value. I want to scat...,You're correct that x86 processors have instru...,"Yes, you can achieve this in C using bitwise o...",0,1,0
51733,3866194691,Bro it to the max,"What do you mean by ""Bro it to the max""? Do yo...","""Bro it to the max"" is a slang phrase that mea...",0,1,0


In [10]:
train_data.iloc[0]['response_a']

'The question of whether it is morally right to aim for a certain percentage of females in managerial positions is a complex ethical issue that involves considerations of fairness, equality, diversity, and discrimination.\n\nHere are some arguments in favor of and against such policies:\n\n**Arguments in favor:**\n\n1. **Correcting Historical Inequities:** Women have historically been underrepresented in leadership roles due to various cultural, institutional, and social barriers. Aiming for a specific percentage can be seen as a corrective measure to address past and ongoing discrimination.\n\n2. **Promoting Diversity:** Diverse leadership teams can enhance decision-making and represent a broader range of perspectives. This can lead to better outcomes for organizations and society as a whole.\n\n3. **Equality of Opportunity:** Setting targets for female representation in management can help ensure that women have equal opportunities to advance in their careers.\n\n4. **Role Modeling:*

### Transformer

In [11]:
config = {
    'model_name': 'google-bert/bert-base-uncased',
    'num_labels': 3,
    'max_length': 512,
    'batch_size': 16,
    'learning_rate': 2e-5,
    'num_epochs': 1000,
    'output_dir': './model_output',
    'device': 'cuda',
    'data_path': 'data/train.csv',
    'data_size': len(train_data),
    'cleaned_data array':False
}

In [12]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from model import ClassificationModel
from data import DataPreparation



torch.backends.cudnn.benchmark = True
device = torch.device("cuda:0")
model = ClassificationModel(backbone_model_name=config['model_name']).float().to(device)
optima = optim.AdamW(model.parameters(), lr=config['learning_rate'])

train_set = DataPreparation(df=train_data, tokenizer_name=config['model_name'])
val_set = DataPreparation(df=validation_data, tokenizer_name=config['model_name'])
dataloader_train = DataLoader(train_set, batch_size=config['batch_size'],
                                  shuffle=True, num_workers=4, pin_memory=True)
dataloader_val = DataLoader(val_set, batch_size=config['batch_size'], 
                                shuffle=False, num_workers=4, pin_memory=True)
criterion = nn.CrossEntropyLoss()

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
import mlflow
from train import train
from train import validate
import os
os.makedirs(config['output_dir'], exist_ok=True)
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("LLM Classification Fine-tuning")

with mlflow.start_run():
    # Log hyperparameters once
    mlflow.log_params(config)
    min_loss = float('inf')
    for epoch in range(config['num_epochs']):
        print(f"Epoch {epoch+1}/{config['num_epochs']}")

        total_loss = train(dataloader_train, model, optima, criterion, config)

        val_loss = validate(dataloader_val, model, criterion, config)

        # Log loss to MLflow
        mlflow.log_metric("loss", total_loss, step=epoch+1)
        mlflow.log_metric("val_loss", val_loss, step=epoch+1)

        # Save model locally first
        model_path = f"{config['output_dir']}/best{val_loss:.4f}_epoch{epoch+1}"
        if val_loss < min_loss:
            torch.save(model.state_dict(), f"{model_path}.pth")
            mlflow.pytorch.log_model(model, name=f'best-{epoch+1}', step=epoch+1)
            min_loss = val_loss

        

    print(f"✅ Model for epoch {epoch+1} saved and logged to MLflow.")


In [14]:
test_data['response_a'] = test_data['response_a'].apply(safe_eval)
test_data['response_b'] = test_data['response_b'].apply(safe_eval)
test_data['prompt'] = test_data['prompt'].apply(safe_eval)

In [15]:
import mlflow.pytorch
import mlflow
from train import test
mlflow.set_tracking_uri("http://127.0.0.1:5000")
model = mlflow.pytorch.load_model("models:/LLM Classifcation Model/1")

Downloading artifacts: 100%|██████████| 6/6 [02:17<00:00, 23.00s/it] 


In [18]:
test_data

Unnamed: 0,id,prompt,response_a,response_b
0,136060,"I have three oranges today, I ate an orange ye...",You have two oranges today.,You still have three oranges. Eating an orange...
1,211333,You are a mediator in a heated political debat...,Thank you for sharing the details of the situa...,Mr Reddy and Ms Blue both have valid points in...
2,1233961,How to initialize the classification head when...,When you want to initialize the classification...,To initialize the classification head when per...


In [16]:
from train import test
from torch.utils.data import DataLoader
from data import DataPreparation
test_set = DataPreparation(df=test_data, tokenizer_name=config['model_name'])
dataloader_test = DataLoader(test_set, batch_size=config['batch_size'], 
                                shuffle=False, num_workers=4, pin_memory=True)
all_preds, all_labels, outputs = test(dataloader_test, model, config)
print(all_preds)

Testing:   0%|          | 0/1 [00:00<?, ?it/s]

⚠️ Error at index 0: 'winner_model_a'
   prompt: I have three oranges today, I ate an orange yesterday. How many oranges do I have?
   response_a: You have two oranges today.
   response_b: You still have three oranges. Eating an orange yesterday does not affect the number of oranges you have today.
⚠️ Error at index 1: 'winner_model_a'
   prompt: You are a mediator in a heated political debate between two opposing parties. Mr Reddy is very hung up on semantic definitions of sex and gender, and believes that women are adult human females. Meanwhile Ms Blue is extremely fluid with definitions and does not care about truth. He (Ms blue uses he\/him pronouns) insists that anybody can be any gender, gametes don't mean anything, and that men can get pregnant. You, Mr Goddy are tasked with helping them both find a middle ground.
   response_a: Thank you for sharing the details of the situation. As a mediator, I understand the importance of finding a middle ground that both parties can agree 

Testing: 100%|██████████| 1/1 [00:00<00:00,  2.84it/s]

tensor([2, 2, 2])



  all_labels = torch.tensor(all_labels)


In [None]:
print(all_preds)

tensor([[0., 0., 1.],
        [0., 0., 1.],
        [0., 0., 1.]])


In [28]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Data awal
labels = all_preds
print(labels)
# Convert ke array 2D (karena encoder butuh 2D)
labels_array = np.array(labels).reshape(-1, 1)
print(labels_array)
# One-hot encoding
# Misal total kelas ada 3: 0, 1, 2
num_classes = 3
encoder = OneHotEncoder(categories=[np.arange(num_classes)], sparse_output=False)

one_hot = encoder.fit_transform(labels_array)
print(one_hot)
# Buat dataframe
df_onehot = pd.DataFrame(one_hot, columns=["winner_model_a", "winner_model_b", "winner_tie"])

# Gabungkan dengan tabel asli (contoh)
df_combined = pd.concat([test_data, df_onehot], axis=1)

print(df_combined)


tensor([2, 2, 2])
[[2]
 [2]
 [2]]
[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]
        id                                             prompt  \
0   136060  I have three oranges today, I ate an orange ye...   
1   211333  You are a mediator in a heated political debat...   
2  1233961  How to initialize the classification head when...   

                                          response_a  \
0                        You have two oranges today.   
1  Thank you for sharing the details of the situa...   
2  When you want to initialize the classification...   

                                          response_b  winner_model_a  \
0  You still have three oranges. Eating an orange...             0.0   
1  Mr Reddy and Ms Blue both have valid points in...             0.0   
2  To initialize the classification head when per...             0.0   

   winner_model_b  winner_tie  
0             0.0         1.0  
1             0.0         1.0  
2             0.0         1.0  


  labels_array = np.array(labels).reshape(-1, 1)


In [30]:
df_combined.drop(['prompt', 'response_a', 'response_b'], axis=1, inplace=True)

In [32]:
df_combined.to_csv('test_results_with_onehot.csv', index=False)