# Cleaning data set
First, load data from csv file.

In [1]:
import pandas as pd

train_df = pd.read_csv("../data/train.csv")

Second, we can take a glimpse at the data frame

In [2]:
train_df.head()


Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


In [3]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57477 entries, 0 to 57476
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              57477 non-null  int64 
 1   model_a         57477 non-null  object
 2   model_b         57477 non-null  object
 3   prompt          57477 non-null  object
 4   response_a      57477 non-null  object
 5   response_b      57477 non-null  object
 6   winner_model_a  57477 non-null  int64 
 7   winner_model_b  57477 non-null  int64 
 8   winner_tie      57477 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 3.9+ MB


Now, we can start exploring the data.

In [4]:
#check missing values
train_df.isnull().sum()

id                0
model_a           0
model_b           0
prompt            0
response_a        0
response_b        0
winner_model_a    0
winner_model_b    0
winner_tie        0
dtype: int64

Apparently, there is no missing values within our current data frame
Then, we can explore the distribution of the data

*We have a total of 50000 rows
*The current ratio of model preference is 0.349:0.341:0.31 (there is a slight inbalance between preference of model A and model B, which might be the result of selection bias (people tending to prefer the first response they read), or it might be by random chance).

In [5]:
train_df.describe()

Unnamed: 0,id,winner_model_a,winner_model_b,winner_tie
count,57477.0,57477.0,57477.0,57477.0
mean,2142564000.0,0.349079,0.341911,0.309011
std,1238327000.0,0.476683,0.474354,0.46209
min,30192.0,0.0,0.0,0.0
25%,1071821000.0,0.0,0.0,0.0
50%,2133658000.0,0.0,0.0,0.0
75%,3211645000.0,1.0,1.0,1.0
max,4294947000.0,1.0,1.0,1.0


Now, we can check for duplicate rows

*apparently, we have none

In [6]:
duplicates = train_df.duplicated()
print("Number of duplicate rows:", duplicates.sum())

Number of duplicate rows: 0


Now, we can start processing the data.
First, we will convert ALL textual data (prompts, response A, response B) into lower case to reduce variation due to capitalization, remove special cahracters and punctuations, and tokenization + stop words.

Let's investigate the current status of the data in the cells. Take the response_a of the first row as an example.


In [7]:
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

# define stop words
stop_words = set(stopwords.words("english"))

tmp = train_df.loc[0]["response_a"]
tmp

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andyz\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'["The question of whether it is morally right to aim for a certain percentage of females in managerial positions is a complex ethical issue that involves considerations of fairness, equality, diversity, and discrimination.\\n\\nHere are some arguments in favor of and against such policies:\\n\\n**Arguments in favor:**\\n\\n1. **Correcting Historical Inequities:** Women have historically been underrepresented in leadership roles due to various cultural, institutional, and social barriers. Aiming for a specific percentage can be seen as a corrective measure to address past and ongoing discrimination.\\n\\n2. **Promoting Diversity:** Diverse leadership teams can enhance decision-making and represent a broader range of perspectives. This can lead to better outcomes for organizations and society as a whole.\\n\\n3. **Equality of Opportunity:** Setting targets for female representation in management can help ensure that women have equal opportunities to advance in their careers.\\n\\n4. **R

Looks like we have many texts like escape characters `\n`, which need to be removed. We can infer that they are original raw markdown-like text.

In [44]:
import re

import re
import ast

def clean_list_string(s):
    try:
        parsed = ast.literal_eval(s)  # turns string into list
        if isinstance(parsed, list):
            return " ".join(parsed)   # combine list into one string
        else:
            return s  # fallback if not a list
    except:
        return s  # fallback if string is malformed

def clean(text):
    text = clean_list_string(text)
    # 1. Remove code blocks (multiline, using triple backticks)
    text = re.sub(r"```.*?```", "", text, flags=re.DOTALL)

    # 2. Remove LaTeX/math segments enclosed in dollar signs
    text = re.sub(r"\$.*?\$", "", text)

    # 3. Remove URLs (anything starting with http)
    text = re.sub(r"http\S+", "", text)

    # 4. Remove literal "\n" sequences if present as text
    text = re.sub(r"\\n", " ", text)


    # 6. Remove leftover emoji sequences formed from hex digits (like "ud83cudf4dud83cudf55")
    # This pattern matches one or more repetitions of a surrogate-like pattern.
    text = re.sub(r"(ud83[c-d][a-f0-9]{4})+", "", text, flags=re.IGNORECASE)

    # 7. Insert a space after punctuation if it is immediately followed by a non-space character.
    text = re.sub(r"([,.?!])(?=\S)", r"\1 ", text)

    # 8. Convert text to lowercase to normalize for sentiment analysis.
    text = text.lower()

    # 9. Remove any characters that are not:
    #    - Word characters (which now support accented letters using Unicode)
    #    - Digits
    #    - Common punctuation (comma, period, exclamation, question, single quote, hyphen)
    #    - Whitespace
    text = re.sub(r"[^\w\s,.?!'\-]", "", text, flags=re.UNICODE)

    # 10. Normalize whitespace: remove extra spaces and strip leading/trailing spaces.
    text = re.sub(r"\s+", " ", text).strip()

    return text



Then apply our cleaning method to all the string-type cells.

In [45]:
train_df['clean_prompt'] = train_df['prompt'].apply(clean)
train_df['clean_response_a'] = train_df['response_a'].apply(clean)
train_df['clean_response_b'] = train_df['response_b'].apply(clean)

Check out first three rows:

In [46]:
# Display the first 5 rows
train_df[['prompt', 'clean_prompt', 'response_a', 'clean_response_a', 'response_b', 'clean_response_b']].head()




Unnamed: 0,prompt,clean_prompt,response_a,clean_response_a,response_b,clean_response_b
0,"[""Is it morally right to try to have a certain...",is it morally right to try to have a certain p...,"[""The question of whether it is morally right ...",the question of whether it is morally right to...,"[""As an AI, I don't have personal beliefs or o...","as an ai, i don't have personal beliefs or opi..."
1,"[""What is the difference between marriage lice...",what is the difference between marriage licens...,"[""A marriage license is a legal document that ...",a marriage license is a legal document that al...,"[""A marriage license and a marriage certificat...",a marriage license and a marriage certificate ...
2,"[""explain function calling. how would you call...",explain function calling. how would you call a...,"[""Function calling is the process of invoking ...",function calling is the process of invoking or...,"[""Function calling is the process of invoking ...",function calling is the process of invoking a ...
3,"[""How can I create a test set for a very rare ...",how can i create a test set for a very rare ca...,"[""Creating a test set for a very rare category...",creating a test set for a very rare category c...,"[""When building a classifier for a very rare c...",when building a classifier for a very rare cat...
4,"[""What is the best way to travel from Tel-Aviv...",what is the best way to travel from tel-aviv t...,"[""The best way to travel from Tel Aviv to Jeru...",the best way to travel from tel aviv to jerusa...,"[""The best way to travel from Tel-Aviv to Jeru...",the best way to travel from tel-aviv to jerusa...


In [30]:
# Alternatively, sample random rows to see a broader subset
print(train_df[['prompt', 'clean_prompt', 'response_a', 'clean_response_a', 'response_b', 'clean_response_b']].sample(5))

                                                  prompt  \
43386  ["give me words in french that are related to ...   
12289              ["how many legs does an apple have?"]   
14084  ["You are the captain of a ship. Write the ann...   
14757  ["write a script to convert an mp3 audio file ...   
48763                 ["Que veut dire cloud computing "]   

                                            clean_prompt  \
43386  give me words in french that are related to br...   
12289                  how many legs does an apple have?   
14084  You are the captain of a ship. Write the annou...   
14757  write a script to convert an mp3 audio file to...   
48763                      Que veut dire cloud computing   

                                              response_a  \
43386  ["Cro\u00fbton (kroo-tn) - A piece of bread, o...   
12289  ["Apples have no legs because they are fruits ...   
14084  ["\"Ladies and Gentlemen, this is your captain...   
14757  ["Here is a sample bash script 

Great! Looks like the cleaning process works decently.
Now we store the cleaned data to another .csv file and then proceed.

In [47]:
train_df.to_csv("../data/cleaned_train.csv", index=False)
train_df

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie,clean_prompt,clean_response_a,clean_response_b,partial_clean_prompt,partial_clean_response_a,partial_clean_response_b
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0,is it morally right to try to have a certain p...,the question of whether it is morally right to...,"as an ai, i don't have personal beliefs or opi...",is it morally right to try to have a certain p...,the question of whether it is morally right to...,"as an ai, i don't have personal beliefs or opi..."
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0,what is the difference between marriage licens...,a marriage license is a legal document that al...,a marriage license and a marriage certificate ...,what is the difference between marriage licens...,a marriage license is a legal document that al...,a marriage license and a marriage certificate ...
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1,explain function calling. how would you call a...,function calling is the process of invoking or...,function calling is the process of invoking a ...,explain function calling. how would you call a...,function calling is the process of invoking or...,function calling is the process of invoking a ...
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0,how can i create a test set for a very rare ca...,creating a test set for a very rare category c...,when building a classifier for a very rare cat...,how can i create a test set for a very rare ca...,creating a test set for a very rare category c...,when building a classifier for a very rare cat...
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0,what is the best way to travel from tel-aviv t...,the best way to travel from tel aviv to jerusa...,the best way to travel from tel-aviv to jerusa...,what is the best way to travel from telaviv to...,the best way to travel from tel aviv to jerusa...,the best way to travel from telaviv to jerusal...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57472,4294656694,gpt-4-0613,claude-1,"[""A simple mnemonic for \u03c0:\n\""How I wish ...","[""Sure, let's break it down:\n\n1. \""How\"" has...","[""Here is how that mnemonic represents the dig...",1,0,0,a simple mnemonic for π how i wish i could enu...,"sure, let's break it down 1. how has 3 letters...",here is how that mnemonic represents the digit...,a simple mnemonic for u03c0nhow i wish i could...,"sure, let's break it downnn1. how has 3 letter...",here is how that mnemonic represents the digit...
57473,4294692063,claude-2.0,llama-2-13b-chat,"[""In python, implement a naive Bayes with gaus...","[""Here is an implementation of a naive Bayes c...","[""Sure! Here's an implementation of a naive Ba...",1,0,0,"in python, implement a naive bayes with gaussi...",here is an implementation of a naive bayes cla...,sure! here's an implementation of a naive baye...,"in python, implement a naive bayes with gaussi...",here is an implementation of a naive bayes cla...,sure! here's an implementation of a naive baye...
57474,4294710549,claude-1,alpaca-13b,"[""is it unethical to work on building weapons?...","[""Working on weapons technology raises some et...","[""It depends on the context. Weapons can be us...",1,0,0,is it unethical to work on building weapons? a...,working on weapons technology raises some ethi...,it depends on the context. weapons can be used...,"is it unethical to work on building weapons?,a...",working on weapons technology raises some ethi...,it depends on the context. weapons can be used...
57475,4294899228,palm-2,tulu-2-dpo-70b,"[""If a bait contains 0,0025% bromadiolon then ...","[""Bromadiolone is a rodenticide which is most ...","[""As an AI language model, I do not promote or...",0,1,0,"if a bait contains 0, 0025 bromadiolon then ho...",bromadiolone is a rodenticide which is most of...,"as an ai language model, i do not promote or c...","if a bait contains 0,0025 bromadiolon then how...",bromadiolone is a rodenticide which is most of...,"as an ai language model, i do not promote or c..."
