# Cleaning data set
First, load data from csv file.

In [1]:
import pandas as pd

train_df = pd.read_csv("../data/train.csv")

Second, we can take a glimpse at the data frame

In [3]:
train_df.head()


Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0


In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57477 entries, 0 to 57476
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              57477 non-null  int64 
 1   model_a         57477 non-null  object
 2   model_b         57477 non-null  object
 3   prompt          57477 non-null  object
 4   response_a      57477 non-null  object
 5   response_b      57477 non-null  object
 6   winner_model_a  57477 non-null  int64 
 7   winner_model_b  57477 non-null  int64 
 8   winner_tie      57477 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 3.9+ MB


Now, we can start exploring the data.

In [5]:
#check missing values
train_df.isnull().sum()

id                0
model_a           0
model_b           0
prompt            0
response_a        0
response_b        0
winner_model_a    0
winner_model_b    0
winner_tie        0
dtype: int64

Apparently, there is no missing values within our current data frame
Then, we can explore the distribution of the data

*We have a total of 50000 rows
*The current ratio of model preference is 0.349:0.341:0.31 (there is a slight inbalance between preference of model A and model B, which might be the result of selection bias (people tending to prefer the first response they read), or it might be by random chance).

In [7]:
train_df.describe()

Unnamed: 0,id,winner_model_a,winner_model_b,winner_tie
count,57477.0,57477.0,57477.0,57477.0
mean,2142564000.0,0.349079,0.341911,0.309011
std,1238327000.0,0.476683,0.474354,0.46209
min,30192.0,0.0,0.0,0.0
25%,1071821000.0,0.0,0.0,0.0
50%,2133658000.0,0.0,0.0,0.0
75%,3211645000.0,1.0,1.0,1.0
max,4294947000.0,1.0,1.0,1.0


Now, we can check for duplicate rows

*apparently, we have none

In [8]:
duplicates = train_df.duplicated()
print("Number of duplicate rows:", duplicates.sum())

Number of duplciate rows: 0


Now, we can start processing the data.
First, we will convert ALL textual data (prompts, response A, response B) into lower case to reduce variation due to capitalization, remove special cahracters and punctuations, and tokenization + stop words.

Let's investigate the current status of the data in the cells. Take the response_a of the first row as an example.


In [18]:
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

# define stop words
stop_words = set(stopwords.words("english"))

tmp = train_df.loc[0]["response_a"]
tmp

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cody_\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


'["The question of whether it is morally right to aim for a certain percentage of females in managerial positions is a complex ethical issue that involves considerations of fairness, equality, diversity, and discrimination.\\n\\nHere are some arguments in favor of and against such policies:\\n\\n**Arguments in favor:**\\n\\n1. **Correcting Historical Inequities:** Women have historically been underrepresented in leadership roles due to various cultural, institutional, and social barriers. Aiming for a specific percentage can be seen as a corrective measure to address past and ongoing discrimination.\\n\\n2. **Promoting Diversity:** Diverse leadership teams can enhance decision-making and represent a broader range of perspectives. This can lead to better outcomes for organizations and society as a whole.\\n\\n3. **Equality of Opportunity:** Setting targets for female representation in management can help ensure that women have equal opportunities to advance in their careers.\\n\\n4. **R

Looks like we have many texts like escape characters `\n`, which need to be removed. We can infer that they are original raw markdown-like text.

In [42]:
def clean(text):
    # clean up the string to keep only meaningful words.
    # removing the stop words and any non-alpha words.
    # return cleaned up string
    if type(text) != str:
        raise Exception("Found non-string object.")
    text = text.lower()
    text = re.sub(r'(\\[n\'\"])|(\*\*)|([\\\[\]])|([0123456789].)|([\:,\.\"\!\?])', " ", text)
    lis = text.split()
    final = ""
    for x in lis:
        if x not in stop_words:
            final += x+" "
    return final

clean(tmp)

"question whether morally right aim certain percentage females managerial positions complex ethical issue involves considerations fairness equality diversity discrimination arguments favor policies arguments favor correcting historical inequities women historically underrepresented leadership roles due various cultural institutional social barriers aiming specific percentage seen corrective measure address past ongoing discrimination promoting diversity diverse leadership teams enhance decision-making represent broader range perspectives lead better outcomes organizations society whole equality opportunity setting targets female representation management help ensure women equal opportunities advance careers role modeling increased visibility female leaders inspire encourage women girls pursue leadership roles arguments reverse discrimination setting quota female representation might lead perception reality reverse discrimination men might overlooked positions despite qualified simply m

Then apply our cleaning method to all the string-type cells.

In [45]:
train_df['clean_prompt'] = train_df['prompt'].apply(clean)
train_df['clean_response_a'] = train_df['response_a'].apply(clean)
train_df['clean_response_b'] = train_df['response_b'].apply(clean)

0    question whether morally right aim certain per...
1    marriage license legal document allows couple ...
2    function calling process invoking executing fu...
Name: clean_response_a, dtype: object

Check out first three rows:

In [46]:
train_df['response_a'].head(3)

0    ["The question of whether it is morally right ...
1    ["A marriage license is a legal document that ...
2    ["Function calling is the process of invoking ...
Name: response_a, dtype: object

In [47]:
train_df['clean_response_a'].head(3)

0    question whether morally right aim certain per...
1    marriage license legal document allows couple ...
2    function calling process invoking executing fu...
Name: clean_response_a, dtype: object

Great! Looks like the cleaning process works decently.
Now we store the cleaned data to another .csv file and then proceed.

In [49]:
train_df.to_csv("../data/cleaned_train.csv", index=False)
train_df

Unnamed: 0,id,model_a,model_b,prompt,response_a,response_b,winner_model_a,winner_model_b,winner_tie,clean_prompt,clean_response_a,clean_response_b
0,30192,gpt-4-1106-preview,gpt-4-0613,"[""Is it morally right to try to have a certain...","[""The question of whether it is morally right ...","[""As an AI, I don't have personal beliefs or o...",1,0,0,morally right try certain percentage females m...,question whether morally right aim certain per...,ai personal beliefs opinions however tell ques...
1,53567,koala-13b,gpt-4-0613,"[""What is the difference between marriage lice...","[""A marriage license is a legal document that ...","[""A marriage license and a marriage certificat...",0,1,0,difference marriage license marriage certifica...,marriage license legal document allows couple ...,marriage license marriage certificate two diff...
2,65089,gpt-3.5-turbo-0613,mistral-medium,"[""explain function calling. how would you call...","[""Function calling is the process of invoking ...","[""Function calling is the process of invoking ...",0,0,1,explain function calling would call function,function calling process invoking executing fu...,function calling process invoking function pro...
3,96401,llama-2-13b-chat,mistral-7b-instruct,"[""How can I create a test set for a very rare ...","[""Creating a test set for a very rare category...","[""When building a classifier for a very rare c...",1,0,0,create test set rare category want build class...,creating test set rare category challenging ma...,building classifier rare category creating tes...
4,198779,koala-13b,gpt-3.5-turbo-0314,"[""What is the best way to travel from Tel-Aviv...","[""The best way to travel from Tel Aviv to Jeru...","[""The best way to travel from Tel-Aviv to Jeru...",0,1,0,best way travel tel-aviv jerusalem car bus plane,best way travel tel aviv jerusalem depends per...,best way travel tel-aviv jerusalem depends per...
...,...,...,...,...,...,...,...,...,...,...,...,...
57472,4294656694,gpt-4-0613,claude-1,"[""A simple mnemonic for \u03c0:\n\""How I wish ...","[""Sure, let's break it down:\n\n1. \""How\"" has...","[""Here is how that mnemonic represents the dig...",1,0,0,simple mnemonic u c wish could enumerate pi ea...,sure let's break letters letter wish letters l...,mnemonic represents digits u c = letters -> u ...
57473,4294692063,claude-2.0,llama-2-13b-chat,"[""In python, implement a naive Bayes with gaus...","[""Here is an implementation of a naive Bayes c...","[""Sure! Here's an implementation of a naive Ba...",1,0,0,python implement naive bayes gaussian mixture ...,implementation naive bayes classifier gaussian...,sure here's implementation naive bayes classif...
57474,4294710549,claude-1,alpaca-13b,"[""is it unethical to work on building weapons?...","[""Working on weapons technology raises some et...","[""It depends on the context. Weapons can be us...",1,0,0,unethical work building weapons deterrents nec...,working weapons technology raises ethical ques...,depends context weapons used defensive purpose...
57475,4294899228,palm-2,tulu-2-dpo-70b,"[""If a bait contains 0,0025% bromadiolon then ...","[""Bromadiolone is a rodenticide which is most ...","[""As an AI language model, I do not promote or...",0,1,0,bait contains % bromadiolon much adult rat nee...,bromadiolone rodenticide often used bait secon...,ai language model promote condone use poison h...
