# Data Preprocessing
This is the first notebook of three (1/3). This notebook handles importing data formatting it for clustering, including embedding and dimension reduction.

In [None]:
# setup colab enviornment and suppress output
%%capture

!pip install -U sentence-transformers
!pip install umap-learn

In [None]:
import ast
import csv
import pandas as pd
import numpy as np
import umap
from sentence_transformers import SentenceTransformer

## Load Dataset
This project uses the publicly available [Chatbot Arena human preference dataset](https://huggingface.co/datasets/lmsys/lmsys-arena-human-preference-55k) with over 55 thousand conversations between real humans and LLMs. A warning may pop up but a token is not required.

In [None]:
df = pd.read_csv('hf://datasets/lmsys/lmsys-arena-human-preference-55k/train.csv')

# keep columns of interest
df = df[['id', 'prompt', 'winner_tie', 'model_a', 'winner_model_a', 'model_b', 'winner_model_b']]

# drop duplicates
df.drop_duplicates(
    subset = ('prompt',	'winner_tie', 'model_a',	'winner_model_a',	'model_b',	'winner_model_b'),
    inplace = True
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


For the scope of this project, we're not interested in conversations between a human and LLM with more than a single prompt. Rows with more than one prompt are dropped.

In [None]:
# drop any row with more than one list element in prompt
df = df[df['prompt'].apply(lambda x: len(ast.literal_eval(x)) == 1)]
df = df.reset_index(drop=True)

df['prompt'] = df['prompt'].apply(ast.literal_eval)

# convert prompts to strings
df['prompt'] = df['prompt'].apply(lambda x: x[0] if x else '')
df = df[df['prompt'] != ''] # drop empty strings
df = df.reset_index(drop=True)
df['prompt'] = df['prompt'].apply(str)

# Remove invalid characters
df['prompt'] = df['prompt'].apply(
    lambda x: x.encode('utf-8', 'ignore').decode('utf-8')
)

In [None]:
df.head()

Unnamed: 0,id,prompt,winner_tie,model_a,winner_model_a,model_b,winner_model_b
0,65089,explain function calling. how would you call a...,1,gpt-3.5-turbo-0613,0,mistral-medium,0
1,96401,How can I create a test set for a very rare ca...,0,llama-2-13b-chat,1,mistral-7b-instruct,0
2,198779,What is the best way to travel from Tel-Aviv t...,0,koala-13b,0,gpt-3.5-turbo-0314,1
3,292873,"Construct a rap battle, in the style of Epic R...",0,vicuna-13b,0,gpt-4-0314,1
4,313413,Why water is not used in bath tub?,0,mixtral-8x7b-instruct-v0.1,1,vicuna-13b,0


## Model Size Data
We have manually labled each model included in the dataset as either 'big' or 'small' depending on if it is greater or less than 65 billion parameters.

In [None]:
# read model size csv from github
github_url = "https://github.com/bttgroup45/DialogueDecoded/raw/main/Model%20Size%20Classification.csv"
model_size = pd.read_csv(github_url)

model_size.head()

Unnamed: 0,model,size
0,alpaca-13b,0
1,chatglm-6b,0
2,chatglm2-6b,0
3,chatglm3-6b,0
4,claude-1,1


In [None]:
# add size columns to df
df = df.merge(model_size, how='left', left_on='model_a', right_on='model')
df.rename(columns={'size': 'model_a_big'}, inplace=True)
df.drop(columns=['model'], inplace=True)
df = df.merge(model_size, how='left', left_on='model_b', right_on='model')
df.rename(columns={'size': 'model_b_big'}, inplace=True)
df.drop(columns=['model'], inplace=True)

# add column that indicates a small and big model tie
small_big_tie = (
    (df['model_a_big'] == 0) & (df['model_b_big'] == 1) & (df['winner_tie'] == 1)
) | (
    (df['model_a_big'] == 1) & (df['model_b_big'] == 0) & (df['winner_tie'] == 1)
)
df['small_big_tie'] = np.where(small_big_tie, 1, 0)

# add column that indicates if a small model beats a big model
small_beat_big = (
    (df['model_a_big'] == 0) & (df['model_b_big'] == 1) & (df['winner_model_a'] == 1)
) | (
    (df['model_a_big'] == 1) & (df['model_b_big'] == 0) & (df['winner_model_b'] == 1)
)
df['small_beat_big'] = np.where(small_beat_big, 1, 0)

# update column order
new_col_order = ['id', 'prompt', 'small_big_tie', 'small_beat_big', 'winner_tie', 'winner_model_a', 'model_a_big', 'model_a', 'winner_model_b', 'model_b_big', 'model_b']
df = df[new_col_order]

df.head()

Unnamed: 0,id,prompt,small_big_tie,small_beat_big,winner_tie,winner_model_a,model_a_big,model_a,winner_model_b,model_b_big,model_b
0,65089,explain function calling. how would you call a...,0,0,1,0,0,gpt-3.5-turbo-0613,0,0,mistral-medium
1,96401,How can I create a test set for a very rare ca...,0,0,0,1,0,llama-2-13b-chat,0,0,mistral-7b-instruct
2,198779,What is the best way to travel from Tel-Aviv t...,0,0,0,0,0,koala-13b,1,0,gpt-3.5-turbo-0314
3,292873,"Construct a rap battle, in the style of Epic R...",0,0,0,0,0,vicuna-13b,1,1,gpt-4-0314
4,313413,Why water is not used in bath tub?,0,0,0,1,0,mixtral-8x7b-instruct-v0.1,0,0,vicuna-13b


## Prompt Embedding with Sentence Transformers
Strings have to be converted to vectors of numbers (embeddings) before they can be processed. Traditionally, this is done with tf-idf. A more recent approch is sentence transformers, which is what is used here.

Either way, the text is embedded in a high dimensional vector space where text with similar semantic meanings are close together.

In [None]:
prompts = df['prompt'].tolist()

# embedding all sentences (resource intensive)
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
embeddings = model.encode(prompts, convert_to_tensor=False)

# add embeddings to dataframe
embedList = embeddings.tolist()
df['embeddings'] = embedList
print(embeddings.shape)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(49853, 384)


## Dimensionality Reduction
Embeddings, by default, are encoded into a vector space with many dimensions. Before working with the data, we use UMAP, a dimension reductionality tool, to reduce the dimensions of the embeddings.

We will reduce each embedding to 25 and then 2 dimensions and create a column for each.

In [None]:
embeddings = np.array(df['embeddings'].tolist())

# reduce to 25
reduced_25 = umap.UMAP(
    n_neighbors=15,
    min_dist=0.0,
    n_components=25,
    random_state=42,
  ).fit_transform(embeddings)

df['reduced_25'] = reduced_25.tolist()

# reduce to 2
reduced_2 = umap.UMAP(
    n_neighbors=15,
    min_dist=0.0,
    n_components=2,
    random_state=42,
  ).fit_transform(embeddings)

df['reduced_2'] = reduced_2.tolist()

  warn(
  warn(


In [None]:
df.head(3)

Unnamed: 0,id,prompt,small_big_tie,small_beat_big,winner_tie,winner_model_a,model_a_big,model_a,winner_model_b,model_b_big,model_b,embeddings,reduced_25,reduced_2
0,65089,explain function calling. how would you call a...,0,0,1,0,0,gpt-3.5-turbo-0613,0,0,mistral-medium,"[-0.7643815279006958, -0.007776608690619469, -...","[8.414632797241211, 4.291245937347412, 5.09297...","[4.365998268127441, 3.940767526626587]"
1,96401,How can I create a test set for a very rare ca...,0,0,0,1,0,llama-2-13b-chat,0,0,mistral-7b-instruct,"[-0.2905693054199219, 0.005334902089089155, -0...","[8.428787231445312, 4.023136138916016, 4.84389...","[4.172776699066162, 1.9042508602142334]"
2,198779,What is the best way to travel from Tel-Aviv t...,0,0,0,0,0,koala-13b,1,0,gpt-3.5-turbo-0314,"[0.8432239294052124, 0.81766277551651, 0.25694...","[7.988311290740967, 3.9170477390289307, 5.0029...","[6.135153293609619, -2.5313286781311035]"


### Export for Next Notebook
Exporting between notebooks could be done a number of ways. We are using Python Pickle files because they load quickly and preserve data types.

This Pickle file will be loaded in at the beginning of the next notebook where we will cluster and label the prompts.

In [None]:
df.to_pickle('Notebook1_Output.pkl')
print('done')

done


If running this in Google Colab, the file can be downloaded with the following block.

In [None]:
# download from google colab env
from google.colab import files
files.download('Notebook1_Output.pkl')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>