다양한 임베딩 모델을 평가하는 구체적인 방법을 알려드리도록 하겠습니다
- 임베딩 후보 리스트 준비 (OpenAI, Cohere, e5-base-v2)
- 활용하고자 하는 데이터셋을 임베딩 변환
- Test set 랜덤 선별 후 평가 지표 생성

---

In [1]:
import pandas as pd
import os
import random
import cohere
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
import openai
from openai import OpenAI
from tqdm.notebook import tqdm
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# initialize openai
os.environ['OPENAI_API_KEY']= "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

# initialize cohere
os.environ["CO_API_KEY"] = "KfiaxhA9zulTfhPEmSqZmh3JFMClCLBAj4hmCY3E"
co = cohere.Client()

import warnings
warnings.filterwarnings('ignore')


### Read dataset

In [2]:
df = pd.read_csv("quora_dataset.csv")

In [3]:
df.head()

Unnamed: 0,text,id,duplicated_questions,length
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1
2,How can I be a good geologist?,15,[16],1
3,What should I do to be a great geologist?,16,[15],1
4,How do I read and find my YouTube comments?,23,[24],1


### 1. Playground

In [4]:
text1 = df.loc[2, 'text']
print(text1)

How can I be a good geologist?


In [5]:
text2 = df.loc[3, 'text']
print(text2)

What should I do to be a great geologist?


In [6]:
def create_embeddings(txt_list, provider='openai'):
    if provider=='openai':
        client = OpenAI()

        response = client.embeddings.create(
        input=txt_list,
        model="text-embedding-3-small")
        responses = [r.embedding for r in response.data]

        return responses
    
    elif provider=='cohere':
        doc_embeds = co.embed(
        txt_list,
        input_type="search_document",
        model="embed-english-v3.0")
        return doc_embeds.embeddings
    else:
        assert False, "Double check provider name"

In [7]:
emb1 = create_embeddings(df.loc[2, 'text'])
emb2 = create_embeddings(df.loc[3, 'text'])

In [8]:
from utils import cosine_similarity

In [9]:
# simarity between two embeddings
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb2[0]), text1, text2))

Cosine 유사도 : 0.9152950144221176.
사용된 문장 : 
How can I be a good geologist?
What should I do to be a great geologist?


In [10]:
text3 = df.loc[4, 'text']

emb3 = create_embeddings(text3)
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb3[0]), text1, text3))

Cosine 유사도 : 0.18173412921294915.
사용된 문장 : 
How can I be a good geologist?
How do I read and find my YouTube comments?


In [11]:
text4 = df.loc[6, 'text']

emb3 = create_embeddings(text4)
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb3[0]), text1, text4))

Cosine 유사도 : 0.27957031195045084.
사용된 문장 : 
How can I be a good geologist?
What can make Physics easy to learn?


---

### 2. Embedding vector Dataset 만들기

openai embeddings

In [None]:
# create embeddings (openai)
# (비용 발생 주의)
openai_emb = create_embeddings(df.text.tolist(), provider='openai')

In [None]:
# df['openai_emb'] = openai_emb

cohere embeddings

In [None]:
create embeddings (cohere)
(비용 발생 주의)
cohere_emb = create_embeddings(df.text.tolist(), 'cohere')

In [None]:
df['cohere_emb'] = cohere_emb

e5 embeddings

In [None]:
# load gpu if possible
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "intfloat/e5-base-v2"

# init tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

In [None]:
def create_e5_emb(docs, model):
    """
    e5 embedding 모델을 활용하여 임베딩 벡터 생성
    """
    docs = [f"query: {d}" for d in docs]
    # tokenize
    tokens = tokenizer(
        docs, padding=True, max_length=512, truncation=True, return_tensors="pt"
    ).to(device)
    with torch.no_grad():
        out = model(**tokens)
        last_hidden = out.last_hidden_state.masked_fill( # from last hidden state
            ~tokens["attention_mask"][..., None].bool(), 0.0
        )
        # average out embeddings per token (non-padding)
        doc_embeds = last_hidden.sum(dim=1) / tokens["attention_mask"].sum(dim=1)[..., None]
    return doc_embeds.cpu().numpy()

긴 runtime 주의 (약 2시간)

In [None]:
# data = df.text.tolist()
# batch_size = 128

# for i in tqdm(range(0, len(data), batch_size)):
#     i_end = min(len(data), i+batch_size)
#     data_batch = data[i:i_end]
#     # embed current batch
#     embed_batch = create_e5_emb(data_batch)
#     if i == 0:
#         emb3 = embed_batch.copy()
#     else:
#         emb3 = np.concatenate([emb3, embed_batch.copy()])

In [None]:
# emb3 = [list(e) for e in emb3]
# df['e5_emb'] = emb3

In [None]:
# df.to_csv("quora_dataset_emb.csv", index=False)

embedding이 이미 처리된 데이터 읽어오기

In [12]:
df = pd.read_csv("quora_dataset_emb.csv")
# str -> list 형태로 변환
import json
df['openai_emb'] = df['openai_emb'].apply(json.loads)
df['cohere_emb'] = df['cohere_emb'].apply(json.loads)
df['e5_emb'] = df['e5_emb'].apply(json.loads)
df['duplicated_questions'] = df['duplicated_questions'].apply(json.loads)

In [13]:
df.head()

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1,"[-0.005765771958976984, -0.018585262820124626,...","[-0.05834961, -0.010795593, -0.04522705, 0.035...","[0.059878636, -0.15769655, -0.14131568, -0.546..."
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1,"[0.026014558970928192, -0.014319832436740398, ...","[-0.022338867, -0.0063285828, -0.057128906, 0....","[0.08937627, -0.2954505, -0.33455396, -0.32940..."
2,How can I be a good geologist?,15,[16],1,"[0.005276682320982218, 0.004194203298538923, 0...","[-0.012535095, 0.005092621, -0.033233643, -0.0...","[0.0825816, -0.09264662, -0.78053623, -0.32416..."
3,What should I do to be a great geologist?,16,[15],1,"[0.015116829425096512, 0.0010464431252330542, ...","[-0.013465881, 0.0018148422, -0.052612305, 0.0...","[-0.1653303, 0.19044468, -0.8906647, -0.364357..."
4,How do I read and find my YouTube comments?,23,[24],1,"[0.03505030274391174, -0.0010134828044101596, ...","[-0.0047836304, 0.028137207, -0.037231445, -0....","[0.50644577, -0.62657785, -0.2523397, -0.17112..."


### 3. Test set 선별 및 

테스팅을 위해 필요한 랜덤 질문들 선별

In [14]:
# now choose random 10 rows of answers
test_query = random.choices(df.id, k=1000)

In [15]:
test_query[:5]

[5874, 8462, 2999, 14788, 1591]

In [16]:
test = df.loc[df.id.isin(test_query)]

각 테스트 질문별로 가장 유사한 질문들 top-k개 retrieve

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

def search_top_k(search_df, search_df_column, id, topk):
    """
    search_df : search를 할 대상 dataframe
    search_df_column : search를 위해 사용될 embedding column name
    id : test query id
    topk : 유사도 기반으로 top-k개 선별
    """
    query = search_df.loc[search_df['id']==id, search_df_column].values[0]
    query_reshaped = np.array(query).reshape(1, -1)
    
    search_df = search_df.loc[search_df['id']!=id]
    # cosine similarity in batch
    similarities = cosine_similarity(query_reshaped, np.vstack(search_df[search_df_column].values)).flatten()
    
    search_df['similarity'] = similarities
    
    # Get top-k indices
    # hence we sort the topk indices again to ensure they are truly the top-k
    topk_indices = np.argpartition(similarities, -topk)[-topk:]
    topk_indices_sorted = topk_indices[np.argsort(-similarities[topk_indices])]
    
    # Retrieve the top-k results
    search_result = search_df.iloc[topk_indices_sorted]
    
    return search_result


- 각 테스트 질문당 데이터 전체를 대상으로 cosine_similarity를 계산하고
- openai embedding, cohere embedding에 대해 각각 질문 k 개씩 진행
- search_result format :
```json
{
    'question id' : cosine_sim 기준 유사한 질문 top-k개를 담은 pd.DataFrame,
    'question id' : ...
}
```

In [19]:
# 각 질문들 중, test 질문과 동일한 질문이 가장 유사하게 도출될 것이기 때문에
# test 질문을 제외한 top-5
query_results_openai = { k:search_top_k(df, 'openai_emb', k, 5) for k in test.id }
query_results_cohere = { k:search_top_k(df, 'cohere_emb', k, 5) for k in test.id }
query_results_e5 = { k:search_top_k(df, 'e5_emb', k, 5) for k in test.id }

테스트 결과 엿보기

In [21]:
test.loc[test.length==3].tail()

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
5274,What same food should I eat every day to prote...,14182,"[6194, 854, 14183]",3,"[-0.004403913859277964, -0.04025837033987045, ...","[0.023086548, 0.03036499, -0.08929443, -0.0823...","[0.18870209, -0.3007043, -0.9383021, -0.120102..."
5280,What are some of the best investment plans?,14211,"[12694, 14212, 12695]",3,"[0.01397707685828209, -0.005568686407059431, 0...","[-0.024749756, -0.0109939575, 0.0053482056, 0....","[-0.4714, -0.439601, -0.8333078, -0.062841795,..."
5281,What's the best investment?,14212,"[12695, 14211, 12694]",3,"[0.0035537381190806627, 0.007597768679261208, ...","[0.0030536652, -0.020111084, 0.018203735, -0.0...","[-0.6153567, -0.45850173, -0.8385687, -0.05946..."
5403,What can I do to increase my memory,14636,"[10682, 10681, 14637]",3,"[0.014318248257040977, -0.01865561679005623, -...","[0.03640747, 0.02482605, -0.06390381, -0.04617...","[-0.057518903, -0.39781347, -0.66775864, -0.14..."
5464,Why has Dhoni left the captaincy from ODI and ...,14807,"[12515, 12516, 14806]",3,"[0.04666692391037941, 0.012602170929312706, 0....","[-0.0054779053, 0.009407043, -0.017623901, -0....","[-0.5722597, -0.7670558, -0.5822621, 0.1379486..."


In [25]:
test.loc[test['id']==14182, 'text'].values

array(['What same food should I eat every day to protect my health?'],
      dtype=object)

In [22]:
query_results_openai[14182]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
325,What can I eat every day to be more healthy?,854,"[1038, 14182, 7834, 1039, 855, 14183, 8569, 6194]",8,"[0.011896367184817791, -0.01070474460721016, -...","[0.02798462, 0.04119873, -0.091918945, -0.0729...","[-0.025168298, -0.6016844, -1.0017023, -0.2510...",0.772527
5275,Is it healthy to eat a tomato every day?,14183,"[6194, 1039, 14182, 8569, 855, 854, 7834, 1038]",8,"[-0.004980060271918774, -0.03401864692568779, ...","[0.03652954, 0.015296936, -0.04888916, -0.0508...","[-0.19459125, -0.570077, -0.99812454, -0.33799...",0.595202
2976,Is it bad for health to eat eggs every day?,7834,"[1039, 854, 7835, 8569, 1038, 14183, 6194, 855]",8,"[0.029307972639799118, -0.015874303877353668, ...","[0.045318604, 0.013931274, -0.057281494, -0.03...","[-0.16033827, -0.26646888, -0.9629145, -0.3015...",0.594892
400,Is it healthy to eat one chicken every day?,1039,"[854, 7834, 1038, 8569, 6194, 14183, 855]",7,"[0.055902957916259766, -0.027588749304413795, ...","[0.030822754, 0.008201599, -0.07324219, -0.066...","[0.089812614, -0.37340665, -1.0105903, -0.2102...",0.587362
2365,Is it healthy to eat bread every day?,6194,"[1038, 854, 7834, 1039, 14182, 8569, 855, 14183]",8,"[0.022213390097022057, -0.010902013629674911, ...","[0.0317688, 0.041259766, -0.07977295, -0.05123...","[0.088843346, -0.28678697, -1.2354667, -0.3420...",0.586861


### 4. Scoring function 정의

- 각 질문별로 accuracy score 부여
    - Accuracy score : 현재 유사하다고 태그된 질문들 중 몇 개가 실제 유사한 질문들인가?

In [26]:
def score_accuracy(full_df, tmp_df, test_id):
    """
    각 테스트 질문과 유사하다고 판단된 질문들 중, 실제 duplicated_questions에 들어있는 질문들을 count
    """
    duplicated_questions = full_df.loc[full_df['id'] == test_id, 'duplicated_questions'].values[0]

    # 본인 ID는 제외
    filtered_df = tmp_df[tmp_df['id'] != test_id]
    # 현재 retrieve 해온 ID들이, 테스트 질문 내에 들어있는 아이디들인지 count
    match_count = filtered_df['id'].isin(duplicated_questions).sum()

    # Calculate the accuracy in terms of percentage
    if filtered_df.shape[0]<len(duplicated_questions):
        percentage = (match_count / filtered_df.shape[0])
    else:
        percentage = (match_count / len(duplicated_questions))
    return percentage


In [27]:
accuracy_openai = [score_accuracy(df, query_results_openai[i], i) for i in query_results_openai.keys()]
accuracy_cohere = [score_accuracy(df, query_results_cohere[i], i) for i in query_results_cohere.keys()]
accuracy_e5 = [score_accuracy(df, query_results_e5[i], i) for i in query_results_e5.keys()]

In [28]:
np.mean(accuracy_openai)

0.9552459016393442

In [29]:
np.mean(accuracy_cohere)

0.9557194899817851

In [30]:
np.mean(accuracy_e5)

0.9483060109289617

오답 엿보기

In [33]:
indices = [index for index, value in enumerate(accuracy_openai) if value <= 0.5]

In [34]:
indices

[57,
 60,
 102,
 116,
 120,
 121,
 226,
 228,
 285,
 340,
 364,
 393,
 427,
 455,
 459,
 472,
 478,
 479,
 497,
 523,
 527,
 538,
 547,
 608,
 646,
 683,
 738,
 768,
 796,
 883,
 885,
 910]

In [35]:
list(query_results_openai.keys())[60]

985

In [37]:
test.loc[test['id']==985]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
378,What are some tricks to study effectively?,985,[984],1,"[0.022660082206130028, -0.025483181700110435, ...","[0.012863159, 0.02041626, -0.017868042, 0.0209...","[0.2421261, -0.087377824, -0.7994292, -0.37665..."


In [38]:
test.loc[test['id']==984]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
377,How should I study,984,[985],1,"[0.004720842465758324, -0.04900241643190384, 0...","[-0.017440796, 0.02861023, -0.051086426, -0.03...","[0.44418278, -0.010970661, -0.6834363, -0.4839..."


In [36]:
query_results_openai[985]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
2674,What are some good tips for self study?,7033,[7034],1,"[0.011717529967427254, -0.036956727504730225, ...","[-0.004623413, 0.0098724365, -0.028701782, 0.0...","[0.042563524, -0.11200364, -0.805879, -0.17195...",0.709618
2675,What are some tips for self study?,7034,[7033],1,"[0.013052663765847683, -0.036678362637758255, ...","[-0.0050735474, 0.015113831, -0.035949707, 0.0...","[0.0115949465, -0.0983381, -0.8623934, -0.2426...",0.696768
4327,What are the best ways to study/memorize things?,11331,[11332],1,"[-0.01611529476940632, -0.020404629409313202, ...","[-0.018630981, -0.0075263977, -0.024902344, -0...","[0.3084511, 0.2693039, -0.69890785, -0.1166682...",0.661062
1456,How do I concentrate better in my studies?,3902,[3903],1,"[0.008100658655166626, -0.02210518717765808, -...","[0.010749817, 0.043182373, -0.024032593, -0.00...","[0.48886126, -0.24943542, -0.675636, -0.476426...",0.646608
2961,How do I study well without getting distracted?,7803,[7802],1,"[0.0016504509840160608, -0.03981846570968628, ...","[0.041503906, 0.03475952, -0.02571106, -0.0210...","[0.45737946, -0.23788102, -0.9406135, -0.41269...",0.64531


#### 결론

- cohere, openai, e5 모두 굉장히 성능이 좋기 때문에 대부분의 task에 곧바로 활용해도 무방함.
- Local embedding 모델을 활용하고자 할 때 위와 같은 방법으로 classification 성능 & 자원 할당 체크 필요.
- 성능 평가 방법
    - 태깅된 데이터 셋 활용
    - 정성적 평가
        - 데이터 태깅을 할 노동력이 부족할 때
        - 태깅을 하기 애매한 분야 (정답이 없는 경우)

--END--