다양한 임베딩 모델을 평가하는 구체적인 방법을 알려드리도록 하겠습니다
- 임베딩 후보 리스트 준비 (OpenAI, Cohere, e5-base-v2)
- 활용하고자 하는 데이터셋을 임베딩 변환
- Test set 랜덤 선별 후 평가 지표 생성

---

In [29]:
import pandas as pd
import os
import heapq
import random
import cohere
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
import openai
from openai import OpenAI
from tqdm.notebook import tqdm
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# initialize openai
os.environ['OPENAI_API_KEY']= "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

# initialize cohere
os.environ["CO_API_KEY"] = "KfiaxhA9zulTfhPEmSqZmh3JFMClCLBAj4hmCY3E"
co = cohere.Client()

import warnings
warnings.filterwarnings('ignore')


In [1]:
180/15

12.0

### Read dataset

In [3]:
df = pd.read_csv("quora_dataset.csv")

In [4]:
df.head(3)

Unnamed: 0,text,id,duplicated_questions,length
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1
2,How can I be a good geologist?,15,[16],1


### Test Embedding 생성

In [5]:
text1 = df.loc[0, 'text']
print(text1)

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?


In [6]:
text2 = df.loc[1, 'text']
print(text2)

I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?


In [7]:
def create_embeddings(txt_list, provider='openai'):
    if provider=='openai':
        client = OpenAI()

        response = client.embeddings.create(
        input=txt_list,
        model="text-embedding-3-small")
        responses = [r.embedding for r in response.data]

        return responses
    
    elif provider=='cohere':
        doc_embeds = co.embed(
        txt_list,
        input_type="search_document",
        model="embed-english-v3.0")
        return doc_embeds.embeddings
    else:
        assert False, "Double check provider name"

In [8]:
emb1 = create_embeddings(df.loc[0, 'text'])
emb2 = create_embeddings(df.loc[1, 'text'])

In [9]:
from utils import cosine_similarity

In [10]:
# simarity between two embeddings
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb2[0]), text1, text2))

Cosine 유사도 : 0.8214627029973013.
사용된 문장 : 
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?


In [11]:
text3 = df.loc[100, 'text']
print(text3)

If I do not monetize YouTube videos & upload copyright content, then are there chances that Google may block my account?


In [12]:
emb3 = create_embeddings(text3)
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb3[0]), text1, text3))

Cosine 유사도 : 0.09266203744424949.
사용된 문장 : 
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
If I do not monetize YouTube videos & upload copyright content, then are there chances that Google may block my account?


---

### Embedding이 포함된 Dataset 만들기

openai embeddings

In [13]:
# # create embeddings (openai)
# # (비용 발생 주의)
# openai_emb = create_embeddings(df.text.tolist(), provider='openai')

In [14]:
# df['openai_emb'] = openai_emb

cohere embeddings

In [159]:
# create embeddings (cohere)
# (비용 발생 주의)
# cohere_emb = create_embeddings(df.text.tolist(), 'cohere')

In [160]:
# df['cohere_emb'] = cohere_emb

e3 embeddings

In [15]:
# load gpu
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "intfloat/e5-base-v2"

# init tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [16]:
def create_e5_emb(docs, model):
    """
    e5 embedding 모델을 활용하여 임베딩 벡터 생성
    
    """
    docs = [f"query: {d}" for d in docs]
    # tokenize
    tokens = tokenizer(
        docs, padding=True, max_length=512, truncation=True, return_tensors="pt"
    ).to(device)
    with torch.no_grad():
        out = model(**tokens)
        last_hidden = out.last_hidden_state.masked_fill(
            ~tokens["attention_mask"][..., None].bool(), 0.0
        )
        doc_embeds = last_hidden.sum(dim=1) / tokens["attention_mask"].sum(dim=1)[..., None]
    return doc_embeds.cpu().numpy()

긴 runtime 주의 (약 2시간)

In [14]:
# data = df.text.tolist()
# batch_size = 128

# for i in tqdm(range(0, len(data), batch_size)):
#     i_end = min(len(data), i+batch_size)
#     data_batch = data[i:i_end]
#     # embed current batch
#     embed_batch = create_e5_emb(data_batch)
#     if i == 0:
#         emb3 = embed_batch.copy()
#     else:
#         emb3 = np.concatenate([emb3, embed_batch.copy()])


In [15]:
# emb3 = [list(e) for e in emb3]
# df['e5_emb'] = emb3

In [16]:
# df.to_csv("quora_dataset_emb.csv", index=False)

embedding이 이미 처리된 데이터 읽어오기

In [18]:
df = pd.read_csv("quora_dataset_emb.csv")
# str -> list 형태로 변환
import json
df['openai_emb'] = df['openai_emb'].apply(json.loads)
df['cohere_emb'] = df['cohere_emb'].apply(json.loads)
df['e5_emb'] = df['e5_emb'].apply(json.loads)
df['duplicated_questions'] = df['duplicated_questions'].apply(json.loads)

In [18]:
df.head()

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1,"[-0.005765771958976984, -0.018585262820124626,...","[-0.05834961, -0.010795593, -0.04522705, 0.035...","[0.059878636, -0.15769655, -0.14131568, -0.546..."
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1,"[0.026014558970928192, -0.014319832436740398, ...","[-0.022338867, -0.0063285828, -0.057128906, 0....","[0.08937627, -0.2954505, -0.33455396, -0.32940..."
2,How can I be a good geologist?,15,[16],1,"[0.005276682320982218, 0.004194203298538923, 0...","[-0.012535095, 0.005092621, -0.033233643, -0.0...","[0.0825816, -0.09264662, -0.78053623, -0.32416..."
3,What should I do to be a great geologist?,16,[15],1,"[0.015116829425096512, 0.0010464431252330542, ...","[-0.013465881, 0.0018148422, -0.052612305, 0.0...","[-0.1653303, 0.19044468, -0.8906647, -0.364357..."
4,How do I read and find my YouTube comments?,23,[24],1,"[0.03505030274391174, -0.0010134828044101596, ...","[-0.0047836304, 0.028137207, -0.037231445, -0....","[0.50644577, -0.62657785, -0.2523397, -0.17112..."


테스팅을 위해 필요한 랜덤 질문들 선별

In [19]:
# now choose random 10 rows of answers
test_query = random.choices(df.id, k=1000)

In [20]:
test_query[:5]

[14196, 8891, 570, 6892, 13756]

In [21]:
test = df.loc[df.id.isin(test_query)]

각 테스트 질문별로 가장 유사한 질문들 top-k개 retrieve

In [26]:
from sklearn.metrics.pairwise import cosine_similarity


def search_top_k(search_df, search_df_column, id, topk):
    """
    search_df : search를 할 대상 dataframe
    search_df_column : search를 위해 사용될 embedding column name
    id : test query id
    topk : 유사도 기반으로 top-k개 선별
    """
    query = search_df.loc[search_df['id']==id, search_df_column].values[0]
    query_reshaped = np.array(query).reshape(1, -1)
    
    search_df = search_df.loc[search_df['id']!=id]
    # cosine similarity in batch
    similarities = cosine_similarity(query_reshaped, np.vstack(search_df[search_df_column].values)).flatten()
    
    search_df['similarity'] = similarities
    
    # Get top-k indices
    # hence we sort the topk indices again to ensure they are truly the top-k
    topk_indices = np.argpartition(similarities, -topk)[-topk:]
    topk_indices_sorted = topk_indices[np.argsort(-similarities[topk_indices])]
    
    # Retrieve the top-k results
    search_result = search_df.iloc[topk_indices_sorted]
    
    return search_result


- 긴 run time 주의 (약 10분)
    - 각 테스트 질문당 데이터 전체를 대상으로 cosine_similarity를 계산하고
    - openai embedding, cohere embedding에 대해 각각 질문 100개씩 진행

In [30]:
# 각 질문들 중, test 질문과 동일한 질문이 가장 유사하게 도출될 것이기 때문에
# test 질문을 제외한 top-5를 얻기 위해 top-6를 계산함
query_results_openai = { k:search_top_k(df, 'openai_emb', k, 6) for k in test.id }
query_results_cohere = { k:search_top_k(df, 'cohere_emb', k, 6) for k in test.id }
query_results_e5 = { k:search_top_k(df, 'e5_emb', k, 6) for k in test.id }

테스트 결과 엿보기

In [31]:
test.loc[test.length==3].head()

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
216,How can I make money online with free of cost?,568,"[569, 8268, 5511]",3,"[0.020787376910448074, 0.011646389029920101, 0...","[0.023849487, 0.003293991, -0.027053833, -0.01...","[-0.07150113, -0.5132854, -0.5708973, 0.100144..."
312,Are there any good incest movie?,809,"[8160, 808, 8159]",3,"[-0.04257148504257202, 0.027882514521479607, -...","[-0.0037841797, 0.025253296, -0.023956299, 0.0...","[-0.46425477, -0.41904235, -0.80047005, -0.096..."
331,How I start prepare for UGC net English litera...,862,"[3133, 863, 3132]",3,"[0.005860482808202505, -0.014007982797920704, ...","[-0.034851074, 0.01550293, -0.014709473, 0.025...","[-0.050091486, -0.79798216, -0.450848, -0.0198..."
699,Does all Muslims hate Narendra Modi?,1829,"[2326, 1830, 2327]",3,"[0.0011884663254022598, -0.010927995666861534,...","[0.010787964, 0.009635925, -0.025405884, 0.013...","[-0.34569547, -0.45513502, -0.6565351, -0.1835..."
858,How can I meet Modi?,2267,"[11766, 2266, 7188]",3,"[0.004399363417178392, -0.006411267444491386, ...","[0.023269653, 0.022857666, -0.0099487305, -0.0...","[-0.07941816, -0.60199785, -0.50433916, -0.242..."


In [33]:
# query_results_openai[1259]

In [34]:
# query_results_cohere[1259]

In [35]:
# query_results_e5[1259]

---
- 각 질문별로 accuracy score 부여
    - Accuracy score : 현재 유사하다고 태그된 질문들 중 몇 개를 가져왔는가?

In [36]:
def score_accuracy(full_df, tmp_df, test_id):
    """
    각 테스트 질문과 유사하다고 판단된 질문들 중, 실제 duplicated_questions에 들어있는 질문들을 count
    """
    duplicated_questions = full_df.loc[full_df['id'] == test_id, 'duplicated_questions'].values[0]

    # 본인 ID는 제외
    filtered_df = tmp_df[tmp_df['id'] != test_id]
    # 현재 retrieve 해온 ID들이, 테스트 질문 내에 들어있는 아이디들인지 count
    match_count = filtered_df['id'].isin(duplicated_questions).sum()

    # Calculate the accuracy in terms of percentage
    if filtered_df.shape[0]<len(duplicated_questions):
        percentage = (match_count / filtered_df.shape[0])
    else:
        percentage = (match_count / len(duplicated_questions))
    return percentage


In [37]:
accuracy_openai = [score_accuracy(df, query_results_openai[i], i) for i in query_results_openai.keys()]
accuracy_cohere = [score_accuracy(df, query_results_cohere[i], i) for i in query_results_cohere.keys()]
accuracy_e5 = [score_accuracy(df, query_results_e5[i], i) for i in query_results_e5.keys()]

In [38]:
indices = [index for index, value in enumerate(accuracy_openai) if value <= 0.5]

In [52]:
list(query_results_openai.keys())[37]

569

In [54]:
df.loc[df.id==569]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
217,How do I to make money online?,569,"[568, 8268]",2,"[0.016434762626886368, -0.014571025967597961, ...","[-0.0073127747, 0.0023593903, -0.036712646, 0....","[0.35096762, -0.6522574, -0.62547153, -0.17215...",0.17019


In [59]:
query_results_openai[569]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
3064,How do you make money online?,8037,"[12717, 13144, 1886, 6438, 4416, 7801, 1885, 1...",24,"[-0.00038440912612713873, -0.00698323501273989...","[-0.00063991547, 0.0026111603, -0.032287598, 0...","[0.19221021, -0.6304249, -0.5507541, -0.097675...",0.878572
4828,How can I start to make money online?,12852,"[57, 4415, 12191, 6100, 8154, 8037, 6800, 1285...",22,"[0.016455164179205894, -0.01352680567651987, 0...","[-0.019210815, 0.0059051514, -0.040740967, 0.0...","[0.44488332, -0.7642211, -0.7528671, 0.0212857...",0.836373
3911,How do we make money online?,10234,"[8154, 12852, 4037, 7801, 4893, 6438, 57, 4416...",24,"[-0.00665827002376318, 0.007081437390297651, 0...","[0.0028629303, 0.002275467, -0.03262329, 0.015...","[0.3740802, -0.63016343, -0.5129542, 0.1140559...",0.816287
18,What is best way to make money online?,57,"[6800, 12851, 13144, 6099, 4038, 8037, 6799, 1...",23,"[0.009966920129954815, -0.002238746965304017, ...","[-0.005718231, -0.013580322, -0.009460449, 0.0...","[-0.117242806, -0.5602809, -0.5706873, 0.13857...",0.808932
2958,What is a way to make money online?,7800,"[6800, 12852, 6438, 10234, 4415, 12717, 2561, ...",22,"[0.0023033299949020147, -0.01025333534926176, ...","[0.0073013306, 3.4749508e-05, -0.029083252, 0....","[0.09597785, -0.63416886, -0.63889635, 0.04895...",0.807559
4827,How can I earn money online?,12851,"[4037, 6099, 6100, 6171, 4038, 12717, 7801, 67...",26,"[0.010865186341106892, -0.007262030150741339, ...","[0.0073242188, 0.011703491, -0.03427124, -0.00...","[0.35610572, -0.77514553, -0.5421254, -0.08505...",0.792847


In [39]:
len(indices)

24

In [41]:
np.mean(accuracy_cohere)

0.9587187958883994

In [42]:
np.mean(accuracy_openai)

0.9609214390602056

In [43]:
np.mean(accuracy_e5)

0.9564610866372981

#### 결론

- cohere, openai, e5 모두 굉장히 성능이 좋기 때문에 대부분의 task에 곧바로 활용해도 무방함.
- Local embedding 모델을 활용하고자 할 때 위와 같은 방법으로 classification 성능 & 자원 할당 체크 필요.

--END--