다양한 임베딩 모델을 평가하는 구체적인 방법을 알려드리도록 하겠습니다
- 임베딩 후보 리스트 준비 (OpenAI, Cohere, e5-base-v2)
- 활용하고자 하는 데이터셋을 임베딩 변환
- Test set 랜덤 선별 후 평가 지표 생성

---

In [289]:
import pandas as pd
import os
import heapq
import random
import cohere
import torch
from transformers import AutoModel, AutoTokenizer
import openai
from openai import OpenAI
from tqdm.notebook import tqdm
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# initialize openai
os.environ['OPENAI_API_KEY']= "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

# initialize cohere
os.environ["CO_API_KEY"] = "KfiaxhA9zulTfhPEmSqZmh3JFMClCLBAj4hmCY3E"
co = cohere.Client()

### Read dataset

In [341]:
df = pd.read_csv("quora_dataset.csv")

In [342]:
df.head(3)

Unnamed: 0,text,id,duplicated_questions,length
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1
2,How can I be a good geologist?,15,[16],1


### Test Embedding 생성

In [297]:
text1 = df.loc[0, 'text']
print(text1)

Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?


In [298]:
text2 = df.loc[1, 'text']
print(text2)

I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?


In [303]:
def create_embeddings(txt_list, provider='openai'):
    if provider=='openai':
        client = OpenAI()

        response = client.embeddings.create(
        input=txt_list,
        model="text-embedding-3-small")
        responses = [r.embedding for r in response.data]

        return responses
    
    elif provider=='cohere':
        doc_embeds = co.embed(
        txt_list,
        input_type="search_document",
        model="embed-english-v3.0")
        return doc_embeds.embeddings
    else:
        assert False, "Double check provider name"

In [304]:
emb1 = create_embeddings(df.loc[0, 'text'])
emb2 = create_embeddings(df.loc[1, 'text'])

In [305]:
# 유사도 계산하기 from scratch
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vector_a, vector_b):
    """Calculate the cosine similarity between two vectors."""
    dot_product = np.dot(vector_a, vector_b)
    norm_a = norm(vector_a)
    norm_b = norm(vector_b)
    similarity = dot_product / (norm_a * norm_b)
    return similarity

In [307]:
# simarity between two embeddings
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb2[0]), text1, text2))

Cosine 유사도 : 0.8214627029973013.
사용된 문장 : 
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?


In [308]:
text3 = df.loc[100, 'text']
print(text3)

If I do not monetize YouTube videos & upload copyright content, then are there chances that Google may block my account?


In [309]:
emb3 = create_embeddings(text3)
print("Cosine 유사도 : {}.\n사용된 문장 : \n{}\n{}".format(cosine_similarity(emb1[0], emb3[0]), text1, text3))

Cosine 유사도 : 0.09266203744424949.
사용된 문장 : 
Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
If I do not monetize YouTube videos & upload copyright content, then are there chances that Google may block my account?


---

### Embedding이 포함된 Dataset 만들기

openai embeddings

In [310]:
# # create embeddings (openai)
# # (비용 발생 주의)
# openai_emb = create_embeddings(df.text.tolist(), provider='openai')

In [138]:
# df['openai_emb'] = openai_emb

cohere embeddings

In [159]:
# create embeddings (cohere)
# (비용 발생 주의)
# cohere_emb = create_embeddings(df.text.tolist(), 'cohere')

In [160]:
# df['cohere_emb'] = cohere_emb

e3 embeddings

In [451]:
# load gpu
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "intfloat/e5-base-v2"

# init tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id).to(device)
model.eval()

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [452]:
def create_e5_emb(docs):
    docs = [f"query: {d}" for d in docs]
    # tokenize
    tokens = tokenizer(
        docs, padding=True, max_length=512, truncation=True, return_tensors="pt"
    ).to(device)
    with torch.no_grad():
        out = model(**tokens)
        last_hidden = out.last_hidden_state.masked_fill(
            ~tokens["attention_mask"][..., None].bool(), 0.0
        )
        doc_embeds = last_hidden.sum(dim=1) / tokens["attention_mask"].sum(dim=1)[..., None]
    return doc_embeds.cpu().numpy()

긴 runtime 주의 (약 2시간)

In [None]:
# data = df.text.tolist()
# batch_size = 128

# for i in tqdm(range(0, len(data), batch_size)):
#     i_end = min(len(data), i+batch_size)
#     data_batch = data[i:i_end]
#     # embed current batch
#     embed_batch = create_e5_emb(data_batch)
#     if i == 0:
#         emb3 = embed_batch.copy()
#     else:
#         emb3 = np.concatenate([emb3, embed_batch.copy()])


In [481]:
# emb3 = [list(e) for e in emb3]
# df['e5_emb'] = emb3

In [528]:
df.to_csv("quora_dataset_emb.csv", index=False)

embedding이 이미 처리된 데이터 읽어오기

In [553]:
df = pd.read_csv("quora_dataset_emb.csv")
# str -> list 형태로 변환
# df['openai_emb'] = [ast.literal_eval(i) for i in df['openai_emb']]
# df['cohere_emb'] = [ast.literal_eval(i) for i in df['cohere_emb']]
# df['e5_emb'] = [ast.literal_eval(i) for i in df['e5_emb']]
# df['duplicated_questions'] = [ast.literal_eval(i) for i in df['duplicated_questions']]
import json
df['openai_emb'] = df['openai_emb'].apply(json.loads)
df['cohere_emb'] = df['cohere_emb'].apply(json.loads)
df['e5_emb'] = df['e5_emb'].apply(json.loads)
df['duplicated_questions'] = df['duplicated_questions'].apply(json.loads)

In [554]:
df.head()

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
0,Astrology: I am a Capricorn Sun Cap moon and c...,11,[12],1,"[-0.005765771958976984, -0.018585262820124626,...","[-0.05834961, -0.010795593, -0.04522705, 0.035...","[0.059878636, -0.15769655, -0.14131568, -0.546..."
1,"I'm a triple Capricorn (Sun, Moon and ascendan...",12,[11],1,"[0.026014558970928192, -0.014319832436740398, ...","[-0.022338867, -0.0063285828, -0.057128906, 0....","[0.08937627, -0.2954505, -0.33455396, -0.32940..."
2,How can I be a good geologist?,15,[16],1,"[0.005276682320982218, 0.004194203298538923, 0...","[-0.012535095, 0.005092621, -0.033233643, -0.0...","[0.0825816, -0.09264662, -0.78053623, -0.32416..."
3,What should I do to be a great geologist?,16,[15],1,"[0.015116829425096512, 0.0010464431252330542, ...","[-0.013465881, 0.0018148422, -0.052612305, 0.0...","[-0.1653303, 0.19044468, -0.8906647, -0.364357..."
4,How do I read and find my YouTube comments?,23,[24],1,"[0.03505030274391174, -0.0010134828044101596, ...","[-0.0047836304, 0.028137207, -0.037231445, -0....","[0.50644577, -0.62657785, -0.2523397, -0.17112..."


테스팅을 위해 필요한 랜덤 질문들 선별

In [555]:
# now choose random 10 rows of answers
test_query = random.choices(df.id, k=100)

In [556]:
test_query[:5]

[12487, 13677, 107, 320, 9342]

In [557]:
test = df.loc[df.id.isin(test_query)]

각 테스트 질문별로 가장 유사한 질문들 top-k개 retrieve

In [558]:
def search_top_k(search_df, search_df_column, query, topk):
    """
    search_df : search를 할 대상 dataframe
    search_df_column : search를 위해 사용될 embedding column name
    query : test query
    topk : 유사도 기반으로 top-k개 선별
    """
    search_df = search_df.reset_index(drop=True)
    search = [cosine_similarity(query, i) for i in search_df[search_df_column]]
    search_df['similarity'] = search
    most_similar = heapq.nlargest(topk, search)
    # find indices
    indices = [search.index(x) for x in most_similar]
    search_result = search_df.loc[indices]
    return search_result

- 긴 run time 주의
    - 각 테스트 질문당 데이터 전체를 대상으로 cosine_similarity를 계산하고
    - openai embedding, cohere embedding에 대해 각각 질문 100개씩 진행

In [559]:
# 각 질문들 중, test 질문과 동일한 질문이 가장 유사하게 도출될 것이기 때문에
# test 질문을 제외한 top-5를 얻기 위해 top-6를 계산함
query_results_openai = { k:search_top_k(df, 'openai_emb', v, 6) for k,v in zip(test.id, test.openai_emb) }
query_results_cohere = { k:search_top_k(df, 'cohere_emb', v, 6) for k,v in zip(test.id, test.cohere_emb) }
query_results_e5 = { k:search_top_k(df, 'e5_emb', v, 6) for k,v in zip(test.id, test.e5_emb) }

테스트 결과 엿보기

In [560]:
test.loc[test.length==3]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb
488,How can I increase my presence of mind?,1259,"[1258, 13480, 11348]",3,"[0.004138645716011524, -0.0256500206887722, -0...","[0.04598999, 0.025527954, -0.0073280334, -0.02...","[0.24511276, -0.42361587, -0.48127538, -0.3267..."
2410,Who is the better candidate for being the Pres...,6306,"[6672, 6307, 8624]",3,"[0.015344919636845589, -0.0261219572275877, 0....","[0.030960083, -0.0038719177, -0.00012165308, 0...","[-0.46373126, -0.3975775, -0.8627275, 0.116123..."
4352,What was the worst thing that happened to you ...,11425,"[10361, 10360, 11424]",3,"[-0.016111603006720543, -0.027796797454357147,...","[0.019241333, 0.019592285, -0.020812988, -0.03...","[-0.41243482, -0.36105016, -0.1343597, 0.35431..."
5419,Do you think there's life on other planets?,14686,"[12523, 12524, 14685]",3,"[0.01339676696807146, 0.008889188058674335, -0...","[-0.032592773, 0.010681152, -0.022476196, 0.04...","[-0.07575863, -0.05358007, -0.8569687, -0.0457..."


In [562]:
query_results_openai[1259]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
488,How can I increase my presence of mind?,1259,"[1258, 13480, 11348]",3,"[0.004138645716011524, -0.0256500206887722, -0...","[0.04598999, 0.025527954, -0.0073280334, -0.02...","[0.24511276, -0.42361587, -0.48127538, -0.3267...",1.0
5029,How can I improve my presence of mind?,13480,"[1258, 11348, 1259]",3,"[-0.0027638771571218967, -0.02685525268316269,...","[0.036376953, 0.015419006, 0.0017023087, -0.01...","[0.33292645, -0.35477617, -0.53358656, -0.2640...",0.961495
4333,How do I improve presence of mind?,11348,"[1258, 1259, 13480]",3,"[-0.004489744082093239, -0.01853412389755249, ...","[0.032714844, 0.011054993, 0.000875473, -0.000...","[0.15466276, -0.3943424, -0.5786618, -0.263803...",0.940601
487,How do I develop my presence of mind?,1258,"[11348, 1259, 13480]",3,"[0.002190226688981056, -0.02701457217335701, -...","[0.009765625, 0.014198303, -0.003704071, 0.006...","[0.3041871, -0.26109898, -0.2508774, -0.201882...",0.901103
2996,What are some examples of 'Presence of Mind'?,7890,[7891],1,"[0.0216471366584301, 0.028882700949907303, -0....","[0.05645752, 0.011474609, 0.034423828, 0.02581...","[-0.13150793, -0.294934, -1.0039504, 0.2830207...",0.733589
3003,What are some best examples of Presence of mind?,7905,"[11920, 7891, 11919]",3,"[0.024972885847091675, 0.001189756439998746, 0...","[0.053710938, 0.0063438416, 0.04827881, 0.0260...","[-0.17511466, -0.26348683, -0.99845076, 0.2703...",0.726639


In [563]:
query_results_cohere[1259]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
488,How can I increase my presence of mind?,1259,"[1258, 13480, 11348]",3,"[0.004138645716011524, -0.0256500206887722, -0...","[0.04598999, 0.025527954, -0.0073280334, -0.02...","[0.24511276, -0.42361587, -0.48127538, -0.3267...",1.0
5029,How can I improve my presence of mind?,13480,"[1258, 11348, 1259]",3,"[-0.0027638771571218967, -0.02685525268316269,...","[0.036376953, 0.015419006, 0.0017023087, -0.01...","[0.33292645, -0.35477617, -0.53358656, -0.2640...",0.938154
4333,How do I improve presence of mind?,11348,"[1258, 1259, 13480]",3,"[-0.004489744082093239, -0.01853412389755249, ...","[0.032714844, 0.011054993, 0.000875473, -0.000...","[0.15466276, -0.3943424, -0.5786618, -0.263803...",0.935045
487,How do I develop my presence of mind?,1258,"[11348, 1259, 13480]",3,"[0.002190226688981056, -0.02701457217335701, -...","[0.009765625, 0.014198303, -0.003704071, 0.006...","[0.3041871, -0.26109898, -0.2508774, -0.201882...",0.8835
3643,How do I increase my concentration?,9521,[9522],1,"[0.01781752146780491, -0.008437324315309525, 0...","[0.02532959, 0.05795288, -0.06384277, -0.03469...","[0.4149682, -0.13139981, -0.5308464, -0.496404...",0.728383
2996,What are some examples of 'Presence of Mind'?,7890,[7891],1,"[0.0216471366584301, 0.028882700949907303, -0....","[0.05645752, 0.011474609, 0.034423828, 0.02581...","[-0.13150793, -0.294934, -1.0039504, 0.2830207...",0.719023


In [564]:
query_results_e5[1259]

Unnamed: 0,text,id,duplicated_questions,length,openai_emb,cohere_emb,e5_emb,similarity
488,How can I increase my presence of mind?,1259,"[1258, 13480, 11348]",3,"[0.004138645716011524, -0.0256500206887722, -0...","[0.04598999, 0.025527954, -0.0073280334, -0.02...","[0.24511276, -0.42361587, -0.48127538, -0.3267...",1.0
5029,How can I improve my presence of mind?,13480,"[1258, 11348, 1259]",3,"[-0.0027638771571218967, -0.02685525268316269,...","[0.036376953, 0.015419006, 0.0017023087, -0.01...","[0.33292645, -0.35477617, -0.53358656, -0.2640...",0.975632
4333,How do I improve presence of mind?,11348,"[1258, 1259, 13480]",3,"[-0.004489744082093239, -0.01853412389755249, ...","[0.032714844, 0.011054993, 0.000875473, -0.000...","[0.15466276, -0.3943424, -0.5786618, -0.263803...",0.972107
487,How do I develop my presence of mind?,1258,"[11348, 1259, 13480]",3,"[0.002190226688981056, -0.02701457217335701, -...","[0.009765625, 0.014198303, -0.003704071, 0.006...","[0.3041871, -0.26109898, -0.2508774, -0.201882...",0.957242
5472,How do I increase my thinking skills?,14827,[14826],1,"[0.04178797826170921, -0.014722226187586784, 0...","[0.02470398, 0.006072998, -0.01322937, 0.01736...","[0.20495385, -0.2965035, -0.72152764, -0.28187...",0.924281
3643,How do I increase my concentration?,9521,[9522],1,"[0.01781752146780491, -0.008437324315309525, 0...","[0.02532959, 0.05795288, -0.06384277, -0.03469...","[0.4149682, -0.13139981, -0.5308464, -0.496404...",0.920644


---
각 질문별로 얼마나 맞췄는지 score 부여

1. Accuracy score : 현재 유사하다고 태그된 질문들 중 몇 개를 가져왔는가?
2. Recall@K : Retrieve 한 결과 값들 중, 가장 유사하다고 판단된 top-K개만 선택했을 때 얼마나 정확한 결과를 가져왔는가?

In [565]:
def score_accuracy(tmp_df, test_id):
    """
    각 테스트 질문과 유사하다고 판단된 질문들 중, 실제 duplicated_questions에 들어있는 질문들을 count
    """
    duplicated_questions = tmp_df.loc[tmp_df['id'] == test_id, 'duplicated_questions'].values[0]

    # 본인 ID는 제외
    filtered_df = tmp_df[tmp_df['id'] != test_id]
    # 현재 retrieve 해온 ID들이, 테스트 질문 내에 들어있는 아이디들인지 count
    match_count = filtered_df['id'].isin(duplicated_questions).sum()

    # Calculate the accuracy in terms of percentage
    if filtered_df.shape[0]<len(duplicated_questions):
        percentage = (match_count / filtered_df.shape[0])
    else:
        percentage = (match_count / len(duplicated_questions))
    return percentage


In [566]:
accuracy_openai = [score_accuracy(query_results_openai[i], i) for i in query_results_openai.keys()]
accuracy_cohere = [score_accuracy(query_results_cohere[i], i) for i in query_results_cohere.keys()]
accuracy_e5 = [score_accuracy(query_results_e5[i], i) for i in query_results_e5.keys()]

In [567]:
np.mean(accuracy_cohere)

0.9516835016835017

In [568]:
np.mean(accuracy_openai)

0.9446127946127946

In [569]:
np.mean(accuracy_e5)

0.9508417508417507

#### 결론

- cohere, openai, e5 모두 굉장히 성능이 좋기 때문에 대부분의 task에 곧바로 활용해도 무방함.
- Local embedding 모델을 활용하고자 할 때 위와 같은 방법으로 classification 성능 & 자원 할당 체크 필요.

--END--