Search 이외 임베딩 활용 방법을 알려 드립니다
- use case
    - 사용자 의도 파악
    - 자주 묻는 질문 set
- ABC news topic modeling
    - Clustering
    - 정보의 다양성 측정
    - Outlier detection
- 자주 물어보는 질문
    - 한국어는 왜 안되나요?

=> 사용자의 인풋에 따라 다른 function이 실행될 수 있는 trigger<br>
=> VectorDB에 저장하고자 하는 컨텐츠에 대한 검수 및 전처리

---

In [1]:
import pandas as pd
import os
import json
import openai
from openai import OpenAI
import numpy as np
from tqdm.notebook import tqdm, trange
from sklearn.cluster import KMeans
from utils import cosine_similarity

# initialize openai
os.environ['OPENAI_API_KEY']= "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

## 1. 사용자 의도 파악

In [4]:
politics = ["What are the key policies of the main political parties in the upcoming election?",
            "Who do you vote for the next presedent?",
            "I love the current Democratic Party.",
            "What is your opinion on the president's current political move?",
            "I love politics. Don't you?"]

ml = ["How does supervised learning differ from unsupervised learning in machine learning models?",
      "What are the ethical considerations of using machine learning in predictive policing?",
    "How do neural networks mimic the human brain in processing data and recognizing patterns?",
    "What are some examples of natural language processing?",
    "Can you describe how machine learning is being utilized in personalized medicine and healthcare?"]

In [2]:
def create_embeddings(txt_list):
    client = OpenAI()

    response = client.embeddings.create(
    input=txt_list,
    model="text-embedding-3-small")
    responses = [r.embedding for r in response.data]

    return responses

In [5]:
embeddings = politics+ml
emb = create_embeddings(embeddings)

#### Clustering 활용

In [10]:
n_clusters = 2
kmeans = KMeans(n_clusters=n_clusters)
clusters = kmeans.fit_predict(emb)

In [11]:
clusters

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=int32)

유저가 정치 관련 질문을 한 경우

In [13]:
input_sentence = "I would like to have a talk about politics."
sent_emb = create_embeddings([input_sentence])

In [14]:
kmeans.predict(sent_emb)

array([0], dtype=int32)

유저가 machine learning 관련 질문을 한 경우

In [15]:
input_sentence = "Tell me about machine learning."
sent_emb = create_embeddings([input_sentence])

In [16]:
kmeans.predict(sent_emb)

array([1], dtype=int32)

#### Similarity search를 활용

In [17]:
politics_emb = create_embeddings(politics)
ml_emb = create_embeddings(ml)

In [18]:
def route_selection(emb_list, query_emb, threshold=0.5):
    cos_sim = [cosine_similarity(i, query_emb) for i in emb_list]

    threshold_filtered = [i for i in cos_sim if i>threshold]

    if len(threshold_filtered)>0:
        return True
    else:
        return False

In [19]:
input_sentence = "I would like to have a talk about politics."
sent_emb = create_embeddings([input_sentence])

print("{} for politics, {} for machine learning".format(route_selection(politics_emb, sent_emb[0]), route_selection(ml_emb, sent_emb[0])))

True for politics, False for machine learning


In [20]:
input_sentence = "How is the weather today?"
sent_emb = create_embeddings([input_sentence])

print("{} for politics, {} for machine learning".format(route_selection(politics_emb, sent_emb[0]), route_selection(ml_emb, sent_emb[0])))

False for politics, False for machine learning


In [21]:
input_sentence = "What is the best way to learn machine learning?"
sent_emb = create_embeddings([input_sentence])

print("{} for politics, {} for machine learning".format(route_selection(politics_emb, sent_emb[0]), route_selection(ml_emb, sent_emb[0])))

False for politics, False for machine learning


Embedding을 활용하기 때문에 최소한의 input을 활용하여 clustering이 가능해짐 <br>
##### __=> 사용자의 목적을 파악하여, 각 목적에 맞는 function 실행 가능__ (guardrails 또는 semantic router)

## 2. 자주 묻는 질문

1. 동일한 방식으로 자주 묻는 질문을 카테고리 별로 저장
2. Threshold를 정해서 유사한 질문 search
3. 유사한 질문과 연결된 정보 제공

---

# How To (ABC News)

## 1. Clustering
- 2020년에 어떤 주제들의 뉴스들이 있었을까?
##### => __각 문서의 주제 탐색 / 유사 문서 그룹핑__

In [22]:
df = pd.read_csv("abcnews_2020.csv")

(비용 발생 주의) batch 별로 embedding화

In [24]:
batch_size = 2000
headline_emb = list()

headline = df['headline_text'].tolist()

for i in trange(0, len(headline), batch_size):
    i_end = min(len(headline), i+batch_size)
    data_batch = headline[i:i_end]

    tmp_emb = create_embeddings(data_batch)
    headline_emb.extend(tmp_emb)

  0%|          | 0/2 [00:00<?, ?it/s]

In [25]:
df['headline_emb'] = headline_emb

In [27]:
# df.head()

In [15]:
# df.to_csv("abcnews_2020_emb.csv", index=False)

k-means를 활용하여 주요 토픽별 cluster 생성

In [30]:
df = pd.read_csv("abcnews_2020_emb.csv")

In [34]:
type(df.loc[0, 'headline_emb'])

str

In [35]:
df['headline_emb'] = df['headline_emb'].apply(json.loads)

In [36]:
type(df.loc[0, 'headline_emb'])

list

In [18]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb
0,20200101,a new type of resolution for the new year,"[-0.029918815940618515, 0.027576250955462456, ..."
1,20200101,adelaide records driest year in more than a de...,"[0.02332385629415512, 0.024166574701666832, 0...."


In [38]:
clusters = KMeans(n_clusters=30, random_state=0).fit_predict(df['headline_emb'].tolist())
df['cluster'] = clusters

In [39]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
0,20200101,a new type of resolution for the new year,"[-0.029918815940618515, 0.027576250955462456, ...",14
1,20200101,adelaide records driest year in more than a de...,"[0.02332385629415512, 0.024166574701666832, 0....",17


In [40]:
df.loc[df['cluster']==18]

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
212,20200104,china to identify cause of mystery pneumonia p...,"[-0.006448061671108007, -0.054372843354940414,...",18
360,20200106,mysterious illness in china is not sars,"[-0.0005059370887465775, -0.033365555107593536...",18
1045,20200115,who says new china coronavirus could spread; w...,"[-0.013549289666116238, -0.028695644810795784,...",18
1237,20200117,thailand finds second case of new chinese coro...,"[-0.02747490629553795, -0.044669169932603836, ...",18
1265,20200118,china reports new virus cases raising concern ...,"[0.004373899661004543, -0.031095510348677635, ...",18
...,...,...,...,...
2385,20200131,how deadly is the coronavirus,"[-0.01725398749113083, 0.0033517233096063137, ...",18
2437,20200131,wall street volatile coronavirus australian do...,"[-0.07781743258237839, -0.02185390144586563, 0...",18
2442,20200131,who coronavirus global emergency,"[-0.025180980563163757, -0.004652089439332485,...",18
2443,20200131,who declares coronavirus outbreak as global he...,"[-0.04890184849500656, -0.033849552273750305, ...",18


In [43]:
df.loc[df['cluster']==20]

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
78,20200102,david stern former nba basketball commissioner...,"[-0.03847704082727432, 0.007147535681724548, -...",20
118,20200102,the one question every female sport presenter ...,"[0.02246895618736744, -0.0231456495821476, 0.0...",20
252,20200105,awesome games done quick millions for charity,"[0.006493553053587675, -0.008684853091835976, ...",20
355,20200106,michele williams encourages women to vote in s...,"[0.06645362824201584, 0.059140317142009735, 0....",20
380,20200106,sam kerr back foot assist chelsea debut goal,"[-0.0035119790118187666, -0.04333627223968506,...",20
381,20200106,sam kerr lays on backheel assist wiped out in ...,"[-0.03333199769258499, -0.021290021017193794, ...",20
541,20200108,lance franklin a chance for round one after kn...,"[-0.0332072414457798, 0.011103429831564426, 0....",20
595,20200109,bbl becomes big bowl league with two hat trick...,"[-0.01639978028833866, -0.025932803750038147, ...",20
634,20200109,michael king speaks about rebuilding after the,"[0.003431144403293729, 0.02760828100144863, 0....",20
650,20200109,rashid khan and haris rauf both took hat trick...,"[-0.03229755908250809, -0.05221964046359062, 0...",20


## 2. 정보의 다양성 (Diversity) 측정

In [44]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_diversity(df, column_name):
    """
    Calculates the diversity of a set of embeddings based on cosine distance.
    
    :param embeddings: NumPy array of embeddings
    :return: The average cosine distance between embeddings, higher means more diverse
    """
    # 각각의 임베딩끼리 모두 pairwise cosine similarity를 계산
    embeddings = np.vstack(df[column_name])
    cosine_sim = cosine_similarity(embeddings)
    
    # self-comparisons (diagonal elements)를 제외하고 cosine similarity 계산
    np.fill_diagonal(cosine_sim, np.nan) # 본인과의 similarity는 제외
    avg_distance = np.nanmean(cosine_sim)
    
    return cosine_sim, avg_distance


In [45]:
dist, avg = calculate_diversity(df, 'headline_emb')

In [47]:
avg

0.19837456404442783

In [48]:
diversity_score = {k:calculate_diversity(df.loc[df['cluster']==k], 'headline_emb')[1] for k in range(0, 30)}

In [49]:
diversity_score

{0: 0.30127963292576626,
 1: 0.4339087803488167,
 2: 0.2147994063589832,
 3: 0.3984893541964128,
 4: 0.8579692271248459,
 5: 0.4212757317776546,
 6: 0.3205253159622736,
 7: 0.44165128926019925,
 8: 0.5049445264108715,
 9: 0.18107570874467785,
 10: 0.3482301300183521,
 11: 0.19773536876618047,
 12: 0.4786903719253222,
 13: 0.4870874258713398,
 14: 0.21972834313499837,
 15: 0.6949230479873828,
 16: 0.16821974781357854,
 17: 0.42249861961474716,
 18: 0.44025882277053147,
 19: 0.3107201382398215,
 20: 0.236653063288377,
 21: 0.3339562535194002,
 22: 0.30493444156468086,
 23: 0.4137818825692707,
 24: 0.379302359023832,
 25: 0.24157646125137347,
 26: 0.3319578912797995,
 27: 0.2827679031282283,
 28: 0.2528545554418479,
 29: 0.41047133156097776}

In [51]:
df.loc[df['cluster']==29]

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
41,20200101,queensland three premiers climate change warni...,"[-0.019185040146112442, -0.020576568320393562,...",29
63,20200102,anthony albanese continues to call for action,"[-0.018650749698281288, 0.002697843126952648, ...",29
90,20200102,group of cobargo residents vent anger at pm yo...,"[0.0205977875739336, 0.014941252768039703, 0.0...",29
112,20200102,scott morrison responds to unwelcome reception...,"[-0.02228597179055214, 0.013678975403308868, 0...",29
113,20200102,scott morrison urges patience and calm to deal...,"[0.010633979924023151, 0.013198329135775566, 0...",29
...,...,...,...,...
2228,20200129,scott morrison christmas island quarantine pro...,"[-0.040498774498701096, 0.029602359980344772, ...",29
2230,20200129,scott morrison hazard emissions reduction clim...,"[-0.013279592618346214, 0.0006563903298228979,...",29
2231,20200129,scott morrison knows his weaknesses but wont g...,"[-0.024246487766504288, 0.023730335757136345, ...",29
2232,20200129,scott morrison national emergency law review,"[-0.02942686155438423, 0.04149668291211128, 0....",29


## 4. Outlier detection
- 2020년에 우리들이 관심을 갖지 못 했던 뉴스들은 어떤게 있을까?

In [58]:
from sklearn.ensemble import IsolationForest

In [59]:
cluster29 = df.loc[df['cluster']==29]

In [60]:
iso_forest = IsolationForest(contamination=0.05)  # Adjust contamination as needed
anomalies = iso_forest.fit_predict(cluster29['headline_emb'].tolist())

anomalous_headlines = np.array(cluster29['headline_text'].tolist())[anomalies == -1]
# print("Anomalous Headlines:", anomalous_headlines)

In [61]:
sum(anomalies==-1)

4

In [62]:
anomalous_headlines

array(['austrian conservatives greens sebastian kurz returned to power',
       'pm has been met with hostility and criticism in cobargo',
       'a load of rubbish: greg mullins wants more proactive govt',
       'aboriginals slam state government efforts to reset relationship'],
      dtype='<U64')

In [63]:
cluster29

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
41,20200101,queensland three premiers climate change warni...,"[-0.019185040146112442, -0.020576568320393562,...",29
63,20200102,anthony albanese continues to call for action,"[-0.018650749698281288, 0.002697843126952648, ...",29
90,20200102,group of cobargo residents vent anger at pm yo...,"[0.0205977875739336, 0.014941252768039703, 0.0...",29
112,20200102,scott morrison responds to unwelcome reception...,"[-0.02228597179055214, 0.013678975403308868, 0...",29
113,20200102,scott morrison urges patience and calm to deal...,"[0.010633979924023151, 0.013198329135775566, 0...",29
...,...,...,...,...
2228,20200129,scott morrison christmas island quarantine pro...,"[-0.040498774498701096, 0.029602359980344772, ...",29
2230,20200129,scott morrison hazard emissions reduction clim...,"[-0.013279592618346214, 0.0006563903298228979,...",29
2231,20200129,scott morrison knows his weaknesses but wont g...,"[-0.024246487766504288, 0.023730335757136345, ...",29
2232,20200129,scott morrison national emergency law review,"[-0.02942686155438423, 0.04149668291211128, 0....",29


단순히 텍스트를 embedding화 하는 것에서 더 나아가, <br>
텍스트를 특징별로 묶거나 유관하지 않다고 판단되는 텍스트는 제외하는 등, 컨텐츠 자체를 preprocessing 하는데에 활용 가능

--END--