Embedding의 실제 활용 방법을 알려 드립니다
- 사용자 의도 파악
- Clustering
- Outlier detection

=> 사용자의 인풋에 따라 다른 function이 실행될 수 있는 trigger<br>
=> VectorDB에 저장하고자 하는 컨텐츠에 대한 검수 및 전처리

---

In [22]:
import pandas as pd
import os
import ast
import json
import openai
from openai import OpenAI
import numpy as np
from tqdm.notebook import tqdm, trange
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import plotly.express as px
from scipy.spatial.distance import cosine

# initialize openai
os.environ['OPENAI_API_KEY']= "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

### 1. 사용자 의도 파악

In [2]:
politics = ["What are the key policies of the main political parties in the upcoming election?",
            "Who do you vote for the next presedent?",
            "I love the current Democratic Party.",
            "What is your opinion on the president's current political move?",
            "I love politics. Don't you?"]

ml = ["How does supervised learning differ from unsupervised learning in machine learning models?",
      "What are the ethical considerations of using machine learning in predictive policing?",
    "How do neural networks mimic the human brain in processing data and recognizing patterns?",
    "What are some examples of natural language processing?",
    "Can you describe how machine learning is being utilized in personalized medicine and healthcare?"]

In [3]:
def create_embeddings(txt_list):
    client = OpenAI()

    response = client.embeddings.create(
    input=txt_list,
    model="text-embedding-3-small")
    responses = [r.embedding for r in response.data]

    return responses

In [19]:
embeddings = politics+ml
emb = create_embeddings(embeddings)

In [20]:
n_clusters = 2
kmeans = KMeans(n_clusters=n_clusters)
clusters = kmeans.fit_predict(emb)

In [21]:
clusters

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=int32)

유저가 정치 관련 질문을 한 경우

In [24]:
input_sentence = "I would like to talk have a talk about politics."
sent_emb = create_embeddings([input_sentence])

In [25]:
kmeans.predict(sent_emb)

array([0], dtype=int32)

유저가 machine learning 관련 질문을 한 경우

In [26]:
input_sentence = "Tell me about machine learning."
sent_emb = create_embeddings([input_sentence])

In [27]:
kmeans.predict(sent_emb)

array([1], dtype=int32)

Embedding을 활용하기 때문에 최소한의 input을 활용하여 clustering이 가능해짐 <br>
#### __사용자의 목적을 파악하여, 각 목적에 맞는 function 실행 가능__

---

### 2. ML과 함께 활용

1. Clustering
- 2020년에 어떤 주제들의 뉴스들이 있었을까?

In [38]:
# df = pd.read_csv("abcnews_2020.csv")

(비용 발생 주의) batch 별로 embedding화

In [39]:
# batch_size = 2000
# headline_emb = list()

# headline = df['headline_text'].tolist()

# for i in trange(0, len(headline), batch_size):
#     i_end = min(len(headline), i+batch_size)
#     data_batch = headline[i:i_end]

#     tmp_emb = create_embeddings(data_batch)
#     headline_emb.extend(tmp_emb)

  0%|          | 0/2 [00:00<?, ?it/s]

In [40]:
# df['headline_emb'] = headline_emb

In [41]:
# df.to_csv("abcnews_2020_emb.csv", index=False)

k-means를 활용하여 주요 토픽별 cluster 생성

In [42]:
df = pd.read_csv("abcnews_2020_emb.csv")

In [43]:
df['headline_emb'] = df['headline_emb'].apply(json.loads)

In [44]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb
0,20200101,a new type of resolution for the new year,"[-0.029918815940618515, 0.027576250955462456, ..."
1,20200101,adelaide records driest year in more than a de...,"[0.02332385629415512, 0.024166574701666832, 0...."


In [59]:
clusters = KMeans(n_clusters=30, random_state=0).fit_predict(df['headline_emb'].tolist())
df['cluster'] = clusters

In [61]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
0,20200101,a new type of resolution for the new year,"[-0.029918815940618515, 0.027576250955462456, ...",14
1,20200101,adelaide records driest year in more than a de...,"[0.02332385629415512, 0.024166574701666832, 0....",17


- 가장 핫 했던 뉴스와
- 그렇지 못 했던 뉴스

In [68]:
# df.loc[df['cluster']==13]

In [69]:
# df.loc[df['cluster']==4]

2. Outlier detection
- 2020년에 우리들이 관심을 갖지 못 했던 뉴스들은 어떤게 있을까?

In [70]:
from sklearn.ensemble import IsolationForest

In [74]:
iso_forest = IsolationForest(contamination=0.05)  # Adjust contamination as needed
anomalies = iso_forest.fit_predict(df['headline_emb'].tolist())

# Identifying Anomalous Headlines
anomalous_headlines = np.array(df['headline_text'].tolist())[anomalies == -1]
# print("Anomalous Headlines:", anomalous_headlines)
# 

In [76]:
sum(anomalies==-1)

123

In [75]:
anomalous_headlines

array(['archaic legislation governing nt women property rights',
       'more heat bound for firegrounds but cool change to follow',
       'what do you do when you have a tick',
       'meteorologist describes why forecast conditions are so dangerous',
       'swearing in public is illegal unlikely to be charged if white',
       'whats going on with americas mayor rudy giuliani',
       'democrats broke fundraising records but donald trump is still k',
       'on the periphery of north and south korea',
       'qassem soleimani was head of elite quds force',
       'history why america and iran hate each other',
       'wrong anthem played for moldova at atp cup',
       'chinas communist party is at a fatal age for one party regimes',
       'sasha zhoya turns back on australia athletics for france',
       'golden globes red carpet celebrity outfits',
       'michele williams encourages women to vote in speech',
       'where to watch golden globe winning tv series and movies',
   

단순히 텍스트를 embedding화 하는 것에서 더 나아가, <br>
텍스트를 특징별로 묶거나 유관하지 않다고 판단되는 텍스트는 제외하는 등, 컨텐츠 자체를 preprocessing 하는데에 활용 가능

--END--