Search 이외 임베딩 활용 방법을 알려 드립니다
- ABC news topic modeling
    - Clustering
    - 정보의 다양성 측정
    - Outlier detection

    
=> VectorDB에 저장하고자 하는 컨텐츠에 대한 검수 및 전처리

---

In [1]:
import pandas as pd
import os
import json
import openai
from openai import OpenAI
import numpy as np
from tqdm.notebook import tqdm, trange
from sklearn.cluster import KMeans
from utils import create_embeddings

# initialize openai
os.environ['OPENAI_API_KEY']= "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"
openai.api_key = os.environ["OPENAI_API_KEY"]

# How To (ABC News)

## 1. Clustering
- 2020년에 어떤 주제들의 뉴스들이 있었을까?
##### => __각 문서의 주제 탐색 / 유사 문서 그룹핑__

In [2]:
df = pd.read_csv("abcnews_2020.csv")

(비용 발생 주의) batch 별로 embedding화

In [None]:
batch_size = 2000
headline_emb = list()

headline = df['headline_text'].tolist()

for i in trange(0, len(headline), batch_size):
    i_end = min(len(headline), i+batch_size)
    data_batch = headline[i:i_end]

    tmp_emb = create_embeddings(data_batch)
    headline_emb.extend(tmp_emb)

In [None]:
df['headline_emb'] = headline_emb

In [None]:
# df.head()

In [None]:
# df.to_csv("abcnews_2020_emb.csv", index=False)

k-means를 활용하여 주요 토픽별 cluster 생성

<img src="https://static.javatpoint.com/tutorial/machine-learning/images/k-means-clustering-algorithm-in-machine-learning.png" width="500" height="300"/>
<br>
출처 : https://static.javatpoint.com/tutorial/machine-learning/images/k-means-clustering-algorithm-in-machine-learning.png

In [3]:
df = pd.read_csv("abcnews_2020_emb.csv")

In [4]:
df.head()

Unnamed: 0,publish_date,headline_text,headline_emb
0,20200101,a new type of resolution for the new year,"[-0.029918815940618515, 0.027576250955462456, ..."
1,20200101,adelaide records driest year in more than a de...,"[0.02332385629415512, 0.024166574701666832, 0...."
2,20200101,adelaide riverbank catches alight after new ye...,"[0.008539590984582901, -0.00674605555832386, 0..."
3,20200101,adelaides 9pm fireworks spark blaze on riverbank,"[0.03185156360268593, -1.126696679421002e-05, ..."
4,20200101,archaic legislation governing nt women propert...,"[0.05419066920876503, 0.061877865344285965, 0...."


In [5]:
type(df.loc[0, 'headline_emb'])

str

In [6]:
df['headline_emb'] = df['headline_emb'].apply(json.loads)

In [7]:
type(df.loc[0, 'headline_emb'])

list

In [8]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb
0,20200101,a new type of resolution for the new year,"[-0.029918815940618515, 0.027576250955462456, ..."
1,20200101,adelaide records driest year in more than a de...,"[0.02332385629415512, 0.024166574701666832, 0...."


In [9]:
clusters = KMeans(n_clusters=15, random_state=0).fit_predict(df['headline_emb'].tolist())
df['cluster'] = clusters

In [10]:
df.head(2)

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
0,20200101,a new type of resolution for the new year,"[-0.029918815940618515, 0.027576250955462456, ...",10
1,20200101,adelaide records driest year in more than a de...,"[0.02332385629415512, 0.024166574701666832, 0....",6


In [13]:
df.loc[df['cluster']==1]

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
162,20200103,nick kyrgios kicks off australias atp cup chal...,"[-0.0559181347489357, 0.018481604754924774, 0....",1
199,20200104,australia marnus labuschagne stars vs new zeal...,"[-0.01846972480416298, 0.018750378862023354, 0...",1
201,20200104,bushfire help sparked by ashleigh barty pink a...,"[-0.006249888334423304, -0.03250924497842789, ...",1
249,20200104,wrong anthem played for moldova at atp cup,"[-0.033767540007829666, 0.01257616188377142, 0...",1
297,20200105,sasha zhoya turns back on australia athletics ...,"[0.026853417977690697, 0.03245200961828232, 0....",1
...,...,...,...,...
2314,20200130,rafael nadal agitated by chair umpire after gi...,"[-0.051331765949726105, 0.03031117469072342, 0...",1
2315,20200130,rafael nadal loses to dominic thiem australian...,"[-0.018239960074424744, 0.014115291647613049, ...",1
2354,20200131,australian open dominic thiem beats alexander ...,"[-0.010364953428506851, 0.03771139308810234, 0...",1
2355,20200131,australian open has delivered more that we cou...,"[-0.018959054723381996, 0.051743846386671066, ...",1


## 2. 정보의 다양성 (Diversity) 측정

- 각 클러스터 내에 있는 뉴스들은 얼마나 유사한 정보를 담고 있을까?

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

def calculate_diversity(df, column_name):
    """
    Calculates the diversity of a set of embeddings based on cosine distance.
    
    :param embeddings: NumPy array of embeddings
    :return: The average cosine distance between embeddings, higher means more diverse
    """
    # 각각의 임베딩끼리 모두 pairwise cosine similarity를 계산
    embeddings = np.vstack(df[column_name])
    cosine_sim = cosine_similarity(embeddings)
    
    # self-comparisons (diagonal elements)를 제외하고 cosine similarity 계산
    np.fill_diagonal(cosine_sim, np.nan) # 본인과의 similarity는 제외
    avg_distance = np.nanmean(cosine_sim)
    
    return cosine_sim, avg_distance


In [15]:
dist, avg = calculate_diversity(df, 'headline_emb')

In [16]:
avg

0.19837456404442783

In [21]:
diversity_score = {k:calculate_diversity(df.loc[df['cluster']==k], 'headline_emb')[1] for k in range(0, 15)}

In [22]:
diversity_score

{0: 0.2974562651570661,
 1: 0.4269269249796021,
 2: 0.2866819311418472,
 3: 0.1914343118015805,
 4: 0.25824106832225036,
 5: 0.8579692271248459,
 6: 0.3731762928424978,
 7: 0.45046801441034473,
 8: 0.360956829178157,
 9: 0.37260189328386645,
 10: 0.15387536426567106,
 11: 0.3722846108917884,
 12: 0.42202751526105636,
 13: 0.31245127889885826,
 14: 0.19808206341634074}

In [23]:
df.loc[df['cluster']==7]

Unnamed: 0,publish_date,headline_text,headline_emb,cluster
3,20200101,adelaides 9pm fireworks spark blaze on riverbank,"[0.03185156360268593, -1.126696679421002e-05, ...",7
7,20200101,bushfire relief: how you can help frontline se...,"[-0.006826662924140692, -0.018459511920809746,...",7
8,20200101,bushfires what now for stranded,"[0.004304838832467794, 0.009886191226541996, 0...",7
14,20200101,family defends home from fire at goongerah,"[0.014011625200510025, 0.025837328284978867, 0...",7
21,20200101,karen lissa describes the bushfire swept throu...,"[-0.009976834058761597, -0.020278777927160263,...",7
...,...,...,...,...
2342,20200130,weather event making australian bushfires wors...,"[-0.02796875685453415, 0.04293641075491905, 0....",7
2350,20200131,act enters state of emergency amid bushfire th...,"[-0.015671929344534874, 0.011608334258198738, ...",7
2360,20200131,brumbies game threatened by smoke and heat,"[-0.033987779170274734, 0.020251505076885223, ...",7
2430,20200131,tourism industry questions nsw response to bus...,"[-0.018678519874811172, -0.02478831633925438, ...",7


## 4. Outlier detection
- 각 클러스터 내에 속하지 않는 정보들이 있을까?

<img src="https://miro.medium.com/v2/resize:fit:725/1*y3wXEId0poYUIzCD3HBh4w.png"/>
<br>
출처 : https://miro.medium.com/v2/resize:fit:725/1*y3wXEId0poYUIzCD3HBh4w.png

In [24]:
from sklearn.ensemble import IsolationForest

In [29]:
cluster = df.loc[df['cluster']==10]

In [30]:
iso_forest = IsolationForest(contamination=0.05)  # Adjust contamination as needed
anomalies = iso_forest.fit_predict(cluster['headline_emb'].tolist())

anomalous_headlines = np.array(cluster['headline_text'].tolist())[anomalies == -1]
# print("Anomalous Headlines:", anomalous_headlines)

In [31]:
anomalies

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,
        1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1

In [32]:
anomalous_headlines

array(['dozens of residents are queuing to get into',
       'cancer signs potentially missed hundreds of patients by doctor',
       'we were expecting something like this to happen:',
       'asic guides banks and brokers on home loans',
       'baby yoda name still secret gender probably male the mandalorian',
       'harry and meghan royal family in uncharted territory',
       'turning 20 in 2020',
       'women not attracted to men with beards study finds',
       'bodysurfing how to get started beach surfing',
       'hundreds central american migrants wade across river into mexico',
       'colonic irrigation is it safe effective',
       'what can parents do to help their childs teacher',
       'how you can send your child to school outside catchment zone',
       'suicide concerns mental health of international students'],
      dtype='<U64')

단순히 텍스트를 embedding화 하는 것에서 더 나아가, <br>
텍스트를 특징별로 묶거나 유관하지 않다고 판단되는 텍스트는 제외하는 등, 컨텐츠 자체를 preprocessing/탐색 하는데에 활용 가능

--END--