파이썬 머신러닝 완벽 가이드 개정2판
ch8. 7
 p.538-549

# 8. 텍스트 분석
## 07. 문서 군집화 소개와 실습 (Opinion Review 데이터 세트)

### [문서 군집화 개념]
문서 군집화 : 비슷한 텍스트 구성의 문서를 군집화하는 것
- 학습 데이터 세트가 필요 없는 비지도 학습 기반으로 동작

### [Opinion Review 데이터 세트를 이용한 문서 군집화 수행하기]
UCI 머신러닝 리포지토리에 있는 Opinion Review 데이터 세트
- 51개의 텍스트 파일로 구성
- 호텔, 자동차, 전자제품 사이트에서 가져온 리뷰 문서

In [29]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
import pandas as pd
import glob, os
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# 필수 NLTK 데이터 다운로드
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [31]:
import pandas as pd
import glob, os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# 디렉터리
path = r'/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics'
# path로 지정한 디렉터리 밑에 있는 모든 .data 파일들의 파일명을 리스트로 취합
all_files = glob.glob(os.path.join(path, '*.data'))
filename_list = []
opinion_text = []

# 개별 파일들의 파일명은 filename_list 리스트로 취합,
# 개별 파일들의 파일 내용은 DataFrame 로딩 후 다시 string으로 변환하여 opinion_text 리스트로 취합
for file_ in all_files:
    # 개별 파일을 읽어서 DataFrame으로 생성
    df = pd.read_table(file_, index_col=None, header=0, encoding='latin1')
    # 절대 경로로 주어진 파일명을 가공. Linux에서 수행 시에는 아래 \\를 / 변경.
    # 맨 마지막 .data 확장자도 제거
    filename_ = file_.split('\\')[-1]
    filename = filename_.split('.')[0]

    # 파일명 리스트와 파일 내용 리스트에 파일명과 파일 내용을 추가
    filename_list.append(filename)
    opinion_text.append(df.to_string())

# 파일명 리스트와 파일 내용 리스트를 DataFrame으로 생성
document_df = pd.DataFrame({'filename': filename_list, 'opinion_text': opinion_text})
document_df.head()

Unnamed: 0,filename,opinion_text
0,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi..."
1,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now..."
2,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ..."
3,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ..."
4,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...


-> 각 파일 이름 (filename) 자체만으로 의견(opinion)의 텍스트(text)가 어떠한 제품/서비스에 대한 리뷰인지 알 수 있음

문서를 TF-IDF 형태로 피처 벡터화하기
- LemNormalize() 함수 : 어근 변환 수행

In [32]:
# 텍스트 정규화 및 토큰화
def LemNormalize(text):
    wordnet_lemmatizer = WordNetLemmatizer()
    words = word_tokenize(text)  # nltk 토큰화 적용
    return [wordnet_lemmatizer.lemmatize(word) for word in words]


# 빈 문서 제거
document_df = document_df.dropna(subset=['opinion_text'])

# Ensure the 'opinion_text' column contains strings
document_df['opinion_text'] = document_df['opinion_text'].astype(str)

document_df = document_df[document_df['opinion_text'].str.len() > 0]

# TF-IDF 벡터화 (stop_words 제거, min_df 완화)
tfidf_vect = TfidfVectorizer(tokenizer=None, preprocessor=None,
                             ngram_range=(1,2), min_df=0.01, max_df=0.95)
feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

문서별 텍스트가 TF-IDF 변환된 피처 벡터화 행렬 데이터에 대해서 군집화를 수행 -> 어떤 문서끼리 군집되는지 확인하기
- 군집화 기법 : K-평균 적용
- 최대 반복 횟수 max_iter = 10000

In [33]:
# KMeans 클러스터링
km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0, n_init='auto')
km_cluster.fit(feature_vect)

cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_

document_df 에 'cluster_label' 칼럼을 추가해 저장하기

In [34]:
document_df['cluster_label'] = cluster_label
document_df.head()

Unnamed: 0,filename,opinion_text,cluster_label
0,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi...",1
1,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now...",1
2,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",0
3,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ...",2
4,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...,1


군집화 결과 확인해보기

- cluster_label=0 인 데이터 세트

In [35]:
document_df[document_df['cluster_label']==0].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label
2,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",0
8,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 Great gas mileage and comfortable on long trips ...",0
19,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/gas_mileage_toyota_camry_2007,Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 ...,0
21,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/interior_honda_accord_2008,I love the new body style and the interior is a simple pleasure except for the center dash .\n0 ...,0
15,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/interior_toyota_camry_2007,"First of all, the interior has way too many cheap plastic parts like the cheap plastic center piece that houses the clock .\n0 3 blown struts at 30,000 miles, interior trim coming loose and rattling squeaking, stains on paint, and bug splats taking paint off, premature uneven brake wear, on 3rd windsh...",0
16,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/mileage_honda_accord_2008,"It's quiet, get good gas mileage and looks clean inside and out .\n0 The mileage is great, and I've had to get used to stopping less for gas .\n1 Thought gas ...",0
27,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/performance_honda_accord_2008,"Very happy with my 08 Accord, performance is quite adequate it has nice looks and is a great long, distance cruiser .\n0 6, 4, 3 eco engine has poor performance and gas mileage of 22 highway .\n1 Overall performance is good but comfort level is poor .\n2 ...",0
28,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/quality_toyota_camry_2007,I previously owned a Toyota 4Runner which had incredible build quality and reliability .\n0 I bought the Camry because of Toyota reliability and qua...,0
37,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/seats_honda_accord_2008,"Front seats are very uncomfortable .\n0 No memory seats, no trip computer, can only display outside temp with trip odometer .\n1 ...",0
49,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/transmission_toyota_camry_2007,"After slowing down, transmission has to be kicked to speed up .\n0 ...",0


-> cluster #0은 호텔에 대한 리뷰로 군집화됨

In [36]:
document_df[document_df['cluster_label']==1].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label
0,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi...",1
6,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/battery-life_amazon_kindle,"After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !\n0 After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !\n1 ...",1
4,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...,1
1,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now...",1
11,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/display_garmin_nuvi_255W_gps,"3 quot widescreen display was a bonus .\n0 This made for smoother graphics on the 255w of the vehicle moving along displayed roads, where the 750's display was more of a jerky movement .\n1 ...",1
13,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/eyesight-issues_amazon_kindle,"It feels as easy to read as the K1 but doesn't seem any crisper to my eyes .\n0 the white is really GREY, and to avoid considerable eye, strain I had to refresh pages every other page .\n1 The dream has always been a portable electronic device that could hold a ton of reading material, automate subscriptions and fa...",1
12,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/features_windows7,"I had to uninstall anti, virus and selected other programs, some of which did not have listings in the Programs and Features Control Panel section .\n0 This review briefly touches upon some of the key features and enhancements of Microsoft's latest OS .\n1 ...",1
20,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/keyboard_netbook_1005ha,", I think the new keyboard rivals the great hp mini keyboards .\n0 Since the battery life difference is minimum, the only reason to upgrade would be to get the better keyboard .\n1 The keyboard is now as good as t...",1
18,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/navigation_amazon_kindle,"In fact, the entire navigation structure has been completely revised , I'm still getting used to it but it's a huge step forward .\n0 ...",1
31,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/performance_netbook_1005ha,"The Eee Super Hybrid Engine utility lets users overclock or underclock their Eee PC's to boost performance or provide better battery life depending on their immediate requirements .\n0 In Super Performance mode CPU, Z shows the bus speed to increase up to 169 .\n1 One...",1


-> Cluster #1 : 킨들, 아이팟, 넷북 등의 포터블 전자기기 및 주요 구성요소에 대한 리뷰로 군집화됨

In [37]:
document_df[document_df['cluster_label']==2].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label
3,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ...",2
5,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/buttons_amazon_kindle,"I thought it would be fitting to christen my Kindle with the Stephen King novella UR, so went to the Amazon site on my computer and clicked on the button to buy it .\n0 As soon as I'd clicked the button to confirm my order it appeared on my Kindle almost immediately !\n1 ...",2
7,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/directions_garmin_nuvi_255W_gps,You also get upscale features like spoken directions including street names and programmable POIs .\n0 I used to hesitate to go out of my directions but no...,2
9,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/fonts_amazon_kindle,"Being able to change the font sizes is awesome !\n0 For whatever reason, Amazon decided to make the Font on the Home Screen ...",2
10,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/food_holiday_inn_london,The room was packed to capacity with queues at the food buffets .\n0 The over zealous staff cleared our unfinished drinks while we were collecting cooked food and movement around the room with plates was difficult in the crowded circumstances .\n1 ...,2
23,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/food_swissotel_chicago,The food for our event was delicious .\n0 ...,2
22,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/free_bestwestern_hotel_sfo,The wine reception is a great idea as it is nice to meet other travellers and great having access to the free Internet access in our room .\n0 They also have a computer available with free internet which is a nice bonus but I didn't find that out till the day before we left but was still able to get on there to check our flight to Vegas the next day .\n1 ...,2
17,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/location_bestwestern_hotel_sfo,"Good Value good location , ideal choice .\n0 Great Location , Nice Rooms , Helpless Concierge\n1 ...",2
14,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/location_holiday_inn_london,"Great location for tube and we crammed in a fair amount of sightseeing in a short time .\n0 All in all, a normal chain hotel on a nice lo...",2
29,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/parking_bestwestern_hotel_sfo,Parking was expensive but I think this is common for San Fran .\n0 there is a fee for parking but well worth it seeing no where to park if you do have a car .\n1 ...,2


-> Cluster #2 : 주로 차량용 내비게이션 리뷰로 군집화됨

In [38]:
document_df[document_df['cluster_label']==3].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label


-> Cluster #3 : 킨들 리뷰가 한 개 섞여 있지만, 대부분 호텔에 대한 리뷰로 군집화됨

In [39]:
document_df[document_df['cluster_label']==4].sort_values(by='filename')

Unnamed: 0,filename,opinion_text,cluster_label


-> Cluster #4 : 토요타, 혼다 등의 자동차에 대한 리뷰로 군집화됨

=> 군집 개수가 약간 많게 설정돼 있어서 세분화되어 군집화된 경향이 있음

중심 개수를 5개에서 3개로 낮춰서 3개 그룹으로 군집화한 뒤 결과 확인해보기

In [40]:
from sklearn.cluster import KMeans

# 3개의 집합으로 군집화
km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_

# 소속 클러스터를 cluster_label 칼럼으로 할당하고 cluster_label 값으로 정렬
document_df['cluster_label'] = cluster_label
document_df.sort_values(by='cluster_label')

Unnamed: 0,filename,opinion_text,cluster_label
2,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",0
8,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 Great gas mileage and comfortable on long trips ...",0
15,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/interior_toyota_camry_2007,"First of all, the interior has way too many cheap plastic parts like the cheap plastic center piece that houses the clock .\n0 3 blown struts at 30,000 miles, interior trim coming loose and rattling squeaking, stains on paint, and bug splats taking paint off, premature uneven brake wear, on 3rd windsh...",0
28,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/quality_toyota_camry_2007,I previously owned a Toyota 4Runner which had incredible build quality and reliability .\n0 I bought the Camry because of Toyota reliability and qua...,0
27,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/performance_honda_accord_2008,"Very happy with my 08 Accord, performance is quite adequate it has nice looks and is a great long, distance cruiser .\n0 6, 4, 3 eco engine has poor performance and gas mileage of 22 highway .\n1 Overall performance is good but comfort level is poor .\n2 ...",0
21,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/interior_honda_accord_2008,I love the new body style and the interior is a simple pleasure except for the center dash .\n0 ...,0
19,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/gas_mileage_toyota_camry_2007,Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 ...,0
16,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/mileage_honda_accord_2008,"It's quiet, get good gas mileage and looks clean inside and out .\n0 The mileage is great, and I've had to get used to stopping less for gas .\n1 Thought gas ...",0
49,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/transmission_toyota_camry_2007,"After slowing down, transmission has to be kicked to speed up .\n0 ...",0
37,/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/seats_honda_accord_2008,"Front seats are very uncomfortable .\n0 No memory seats, no trip computer, can only display outside temp with trip odometer .\n1 ...",0


### [군집별 핵심 단어 추출하기]
각 군집을 구성하는 핵심 단어가 어떤 것이 있는지 확인하기

KMeans 객체의 clusters_centers_
- 각 군집을 구성하는 단어 피처가 군집의 중심을 기준으로 얼마나 가깝게 위치해 있ㄴ느지
- 배열 값으로 제공, 행은 개별 군집, 열은 개별 피처
- 각 배열 내의 값 : 개별 군집 내의 상대 위치를 숫자 값으로 표현한 일종의 좌표 값

In [41]:
cluster_centers = km_cluster.cluster_centers_
print('cluster_centers shape : ', cluster_centers.shape)
print(cluster_centers)

cluster_centers shape :  (3, 66040)
[[0.00071021 0.         0.         ... 0.         0.         0.        ]
 [0.00065802 0.00109736 0.         ... 0.00238505 0.00132443 0.00117224]
 [0.00299631 0.         0.00087601 ... 0.         0.         0.        ]]


- (3, 66040) 배열 : 군집이 3개, word 피처가 66040개로 구성됨
- 0에서 1까지의 값 : 1에 가까울수록 중심과 가까운 값


cluster_centers_ 속성값을 이용해 각 군집별 핵심 단어 찾아보기
- cluster_centers_ 속성 : 넘파이의 ndarray
- 위치 인덱스 값 : 핵심 단어 피처의 이름을 출력하기 위해 필요

get_cluster_details() 함수 생성
- cluster_centers_ 배열 내에서 가장 값이 큰 데이터의 위치 인덱스를 추출
- 해당 인덱스를 이용해 핵심 단어 이름과 그때의 상대 위치 값을 추출
- cluster_details라는 Dict 객체 변수에 기록하고 반환하기

In [49]:
# 군집별 top n 핵심 단어, 그 단어의 중심 위치 상댓값, 대상 파일명을 반환함
def get_cluster_details(cluster_model, cluster_data, feature_names, clusters_num, top_n_features=10):
    cluster_details = {}

    # cluster_centers array의 값이 큰 순으로 정렬된 인덱스 값을 반환
    # 군집 중심점별 할당된 word 피처들의 거리값이 큰 순으로 값을 구하기 위함
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:, ::-1]

    # 개별 군집별로 반복하면서 핵심 단어, 그 단어의 중심 위치 상댓값, 대상 파일명 입력
    for  cluster_num in range(clusters_num):
        # 개별 군집별 정보를 담을 데이터 초기화
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]['cluster'] = cluster_num

        # cluster_centers_.argsort()[:, ::-1]로 구한 인덱스를 이용해 top_n 피처 단어를 구함
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num, :top_n_features]
        top_features = [ feature_names[ind] for ind in top_feature_indexes ]

        # top_feature_indexes를 이용해 해당 피처 단어의 중심 위치 상댓값 구함
        top_feature_values = cluster_model.cluster_centers_[cluster_num, top_feature_indexes].tolist()

        # cluster_details 딕셔너리 객체에 개별 군집별 핵심단어와 중심위치 상댓값, 해당 파일명 입력
        cluster_details[cluster_num]['top_features'] = top_features
        cluster_details[cluster_num]['top_features_value'] = top_feature_values
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()

        cluster_details[cluster_num]['filenames'] = filenames

    return cluster_details

get_cluster_details() 호출 : dictionary를 원소로 가지는 리스트인 cluster_details를 반환
- cluster_details : 개별 군집번호, 핵심 단어, 핵심 단어 중심 위치 상댓값, 파일명 속성 값 정보
- 이를 보기 좋게 print_cluster_details() 함수 제작

In [43]:
def print_cluster_details(cluster_details):
    for  cluster_num, cluster_detail in cluster_details.items():
        print('##### Cluster {0}'.format(cluster_num))
        print('Top features:', cluster_detail['top_features'])
        print('Reviews 파일명 : ', cluster_detail['filenames'][:7])
        print(' ============================================== ')

두 함수 호출하기

In [50]:
feature_names = tfidf_vect.get_feature_names_out()

cluster_details = get_cluster_details(cluster_model=km_cluster, cluster_data=document_df, feature_names=feature_names, clusters_num=3, top_n_features=10)

print_cluster_details(cluster_details)


##### Cluster 0
Top features: ['interior', 'mileage', 'seats', 'gas', 'comfortable', 'gas mileage', 'the interior', 'transmission', 'car', 'performance']
Reviews 파일명 :  ['/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_honda_accord_2008', '/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/comfort_toyota_camry_2007', '/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/interior_toyota_camry_2007', '/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/mileage_honda_accord_2008', '/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/gas_mileage_toyota_camry_2007', '/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/interior_honda_accord_2008', '/content/drive/MyDrive/2025-1/2025-1 ESAA OB/과제/data/topics/performance_honda_accord_2008']
##### Cluster 1
Top features: ['battery', 'screen', 'battery life', 'the screen', 'life', 'keyboard', 'the battery', 'video', 'speed', 'the keyboard']
Reviews 파일명 :  ['/content/drive/