# 감성 분석
- 문서의 주관적인 감성/의견/감정/기분 등을 파악하기 위한 방법
- 소셜미디어, 여론조사, 온라인리뷰, 피드백 등 다양한 분야에서 활용되고 있음
- 문서 내 텍스트가 나타내는 여러가지 주관적인 단어와 문맥을 기반으로 감성(Sentiment) 수치를 계산하는 방법 이용
- 지도학습/비지도학습 방식으로 나뉨
    - 지도학습: 학습 데이터와 타깃 레이블 값을 기반으로 감성 분석 학습 수행 후 다른 데이터의 감성 분석 예측
    - 비지도학습: 'Lexicon'이라는 일종의 감성 어휘 사전 이용, 용어와 문맥에 대한 다양한 정보를가지고 있으며, 이를 이용해 문서의 긍정/부정 여부 판단

## 실습: IMDB 영화평

## 지도학습 기반 감성분석

- id: 각 데이터의 id
- sentiment: 영화평 label값
- review: 영화평 텍스트

In [109]:
import pandas as pd

review_df = pd.read_csv('./labeledTrainData.tsv', header = 0, sep = '\t', quoting = 3)
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [110]:
print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

### 정규표현식으로 데이터 전처리

In [111]:
import re

#<br /> 태그 공백으로 변환
review_df['review'] = review_df['review'].str.replace('<br />',' ')

# 영어 문자열이 아닌 것은 모두 공백으로 변환
review_df['review'] = review_df['review'].apply(lambda x: re.sub('[^a-zA-Z]',' ',x))

In [112]:
# 확인
review_df['review'][0]

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for 

### 학습용/테트스용 데이터 분리

#### 안됨

In [81]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df['review']

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size = 0.3, random_state = 156)

X_train.shape, X_test.shape

((17500,), (7500,))

In [82]:
X_train

3724     ThisversionmovedalittleslowformytasteandIsuppo...
23599    IreallyenjoyedthisfilmbecauseIhaveatremendousi...
11331    Sawthisinthetheaterinandfelloutofmychairlaughi...
15745    RecentlyIwaslookingforthenewlyissuedWideScreen...
845      Escapingthelifeofbeingpimpedbyherfatherandthes...
                               ...                        
6955     Thisisagenerallynicefilmwithgoodstorygreatacto...
7653     TherealshameofTheGatheringisnotinthebadactingn...
9634     Inwhatcouldhavebeenanotherwiserunofthemillmedi...
6860     ExcellentPOWadventureadaptedbyEricWilliamsfrom...
24108    ThisonefeaturesallthebadeffectofPriorscheapomo...
Name: review, Length: 17500, dtype: object

In [83]:
X_test

1692     MygirlfriendandIwerestunnedbyhowbadthisfilmwas...
13392    Whatdoyouexpectwhenthereisnoscripttobeginwitha...
21063    ThisisaGermanfilmfromthatissomethingtodowithso...
10335    RichardTylerisalittleboywhoisscaredofeverythin...
16847    IrunagrouptostopcomedianexploitationandIjustsp...
                               ...                        
14848    Ifyouliketocommentonfilmswherethescriptarriveh...
8450     FirstletmesaythatNotoriousisanabsolutelycharmi...
8221     Realisticmoviesureexceptforthefactthatthechara...
10638    IwillspendafewdaysdedicatedtoRonHowardbeforeIs...
20673    JerryspiesTomlisteningtoacreepystoryontheradio...
Name: review, Length: 7500, dtype: object

#### 됨

In [113]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id','sentiment'], axis = 1, inplace = False)

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size = 0.3, random_state = 156)

X_train.shape, X_test.shape

((17500, 1), (7500, 1))

In [114]:
X_train

Unnamed: 0,review
3724,This version moved a little slow for my taste...
23599,I really enjoyed this film because I have a t...
11331,Saw this in the theater in and fell out o...
15745,Recently I was looking for the newly issued W...
845,Escaping the life of being pimped by her fath...
...,...
6955,This is a generally nice film with good stor...
7653,The real shame of The Gathering is not in...
9634,In what could have been an otherwise run of t...
6860,Excellent P O W adventure adapted by Eric W...


In [115]:
X_test

Unnamed: 0,review
1692,My girlfriend and I were stunned by how bad t...
13392,What do you expect when there is no script to...
21063,This is a German film from that is somet...
10335,Richard Tyler is a little boy who is scared o...
16847,I run a group to stop comedian exploitation a...
...,...
14848,If you like to comment on films where the scr...
8450,First let me say that Notorious is an absolu...
8221,Realistic movie sure except for the fact that...
10638,I will spend a few days dedicated to Ron Howa...


### 피처 벡터화

#### CounterVectorizer 기반

In [116]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score


# CounterVectorizer 파라미터 지정: 스톱워드는 english, ngram_range는 (1,2) 
# LogiticRegression의 C는 10으로 설정
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words = 'english', ngram_range = (1,2))),
    ('lr_clf', LogisticRegression(solver = 'liblinear', C = 10))
])

# pipeline 객체를 이용해 학습/예측/평가 수행
## 학습
pipeline.fit(X_train['review'], y_train)

## 예측
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[: ,1]

## 평가
accuracy = accuracy_score(y_test, pred)
roc_auc_score = roc_auc_score(y_test, pred_probs)

print(f'예측 정확도는 {np.round(accuracy,3)}, ROC-AUC는 {np.round(roc_auc_score,3)}')

예측 정확도는 0.886, ROC-AUC는 0.95


#### TF-IDF 기반

In [119]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# TfidfVectorizer 파라미터 지정: 스톱워드는 english, ngram_range는 (1,2) 
# LogiticRegression의 C는 10으로 설정
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words = 'english', ngram_range = (1,2))),
    ('lr_clf', LogisticRegression(solver = 'liblinear', C = 10))
])


# pipeline 객체를 이용해 학습/예측/평가 수행
## 학습
pipeline.fit(X_train['review'], y_train)

## 예측
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:,1]

## 평가
accuracy = accuracy_score(y_test, pred)
roc_auc = roc_auc_score(y_test, pred_probs)

print(f'예측 정확도는 {accuracy}, ROC-AUC는 {roc_auc}')

예측 정확도는 0.8936, ROC-AUC는 0.959800534941953


## 비지도학습 기반 감성분석

- 비지도 감성분석은 Lexicon을 기반으로 함
- 많은 감성 분석용 데이터는 결정된 레이블 값을 가지고 있지 않아서 Lexicon이 유용하게 사용됨
- 다만, 한글 지원이 되지 않음
- Lexicon은 일반적으로 어휘집을 의미하지만, 주로 감성만을 분석하기 위해 지원하는 감성 어휘사전
    - 감성 어휘사전은 긍정감성/부정감성의 정도를 의미하는 감성 지수를 가짐
    - 감성 지수는 단어의 위치나 주변 단어, 문맥, POS(품사) 등을 참고해 결정됨
    - NLTK 패키지에서 포함된 모듈임


- NLP 패키지
    - WordNet: 단순한 어휘사전이 아닌 시맨틱 분석을 제공하는 어휘사전임
        - 시맨틱(Semantic): 문맥상의 의미
        - 다양한 상황에서 같은 어휘라도 다르게 사용되는 어휘의 시맨틱 정보를 제공하며, 이를 위해 각각의 품사로 구성된 개별 단어를 Synset이라는 개념을 이용해 표현함
        - synset은 단순한 하나의 단어가 아니라 그 단어가 가지는 문맥, 시맨틱 정보를 제공함
    - SentiWordNet: NLTK패키지의 WordNet과 유사하게 감성 단어 전용의 WordNet을 구현한 것
        - WordNet의 synset 개념을 감성 분석에 적용한 것
        - WordNet의 synset 별로 긍정/부정/객관성 3가지 감성 점수를 할당함
    - VADER: 주로 소셜 미디어의 텍스트에 대한 감성 분석을 제공하기 위한 패키지
        - 뛰어난 감성 분석 결과를 제공하며, 비교적 빠른 수행시간을 보장해 대용량 텍스트 데이터에 잘 사용되는 패키지

In [41]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\basque_grammars.zip.
[nltk_data]    | Downloa

[nltk_data]    |   Unzipping corpora\nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package nps_chat to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\nps_chat.zip.
[nltk_data]    | Downloading package omw to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package omw-1.4 to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Package omw-1.4 is already up-to-date!
[nltk_data]    | Downloading package opinion_lexicon to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\opinion_lexicon.zip.
[nltk_data]    | Downloading package panlex_swadesh to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package paradigms to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\paradigms.zip.
[nltk_data]    | Downloading package p

[nltk_data]    |   Unzipping corpora\webtext.zip.
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\wmt15_eval.zip.
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\word2vec_sample.zip.
[nltk_data]    | Downloading package wordnet to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Package wordnet is already up-to-date!
[nltk_data]    | Downloading package wordnet2021 to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package wordnet31 to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package wordnet_ic to
[nltk_data]    |     C:\Users\NEW\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\wordnet_ic.zip.
[nltk_data]    | Downloading package words t

True

### WordNet

In [122]:
from nltk.corpus import wordnet as wn

# 단어 지정
term = 'present'

# present라는 단어로 wordnet의 synsets 생성
synsets = wn.synsets(term)


print(f'synsets 반환 type: {type(synsets)}')
print(f'synsets 반환 값 개수: {len(synsets)}')
print(f'synsets 반환 값: \n{synsets}')

synsets 반환 type: <class 'list'>
synsets 반환 값 개수: 18
synsets 반환 값: 
[Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


- synset 객체의 파라미터: 'present.n.01' -> POS태그를 나타냄
    - present: 의미
    - n: 품사
    - 01: 품사로서 가지는 의미가 여러가지가 있어서 이를 구분하는 인덱스

- synset객체의 속성:
    - synset.lexname() -> POS: 품사 
    - synset.definition() -> Definition: 정의
    - synset.lemma_names() -> Lemma: 부명제

In [125]:
for synset in synsets:
    print(f'***synset: {synset.name()} ****')
    print(f'POS: {synset.lexname()}')
    print(f'Definition: {synset.definition()}')
    print(f'Lemmas: {synset.lemma_names()}')
    print('\n')

***synset: present.n.01 ****
POS: noun.time
Definition: the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas: ['present', 'nowadays']


***synset: present.n.02 ****
POS: noun.possession
Definition: something presented as a gift
Lemmas: ['present']


***synset: present.n.03 ****
POS: noun.communication
Definition: a verb tense that expresses actions or states at the time of speaking
Lemmas: ['present', 'present_tense']


***synset: show.v.01 ****
POS: verb.perception
Definition: give an exhibition of to an interested audience
Lemmas: ['show', 'demo', 'exhibit', 'present', 'demonstrate']


***synset: present.v.02 ****
POS: verb.communication
Definition: bring forward and present to the mind
Lemmas: ['present', 'represent', 'lay_out']


***synset: stage.v.01 ****
POS: verb.creation
Definition: perform (a play), especially on a stage
Lemmas: ['stage', 'present', 'represent']


***synset: present.v.04 ****
POS: verb.possession
Defini

#### path_similarity 메서드: 어휘 간 유사도 측정

In [132]:
from nltk.corpus import wordnet as wn

# synset 객체를 단어별로 생성

tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.01')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')


# 단어 객체 리스트
entities = [tree, lion, tiger, cat, dog]

# 유사도 측정값 담을 빈 리스트
similarities = []

# 단어 객체 이름
entity_names = [entity.name().split('.')[0] for entity in entities]

# 단어별 synset을 반복하면서 다른 단어와의 synset유사도 측정
for entity in entities:
    similarity = [round(entity.path_similarity(compared_entity),2)
                     for compared_entity in entities ]
    similarities.append(similarity)
    
# 개별 단어별 synset과 다른 단어의 synset과의 유사도를 데이터프레임형태로 생성
similarity_df = pd.DataFrame(similarities, columns = entity_names, index = entity_names)
similarity_df

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.14,0.08,0.12
lion,0.07,1.0,0.08,0.25,0.17
tiger,0.14,0.08,1.0,0.09,0.17
cat,0.08,0.25,0.09,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


### SentiWordNet

In [43]:
import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print(f'senti_synsets 반환 타입: {type(senti_synsets)}')
print(f'senti_synsets 반환 값 개수: {len(senti_synsets)}')
print(f'senti_synsets 반환 값: \n{senti_synsets}')

senti_synsets 반환 타입: <class 'list'>
senti_synsets 반환 값 개수: 11
senti_synsets 반환 값: 
[SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]


#### father, fabulous의 긍정/부정/객관성 지수 추출

In [44]:
import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print(f'father 긍정감정 지수: {father.pos_score()}')
print(f'father 부정감정 지수: {father.neg_score()}')
print(f'father 객관성 지수: {father.obj_score()}')
print('\n')

fabulous = swn.senti_synset('fabulous.a.01')
print(f'fabulous 긍정감정 지수: {fabulous.pos_score()}')
print(f'fabulous 부정감정 지수: {fabulous.neg_score()}')
print(f'fabulous 객관성 지수: {fabulous.obj_score()}')

father 긍정감정 지수: 0.0
father 부정감정 지수: 0.0
father 객관성 지수: 1.0


fabulous 긍정감정 지수: 0.875
fabulous 부정감정 지수: 0.125
fabulous 객관성 지수: 0.0


- father는 객관적인 단어로 객관성 지수가 1.0, fabulous는 감성 단어로 긍정 감정 지수가 0.875, 부정 감정 지수가 0.125임

### 실습: SentiWordNet을 이용한 영화 감상평 감성 분석

1. 문서를 문장 단위로 분해
2. 다시 문장을 단어 단위로 토큰화하고 품사 태깅
3. 품사 태깅된 단어를 기반으로 synset 객체와 senti_synset 객체를 생성
4. senti_synset에서 긍정감성/부정 감성 지수를 구하고 이를 모두 합산해 특정 임계치 값 이상일 때 긍정 감성으로, 그렇지 않을 때는 부정 감성으로 결정

In [186]:
# 영화 감상평 데이터 불러오기

import pandas as pd

review_df = pd.read_csv('./labeledTrainData.tsv', header = 0, sep = '\t', quoting = 3)
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur \""critics\"" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stac..."


#### 감상평 컬럼 정규표현식을 이용해 불필요한 단어 제거

In [187]:
review_df['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [188]:
import re

review_df['review'] = review_df['review'].str.replace('<br />', ' ')
review_df['review'] = review_df['review'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', x))

print(review_df['review'][0])

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for  

#### 품사 태깅을 수행하는 사용자 함수를 생성

In [189]:
from nltk.corpus import wordnet as wn

def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ  # a 형용사
    elif tag.startswith('N'):
        return wn.NOUN  # n 명사
    elif tag.startswith('R'):
        return wn.ADV   # r 부사
    elif tag.startswith('V'):
        return wn.VERB   # v 동사

#### 문장-> 단어 토큰-> 품사 태깅-> SentiSynset 클래스 생성-> polarity Score 합산 함수 생성

In [190]:
from nltk import sent_tokenize, word_tokenize, pos_tag # 문장/단어 토큰화
from nltk.stem import WordNetLemmatizer  #품사 태깅
from nltk.corpus import sentiwordnet as swn #sentisynset클래스 생성
  
def swn_polarity(text):
    # 감성 지수 초기화
    sentiment = 0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    
    # 문단-> 문장 토큰화
    raw_sentences = sent_tokenize(text)
    
    for raw_sentence in raw_sentences:
        # 분해된 문장별로 단어 토큰화 후 품사분류 
        ## 예시: (단어, 품사태그)
        tagged_sentence = pos_tag(word_tokenize(raw_sentence)) 
#         print(tagged_sentence)
        for word, tag in tagged_sentence:
            # NTLK 기반의 품사 태깅 문장 추출
            wn_tag = penn_to_wn(tag)
#             print(wn_tag)
            
            # 명사, 형용사, 부사에 해당하지 않으면 빠이
            if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
                continue
#             print(wn_tag)
            # 어근 추출
            lemma = lemmatizer.lemmatize(word, pos = wn_tag)
#             print(lemma)
            if not lemma:
                continue
#             print(lemma)
                
            # 어근을 추출한 단어와 wordnet 기반 품사 태깅을 입력해서 synset 객체 생성
            synsets = wn.synsets(lemma, pos = wn_tag)
#             print(synsets)
            # 없으면 돌아가잇!
            if not synsets:
                continue
#             print(synsets)
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감정 지수는 +로, 부정 감정지수는 -로 합산해 감성 지수 계산
            synset = synsets[0]  # 왜 가장 첫번재걸 뽑지..
            ## sentisysset 객체화
            swn_synset = swn.senti_synset(synset.name())
            ## 감성지수 계산 후 합산
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
            
            tokens_count += 1
    # 의미 있는 토큰이 없는 경우 0 반환
    if not tokens_count:
        return 0
    
    # 총 score가 0 이상인 경우 긍정의 1 반환
    if sentiment >= 0:
        return 1
    
    # 총 score가 0 미만인 경우 부정의 0
    return 0

In [191]:
# 확인용
text = ' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for    minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord  Why he wants MJ dead so bad is beyond me  Because MJ overheard his plans  Nah  Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates MJ s music   Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence  Also  the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene   Bottom line  this movie is for people who like MJ on one level or another  which i think is most people   If not  then stay away  It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl  Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty  Well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  He is either an extremely nice but stupid guy or one of the most sickest liars  I hope he is not the latter  '
swn_polarity(text)

0

#### 학습용/테스트용 데이터 생성

In [192]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(['id','sentiment'], axis = 1, inplace = False)

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, test_size = 0.3, random_state = 156)

X_train.shape, X_test.shape

((17500, 1), (7500, 1))

#### 영화감상평마다 긍정/부정을 나타내는 컬럼 생성

In [193]:
# 긍정/부정 결과를 나타내는 label 컬럼을 사용자함수를 적용하여 생성
review_df['pred'] = review_df['review'].apply(lambda x: swn_polarity(x))

In [194]:
# value 추출
y_target = review_df['sentiment'].values
preds = review_df['pred'].values


y_target, preds

(array([1, 1, 0, ..., 0, 0, 1], dtype=int64),
 array([0, 1, 0, ..., 1, 0, 0], dtype=int64))

#### 분석 예측 성능 확인하기

In [195]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np

print(confusion_matrix(y_target, preds))
print(f'정확도: {np.round(accuracy_score(y_target, preds),4)}')
print(f'정밀도: {np.round(precision_score(y_target, preds),4)}')
print(f'재현율: {np.round(recall_score(y_target, preds),4)}')

[[7668 4832]
 [3636 8864]]
정확도: 0.6613
정밀도: 0.6472
재현율: 0.7091


In [196]:
review_df.head(3)

Unnamed: 0,id,sentiment,review,pred
0,"""5814_8""",1,With all this stuff going down at the moment with MJ i ve started listening to his music watching the odd documentary here and there watched The Wiz and watched Moonwalker again Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent Moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ i...,0
1,"""2381_9""",1,The Classic War of the Worlds by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H G Wells classic book Mr Hines succeeds in doing so I and those who watched his film with me appreciated the fact that it was not the standard predictable Hollywood fare that comes out every year e g the Spielberg version with Tom Cruise that had only the slightest resemblance to the book Obviously everyone looks for different things in a movie Those who envision themselves as amateur critics look only to criticize everything they can Others rate a movie on more important bases like being entertained which is why most ...,1
2,"""7759_3""",0,The film starts with a manager Nicholas Bell giving welcome investors Robert Carradine to Primal Park A secret project mutating a primal animal using fossilized DNA like Jurassik Park and some scientists resurrect one of nature s most fearsome predators the Sabretooth tiger or Smilodon Scientific ambition turns deadly however and when the high voltage fence is opened the creature escape and begins savagely stalking its prey the human visitors tourists and scientific Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre historical animals which are deadlier and bigger In addition a security agent Stac...,0


### VADER

In [197]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# VADER 감성 분석 클래스 객체화
senti_analyzer = SentimentIntensityAnalyzer()

# polarity_scores 메서드로 감성 지수 출력
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)

{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}


- neg: 부정 감성 지수
- neu: 중립적인 감성 지수
- pos: 긍정 감성 지수
- compound: neg, new, pos 감성 지수를 적절히 조합해 -1~1 사이 감ㅁ성 지수를 표현한 값

#### 긍정/부정 구분해주는 사용자 함수 생성

In [198]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def vader_polarity(review, threshold = 0.1):
    # 감성분석해주는 클래스 객체화
    senti_analyzer = SentimentIntensityAnalyzer()
    
    # 감상평에 대한 여러 감성지수 추출
    senti_scores = senti_analyzer.polarity_scores(review)
    
    # neg, pos, neu를 고려한 감정지수를 객체화
    agg_score = senti_scores['compound']
    
    # 부정/긍정으로 나눔
    final_score = 1 if agg_score >= threshold else 0
    
    return final_score

In [199]:
# vader로 부정/긍정 감성지수 여부를 데이터프레임의 vader_preds 컬럼에 추가

review_df['vader_preds'] = review_df['review'].apply(lambda x: vader_polarity(x, 0.1))
review_df.head(5)

Unnamed: 0,id,sentiment,review,pred,vader_preds
0,"""5814_8""",1,With all this stuff going down at the moment with MJ i ve started listening to his music watching the odd documentary here and there watched The Wiz and watched Moonwalker again Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent Moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ i...,0,0
1,"""2381_9""",1,The Classic War of the Worlds by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H G Wells classic book Mr Hines succeeds in doing so I and those who watched his film with me appreciated the fact that it was not the standard predictable Hollywood fare that comes out every year e g the Spielberg version with Tom Cruise that had only the slightest resemblance to the book Obviously everyone looks for different things in a movie Those who envision themselves as amateur critics look only to criticize everything they can Others rate a movie on more important bases like being entertained which is why most ...,1,1
2,"""7759_3""",0,The film starts with a manager Nicholas Bell giving welcome investors Robert Carradine to Primal Park A secret project mutating a primal animal using fossilized DNA like Jurassik Park and some scientists resurrect one of nature s most fearsome predators the Sabretooth tiger or Smilodon Scientific ambition turns deadly however and when the high voltage fence is opened the creature escape and begins savagely stalking its prey the human visitors tourists and scientific Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre historical animals which are deadlier and bigger In addition a security agent Stac...,0,0
3,"""3630_4""",0,It must be assumed that those who praised this film the greatest filmed opera ever didn t I read somewhere either don t care for opera don t care for Wagner or don t care about anything except their desire to appear Cultured Either as a representation of Wagner s swan song or as a movie this strikes me as an unmitigated disaster with a leaden reading of the score matched to a tricksy lugubrious realisation of the text It s questionable that people with ideas as to what an opera or for that matter a play especially one by Shakespeare is about should be allowed anywhere near a theatre or film studio Syberberg very fashionably but without the smallest justifica...,0,1
4,"""9495_8""",1,Superbly trashy and wondrously unpretentious s exploitation hooray The pre credits opening sequences somewhat give the false impression that we re dealing with a serious and harrowing drama but you need not fear because barely ten minutes later we re up until our necks in nonsensical chainsaw battles rough fist fights lurid dialogs and gratuitous nudity Bo and Ingrid are two orphaned siblings with an unusually close and even slightly perverted relationship Can you imagine playfully ripping off the towel that covers your sister s naked body and then stare at her unshaven genitals for several whole minutes Well Bo does that to his sister and judging by her dubbed laughter sh...,0,1


#### 분석 예측 성능 확인하기

In [200]:
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score

# y_test 
y_target = review_df['sentiment'].values

# y_pred
vader_preds = review_df['vader_preds'].values

print(f'오차행렬: \n{confusion_matrix(y_target, vader_preds)}')
print(f'정확도: {np.round(accuracy_score(y_target, vader_preds),3)}')
print(f'정밀도: {np.round(precision_score(y_target, vader_preds),3)}')
print(f'재현율: {np.round(recall_score(y_target, vader_preds),3)}')

오차행렬: 
[[ 6747  5753]
 [ 1858 10642]]
정확도: 0.696
정밀도: 0.649
재현율: 0.851


```
SentiWordNet을 활용한 감성분석 예측 성능 결과: 

[[7650 4850]
 [3549 8951]]
정확도: 0.664
정밀도: 0.6486
재현율: 0.7161
```
- 정확도가 약간 증가하고, 재현율이 큰 폭으로 증가함 

## 문서 군집화

In [201]:
import pandas as pd
import glob, os
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 700)

### 실습: Opinion Review 데이터세트를 이용한 무넛 군집화

#### 파일별 파일이름,파일 내용 데이터프레임 생성

In [237]:
# 기본 경로 지정
path = r'OpinosisDataset1.0\topics'

# path 경로의 디렉토리 속 모든 .data 파일들의 파일명을 리스트로 취합함
all_files = glob.glob(os.path.join(path, '*.data'))

filename_list = []
opinion_text = []
# 개별 파일들의 파일명은 filename_list로 취합
# 개별 파일들의 파일 내용은 데이터프레임 로딩 후 다시 string으로 변환하여 opinion_text 리스트로 취합
for file_ in all_files:
    # 개별 파일들을 읽어서 데이터프레임 형성
    df = pd.read_table(file_, index_col = None, header = 0, encoding = 'latin1')
#     display(df.head())
    
    # 절대 경로로 주어진 파일명을 가공
    filename_ = file_.split('\\')[-1].split('.')[0]
    
    # 파일이름 리스트에 추가, 파일 내용 추가
    filename_list.append(filename_)
#     print(df.to_string())
    opinion_text.append(df.to_string())

# 파일명 리스트와 파일 내용 리스트를 데이터프레임으로 생성
document_df = pd.DataFrame({'filename': filename_list, 
                            'opinion_text': opinion_text})
document_df

Unnamed: 0,filename,opinion_text
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi..."
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ..."
2,battery-life_amazon_kindle,"After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !\n0 After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !\n1 ..."
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now..."
5,buttons_amazon_kindle,"I thought it would be fitting to christen my Kindle with the Stephen King novella UR, so went to the Amazon site on my computer and clicked on the button to buy it .\n0 As soon as I'd clicked the button to confirm my order it appeared on my Kindle almost immediately !\n1 ..."
6,comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ..."
7,comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 Great gas mileage and comfortable on long trips ..."
8,directions_garmin_nuvi_255W_gps,You also get upscale features like spoken directions including street names and programmable POIs .\n0 I used to hesitate to go out of my directions but no...
9,display_garmin_nuvi_255W_gps,"3 quot widescreen display was a bonus .\n0 This made for smoother graphics on the 255w of the vehicle moving along displayed roads, where the 750's display was more of a jerky movement .\n1 ..."


- 각 파일 이름만으로는 의견의 텍스트가 어떠한 제품/서비스에 대한 리뷰인지 알 수 없음

#### LemNomalize 사용자함수 생성

In [238]:
from nltk.stem import WordNetLemmatizer
import nltk
import string

In [239]:
# 제거해야할 문장부호 딕셔너리화
## ord: 하나의 문자를 인자로 받고 해당 문자에 해당하는 유니코드 정수화
## string.punctuation: 따옴표,마침표 물음표 등의 문장부호
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)


# 표제어 추출: 어간 추출과 달리 단어의 형태가 적절히 보존되는 양상을 보이는 특징임
lemmar = WordNetLemmatizer()


# 입력으로 들어온 token 단어들에 대해 Lemmatization 어근 변환
def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

# 입력으로 문장을 받아서 stop_word 제거(translate)-> 소문자 변환(lower) -> 단어 토큰화 -> lemmatization 어근 변환
## translate: 문자 변환 메서드
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

In [240]:
## 확인용

In [241]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [242]:
remove_punct_dict

{33: None,
 34: None,
 35: None,
 36: None,
 37: None,
 38: None,
 39: None,
 40: None,
 41: None,
 42: None,
 43: None,
 44: None,
 45: None,
 46: None,
 47: None,
 58: None,
 59: None,
 60: None,
 61: None,
 62: None,
 63: None,
 64: None,
 91: None,
 92: None,
 93: None,
 94: None,
 95: None,
 96: None,
 123: None,
 124: None,
 125: None,
 126: None}

In [243]:
text = "The wine reception is a great idea as it is nice to meet other travellers and great having access to the free Internet access in our room .\n0 They also have a computer available with free internet which is a nice bonus but I didn't find that out till the day before we left but was still able to get on there to check our flight to Vegas the next da"
text

"The wine reception is a great idea as it is nice to meet other travellers and great having access to the free Internet access in our room .\n0 They also have a computer available with free internet which is a nice bonus but I didn't find that out till the day before we left but was still able to get on there to check our flight to Vegas the next da"

In [209]:
text.lower().translate(remove_punct_dict)

'the wine reception is a great idea as it is nice to meet other travellers and great having access to the free internet access in our room \n0 they also have a computer available with free internet which is a nice bonus but i didnt find that out till the day before we left but was still able to get on there to check our flight to vegas the next da'

In [210]:
print(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

['the', 'wine', 'reception', 'is', 'a', 'great', 'idea', 'as', 'it', 'is', 'nice', 'to', 'meet', 'other', 'travellers', 'and', 'great', 'having', 'access', 'to', 'the', 'free', 'internet', 'access', 'in', 'our', 'room', '0', 'they', 'also', 'have', 'a', 'computer', 'available', 'with', 'free', 'internet', 'which', 'is', 'a', 'nice', 'bonus', 'but', 'i', 'didnt', 'find', 'that', 'out', 'till', 'the', 'day', 'before', 'we', 'left', 'but', 'was', 'still', 'able', 'to', 'get', 'on', 'there', 'to', 'check', 'our', 'flight', 'to', 'vegas', 'the', 'next', 'da']


In [211]:
print(LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict))))

['the', 'wine', 'reception', 'is', 'a', 'great', 'idea', 'a', 'it', 'is', 'nice', 'to', 'meet', 'other', 'traveller', 'and', 'great', 'having', 'access', 'to', 'the', 'free', 'internet', 'access', 'in', 'our', 'room', '0', 'they', 'also', 'have', 'a', 'computer', 'available', 'with', 'free', 'internet', 'which', 'is', 'a', 'nice', 'bonus', 'but', 'i', 'didnt', 'find', 'that', 'out', 'till', 'the', 'day', 'before', 'we', 'left', 'but', 'wa', 'still', 'able', 'to', 'get', 'on', 'there', 'to', 'check', 'our', 'flight', 'to', 'vega', 'the', 'next', 'da']


#### TF-IDF 벡터화

In [244]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(tokenizer = LemNormalize, 
                             stop_words = 'english',
                             ngram_range = (1,2),
                             min_df = 0.05,
                             max_df = 0.85)


feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])
feature_vect

<51x4611 sparse matrix of type '<class 'numpy.float64'>'
	with 30124 stored elements in Compressed Sparse Row format>

#### 군집화

In [245]:
from sklearn.cluster import KMeans

km_cluster = KMeans(n_clusters = 5, max_iter = 10000, random_state = 0)
km_cluster.fit(feature_vect)

# 군집화된 전체 레이블 값
cluster_label = km_cluster.labels_
# 군집의 중심점
cluster_centers = km_cluster.cluster_centers_

In [246]:
document_df['cluster_label'] = cluster_label
document_df

Unnamed: 0,filename,opinion_text,cluster_label
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi...",2
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ...",0
2,battery-life_amazon_kindle,"After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !\n0 After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !\n1 ...",1
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...,1
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now...",1
5,buttons_amazon_kindle,"I thought it would be fitting to christen my Kindle with the Stephen King novella UR, so went to the Amazon site on my computer and clicked on the button to buy it .\n0 As soon as I'd clicked the button to confirm my order it appeared on my Kindle almost immediately !\n1 ...",2
6,comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",4
7,comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 Great gas mileage and comfortable on long trips ...",4
8,directions_garmin_nuvi_255W_gps,You also get upscale features like spoken directions including street names and programmable POIs .\n0 I used to hesitate to go out of my directions but no...,2
9,display_garmin_nuvi_255W_gps,"3 quot widescreen display was a bonus .\n0 This made for smoother graphics on the 255w of the vehicle moving along displayed roads, where the 750's display was more of a jerky movement .\n1 ...",2


In [247]:
### 확인용

In [248]:
km_cluster.labels_

array([2, 0, 1, 1, 1, 2, 4, 4, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1, 3, 3,
       4, 2, 3, 4, 1, 3, 3, 4, 0, 0, 0, 2, 2, 2, 2, 4, 3, 3, 3, 1, 1, 2,
       1, 3, 3, 4, 2, 2, 2])

In [249]:
km_cluster.cluster_centers_

array([[0.        , 0.        , 0.00474778, ..., 0.        , 0.00159451,
        0.00067345],
       [0.01502839, 0.        , 0.        , ..., 0.00105331, 0.        ,
        0.        ],
       [0.00819396, 0.        , 0.        , ..., 0.01050909, 0.        ,
        0.        ],
       [0.        , 0.0012246 , 0.00068853, ..., 0.        , 0.00176658,
        0.00157225],
       [0.        , 0.00092551, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [250]:
for i in  km_cluster.cluster_centers_:
    print(len(i))

4611
4611
4611
4611
4611


#### 군집화 레이블별 내용 확인

In [251]:
document_df['cluster_label'].value_counts()

2    16
3    13
4    10
1     8
0     4
Name: cluster_label, dtype: int64

##### 0번: 호텔

In [252]:
document_df[document_df['cluster_label'] == 0]

Unnamed: 0,filename,opinion_text,cluster_label
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ...",0
30,rooms_bestwestern_hotel_sfo,"Great Location , Nice Rooms , H...",0
31,rooms_swissotel_chicago,"The Swissotel is one of our favorite hotels in Chicago and the corner rooms have the most fantastic views in the city .\n0 The rooms look like they were just remodled and upgraded, there was an HD TV and a nice iHome docking station to put my iPod so I could set the alarm to wake up with my music instead of the radio .\n1 ...",0
32,room_holiday_inn_london,"We arrived at 23,30 hours and they could not recommend a restaurant so we decided to go to Tesco, with very limited choices but when you are hingry you do not careNext day they rang the bell at 8,00 hours to clean the room, not being very nice being waken up so earlyEvery day they gave u...",0


##### 1번: 킨들, 컴퓨터 등이 섞임

In [253]:
document_df[document_df['cluster_label'] == 1]

Unnamed: 0,filename,opinion_text,cluster_label
2,battery-life_amazon_kindle,"After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !\n0 After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !\n1 ...",1
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...,1
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now...",1
19,keyboard_netbook_1005ha,", I think the new keyboard rivals the great hp mini keyboards .\n0 Since the battery life difference is minimum, the only reason to upgrade would be to get the better keyboard .\n1 The keyboard is now as good as t...",1
26,performance_netbook_1005ha,"The Eee Super Hybrid Engine utility lets users overclock or underclock their Eee PC's to boost performance or provide better battery life depending on their immediate requirements .\n0 In Super Performance mode CPU, Z shows the bus speed to increase up to 169 .\n1 One...",1
41,size_asus_netbook_1005ha,"A few other things I'd like to point out is that you must push the micro, sized right angle end of the ac adapter until it snaps in place or the battery may not charge .\n0 The full size right shift k...",1
42,sound_ipod_nano_8gb,headphone jack i got a clear case for it and it i got a clear case for it and it like prvents me from being able to put the jack all the way in so the sound can b messsed up or i can get it in there and its playing well them go to move or something and it slides out .\n0 Picture and sound quality are excellent for this typ of devic .\n1 ...,1
44,speed_windows7,"Windows 7 is quite simply faster, more stable, boots faster, goes to sleep faster, comes back from sleep faster, manages your files better and on top of that it's beautiful to look at and easy to use .\n0 , faster about 20% to 30% faster at running applications than my Vista , seriously\n1 ...",1


##### 2번: 킨들, 컴퓨터, gps 등이 섞임

In [254]:
document_df[document_df['cluster_label'] == 2]

Unnamed: 0,filename,opinion_text,cluster_label
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi...",2
5,buttons_amazon_kindle,"I thought it would be fitting to christen my Kindle with the Stephen King novella UR, so went to the Amazon site on my computer and clicked on the button to buy it .\n0 As soon as I'd clicked the button to confirm my order it appeared on my Kindle almost immediately !\n1 ...",2
8,directions_garmin_nuvi_255W_gps,You also get upscale features like spoken directions including street names and programmable POIs .\n0 I used to hesitate to go out of my directions but no...,2
9,display_garmin_nuvi_255W_gps,"3 quot widescreen display was a bonus .\n0 This made for smoother graphics on the 255w of the vehicle moving along displayed roads, where the 750's display was more of a jerky movement .\n1 ...",2
10,eyesight-issues_amazon_kindle,"It feels as easy to read as the K1 but doesn't seem any crisper to my eyes .\n0 the white is really GREY, and to avoid considerable eye, strain I had to refresh pages every other page .\n1 The dream has always been a portable electronic device that could hold a ton of reading material, automate subscriptions and fa...",2
11,features_windows7,"I had to uninstall anti, virus and selected other programs, some of which did not have listings in the Programs and Features Control Panel section .\n0 This review briefly touches upon some of the key features and enhancements of Microsoft's latest OS .\n1 ...",2
12,fonts_amazon_kindle,"Being able to change the font sizes is awesome !\n0 For whatever reason, Amazon decided to make the Font on the Home Screen ...",2
23,navigation_amazon_kindle,"In fact, the entire navigation structure has been completely revised , I'm still getting used to it but it's a huge step forward .\n0 ...",2
33,satellite_garmin_nuvi_255W_gps,"It's fast to acquire satellites .\n0 If you've ever had a Brand X GPS take you on some strange route that adds 20 minutes to your trip, has you turn the wrong way down a one way road, tell you to turn AFTER you've passed the street, frequently loses the satellite signal, or has old maps missing streets, you know how important this stuff is .\n1 ...",2
34,screen_garmin_nuvi_255W_gps,It is easy to read and when touching the screen it works great !\n0 and zoom out buttons on the 255w to the same side of the screen which makes it a bit easier .\n1 ...,2


##### 3번: 호텔, 킨들이 섞임

In [255]:
document_df[document_df['cluster_label'] == 3]

Unnamed: 0,filename,opinion_text,cluster_label
13,food_holiday_inn_london,The room was packed to capacity with queues at the food buffets .\n0 The over zealous staff cleared our unfinished drinks while we were collecting cooked food and movement around the room with plates was difficult in the crowded circumstances .\n1 ...,3
14,food_swissotel_chicago,The food for our event was delicious .\n0 ...,3
15,free_bestwestern_hotel_sfo,The wine reception is a great idea as it is nice to meet other travellers and great having access to the free Internet access in our room .\n0 They also have a computer available with free internet which is a nice bonus but I didn't find that out till the day before we left but was still able to get on there to check our flight to Vegas the next day .\n1 ...,3
20,location_bestwestern_hotel_sfo,"Good Value good location , ideal choice .\n0 Great Location , Nice Rooms , Helpless Concierge\n1 ...",3
21,location_holiday_inn_london,"Great location for tube and we crammed in a fair amount of sightseeing in a short time .\n0 All in all, a normal chain hotel on a nice lo...",3
24,parking_bestwestern_hotel_sfo,Parking was expensive but I think this is common for San Fran .\n0 there is a fee for parking but well worth it seeing no where to park if you do have a car .\n1 ...,3
27,price_amazon_kindle,"If a case was included, as with the Kindle 1, that would have been reflected in a higher price .\n0 lower overall price, with nice leather cover .\n1 ...",3
28,price_holiday_inn_london,"All in all, a normal chain hotel on a nice location , I will be back if I do not find anthing closer to Picadilly for a better price .\n0 ...",3
38,service_bestwestern_hotel_sfo,"Both of us having worked in tourism for over 14 years were very disappointed at the level of service provided by this gentleman .\n0 The service was good, very friendly staff and we loved the free wine reception each night .\n1 ...",3
39,service_holiday_inn_london,"not customer, oriented hotelvery low service levelboor reception\n0 The room was quiet, clean, the bed and pillows were comfortable, and the serv...",3


In [256]:
document_df.loc[15,'opinion_text']

"                                                                                                                                                              The wine reception is a great idea as it is nice to meet other travellers and great having access to the free Internet access in our room .\n0                                                                                      They also have a computer available with free internet which is a nice bonus but I didn't find that out till the day before we left but was still able to get on there to check our flight to Vegas the next day .\n1                                                                                                                                                                                                             The service was good, very friendly staff and we loved the free wine reception each night .\n2                                                                                                     

##### 4번: 자동차 관련 내용

In [257]:
document_df[document_df['cluster_label'] == 4]

Unnamed: 0,filename,opinion_text,cluster_label
6,comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",4
7,comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 Great gas mileage and comfortable on long trips ...",4
16,gas_mileage_toyota_camry_2007,Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 ...,4
17,interior_honda_accord_2008,I love the new body style and the interior is a simple pleasure except for the center dash .\n0 ...,4
18,interior_toyota_camry_2007,"First of all, the interior has way too many cheap plastic parts like the cheap plastic center piece that houses the clock .\n0 3 blown struts at 30,000 miles, interior trim coming loose and rattling squeaking, stains on paint, and bug splats taking paint off, premature uneven brake wear, on 3rd windsh...",4
22,mileage_honda_accord_2008,"It's quiet, get good gas mileage and looks clean inside and out .\n0 The mileage is great, and I've had to get used to stopping less for gas .\n1 Thought gas ...",4
25,performance_honda_accord_2008,"Very happy with my 08 Accord, performance is quite adequate it has nice looks and is a great long, distance cruiser .\n0 6, 4, 3 eco engine has poor performance and gas mileage of 22 highway .\n1 Overall performance is good but comfort level is poor .\n2 ...",4
29,quality_toyota_camry_2007,I previously owned a Toyota 4Runner which had incredible build quality and reliability .\n0 I bought the Camry because of Toyota reliability and qua...,4
37,seats_honda_accord_2008,"Front seats are very uncomfortable .\n0 No memory seats, no trip computer, can only display outside temp with trip odometer .\n1 ...",4
47,transmission_toyota_camry_2007,"After slowing down, transmission has to be kicked to speed up .\n0 ...",4


- 0번: 호텔
- 1번: 킨들, 컴퓨터 등이 섞임
- 2번: 킨들, 컴퓨터 등이 섞임
- 3번: 호텔, 킨들 등이 섞임
- 4번: 자동차

#### 재군집화

In [258]:
from sklearn.cluster import KMeans

km_cluster2 = KMeans(n_clusters = 3, max_iter = 10000, random_state = 0)
km_cluster2.fit(feature_vect)

cluster_labels2 = km_cluster2.labels_
cluster_centers2 = km_cluster2.cluster_centers_ 

In [259]:
document_df['cluster_label2'] = cluster_labels2
document_df

Unnamed: 0,filename,opinion_text,cluster_label,cluster_label2
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi...",2,0
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ...",0,2
2,battery-life_amazon_kindle,"After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !\n0 After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !\n1 ...",1,0
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...,1,0
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now...",1,0
5,buttons_amazon_kindle,"I thought it would be fitting to christen my Kindle with the Stephen King novella UR, so went to the Amazon site on my computer and clicked on the button to buy it .\n0 As soon as I'd clicked the button to confirm my order it appeared on my Kindle almost immediately !\n1 ...",2,0
6,comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",4,1
7,comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 Great gas mileage and comfortable on long trips ...",4,1
8,directions_garmin_nuvi_255W_gps,You also get upscale features like spoken directions including street names and programmable POIs .\n0 I used to hesitate to go out of my directions but no...,2,0
9,display_garmin_nuvi_255W_gps,"3 quot widescreen display was a bonus .\n0 This made for smoother graphics on the 255w of the vehicle moving along displayed roads, where the 750's display was more of a jerky movement .\n1 ...",2,0


#### 재군집화 레이블별 내용 확인

In [260]:
document_df['cluster_label2'].value_counts()

0    25
2    16
1    10
Name: cluster_label2, dtype: int64

##### 0번 군집: 전자기기(킨들, gps, 컴퓨터)가 묶임

In [261]:
document_df[document_df['cluster_label2'] == 0]

Unnamed: 0,filename,opinion_text,cluster_label,cluster_label2
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n0 but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n1 This functi...",2,0
2,battery-life_amazon_kindle,"After I plugged it in to my USB hub on my computer to charge the battery the charging cord design is very clever !\n0 After you have paged tru a 500, page book one, page, at, a, time to get from Chapter 2 to Chapter 15, see how excited you are about a low battery and all the time it took to get there !\n1 ...",1,0
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\n0 I love this ipod except for the battery life .\n1 ...,1,0
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh Li, ion Battery , and a 1 .\n0 Not to mention that as of now...",1,0
5,buttons_amazon_kindle,"I thought it would be fitting to christen my Kindle with the Stephen King novella UR, so went to the Amazon site on my computer and clicked on the button to buy it .\n0 As soon as I'd clicked the button to confirm my order it appeared on my Kindle almost immediately !\n1 ...",2,0
8,directions_garmin_nuvi_255W_gps,You also get upscale features like spoken directions including street names and programmable POIs .\n0 I used to hesitate to go out of my directions but no...,2,0
9,display_garmin_nuvi_255W_gps,"3 quot widescreen display was a bonus .\n0 This made for smoother graphics on the 255w of the vehicle moving along displayed roads, where the 750's display was more of a jerky movement .\n1 ...",2,0
10,eyesight-issues_amazon_kindle,"It feels as easy to read as the K1 but doesn't seem any crisper to my eyes .\n0 the white is really GREY, and to avoid considerable eye, strain I had to refresh pages every other page .\n1 The dream has always been a portable electronic device that could hold a ton of reading material, automate subscriptions and fa...",2,0
11,features_windows7,"I had to uninstall anti, virus and selected other programs, some of which did not have listings in the Programs and Features Control Panel section .\n0 This review briefly touches upon some of the key features and enhancements of Microsoft's latest OS .\n1 ...",2,0
12,fonts_amazon_kindle,"Being able to change the font sizes is awesome !\n0 For whatever reason, Amazon decided to make the Font on the Home Screen ...",2,0


##### 1번 군집: 자동차가 묶임

In [262]:
document_df[document_df['cluster_label2'] == 1]

Unnamed: 0,filename,opinion_text,cluster_label,cluster_label2
6,comfort_honda_accord_2008,"Drivers seat not comfortable, the car itself compared to other models of similar class .\n0 ...",4,1
7,comfort_toyota_camry_2007,"Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 Seats are fine, in fact of all the smaller sedans this is the most comfortable I found for the price as I am 6', 2 and 250# .\n1 Great gas mileage and comfortable on long trips ...",4,1
16,gas_mileage_toyota_camry_2007,Ride seems comfortable and gas mileage fairly good averaging 26 city and 30 open road .\n0 ...,4,1
17,interior_honda_accord_2008,I love the new body style and the interior is a simple pleasure except for the center dash .\n0 ...,4,1
18,interior_toyota_camry_2007,"First of all, the interior has way too many cheap plastic parts like the cheap plastic center piece that houses the clock .\n0 3 blown struts at 30,000 miles, interior trim coming loose and rattling squeaking, stains on paint, and bug splats taking paint off, premature uneven brake wear, on 3rd windsh...",4,1
22,mileage_honda_accord_2008,"It's quiet, get good gas mileage and looks clean inside and out .\n0 The mileage is great, and I've had to get used to stopping less for gas .\n1 Thought gas ...",4,1
25,performance_honda_accord_2008,"Very happy with my 08 Accord, performance is quite adequate it has nice looks and is a great long, distance cruiser .\n0 6, 4, 3 eco engine has poor performance and gas mileage of 22 highway .\n1 Overall performance is good but comfort level is poor .\n2 ...",4,1
29,quality_toyota_camry_2007,I previously owned a Toyota 4Runner which had incredible build quality and reliability .\n0 I bought the Camry because of Toyota reliability and qua...,4,1
37,seats_honda_accord_2008,"Front seats are very uncomfortable .\n0 No memory seats, no trip computer, can only display outside temp with trip odometer .\n1 ...",4,1
47,transmission_toyota_camry_2007,"After slowing down, transmission has to be kicked to speed up .\n0 ...",4,1


##### 2번 군집: 호텔이 묶임

In [263]:
document_df[document_df['cluster_label2'] == 2]

Unnamed: 0,filename,opinion_text,cluster_label,cluster_label2
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and very comfortable beds, a great shower and very clean bathrooms .\n0 The second room was smaller, with a very inconvenient bathroom layout, but at least it was quieter and we were able to sleep .\n1 ...",0,2
13,food_holiday_inn_london,The room was packed to capacity with queues at the food buffets .\n0 The over zealous staff cleared our unfinished drinks while we were collecting cooked food and movement around the room with plates was difficult in the crowded circumstances .\n1 ...,3,2
14,food_swissotel_chicago,The food for our event was delicious .\n0 ...,3,2
15,free_bestwestern_hotel_sfo,The wine reception is a great idea as it is nice to meet other travellers and great having access to the free Internet access in our room .\n0 They also have a computer available with free internet which is a nice bonus but I didn't find that out till the day before we left but was still able to get on there to check our flight to Vegas the next day .\n1 ...,3,2
20,location_bestwestern_hotel_sfo,"Good Value good location , ideal choice .\n0 Great Location , Nice Rooms , Helpless Concierge\n1 ...",3,2
21,location_holiday_inn_london,"Great location for tube and we crammed in a fair amount of sightseeing in a short time .\n0 All in all, a normal chain hotel on a nice lo...",3,2
24,parking_bestwestern_hotel_sfo,Parking was expensive but I think this is common for San Fran .\n0 there is a fee for parking but well worth it seeing no where to park if you do have a car .\n1 ...,3,2
28,price_holiday_inn_london,"All in all, a normal chain hotel on a nice location , I will be back if I do not find anthing closer to Picadilly for a better price .\n0 ...",3,2
30,rooms_bestwestern_hotel_sfo,"Great Location , Nice Rooms , H...",0,2
31,rooms_swissotel_chicago,"The Swissotel is one of our favorite hotels in Chicago and the corner rooms have the most fantastic views in the city .\n0 The rooms look like they were just remodled and upgraded, there was an HD TV and a nice iHome docking station to put my iPod so I could set the alarm to wake up with my music instead of the radio .\n1 ...",0,2


#### 비교

In [264]:
pd.DataFrame(document_df[['cluster_label2','cluster_label']].value_counts()).sort_values('cluster_label2')

Unnamed: 0_level_0,Unnamed: 1_level_0,0
cluster_label2,cluster_label,Unnamed: 2_level_1
0,2,16
0,1,8
0,3,1
1,4,10
2,3,12
2,0,4


- 군집 n_cluster = 3
```
0번: 전자기기
1번: 자동차
2번: 호텔
```

- 군집 n_cluster = 5
```
0번: 호텔
1번: 킨들, 컴퓨터 등이 섞임
2번: 킨들, 컴퓨터 등이 섞임
3번: 호텔, 킨들 등이 섞임
4번: 자동차
```


### 군집별 핵심단어 추출

In [265]:
cluster_centers = km_cluster.cluster_centers_
print(f'군집 중심점 행렬 모양: {cluster_centers.shape}')
print(f'군집 중심점 행렬: \n{cluster_centers}')

군집 중심점 행렬 모양: (5, 4611)
군집 중심점 행렬: 
[[0.         0.         0.00474778 ... 0.         0.00159451 0.00067345]
 [0.01502839 0.         0.         ... 0.00105331 0.         0.        ]
 [0.00819396 0.         0.         ... 0.01050909 0.         0.        ]
 [0.         0.0012246  0.00068853 ... 0.         0.00176658 0.00157225]
 [0.         0.00092551 0.         ... 0.         0.         0.        ]]


In [266]:
def get_cluster_details(cluster_model, cluster_data, feature_names, clusters_num,
                       top_n_features = 10):
    cluster_detail = {}
    
    # cluster_centers: array의 값이 큰 순으로 정렬된 인덱스 값을 반환
    ## 군집 중심점(centroid)별 할당된 word 피처들의 거리값이 큰 순으로 값을 구하기 위함
    ## argsort(): 오름차순 값으로 인덱스 반환 ex) 1,3,2 -> 0,2,1
    ## argsort()[::-1]: 내림차순 값으로 인덱스 반환  ex) 1,3,2 -> 1,2,0
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:,::-1]
    
    # 개별 군집별 반복하면서 핵심 단어, 그 단어의 중심 위치 상댓값, 대상 파일명 입력
    for cluster_num in range(clusters_num):
        ## 개별 군집별 정보 담을 빈 데이터 생성
        cluster_detail[cluster_num] = {}
        cluster_detail[cluster_num]['cluster'] = cluster_num
        ### cluster_detail = {0: {'cluster': 0}}
        
        ## 인덱스 내림차순 정렬된 객체로 top n 피처 단어 구하기
        top_feature_indexs = centroid_feature_ordered_ind[cluster_num, :top_n_features]
        top_features = [feature_names[idx] for idx in top_feature_indexs]      
        
        ##  해당 피처 단어의 중심 위치 상댓값 구하기
        top_feature_values = cluster_model.cluster_centers_[cluster_num, 
                                                            top_feature_indexs].tolist()
        
        ## 핵심 단어, 중심 위치 상댓값, 해당 파일명 입력하기
        cluster_detail[cluster_num]['top_features'] = top_features
        cluster_detail[cluster_num]['top_features_values'] = top_feature_values
        
        ## 해당 군집에 해당되는 파일명 리스트화 후 딕셔너리에 추가
        filenames = cluster_data[cluster_data['cluster_label'] == cluster_num]['filename']
        filenames = filenames.values.tolist()
        cluster_detail[cluster_num]['filenames'] = filenames
    
    return cluster_detail

In [267]:
def print_cluster_details(cluster_details):
    for cluster_num, cluster_detail in cluster_details.items():
        print(f'******* 군집 {cluster_num}',"*"*110,'\n')
        print(f'핵심단어: {cluster_detail["top_features"]}\n')
        print(f'리뷰된 파일명: {cluster_detail["filenames"]}\n')
        print('*'*125)

In [268]:
feature_names = tfidf_vect.get_feature_names()

cluster_details = get_cluster_details(cluster_model = km_cluster2,
                                      cluster_data = document_df,
                                      feature_names = feature_names,
                                      clusters_num = 3,
                                      top_n_features = 10)

print_cluster_details(cluster_details)

******* 군집 0 ************************************************************************************************************** 

핵심단어: ['screen', 'battery', 'keyboard', 'battery life', 'life', 'kindle', 'direction', 'video', 'size', 'voice']

리뷰된 파일명: ['bathroom_bestwestern_hotel_sfo', 'rooms_bestwestern_hotel_sfo', 'rooms_swissotel_chicago', 'room_holiday_inn_london']

*****************************************************************************************************************************
******* 군집 1 ************************************************************************************************************** 

핵심단어: ['interior', 'seat', 'mileage', 'comfortable', 'gas', 'gas mileage', 'transmission', 'car', 'performance', 'quality']

리뷰된 파일명: ['battery-life_amazon_kindle', 'battery-life_ipod_nano_8gb', 'battery-life_netbook_1005ha', 'keyboard_netbook_1005ha', 'performance_netbook_1005ha', 'size_asus_netbook_1005ha', 'sound_ipod_nano_8gb', 'speed_windows7']

*************************

```
0번: 전자기기
1번: 자동차
2번: 호텔
```