참고 웹페이지

https://lovit.github.io/nlp/machine%20learning/2018/10/16/spherical_kmeans/#more

### 문서 군집화와 K-means

- k-means 는 빠르고 값싼 메모리 비용 때문에 대량의 문서 군집화에 적합한 방법입니다. 
- scikit-learn 의 k-means 는 Euclidean distance 를 이용합니다. 
- 그러나 고차원 벡터인 문서 군집화 과정에서는 문서 간 거리 척도의 정의가 매우 중요합니다. 
- Bag-of-words처럼 sparse vector로 표현되는 고차원 데이터에 대해서는 Cosine distance를 사용하는 것이 좋습니다. 
- 그리고 Cosine distance 를 이용하는 k-means 를 Spherical k-means 라 합니다. 

### Clustering의 Distance Function의 종류

#### 1. Euclidian Distance 
![image](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbvOFDi%2FbtqEENhBkh4%2FkkBQiVKdTd0hyMXPpJkSGK%2Fimg.png)

하지만 이 유클리디안 거리 공식은 NLP(자연어 처리) 문제에서 큰 힘을 발휘하지 못한다. 왜냐하면 NLP 분야에서는 단어의 빈도수를 계산하게 되는데 예를 들어 'A'라는 문서에 'economic'이라는 단어가 10번 등장했고 'B'라는 문서에 'economic'이라는 단어가 100개가 등장했다고 하자. 이 때 'economic'이라는 단어가 10번 등장했던 100번, 1000번 등장했던 어찌됐든 해당 문서 A,B 두개 모두 주제가 경제와 관련된 주제임을 알 수 있을 것이다. 이 때 유클리디안 거리는 단순히 단어의 빈도수만들 계산하게 되어 비효율적인 방법으로 취급된다. 이를 극복하기 위해서 Cosine Similarity가 등장하게 된다.

#### 2. Cosine Similarity

![image](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FXZAqd%2FbtqEF4PZ1JU%2FzmeLrGkleeHHkX2KGsagRk%2Fimg.png)

Cosine Similarity는 기본적으로 두 벡터들 사이의 각도를 계산하게 된다. 그래서 벡터의 크기(Magnitude)는 무시하되 벡터의 방향의 차이만 계산을 하게 된다. 따라서 위 1번의 유클리디안 거리가 NLP 분야에서  해결하지 못하는 문제를 해결하게 된다. 다음은 Cosine Similarity의 공식이다.

하지만 Cosine Similarity는 상호상관 관계를 갖는 Feature들을 지니고 있는 데이터들간의 유사도는 계산을 잘 하지 못하는 단점을 갖고 있다. 

#### 3. Cosine Distance

그리고 Cosine Similarity와 관계가 있는 Cosine Distance라는 개념이 존재하는데 이는 간단하다. 바로 1에서 Cosine Similarity를 빼준 값이 Cosine Distance가 된다.

![image](https://t1.daumcdn.net/cfile/tistory/25588F4A5934B49432)

![image](https://jiho-ml.com/content/images/2020/01/happy.png)

위 결과를 보면 happy의 가장 가까운 단어들 (Nearest points in the original space)을 quiet, funny, you, love, remember 등 다른 비슷한 단어들이 나옵니다.

이처럼 cosine distance는 vector들 간의 유사성을 계산하기 위해 쓰입니다. word2vec 뿐만 아니라 document embedding, item2vec 등 다양한 모델과 데이터에서 활용될 수 있습니다!

Cosine distance 를 이용하면 모든 벡터가 unit vector 화 되기 때문에 k-means 군집화는 아래 그림처럼 각도가 비슷한 (단어 분포가 비슷한) 문서를 하나의 군집으로 묶는 의미로 해석할 수 있습니다. 이처럼 sparse vector 로 표현되는 문서 집합에 대한 k-means 는 Cosine distance 를 이용하는 것이 좋고, Cosine distance 를 이용하는 k-means 를 Spherical k-means 라 합니다 (Inderjit et al., 2001).

![image](https://lovit.github.io/assets/figures/spherical_kmeans_angle.png)

- unit vector -> 크기가 1인 벡터를 의미한다.

### Opinion Review 데이터 셋을 이용한 문서 군집화

In [45]:
import pandas as pd
import glob ,os

path = r"C:\Users\HYUNJUN\anaconda3\envs\likelion\my_files\data\OpinosisDataset1.0\topics"

# path로 지정한 디렉토리 밑에 있는 모든 .data 파일들의 파일명을 리스트로 취합
all_files = glob.glob(os.path.join(path, "*.data"))
filename_list = []
opinion_text = []

# 개별 파일들의 파일명은 filename_list 리스트로 취합, 
# 개별 파일들의 파일내용은 DataFrame로딩 후 다시 string으로 변환하여 opinion_text 리스트로 취합 
for file_ in all_files:
    df = pd.read_table(file_, index_col=None, header=0, encoding="latin1")
    filename_ = file_.split("\\")[-1]
    filename = filename_.split(".")[0]
    
    filename_list.append(filename)
    opinion_text.append(df.to_string())

document_df = pd.DataFrame({"filename":filename_list, "opinion_text":opinion_text})
document_df.head()

Unnamed: 0,filename,opinion_text
0,accuracy_garmin_nuvi_255W_gps,...
1,bathroom_bestwestern_hotel_sfo,...
2,battery-life_amazon_kindle,...
3,battery-life_ipod_nano_8gb,...
4,battery-life_netbook_1005ha,...


In [46]:
from nltk.stem import WordNetLemmatizer
import nltk
import string

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(tokenizer=LemNormalize, stop_words="english", ngram_range=(1, 2),
                             min_df=0.05, max_df=0.85)

#opinion_text 컬럼값으로 feature vectorization 수행
feature_vect = tfidf_vect.fit_transform(document_df["opinion_text"])



In [121]:
feature_vect.shape

(51, 4611)

In [48]:
from sklearn.cluster import KMeans

# 5개 군집으로 군집화 수행
km_cluster = KMeans(n_clusters=5, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)

cluster_label = km_cluster.labels_
cluster_centers = km_cluster.cluster_centers_

In [49]:
document_df["cluster_label"] = cluster_label
document_df.head()

Unnamed: 0,filename,opinion_text,cluster_label
0,accuracy_garmin_nuvi_255W_gps,...,2
1,bathroom_bestwestern_hotel_sfo,...,0
2,battery-life_amazon_kindle,...,1
3,battery-life_ipod_nano_8gb,...,1
4,battery-life_netbook_1005ha,...,1


In [50]:
from sklearn.metrics import silhouette_samples, silhouette_score

score_samples = silhouette_samples(feature_vect, document_df["cluster_label"])
print("silhouette_samples 값의 shape:", score_samples.shape)

document_df["silhouette_coeff"] = score_samples
average_score = silhouette_score(feature_vect, document_df["cluster_label"])
print("Silhouette Analysis Score:{0:.3f}".format(average_score))

silhouette_samples 값의 shape: (51,)
Silhouette Analysis Score:0.082


In [51]:
document_df[document_df["cluster_label"]==0].sort_values(by="filename")

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff
1,bathroom_bestwestern_hotel_sfo,...,0,0.199887
32,room_holiday_inn_london,...,0,0.398365
30,rooms_bestwestern_hotel_sfo,...,0,0.398066
31,rooms_swissotel_chicago,...,0,0.371413


In [52]:
document_df[document_df["cluster_label"]==1].sort_values(by="filename")

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff
2,battery-life_amazon_kindle,...,1,0.204652
3,battery-life_ipod_nano_8gb,...,1,0.214753
4,battery-life_netbook_1005ha,...,1,0.243608
19,keyboard_netbook_1005ha,...,1,0.056554
26,performance_netbook_1005ha,...,1,0.063305
41,size_asus_netbook_1005ha,...,1,0.072541
42,sound_ipod_nano_8gb,headphone jack i got a clear case for it a...,1,0.008014
44,speed_windows7,...,1,0.002584


In [53]:
document_df[document_df["cluster_label"]==2].sort_values(by="filename")

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff
0,accuracy_garmin_nuvi_255W_gps,...,2,0.03928
5,buttons_amazon_kindle,...,2,0.017695
8,directions_garmin_nuvi_255W_gps,...,2,0.050867
9,display_garmin_nuvi_255W_gps,...,2,0.06189
10,eyesight-issues_amazon_kindle,...,2,0.034234
11,features_windows7,...,2,-0.014768
12,fonts_amazon_kindle,...,2,0.021411
23,navigation_amazon_kindle,...,2,0.02613
33,satellite_garmin_nuvi_255W_gps,...,2,0.010174
34,screen_garmin_nuvi_255W_gps,...,2,0.099431


In [54]:
document_df[document_df["cluster_label"]==3].sort_values(by="filename")

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff
13,food_holiday_inn_london,...,3,0.072858
14,food_swissotel_chicago,...,3,0.062823
15,free_bestwestern_hotel_sfo,...,3,-0.003262
20,location_bestwestern_hotel_sfo,...,3,0.025203
21,location_holiday_inn_london,...,3,0.033107
24,parking_bestwestern_hotel_sfo,...,3,-0.001359
27,price_amazon_kindle,...,3,0.003862
28,price_holiday_inn_london,...,3,-0.00149
38,service_bestwestern_hotel_sfo,...,3,0.087802
39,service_holiday_inn_london,...,3,0.032179


In [55]:
document_df[document_df["cluster_label"]==4].sort_values(by="filename")

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff
6,comfort_honda_accord_2008,...,4,0.173206
7,comfort_toyota_camry_2007,...,4,0.161097
16,gas_mileage_toyota_camry_2007,...,4,0.14681
17,interior_honda_accord_2008,...,4,0.142784
18,interior_toyota_camry_2007,...,4,0.139788
22,mileage_honda_accord_2008,...,4,0.140784
25,performance_honda_accord_2008,...,4,0.031066
29,quality_toyota_camry_2007,...,4,0.058918
37,seats_honda_accord_2008,...,4,0.11082
47,transmission_toyota_camry_2007,...,4,0.024978


In [56]:
km_cluster = KMeans(n_clusters=3, max_iter=10000, random_state=0)
km_cluster.fit(feature_vect)
cluster_label = km_cluster.labels_

document_df["cluster_label"] = cluster_label
document_df.sort_values(by="cluster_label")

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff
0,accuracy_garmin_nuvi_255W_gps,...,0,0.03928
48,updates_garmin_nuvi_255W_gps,...,0,0.032419
44,speed_windows7,...,0,0.002584
43,speed_garmin_nuvi_255W_gps,...,0,0.044643
42,sound_ipod_nano_8gb,headphone jack i got a clear case for it a...,0,0.008014
41,size_asus_netbook_1005ha,...,0,0.072541
36,screen_netbook_1005ha,...,0,-0.015675
35,screen_ipod_nano_8gb,...,0,0.028021
34,screen_garmin_nuvi_255W_gps,...,0,0.099431
33,satellite_garmin_nuvi_255W_gps,...,0,0.010174


### 군집별 핵심 단어 추출

In [57]:
cluster_centers = km_cluster.cluster_centers_

print("cluster_centers shape:", cluster_centers.shape)
print(cluster_centers)

cluster_centers shape: (3, 4611)
[[0.01005322 0.         0.         ... 0.00706287 0.         0.        ]
 [0.         0.00092551 0.         ... 0.         0.         0.        ]
 [0.         0.00099499 0.00174637 ... 0.         0.00183397 0.00144581]]


In [58]:
# 군집별 top n 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명을 반환
def get_cluster_details(cluster_model, cluster_data, feature_names, clusters_num, top_n_features=10):
    cluster_details = {}
    
    # cluster_centers array의 값이 큰 순으로 정렬된 index 값을 반환
    # 군집 중심점별 할당된 word 피처들의 거리값이 큰 순으로 값을 구하기 위함
    centroid_feature_ordered_ind = cluster_model.cluster_centers_.argsort()[:, ::-1]
    
    # 개별 군집별로 iteration하면서 핵심단어, 그 단어의 중심 위치 상대값, 대상 파일명 입력
    for cluster_num in range(clusters_num):
        # 개별 군집별 정보를 담을 데이터 초기화
        cluster_details[cluster_num] = {}
        cluster_details[cluster_num]["cluster"] = cluster_num
        
        # cluster_centers_.argsort()[:, ::-1]로 구한 index를 이용하여 top n 피처 단어를 구함
        top_feature_indexes = centroid_feature_ordered_ind[cluster_num, :top_n_features]
        top_features = [feature_names[ind] for ind in top_feature_indexes]
        
        # top_feature_indexes를 이용해 해당 피처 단어의 중심 위치 상대값 구함
        top_feature_values = cluster_model.cluster_centers_[cluster_num, top_feature_indexes].tolist()
        
        # cluster_details 딕셔너리 객체에 개별 군집별 핵심 단어와 중심 위치 상대값, 해당 파일명 입력
        cluster_details[cluster_num]["top_features"] = top_features
        cluster_details[cluster_num]["top_features_value"] = top_feature_values
        filenames = cluster_data[cluster_data["cluster_label"]==cluster_num]["filename"]
        filenames = filenames.values.tolist()
        cluster_details[cluster_num]["filenames"] = filenames
        
    return cluster_details

In [59]:
def print_cluster_details(cluster_details):
    for cluster_num, cluster_detail in cluster_details.items():
        print("###### Cluster {0}".format(cluster_num))
        print("Top features:", cluster_detail["top_features"])
        print("Reviews 파일명:", cluster_detail["filenames"][:7])
        print("==================================================")

In [60]:
feature_names = tfidf_vect.get_feature_names()
cluster_details = get_cluster_details(cluster_model=km_cluster, cluster_data=document_df, 
                                      feature_names=feature_names, clusters_num=3, top_n_features=10)
print_cluster_details(cluster_details)

###### Cluster 0
Top features: ['screen', 'battery', 'keyboard', 'battery life', 'life', 'kindle', 'direction', 'video', 'size', 'voice']
Reviews 파일명: ['accuracy_garmin_nuvi_255W_gps', 'battery-life_amazon_kindle', 'battery-life_ipod_nano_8gb', 'battery-life_netbook_1005ha', 'buttons_amazon_kindle', 'directions_garmin_nuvi_255W_gps', 'display_garmin_nuvi_255W_gps']
###### Cluster 1
Top features: ['interior', 'seat', 'mileage', 'comfortable', 'gas', 'gas mileage', 'transmission', 'car', 'performance', 'quality']
Reviews 파일명: ['comfort_honda_accord_2008', 'comfort_toyota_camry_2007', 'gas_mileage_toyota_camry_2007', 'interior_honda_accord_2008', 'interior_toyota_camry_2007', 'mileage_honda_accord_2008', 'performance_honda_accord_2008']
###### Cluster 2
Top features: ['room', 'hotel', 'service', 'staff', 'food', 'location', 'bathroom', 'clean', 'price', 'parking']
Reviews 파일명: ['bathroom_bestwestern_hotel_sfo', 'food_holiday_inn_london', 'food_swissotel_chicago', 'free_bestwestern_hotel_s

In [61]:
from sklearn.metrics import silhouette_samples, silhouette_score

score_samples = silhouette_samples(feature_vect, document_df["cluster_label"])
print("silhouette_samples 값의 shape:", score_samples.shape)

document_df["silhouette_coeff"] = score_samples
average_score = silhouette_score(feature_vect, document_df["cluster_label"])
print("Silhouette Analysis Score:{0:.3f}".format(average_score))

silhouette_samples 값의 shape: (51,)
Silhouette Analysis Score:0.085


In [62]:
from sklearn.cluster import DBSCAN

dbscan = DBSCAN(eps=1, min_samples=3, metric='euclidean')
dbscan_labels = dbscan.fit_predict(feature_vect)

In [63]:
dbscan_labels

array([-1,  2,  0,  0,  0, -1,  1,  1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  2,  2,  2, -1,
        3,  3,  3,  1,  4,  4,  4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1],
      dtype=int64)

In [64]:
document_df["dbscan_label"] = dbscan_labels
document_df.head()

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff,dbscan_label
0,accuracy_garmin_nuvi_255W_gps,...,0,0.031607,-1
1,bathroom_bestwestern_hotel_sfo,...,2,0.072249,2
2,battery-life_amazon_kindle,...,0,0.086796,0
3,battery-life_ipod_nano_8gb,...,0,0.083915,0
4,battery-life_netbook_1005ha,...,0,0.08502,0


In [65]:
document_df[document_df["dbscan_label"]==-1]

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff,dbscan_label
0,accuracy_garmin_nuvi_255W_gps,...,0,0.031607,-1
5,buttons_amazon_kindle,...,0,0.03172,-1
8,directions_garmin_nuvi_255W_gps,...,0,0.033114,-1
9,display_garmin_nuvi_255W_gps,...,0,0.044499,-1
10,eyesight-issues_amazon_kindle,...,0,0.042511,-1
11,features_windows7,...,0,0.019356,-1
12,fonts_amazon_kindle,...,0,0.032072,-1
13,food_holiday_inn_london,...,2,0.112198,-1
14,food_swissotel_chicago,...,2,0.109331,-1
15,free_bestwestern_hotel_sfo,...,2,0.044534,-1


In [97]:
from soyclustering import SphericalKMeans
spherical_kmeans = SphericalKMeans(n_clusters=3, max_iter=10000, verbose=1, init="similar_cut",
                                   sparsity="minimum_df", minimum_df_factor=0.05)
labels = spherical_kmeans.fit_predict(feature_vect)

initialization_time=0.003020 sec, sparsity=0.117
n_iter=1, changed=33, inertia=40.774, iter_time=0.029 sec, sparsity=0.00759
n_iter=2, changed=1, inertia=31.907, iter_time=0.029 sec, sparsity=0.00745
Early converged.


In [98]:
labels

array([1, 0, 2, 2, 2, 1, 0, 0, 1, 2, 1, 2, 1, 0, 0, 0, 2, 1, 1, 2, 0, 0,
       2, 1, 0, 2, 2, 1, 0, 1, 0, 0, 0, 1, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2,
       1, 0, 0, 2, 2, 2, 2], dtype=int64)

In [99]:
document_df["spherical_kmeans_label"] = labels
document_df[document_df["spherical_kmeans_label"]==0]

Unnamed: 0,filename,opinion_text,cluster_label,silhouette_coeff,dbscan_label,spherical_kmeans_label
1,bathroom_bestwestern_hotel_sfo,...,2,0.072249,2,0
6,comfort_honda_accord_2008,...,1,0.182211,1,0
7,comfort_toyota_camry_2007,...,1,0.178219,1,0
13,food_holiday_inn_london,...,2,0.112198,-1,0
14,food_swissotel_chicago,...,2,0.109331,-1,0
15,free_bestwestern_hotel_sfo,...,2,0.044534,-1,0
20,location_bestwestern_hotel_sfo,...,2,0.109194,-1,0
21,location_holiday_inn_london,...,2,0.114046,-1,0
24,parking_bestwestern_hotel_sfo,...,2,0.040031,-1,0
28,price_holiday_inn_london,...,2,0.107213,-1,0
