<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#데이터-불러오기" data-toc-modified-id="데이터-불러오기-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>데이터 불러오기</a></span></li><li><span><a href="#전처리" data-toc-modified-id="전처리-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>전처리</a></span><ul class="toc-item"><li><span><a href="#명사추출" data-toc-modified-id="명사추출-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>명사추출</a></span></li><li><span><a href="#word2vec(유의어-처리)" data-toc-modified-id="word2vec(유의어-처리)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>word2vec(유의어 처리)</a></span></li><li><span><a href="#bow(불용어처리)" data-toc-modified-id="bow(불용어처리)-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>bow(불용어처리)</a></span></li></ul></li><li><span><a href="#새로운-query를-포함한-문서-추출" data-toc-modified-id="새로운-query를-포함한-문서-추출-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>새로운 query를 포함한 문서 추출</a></span><ul class="toc-item"><li><span><a href="#tf-idf" data-toc-modified-id="tf-idf-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>tf-idf</a></span><ul class="toc-item"><li><span><a href="#클러스터링" data-toc-modified-id="클러스터링-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>클러스터링</a></span><ul class="toc-item"><li><span><a href="#kmeans" data-toc-modified-id="kmeans-3.1.1.1"><span class="toc-item-num">3.1.1.1&nbsp;&nbsp;</span>kmeans</a></span></li></ul></li></ul></li><li><span><a href="#doc2vec" data-toc-modified-id="doc2vec-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>doc2vec</a></span><ul class="toc-item"><li><span><a href="#클러스터링" data-toc-modified-id="클러스터링-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>클러스터링</a></span></li></ul></li></ul></li><li><span><a href="#kmeans" data-toc-modified-id="kmeans-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>kmeans</a></span></li></ul></div>

In [0]:
import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import pickle
import datetime
import re

import tensorflow as tf

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

matplotlib.rc('font', family='NanumBarunGothic')
plt.rcParams['axes.unicode_minus'] = False

## 데이터 불러오기

In [0]:
def get_dataframe(data_name_with_route):
    ## load news data
    with open(data_name_with_route, 'rb') as file:
        data_list = []
        while True:
            try:
                data = pickle.load(file)
            except EOFError:
                break
            data_list.append(data)
    ## construct lists for dataframe
    title = []
    content = []
    date = []
    for news in data_list[0]['return_object']['documents']: 
        title.append(news['title'])
        content.append(news['content'])
        date.append(news['published_at'][:10]) #### 시간 조정이 필요하면 바꾸기
    ## make lists as dataframe
    news_data = pd.DataFrame([])
    news_data['date'] = date
    news_data['title'] = title
    news_data['content'] = content
    return news_data

In [0]:
data_son = get_dataframe('rawdata_손흥민.pickle')
data_kbl = get_dataframe('rawdata_프로농구.pickle')

In [0]:
data_son['date_tmp'] = data_son['date'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d').toordinal())
data_kbl['date_tmp'] = data_kbl['date'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d').toordinal())

In [0]:
data_son['label'] = 0
data_kbl['label'] = 1

In [0]:
data = pd.concat([data_son,data_kbl])
data.reset_index(inplace = True)
data.drop(['index'],axis = 1,inplace = True)

## 전처리

In [0]:
reg_reporter = re.compile('[가-힣]+\s[가-힣]*기자') # 기자
reg_email = re.compile('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$') # 이메일
reg_eng = re.compile('[a-z]+') # 소문자 알파벳, 이메일 제거용, 대문자는 남겨둔다
reg_chi = re.compile("[\u4e00-\u9fff]+") # 한자
reg_sc = re.compile("·|ㆍ|ㆍ|…|◆+|◇+|▶+|●+|▲+|“|”|‘|’|\"|\'|\(|\)") # 특수문자
reg_date = re.compile('\d*일|\d*월|\d*년|\d*시|\d*분|\(현지시간\)|\(현지시각\)|\d+') #날짜,시간,숫자

In [0]:
def preProcessing(doc):
    tmp = re.sub(reg_reporter, '', doc)
    tmp = re.sub(reg_email, '', tmp)
    tmp = re.sub(reg_eng, '', tmp)
    tmp = re.sub(reg_chi, '', tmp)
    tmp = re.sub(reg_sc, ' ', tmp)
    tmp = re.sub(reg_date, '', tmp)
    return tmp

In [0]:
data['content_re'] = data['content'].apply(preProcessing)

### 명사추출

In [0]:
from ckonlpy.tag import Twitter
twitter = Twitter()

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


In [0]:
## 사전 계속 추가
add_dict = ['마우리시오', 
            '포체티노', 
            '아시안게임', 
            '그리고',]
twitter.add_dictionary(add_dict, 'Noun')

In [0]:
##불용어는 나중에 계속 추가!
##text로 정리하기
stop_basic = '''
아
휴
'''
stop_words = stop_basic.split()

In [0]:
tokenized_doc = data['content_re'].apply(lambda x: twitter.nouns(x))
tokenized_doc2 = tokenized_doc.apply(lambda x: [item.lower() for item in x if item not in stop_words])
display(tokenized_doc.head())
display(tokenized_doc2.head())

0    [내, 챔스리그, 강, 티, 강, 확정, 손, 마, 우리, 오, 포체티노, 사진, ...
1    [미국, 이, 아아, 최고, 스포츠스타, 손흥민, 토트넘, 집중, 조명, 은, 한국...
2    [크루이프, 후계, 자, 아약스, 크리스티아누, 호날두, 유벤투스, 를, 네덜란드,...
3    [네덜란드, 아약스, 크리스티아누, 호날두, 해, 이탈리아, 유벤투스, 유럽, 축구...
4    [지난, 경기, 광주, 경찰, 기숙, 학원, 오전, 이, 걸그룹, 트와이스, 예스,...
Name: content_re, dtype: object

0    [내, 챔스리그, 강, 티, 강, 확정, 손, 마, 우리, 오, 포체티노, 사진, ...
1    [미국, 이, 아아, 최고, 스포츠스타, 손흥민, 토트넘, 집중, 조명, 은, 한국...
2    [크루이프, 후계, 자, 아약스, 크리스티아누, 호날두, 유벤투스, 를, 네덜란드,...
3    [네덜란드, 아약스, 크리스티아누, 호날두, 해, 이탈리아, 유벤투스, 유럽, 축구...
4    [지난, 경기, 광주, 경찰, 기숙, 학원, 오전, 이, 걸그룹, 트와이스, 예스,...
Name: content_re, dtype: object

### word2vec(유의어 처리)

In [0]:
from gensim.models import Word2Vec
model = Word2Vec(sentences=tokenized_doc2, size=300, window=5, min_count=5, workers=4, sg=1)
tmp = model.wv.most_similar("손흥민")

### bow(불용어처리)

In [0]:
tmp_list = []
for doc in tokenized_doc2:
    tmp_list.extend(doc)

In [0]:
word_count = pd.Series(tmp_list).value_counts()

In [0]:
word_count_idx = list(word_count.index)

In [0]:
stop_words = word_count_idx[:10] + word_count_idx[-10:] + ['내']

In [0]:
data['content_token'] = tokenized_doc2.apply(lambda x: [item.lower() for item in x if item not in stop_words])

## 새로운 query를 포함한 문서 추출

In [0]:
query = '경기'

In [0]:
selected_news = []
for idx in range(tokenized_doc.shape[0]):
    if query in tokenized_doc.iloc[idx].content_re :
        selected_news.append(idx)

In [0]:
len(selected_news)

7407

In [0]:
query_doc = tokenized_doc.iloc[selected_news]

### tf-idf

In [0]:
tmp = query_doc['content_re'].apply(lambda x : ' '.join(x))

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import pairwise_distances

obj_tfidf = TfidfVectorizer()
x = obj_tfidf.fit_transform(tmp)

In [0]:
sim_mat = 1 - np.round(pairwise_distances(x, metric="cosine"),3)
np.fill_diagonal(sim_mat,-1) # 같은 문서는 유사도 -1로 처리

In [0]:
sim_mat

array([[-1.   ,  0.22 ,  0.115, ...,  0.019,  0.036,  0.06 ],
       [ 0.22 , -1.   ,  0.064, ...,  0.007,  0.028,  0.029],
       [ 0.115,  0.064, -1.   , ...,  0.005,  0.013,  0.004],
       ...,
       [ 0.019,  0.007,  0.005, ..., -1.   ,  0.372,  0.408],
       [ 0.036,  0.028,  0.013, ...,  0.372, -1.   ,  0.301],
       [ 0.06 ,  0.029,  0.004, ...,  0.408,  0.301, -1.   ]])

#### 클러스터링

kmeans, 코사인 유사도, lsa

##### kmeans

In [0]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2,random_state=0).fit(x)

In [0]:
kmeans_label = kmeans.labels_
label = data['label'][selected_news]

In [0]:
np.sum((label==kmeans_label)*1)/len(kmeans_label)

0.9823140272715

### doc2vec

In [0]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

tagged_data = [TaggedDocument(words=_d, tags=[str(i)]) for i, _d in enumerate(query_doc)]
model = Doc2Vec(
    dm=1,            # PV-DBOW => 0 / default 1
    dbow_words=1,    # w2v simultaneous with DBOW d2v / default 0
    window=8,        # distance between the predicted word and context words
    vector_size=50,        # vector size
    alpha=0.025,     # learning-rate
    seed=42,
    min_count=5,    # ignore with freq lower
    min_alpha=0.025, # min learning-rate
    workers=4,   # multi cpu
    hs = 0,          # hierarchical softmax / default 0
    negative = 10,   # negative sampling / default 5
)

In [0]:
model.build_vocab(tagged_data)
model.corpus_count

In [0]:
n_epochs = 50

for epoch in range(n_epochs):
    if epoch % 10 == 0: 
        print('epoch: ', epoch)
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)
    model.alpha -= 0.0002 # decrease the learning rate
    model.min_alpha = model.alpha # fix the learning rate, no decay
    
model.save("d2v.model")
print("Model Saved")

In [0]:
model.wv.most_similar(positive=['조현우'])

In [0]:
model.wv.most_similar(positive=['손흥민', '농구'], negative=['축구'])

In [0]:
model.wv.most_similar(positive=['손흥민', '이승우'], negative=['토트넘'])

In [0]:
from sklearn.preprocessing import StandardScaler

x_doc2vec = []
for i in range(len(query_doc)):
    x_doc2vec.append(model.docvecs[i])
x_doc2vec = np.array(x_doc2vec)

x_doc2vec_scaled = StandardScaler().fit_transform(x_doc2vec)

In [0]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

x_doc2vec_scaled_tmp = x_doc2vec_scaled[:100]

t_sne = TSNE(n_components=3, learning_rate=200, init='pca', random_state=10)
t_sne_doc2vec = t_sne.fit_transform(x_doc2vec_scaled_tmp)

pca1 = PCA(n_components=3)
pca_doc2vec = pca1.fit_transform(x_doc2vec_scaled_tmp)

kmeans

In [0]:
from sklearn.cluster import KMeans
n_clusters = 3
kmeans_doc2vec = KMeans(n_clusters = n_clusters, random_state=3).fit(x_doc2vec_scaled_tmp)

In [0]:
kmeans_labels = kmeans_doc2vec.labels_
kmeans_labels[:100]

In [0]:
np.sum((label==kmeat_sne_kmeans_pd = np.c_[t_sne_doc2vec, kmeans_labels]
pca_kmeans_pd = np.c_[pca_doc2vec, kmeans_labels]

t_sne_kmeans_pd = pd.DataFrame(t_sne_kmeans_pd, columns=['x', 'y', 'z', 'labels'])
pca_kmeans_pd = pd.DataFrame(pca_kmeans_pd, columns=['x', 'y', 'z', 'labels'])ns_doc2vec_label)*1)/len(kmeans_doc2vec_label)

3차원

In [0]:
from mpl_toolkits.mplot3d import Axes3D

groups = t_sne_kmeans_pd.groupby('labels')

fig = plt.figure()
ax = Axes3D(fig)
for name, group in groups:
    ax.scatter(group.x, 
            group.y,
            group.z,
            marker='o', 
           label=name)
#ax.legend(fontsize=12, loc='upper left') # legend position
plt.title('TSNE Plot of kmeans', fontsize=20)
plt.show()

In [0]:
from mpl_toolkits.mplot3d import Axes3D

groups = t_sne_kmeans_pd.groupby('labels')

fig = plt.figure()
ax = Axes3D(fig)
for name, group in groups:
    ax.scatter(group.x, 
            group.y,
            group.z,
            marker='o', 
           label=name)
#ax.legend(fontsize=12, loc='upper left') # legend position
plt.title('TSNE Plot of kmeans', fontsize=20)
plt.show()

2차원

In [0]:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

t_sne = TSNE(n_components=2, learning_rate=200, init='pca', random_state=10)
t_sne_doc2vec = t_sne.fit_transform(x_doc2vec_scaled_tmp)

pca1 = PCA(n_components=2)
pca_doc2vec = pca1.fit_transform(x_doc2vec_scaled_tmp)

In [0]:
t_sne_kmeans_pd = np.c_[t_sne_doc2vec, kmeans_labels]
pca_kmeans_pd = np.c_[pca_doc2vec, kmeans_labels]

t_sne_kmeans_pd = pd.DataFrame(t_sne_kmeans_pd, columns=['x', 'y', 'labels'])
pca_kmeans_pd = pd.DataFrame(pca_kmeans_pd, columns=['x', 'y', 'labels'])

In [0]:
groups = t_sne_kmeans_pd.groupby('labels')

fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.x, 
            group.y, 
            marker='o', 
            linestyle='',
            label=name)

#ax.legend(fontsize=12, loc='upper left') # legend position
plt.title('TSNE Plot of kmeans', fontsize=20)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.show()

In [0]:
groups = pca_kmeans_pd.groupby('labels')

fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.x, 
            group.y, 
            marker='o', 
            linestyle='',
            label=name)

#ax.legend(fontsize=12, loc='upper left') # legend position
plt.title('PCA Plot of kmeans', fontsize=20)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.show()

뉴스 확인

In [0]:
data_cluster_kmeans = data.iloc[selected_news].copy()
data_cluster_kmeans_tmp = data_cluster_kmeans.iloc[:100].copy()
data_cluster_kmeans_tmp['label'] = kmeans_labels

In [0]:
for n in range(n_clusters):
    print('*' * 100, n)
    print('*' * 100, n)
    print('*' * 100, n)
    for i in range(2):
            print(data_cluster_kmeans_tmp[data_cluster_kmeans_tmp['label'] == n]['content'].iloc[i])
            print('=' *100)
    print('#' * 100, n)

#### DBSCAN

In [0]:
from sklearn.cluster import DBSCAN
## eps 조절 잘해보기
dbscan_labels = DBSCAN(eps=7, min_samples=2).fit_predict(x_doc2vec_scaled_tmp)
dbscan_labels

시각화(2차원)

In [0]:
t_sne_dbscan_pd = np.c_[t_sne_doc2vec, dbscan_labels]
pca_dbscan_pd = np.c_[pca_doc2vec, dbscan_labels]

t_sne_dbscan_pd = pd.DataFrame(t_sne_dbscan_pd, columns=['x', 'y', 'labels'])
pca_dbscan_pd = pd.DataFrame(pca_dbscan_pd, columns=['x', 'y', 'labels'])

In [0]:
groups = t_sne_dbscan_pd.groupby('labels')

fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.x, 
            group.y, 
            marker='o', 
            linestyle='',
            label=name)

#ax.legend(fontsize=12, loc='upper left') # legend position
plt.title('TSNE Plot of dbscan', fontsize=20)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.show()

In [0]:
groups = pca_dbscan_pd.groupby('labels')

fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.x, 
            group.y, 
            marker='o', 
            linestyle='',
            label=name)

#ax.legend(fontsize=12, loc='upper left') # legend position
plt.title('PCA Plot of dbscan', fontsize=20)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.show()

뉴스 확인

In [0]:
groups = pca_dbscan_pd.groupby('labels')

fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.x, 
            group.y, 
            marker='o', 
            linestyle='',
            label=name)

#ax.legend(fontsize=12, loc='upper left') # legend position
plt.title('PCA Plot of dbscan', fontsize=20)
plt.xlabel('x', fontsize=14)
plt.ylabel('y', fontsize=14)
plt.show()

In [0]:
for n in range(-1, 10):
    print('*' * 100, n)
    print('*' * 100, n)
    print('*' * 100, n)    
    for i in range(2):
        try:
            print(data_cluster_dbscan_tmp[data_cluster_dbscan_tmp['label'] == n]['content'].iloc[i])
            print('=' *100)
        except:
            break
    print('#' * 100, n)


클러스터별 뉴스 선별 기준
1. BM25
2. TF-IDF
3. Page Rank
4. 수작업