*Table of Contents*
1. 필요모듈 import & file read
2. 데이터 전처리
3. SVD 적용하기
---

# 필요 모듈 import & file read
- 파일명/ 컬럼
    - metadata : 2018년 10월 1일부터 2019년 3월 14일까지 독자들이 본 글에 대한 정보
        - 매거진 id [magazine_id]
        - 등록 시간 [reg_ts]
        - 작가 id [user_id]
        - 아티클 id [article_id]
        - 글번호 /작가정보 [id]
        - 제목 [title]
        - 부제 [sub_title]
        - url [display_url]
        - 키워드 리스트 (작가부여) [keyword_list]
    - users : 사용자 정보
        - 독자 id [id]
        - 구독중 작가 리스트 [following_list]
        - 키워드 리스트 [keyword_list]
    - read : 읽은 글에 대한 정보
        - 독자 id [id]
        - 아티클 id [article_id]
    - magazine : 
        - 매거진 id [id]
        - 매거진 태그 리스트 [magazine_tag_list]

In [1]:
# 모듈 import
from collections import Counter
from datetime import timedelta, datetime
import glob
from itertools import chain
import json
import os
import re

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import seaborn as sns

In [3]:
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import hangul_font

Hangul font is set!


In [142]:
from sklearn.decomposition import TruncatedSVD

In [5]:
# magazine 파일 읽어오기
magazine = pd.read_json( '.\data\magazine.json', lines=True)

In [6]:
magazine.head()

Unnamed: 0,magazine_tag_list,id
0,"[브런치북, 육아일기, 대화법, 들려주고픈이야기]",38842
1,"[tea, food]",11540
2,[food],11541
3,"[브런치북, 일상, 시, 사람]",11546
4,"[감성에세이, 노래, 음악에세이]",11544


In [67]:
# metadata 읽어오기
metadata = pd.read_json('.\data\metadata.json', lines=True)

In [68]:
metadata = metadata[['article_id', 'display_url', 'id', 'keyword_list', 'magazine_id', 'reg_ts', 'sub_title', 'title', 'user_id']]

In [9]:
metadata.head()

Unnamed: 0,article_id,display_url,id,keyword_list,magazine_id,reg_ts,sub_title,title,user_id
0,782,https://brunch.co.kr/@bookdb/782,@bookdb_782,"[여행, 호주, 국립공원]",8982,1474944427000,세상 어디에도 없는 호주 Top 10,"사진으로 옮기기에도 아까운, 리치필드 국립공원",@bookdb
1,81,https://brunch.co.kr/@kohwang56/81,@kohwang56_81,"[목련꽃, 아지랑이, 동행]",12081,1463092749000,,[시] 서러운 봄,@kohwang56
2,4,https://brunch.co.kr/@hannahajink/4,@hannahajink_4,[],0,1447997287000,무엇 때문에,무엇을 위해,@hannahajink
3,88,https://brunch.co.kr/@bryceandjuli/88,@bryceandjuli_88,"[감정, 마음, 위로]",16315,1491055161000,,싫다,@bryceandjuli
4,34,https://brunch.co.kr/@mijeongpark/34,@mijeongpark_34,"[유럽여행, 더블린, 아일랜드]",29363,1523292942000,#7. 내 친구의 집은 어디인가,Dubliner#7,@mijeongpark


In [11]:
# user 파일 읽어오기 (파일이름이 u로 시작해서 유니코드 에러발생/ 파일 몇 앞에 r붙이면 열림
users = pd.read_json(r'.\data\users.json', lines=True)

In [12]:
users = users[['following_list', 'id', 'keyword_list']]

In [13]:
users.head()

Unnamed: 0,following_list,id,keyword_list
0,"[@perytail, @brunch]",#901985d8bc4c481805c4a4f911814c4a,[]
1,"[@holidaymemories, @wadiz, @sciforus, @dailydu...",#1fd89e9dcfa64b45020d9eaca54e0eed,[]
2,"[@commerceguy, @sunsutu, @kakao-it, @joohoonja...",#1d94baaea71a831e1f33e1c6bd126ed5,[]
3,"[@amberjeon48, @forsy20, @nemotokki, @hawann, ...",#04641c01892b12dc018b1410e4928c0d,[]
4,"[@dwcha7342, @iammento, @kakao-it, @dkam, @ant...",#65bcaff862aadff877e461f54187ab62,[]


In [18]:
# read 파일 읽어오기 (data 폴더 > read 폴더 내 모든 파일 읽어오기)
read_file_lst = glob.glob('.\\data\\read\\*')

In [19]:
exclude_file_lst = ['read.tar']

In [20]:
read_df_lst = []
for f in read_file_lst:
    file_name = os.path.basename(f)
    if file_name in exclude_file_lst:
        print(file_name)
    else:
        df_temp = pd.read_csv(f, header=None, names=['raw'])
        df_temp['dt'] = file_name[:8]
        df_temp['hr'] = file_name[8:10]
        df_temp['user_id'] = df_temp['raw'].str.split(' ').str[0]
        df_temp['article_id'] = df_temp['raw'].str.split(' ').str[1:].str.join(' ').str.strip()
        read_df_lst.append(df_temp)

In [21]:
read = pd.concat(read_df_lst)

## 포맷 변경

In [22]:
def chainer(s):
    return list(chain.from_iterable(s.str.split(' ')))

In [23]:
# read file 정리
read_cnt_by_user = read['article_id'].str.split(' ').map(len)

In [24]:
read_raw = pd.DataFrame({'dt': np.repeat(read['dt'], read_cnt_by_user),
                         'hr': np.repeat(read['hr'], read_cnt_by_user),
                         'user_id': np.repeat(read['user_id'], read_cnt_by_user),
                         'article_id': chainer(read['article_id'])})

In [25]:
read_raw.head()

Unnamed: 0,dt,hr,user_id,article_id
0,20181001,0,#e208be4ffea19b1ceb5cea2e3c4dc32c,@kty0613_91
1,20181001,0,#0a3d493f3b2318be80f391eaa00bfd1c,@miamiyoung_31
1,20181001,0,#0a3d493f3b2318be80f391eaa00bfd1c,@banksalad_49
1,20181001,0,#0a3d493f3b2318be80f391eaa00bfd1c,@rlfrjsdn_95
1,20181001,0,#0a3d493f3b2318be80f391eaa00bfd1c,@readme999_140


In [26]:
# read data 요약
print("전체 데이터 건수:", read_raw.shape)
print("중복 소비를 제외한 데이터 건수:", read_raw[['user_id', 'article_id']].drop_duplicates().shape)
print("Unique 독자 수:", len(read_raw['user_id'].unique()))
print("소비된 Unique 글 수:", len(read_raw['article_id'].unique()))

전체 데이터 건수: (22110706, 4)
중복 소비를 제외한 데이터 건수: (12597878, 2)
Unique 독자 수: 306222
소비된 Unique 글 수: 505841


# 데이터 전처리
- Top200 작가 선정 (가장 글이 많이 읽힌 200명의 작가)
    - reader 기반 작가 추천: top200 작가를 대상으로 진행
    - 머신러닝으로 추천을 하기 위해 데이터 축소
- 독자별 읽은 글의 수/ 작가별 작성한 글의 수 구하기
- 15000명의 독자/ 200명의 작가로 구성된 pivot 생성
    - 자신이 읽은 글 중 해당 작가의 글의 비중과 작가가 쓴 전체 글 중 해당 독자가 읽은 글의 비중을 가중치를 두고 계산하여 value로 사용

## Read + Metadata Files에서 글 많이 읽힌 작가 top 200 뽑기

In [69]:
read_raw.columns

Index(['dt', 'hr', 'user_id', 'article_id'], dtype='object')

In [74]:
read_raw

Unnamed: 0,dt,hr,user_id,article_id
0,20181001,00,#e208be4ffea19b1ceb5cea2e3c4dc32c,@kty0613_91
1,20181001,00,#0a3d493f3b2318be80f391eaa00bfd1c,@miamiyoung_31
1,20181001,00,#0a3d493f3b2318be80f391eaa00bfd1c,@banksalad_49
1,20181001,00,#0a3d493f3b2318be80f391eaa00bfd1c,@rlfrjsdn_95
1,20181001,00,#0a3d493f3b2318be80f391eaa00bfd1c,@readme999_140
...,...,...,...,...
1229,20190228,23,#3eec960b2ad12fc41ec986032effc8b2,@leewoosview_189
1230,20190228,23,#1eab0886c0f0f32156f9ab1e5d0fffab,@rory_7
1230,20190228,23,#1eab0886c0f0f32156f9ab1e5d0fffab,@rory_7
1231,20190228,23,#005be6888ba3f083eed1806ba427cc3a,@cliche-cliche_1


In [70]:
metadata.columns

Index(['article_id', 'display_url', 'id', 'keyword_list', 'magazine_id',
       'reg_ts', 'sub_title', 'title', 'user_id'],
      dtype='object')

In [77]:
metadata.rename(columns={"article_id":"article_number","id":"article_id", "user_id":"author_id"}, inplace=True)

In [78]:
metadata

Unnamed: 0,article_number,display_url,article_id,keyword_list,magazine_id,reg_ts,sub_title,title,author_id
0,782,https://brunch.co.kr/@bookdb/782,@bookdb_782,"[여행, 호주, 국립공원]",8982,1474944427000,세상 어디에도 없는 호주 Top 10,"사진으로 옮기기에도 아까운, 리치필드 국립공원",@bookdb
1,81,https://brunch.co.kr/@kohwang56/81,@kohwang56_81,"[목련꽃, 아지랑이, 동행]",12081,1463092749000,,[시] 서러운 봄,@kohwang56
2,4,https://brunch.co.kr/@hannahajink/4,@hannahajink_4,[],0,1447997287000,무엇 때문에,무엇을 위해,@hannahajink
3,88,https://brunch.co.kr/@bryceandjuli/88,@bryceandjuli_88,"[감정, 마음, 위로]",16315,1491055161000,,싫다,@bryceandjuli
4,34,https://brunch.co.kr/@mijeongpark/34,@mijeongpark_34,"[유럽여행, 더블린, 아일랜드]",29363,1523292942000,#7. 내 친구의 집은 어디인가,Dubliner#7,@mijeongpark
...,...,...,...,...,...,...,...,...,...
643099,24,https://brunch.co.kr/@uxstar/24,@uxstar_24,"[3D, UI, 제스처]",38917,1553502554000,GIS 서비스,3D 지도의 내비게이션 제스처,@uxstar
643100,575,https://brunch.co.kr/@reading15m/575,@reading15m_575,"[독서모임, 경험수집, 글쓰기]",28741,1540984479000,,월간 경험수집 vol.6,@reading15m
643101,118,https://brunch.co.kr/@hje3884/118,@hje3884_118,"[생각, 에세이, 괴로움]",19155,1509957398000,공기 조차 함께 하고 싶지 않을 때,왜 참으라고만 해요?,@hje3884
643102,12,https://brunch.co.kr/@julieleekgep/12,@julieleekgep_12,"[여행, 유럽여행, 리스본]",37504,1540993756000,"리스본, 길 위에서 만난 우정",넌 오늘 뭘 봤니?,@julieleekgep


In [93]:
read_final = pd.merge(left = read_raw, right=metadata, left_on="article_id", right_on="article_id", how="left")

In [94]:
read_final.drop(columns=["display_url", "keyword_list", "magazine_id", "reg_ts", 
                         "sub_title", "title"], inplace=True)

In [95]:
read_top_200_count = read_final.groupby("author_id")["user_id"].agg("count").sort_values(ascending=False).reset_index()

In [96]:
read_top_200_count.head()

Unnamed: 0,author_id,user_id
0,@brunch,402451
1,@tenbody,319864
2,@jordan777,311157
3,@dailylife,256239
4,@binkond,156974


In [97]:
# 상위 200명만 저장
read_top_200_count = read_top_200_count.iloc[:200, :]

In [98]:
read_top_200_count.author_id.nunique()

200

In [99]:
# 컬럼명 변경
read_top_200_count.rename(columns={"user_id":"reader_id_count"}, inplace=True)
read_top_200_count

Unnamed: 0,author_id,reader_id_count
0,@brunch,402451
1,@tenbody,319864
2,@jordan777,311157
3,@dailylife,256239
4,@binkond,156974
...,...,...
195,@ssuujin,16255
196,@finance1026,16251
197,@alicemelbourne,16248
198,@sustainability,16169


In [100]:
# csv파일로 저징
# read_top_200_count.to_csv("Read_Author_Top_200_Count.csv", index=False)

최종 Read_Top_200 DataFrame

In [101]:
read_top_200 = pd.merge(left=read_final, right=read_top_200_count, left_on="author_id", right_on="author_id", how="right")

In [102]:
read_top_200.author_id.nunique()

200

In [103]:
# top200 csv 파일로저장
# read_top_200.to_csv("Read_Author_Top_200.csv", index=False)

In [104]:
read_top_200

Unnamed: 0,dt,hr,user_id,article_id,article_number,author_id,reader_id_count
0,20181001,00,#0a3d493f3b2318be80f391eaa00bfd1c,@readme999_140,140.0,@readme999,31548
1,20181001,00,#88f2c6beaa352f808019befebd9f8bd0,@readme999_145,145.0,@readme999,31548
2,20181001,00,#b1e45aeff4915ce4e2ab350546bd6689,@readme999_165,165.0,@readme999,31548
3,20181001,00,#0e97d1dc7ee6bf7d6fa45eb5d3133343,@readme999_161,161.0,@readme999,31548
4,20181001,00,#0e97d1dc7ee6bf7d6fa45eb5d3133343,@readme999_161,161.0,@readme999,31548
...,...,...,...,...,...,...,...
7797174,20190228,23,#8cb5c6380bbe69cb13ae5c6a257ea240,@aemae-human_54,54.0,@aemae-human,25408
7797175,20190228,23,#8cb5c6380bbe69cb13ae5c6a257ea240,@aemae-human_54,54.0,@aemae-human,25408
7797176,20190228,23,#53c710fa00d99701f2dc2682c3e05e5f,@aemae-human_36,36.0,@aemae-human,25408
7797177,20190228,23,#53c710fa00d99701f2dc2682c3e05e5f,@aemae-human_36,36.0,@aemae-human,25408


## 작가의 글 수 합계, 독자의 읽은 수 합계 구하기

In [106]:
# 작가별 작성한 글의 총 수
df_author = pd.DataFrame(read_top_200.groupby('author_id').count()['article_number']).T

In [108]:
# 독자별 읽은 글의 총 수
df_reader = pd.DataFrame(read_top_200.groupby('user_id').count()['article_number'])

In [109]:
# 독자별 읽은 작가의 수
df_reader_author = pd.DataFrame(read_top_200.groupby('user_id').nunique()['author_id'])

In [110]:
# 독자별로 읽은 작가의 수가 많은 순서대로 상위 15000명 sorting
user_id_df=df_reader_author.sort_values('author_id', ascending=False)[:15000].reset_index()
user_id_df

Unnamed: 0,user_id,author_id
0,#6f55b5508f0c31ae3456621c23c1f6a9,194
1,#71ca8074251ffb285a1ee286fde639b8,193
2,#9ef170cacaa30f62d6de34a5ad42a37e,193
3,#dcbfec386ba5af556b70bb98787c35ea,188
4,#64ed07be4247d4f03737119faad50903,185
...,...,...
14995,#446d2433642d440b48a45509a8ee37a5,24
14996,#4193ff1bcca6a3372f9faae2339d5f7a,24
14997,#db4e97d6b0ba8442cce7a3c89d08c58e,24
14998,#134b1867c4d70127b97d2ca5627d1293,24


In [111]:
user_id_ls = user_id_df.drop('author_id', axis=1)
user_id_ls

Unnamed: 0,user_id
0,#6f55b5508f0c31ae3456621c23c1f6a9
1,#71ca8074251ffb285a1ee286fde639b8
2,#9ef170cacaa30f62d6de34a5ad42a37e
3,#dcbfec386ba5af556b70bb98787c35ea
4,#64ed07be4247d4f03737119faad50903
...,...
14995,#446d2433642d440b48a45509a8ee37a5
14996,#4193ff1bcca6a3372f9faae2339d5f7a
14997,#db4e97d6b0ba8442cce7a3c89d08c58e
14998,#134b1867c4d70127b97d2ca5627d1293


In [112]:
df_author.columns

Index(['@01038273527', '@13july', '@aemae-human', '@ahronjeon', '@alexkang',
       '@alicemelbourne', '@allstay', '@am327', '@anetmom', '@angiesongc9sx',
       ...
       '@workerhanee', '@worknlife', '@ws820512', '@x-xv', '@yemaya', '@yemyo',
       '@yeonboon', '@yoonjikwon', '@yoooong', '@yumileewyky'],
      dtype='object', name='author_id', length=200)

In [91]:
# read_top_200.drop(['dt', 'hr', 'article_id'], axis=1, inplace=True)

In [113]:
read_top_200.columns

Index(['dt', 'hr', 'user_id', 'article_id', 'article_number', 'author_id',
       'reader_id_count'],
      dtype='object')

## 작가 * 독자 pivot 만들기

In [None]:
df_pivot = read_top_200.pivot_table(values='article_number', index=['user_id'], columns='author_id', aggfunc='count')

In [116]:
df2 = pd.concat([df_pivot, df_author], axis=0)

In [121]:
df2

author_id,@01038273527,@13july,@aemae-human,@ahronjeon,@alexkang,@alicemelbourne,@allstay,@am327,@anetmom,@angiesongc9sx,...,@workerhanee,@worknlife,@ws820512,@x-xv,@yemaya,@yemyo,@yeonboon,@yoonjikwon,@yoooong,@yumileewyky
#00001ba6ca8d87d2fc34d626ba9cfe6f,,,,,,,,,,,...,,,,,,,,,,
#0000e87158c1426d6ffb72cebac6cb64,,,,,,,,,,,...,,,,,,,,,,
#0000fdba8f35c76eacab74c5c6bc7f1a,,,,,,,,,,,...,,,,,,,,,,
#000127ad0f1981cae1292efdb228f0e9,,,,,,,,,,,...,,,,,,,,,,
#0001485b31e8f02c1ce117ceb4f41560,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
#fffe67ecc0056dd26ae00511957c5a2b,,2.0,,,,,,,2.0,,...,,,,,,,,,,
#ffff69451ff594425637015500410a13,,,,,,,,,,,...,,,,,,,,,,
#ffff8d99b9caef8ad1b95cecf0b8eef4,,,,,,,,,,,...,,,,,,,,,,
#ffffc97f29a10678203330ec0b6bf138,,,,,,,,,,,...,,,,,,,,,,


In [117]:
df3 = pd.concat([df2, df_reader], axis=1)

In [123]:
df3.reset_index(inplace=True)
df3.rename(columns={'index':'user_id'}, inplace=True)
df3

Unnamed: 0,user_id,@01038273527,@13july,@aemae-human,@ahronjeon,@alexkang,@alicemelbourne,@allstay,@am327,@anetmom,...,@worknlife,@ws820512,@x-xv,@yemaya,@yemyo,@yeonboon,@yoonjikwon,@yoooong,@yumileewyky,article_number
0,#00001ba6ca8d87d2fc34d626ba9cfe6f,,,,,,,,,,...,,,,,,,,,,3.0
1,#0000e87158c1426d6ffb72cebac6cb64,,,,,,,,,,...,,,,,,,,,,1.0
2,#0000fdba8f35c76eacab74c5c6bc7f1a,,,,,,,,,,...,,,,,,,,,,5.0
3,#000127ad0f1981cae1292efdb228f0e9,,,,,,,,,,...,,,,,,,,,,27.0
4,#0001485b31e8f02c1ce117ceb4f41560,,,,,,,,,,...,,,,,,,,,,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239015,#fffe67ecc0056dd26ae00511957c5a2b,,2.0,,,,,,,2.0,...,,,,,,,,,,24.0
239016,#ffff69451ff594425637015500410a13,,,,,,,,,,...,,,,,,,,,,3.0
239017,#ffff8d99b9caef8ad1b95cecf0b8eef4,,,,,,,,,,...,,,,,,,,,,1.0
239018,#ffffc97f29a10678203330ec0b6bf138,,,,,,,,,,...,,,,,,,,,,2.0


In [124]:
# pivot table 만들기
pivot_15000 = pd.merge(left=user_id_ls, right=df3, left_on='user_id', right_on='user_id', how='left')

In [125]:
pivot_15000_fill = pivot_15000.fillna(0)

In [126]:
pivot_15000_fill = pd.concat([pivot_15000_fill, df_author], axis=0)

In [127]:
pivot_15000_fill.set_index('user_id', drop=True, inplace=True)
pivot_15000_fill

Unnamed: 0_level_0,@01038273527,@13july,@aemae-human,@ahronjeon,@alexkang,@alicemelbourne,@allstay,@am327,@anetmom,@angiesongc9sx,...,@worknlife,@ws820512,@x-xv,@yemaya,@yemyo,@yeonboon,@yoonjikwon,@yoooong,@yumileewyky,article_number
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#6f55b5508f0c31ae3456621c23c1f6a9,12.0,41.0,7.0,2.0,61.0,3.0,56.0,36.0,17.0,55.0,...,40.0,36.0,4.0,56.0,60.0,18.0,0.0,10.0,5.0,5367.0
#71ca8074251ffb285a1ee286fde639b8,11.0,16.0,13.0,22.0,3.0,4.0,84.0,9.0,24.0,5.0,...,2.0,60.0,29.0,13.0,37.0,40.0,1.0,18.0,29.0,5845.0
#9ef170cacaa30f62d6de34a5ad42a37e,6.0,35.0,19.0,0.0,23.0,0.0,38.0,41.0,38.0,25.0,...,44.0,15.0,0.0,50.0,20.0,20.0,2.0,30.0,0.0,5380.0
#dcbfec386ba5af556b70bb98787c35ea,17.0,4.0,35.0,2.0,18.0,1.0,16.0,19.0,14.0,5.0,...,12.0,11.0,3.0,13.0,13.0,12.0,1.0,3.0,4.0,2056.0
#64ed07be4247d4f03737119faad50903,26.0,50.0,0.0,10.0,41.0,0.0,74.0,35.0,13.0,15.0,...,59.0,24.0,4.0,31.0,61.0,19.0,0.0,17.0,5.0,5012.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
#4193ff1bcca6a3372f9faae2339d5f7a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,85.0
#db4e97d6b0ba8442cce7a3c89d08c58e,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0
#134b1867c4d70127b97d2ca5627d1293,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,58.0
#0245db8de9afda01f63a641eb7a04889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,77.0


In [128]:
for i in range(15001):
    for j in range(201):
        pivot_15000_fill.iloc[i,j] = (pivot_15000_fill.iloc[i,j] / pivot_15000_fill.iloc[i, 200]) * 0.5 + (pivot_15000_fill.iloc[i,j]/ pivot_15000_fill.iloc[15000, j]) * 0.5 

pivot_15000_fill

Unnamed: 0_level_0,@01038273527,@13july,@aemae-human,@ahronjeon,@alexkang,@alicemelbourne,@allstay,@am327,@anetmom,@angiesongc9sx,...,@worknlife,@ws820512,@x-xv,@yemaya,@yemyo,@yeonboon,@yoonjikwon,@yoooong,@yumileewyky,article_number
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#6f55b5508f0c31ae3456621c23c1f6a9,0.001309,0.004895,0.000790,0.000247,0.006674,0.000372,0.005868,0.003967,0.001726,0.006514,...,0.004675,0.004003,0.000412,0.006365,0.006691,0.001914,0.000000,0.001168,0.000548,
#71ca8074251ffb285a1ee286fde639b8,0.001116,0.001788,0.001368,0.002549,0.000305,0.000465,0.008162,0.000923,0.002254,0.000554,...,0.000219,0.006214,0.002766,0.001378,0.003844,0.003950,0.000104,0.001965,0.002960,
#9ef170cacaa30f62d6de34a5ad42a37e,0.000653,0.004171,0.002140,0.000000,0.002511,0.000000,0.003973,0.004508,0.003850,0.002955,...,0.005133,0.001664,0.000000,0.005671,0.002226,0.002123,0.000223,0.003497,0.000000,
#dcbfec386ba5af556b70bb98787c35ea,0.004405,0.001078,0.009200,0.000547,0.004670,0.000274,0.004077,0.004944,0.003522,0.001342,...,0.003203,0.002873,0.000759,0.003428,0.003400,0.003077,0.000262,0.000800,0.001039,
#64ed07be4247d4f03737119faad50903,0.003008,0.006299,0.000000,0.001301,0.004757,0.000000,0.008243,0.004088,0.001406,0.001876,...,0.007285,0.002827,0.000438,0.003728,0.007205,0.002146,0.000000,0.002098,0.000581,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
#4193ff1bcca6a3372f9faae2339d5f7a,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,
#db4e97d6b0ba8442cce7a3c89d08c58e,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.037054,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,
#134b1867c4d70127b97d2ca5627d1293,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.008632,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,
#0245db8de9afda01f63a641eb7a04889,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,


In [131]:
pivot_15000_fill.drop(['article_number'], axis=1, inplace=True)
pivot_15000_fill

Unnamed: 0_level_0,@01038273527,@13july,@aemae-human,@ahronjeon,@alexkang,@alicemelbourne,@allstay,@am327,@anetmom,@angiesongc9sx,...,@workerhanee,@worknlife,@ws820512,@x-xv,@yemaya,@yemyo,@yeonboon,@yoonjikwon,@yoooong,@yumileewyky
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#6f55b5508f0c31ae3456621c23c1f6a9,0.001309,0.004895,0.000790,0.000247,0.006674,0.000372,0.005868,0.003967,0.001726,0.006514,...,0.001105,0.004675,0.004003,0.000412,0.006365,0.006691,0.001914,0.000000,0.001168,0.000548
#71ca8074251ffb285a1ee286fde639b8,0.001116,0.001788,0.001368,0.002549,0.000305,0.000465,0.008162,0.000923,0.002254,0.000554,...,0.010773,0.000219,0.006214,0.002766,0.001378,0.003844,0.003950,0.000104,0.001965,0.002960
#9ef170cacaa30f62d6de34a5ad42a37e,0.000653,0.004171,0.002140,0.000000,0.002511,0.000000,0.003973,0.004508,0.003850,0.002955,...,0.008823,0.005133,0.001664,0.000000,0.005671,0.002226,0.002123,0.000223,0.003497,0.000000
#dcbfec386ba5af556b70bb98787c35ea,0.004405,0.001078,0.009200,0.000547,0.004670,0.000274,0.004077,0.004944,0.003522,0.001342,...,0.001754,0.003203,0.002873,0.000759,0.003428,0.003400,0.003077,0.000262,0.000800,0.001039
#64ed07be4247d4f03737119faad50903,0.003008,0.006299,0.000000,0.001301,0.004757,0.000000,0.008243,0.004088,0.001406,0.001876,...,0.001392,0.007285,0.002827,0.000438,0.003728,0.007205,0.002146,0.000000,0.002098,0.000581
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
#4193ff1bcca6a3372f9faae2339d5f7a,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#db4e97d6b0ba8442cce7a3c89d08c58e,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.037054,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#134b1867c4d70127b97d2ca5627d1293,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.008632,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#0245db8de9afda01f63a641eb7a04889,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


In [137]:
pivot_15000_fill = pivot_15000_fill.drop(pivot_15000_fill.iloc[-1, :])

In [165]:
# csv파일로 저장
# pivot_15000_fill.to_csv('pivot_15000_fina.csv', encoding='utf-8')

In [138]:
pivot_15000_fill

Unnamed: 0_level_0,@01038273527,@13july,@aemae-human,@ahronjeon,@alexkang,@alicemelbourne,@allstay,@am327,@anetmom,@angiesongc9sx,...,@workerhanee,@worknlife,@ws820512,@x-xv,@yemaya,@yemyo,@yeonboon,@yoonjikwon,@yoooong,@yumileewyky
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#6f55b5508f0c31ae3456621c23c1f6a9,0.001309,0.004895,0.000790,0.000247,0.006674,0.000372,0.005868,0.003967,0.001726,0.006514,...,0.001105,0.004675,0.004003,0.000412,0.006365,0.006691,0.001914,0.000000,0.001168,0.000548
#71ca8074251ffb285a1ee286fde639b8,0.001116,0.001788,0.001368,0.002549,0.000305,0.000465,0.008162,0.000923,0.002254,0.000554,...,0.010773,0.000219,0.006214,0.002766,0.001378,0.003844,0.003950,0.000104,0.001965,0.002960
#9ef170cacaa30f62d6de34a5ad42a37e,0.000653,0.004171,0.002140,0.000000,0.002511,0.000000,0.003973,0.004508,0.003850,0.002955,...,0.008823,0.005133,0.001664,0.000000,0.005671,0.002226,0.002123,0.000223,0.003497,0.000000
#dcbfec386ba5af556b70bb98787c35ea,0.004405,0.001078,0.009200,0.000547,0.004670,0.000274,0.004077,0.004944,0.003522,0.001342,...,0.001754,0.003203,0.002873,0.000759,0.003428,0.003400,0.003077,0.000262,0.000800,0.001039
#64ed07be4247d4f03737119faad50903,0.003008,0.006299,0.000000,0.001301,0.004757,0.000000,0.008243,0.004088,0.001406,0.001876,...,0.001392,0.007285,0.002827,0.000438,0.003728,0.007205,0.002146,0.000000,0.002098,0.000581
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
#446d2433642d440b48a45509a8ee37a5,0.000000,0.005708,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#4193ff1bcca6a3372f9faae2339d5f7a,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#db4e97d6b0ba8442cce7a3c89d08c58e,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.037054,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#134b1867c4d70127b97d2ca5627d1293,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.008632,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


# SVD 적용하기

In [139]:
pivot_t = pivot_15000_fill.values.T

In [143]:
SVD = TruncatedSVD(n_components=12)
matrix = SVD.fit_transform(pivot_t)
matrix.shape

(200, 12)

In [144]:
corr = np.corrcoef(matrix)
corr.shape

(200, 200)

In [145]:
list(df_author.columns)

['@01038273527',
 '@13july',
 '@aemae-human',
 '@ahronjeon',
 '@alexkang',
 '@alicemelbourne',
 '@allstay',
 '@am327',
 '@anetmom',
 '@angiesongc9sx',
 '@ansyd',
 '@anti-essay',
 '@artinsight',
 '@bang1999',
 '@binkond',
 '@bjh4372',
 '@blue2046',
 '@boboc',
 '@bonfire',
 '@bookerbuker',
 '@bookfit',
 '@bookguru',
 '@boot0715',
 '@brunch',
 '@brunch1uhl',
 '@bzup',
 '@cathongzo',
 '@cathykimmd',
 '@chofang1',
 '@chojeremy',
 '@cli-annah',
 '@comeintothe',
 '@conbus',
 '@contigo',
 '@coolivaworld',
 '@cosmos-j',
 '@dahong',
 '@dahyun0421',
 '@dailylife',
 '@dalda',
 '@daljasee',
 '@dancingsnail',
 '@daro',
 '@ddamimovie',
 '@deckey1985',
 '@dizzo',
 '@dong02',
 '@dosa1000',
 '@doyeonsunim',
 '@dprnrn234',
 '@dryjshin',
 '@eastgo',
 '@ehahdp83',
 '@elara1020',
 '@ellieyang47uu',
 '@englishspeaking',
 '@eundang',
 '@expediakr',
 '@finance1026',
 '@flyjy724',
 '@forchoon',
 '@futurewave',
 '@glamjulie',
 '@gorrajeju',
 '@heaven',
 '@hitchwill',
 '@hjl0520',
 '@holidaymemories',
 '@honeytip

In [146]:
pivot_t

array([[0.00130935, 0.00111643, 0.00065332, ..., 0.        , 0.        ,
        0.        ],
       [0.00489485, 0.00178829, 0.00417065, ..., 0.        , 0.        ,
        0.        ],
       [0.00078989, 0.00136789, 0.0021397 , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.0001039 , 0.00022258, ..., 0.        , 0.        ,
        0.        ],
       [0.00116801, 0.00196529, 0.00349729, ..., 0.        , 0.        ,
        0.        ],
       [0.00054847, 0.00296017, 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [147]:
author = pivot_15000_fill.columns
author_list = list(df_author.columns)
coffey_hands = author_list.index("@roysday")

In [148]:
corr_coffey_hands  = corr[coffey_hands]
list(author[(corr_coffey_hands >= 0.9)])[:50]

['@forchoon',
 '@jinbread',
 '@kakao-it',
 '@mobiinside',
 '@plusx',
 '@puzzle87',
 '@roysday',
 '@thinkaboutlove',
 '@windydog']