# 추천 시스템 (Recommender Systems)

* 추천 시스템은 크게 두가지로 구분 가능
  * 컨텐츠 기반 필터링 (content-based filtering)
  * 협업 필터링 (collaborative filtering)
* 두가지를 조합한 hybrid 방식도 가능
* 컨텐츠 기반 필터링은 지금까지 사용자의 이전 행동과 명시적 피드백을 통해 사용자가 좋아하는 것과 유사한 항목을 추천
* 협업 필터링은 사용자와 항목간의 유사성을 동시에 사용해 추천

## Surprise

* 추천 시스템 개발을 위한 라이브러리
* 다양한 모델과 데이터 제공
* scikit-learn과 유사한 사용 방법

In [36]:
!pip install surprise



## 컨텐츠 기반 필터링 (Content-based Filtering)

* 컨텐츠 기반 필터링은 이전의 행동과 명시적 피드백을 통해 좋아하는 것과 유사한 항목을 추천
  * ex) 내가 지금 까지 시청한 영화 목록과 다른 사용자의 시청 목록을 비교해 나와 비슷한 취향의 사용자가 시청한 영화를 추천
* 유사도를 기반으로 추천
* 컨텐츠 기반 필터링은 다음과 같은 장단점이 있다.
  * 장점
    * 많은 수의 사용자를 대상으로 쉽게 확장 가능
    * 사용자가 관심을 갖지 않던 상품 추천 가능
  * 단점
    * 입력 특성을 직접 설계해야 하기 때문에 많은 도메인 지식이 필요
    * 사용자의 기존 관심사항을 기반으로만 추천 가능

## 협업 필터링(Collaborative Filtering)

* 사용자와 항목의 유사성을 동시에 고려해 추천
* 기존에 내 관심사가 아닌 항목이라도 추천 가능
* 자동으로 임베딩 학습 가능
* 협업 필터링은 다음과 같은 장단점을 갖고 있다.
  * 장점
    * 자동으로 임베딩을 학습하기 때문에 도메인 지식이 필요 없다.
    * 기존의 관심사가 아니더라도 추천 가능
  * 단점
    * 학습 과정에 나오지 않은 항목은 임베딩을 만들 수 없음
    * 추가 특성을 사용하기 어려움

## 하이브리드(Hybrid)

* 컨텐츠 기반 필터링과 협업 필터링을 조합한 방식
* 많은 하이브리드 방식이 존재
* 실습에서는 협업 필터링으로 임베딩을 학습하고 컨텐츠 기반 필터링으로 유사도 기반 추천을 수행하는 추천 엔진 개발

In [37]:
import numpy as np
from sklearn.decomposition import randomized_svd,non_negative_factorization
from surprise import Dataset

In [38]:
data=Dataset.load_builtin('ml-100k',prompt=False)
raw_data=np.array(data.raw_ratings,dtype=int)
raw_data[:,0]-=1
raw_data[:,1]-=1

In [39]:
n_users=np.max(raw_data[:,0])
n_movies=np.max(raw_data[:,1])
shape=(n_users+1,n_movies+1)
shape

(943, 1682)

In [40]:
n_users

942

In [41]:
adj_matrix=np.ndarray(shape,dtype=int)
for user_id,movie_id,rating,time in raw_data:
  adj_matrix[user_id][movie_id]=rating
adj_matrix

array([[5, 3, 4, ..., 0, 0, 0],
       [4, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [5, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 5, 0, ..., 0, 0, 0]])

In [42]:
import pandas as pd

In [43]:
df_adj_matrix=pd.DataFrame(adj_matrix)
df_adj_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,5,3,4,3,3,5,4,1,5,3,...,0,0,0,0,0,0,0,0,0,0
1,4,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,0,0,0,0,0,0,0,0,5,0,...,0,0,0,0,0,0,0,0,0,0
939,0,0,0,2,0,0,4,5,3,0,...,0,0,0,0,0,0,0,0,0,0
940,5,0,0,0,0,0,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0
941,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


U =user, V= item ,S= tri value vector

In [44]:
U,S,V=randomized_svd(adj_matrix,n_components=2)
S=np.diag(S)



In [45]:
df_U=pd.DataFrame(U)
df_U

Unnamed: 0,0,1
0,0.065804,0.005975
1,0.014021,-0.046626
2,0.005658,-0.025618
3,0.005993,-0.020698
4,0.032747,0.009159
...,...,...
938,0.011282,-0.041178
939,0.030006,-0.008428
940,0.007445,-0.025021
941,0.024031,0.008096


In [46]:
df_V=pd.DataFrame(V)
df_V

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,0.095951,0.03518,0.019929,0.059952,0.021607,0.005367,0.085673,0.063814,0.065295,0.022075,...,0.00015,0.000358,0.000176,0.000117,0.00022,1.5e-05,4.6e-05,3e-05,0.000331,0.000317
1,-0.08724,-0.007025,-0.028618,0.01305,-0.015311,0.002251,-0.079269,0.027715,-0.042917,0.007884,...,0.000292,0.001386,-0.00077,-0.000513,-0.000215,-0.000224,-0.000672,-0.000448,0.000105,0.000203


In [47]:
print(U.shape)
print(S.shape)
print(V.shape)

(943, 2)
(2, 2)
(2, 1682)


In [48]:
df_USV=np.matmul(np.matmul(U,S),V)
df_USV=pd.DataFrame(df_USV)
df_USV

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1672,1673,1674,1675,1676,1677,1678,1679,1680,1681
0,3.917327,1.472766,0.798262,2.546465,0.888481,0.229530,3.495722,2.730697,2.689811,0.942156,...,0.006770,0.017122,0.006273,0.004182,0.008968,0.000312,0.000937,0.000625,0.014110,0.013655
1,1.857772,0.396191,0.505706,0.389534,0.368863,0.022513,1.674457,0.256811,1.076432,0.108290,...,-0.001976,-0.012612,0.010362,0.006908,0.004432,0.002694,0.008083,0.005389,0.001772,0.000527
2,0.894990,0.171578,0.251739,0.135453,0.174352,0.005336,0.807738,0.057469,0.505864,0.030567,...,-0.001283,-0.007399,0.005463,0.003642,0.002146,0.001460,0.004381,0.002921,0.000540,-0.000126
3,0.810501,0.170669,0.221544,0.164044,0.160548,0.009199,0.730645,0.104553,0.468185,0.044803,...,-0.000900,-0.005652,0.004574,0.003049,0.001935,0.001194,0.003581,0.002388,0.000738,0.000187
4,1.817325,0.722279,0.353912,1.287004,0.418963,0.117633,1.619587,1.400895,1.273581,0.480801,...,0.003811,0.010620,0.001956,0.001304,0.004137,-0.000184,-0.000552,-0.000368,0.007181,0.007103
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
938,1.573066,0.325097,0.432570,0.301751,0.310532,0.016099,1.418415,0.181812,0.904629,0.080072,...,-0.001852,-0.011391,0.009028,0.006018,0.003759,0.002369,0.007106,0.004738,0.001332,0.000242
939,2.024478,0.690753,0.442143,1.125537,0.446946,0.098519,1.810464,1.169505,1.343719,0.408090,...,0.002291,0.004022,0.004962,0.003308,0.004676,0.000754,0.002263,0.001509,0.006147,0.005672
940,0.992052,0.210815,0.270363,0.205977,0.196843,0.011807,0.894203,0.134556,0.574321,0.056985,...,-0.001068,-0.006786,0.005552,0.003701,0.002367,0.001445,0.004335,0.002890,0.000934,0.000267
941,1.304254,0.527670,0.250080,0.948844,0.302297,0.087081,1.161830,1.037358,0.920154,0.355483,...,0.002894,0.008260,0.001176,0.000784,0.002964,-0.000210,-0.000631,-0.000421,0.005305,0.005281


In [49]:
df_U=pd.DataFrame(U)
df_U

Unnamed: 0,0,1
0,0.065804,0.005975
1,0.014021,-0.046626
2,0.005658,-0.025618
3,0.005993,-0.020698
4,0.032747,0.009159
...,...,...
938,0.011282,-0.041178
939,0.030006,-0.008428
940,0.007445,-0.025021
941,0.024031,0.008096


* 사용자 기반 추천
* 나와 비슷한 취향을 가진 다른 사용자의 행동을 추천
* 사용자 특징 벡터의 유사도 사용

* 코사인 유사도를 사용해 추천

\begin{equation}
cos \theta = \frac{A \cdot B}{||A|| \times ||B||}
\end{equation}
* 두 벡터가 이루고 있는 각을 계산

In [50]:
def compute_cos_similarity(v1,v2):
  norm1=np.sqrt(np.sum(np.square(v1)))
  norm2=np.sqrt(np.sum(np.square(v2)))
  dot=np.dot(v1,v2)
  return dot/(norm1*norm2)

In [51]:
my_id,my_vector=0,U[0]
best_match,best_match_id,best_match_vector=-1,-1,[]

for user_id,user_vector in enumerate(U):
  if my_id !=user_id:
    cos_similarity=compute_cos_similarity(my_vector,user_vector)
    if cos_similarity > best_match:
      best_match=cos_similarity
      best_match_id=user_id
      best_match_vector=user_vector

  
print('best Match: {}, Best Match ID : {}'.format(best_match,best_match_id))

best Match: 0.9999942295956321, Best Match ID : 235


In [52]:
recommend_list=[]
for i, log in enumerate(zip(adj_matrix[my_id],adj_matrix[best_match_id])):
  log1,log2=log
  if log1<1. and log2 > 0.:
    recommend_list.append(i)
print(recommend_list)

[272, 273, 274, 281, 285, 288, 293, 297, 303, 306, 312, 317, 327, 332, 369, 410, 418, 419, 422, 426, 428, 431, 434, 442, 461, 475, 477, 482, 495, 503, 504, 505, 506, 509, 519, 520, 522, 525, 531, 545, 548, 590, 594, 595, 613, 631, 654, 658, 660, 672, 684, 685, 691, 695, 698, 704, 716, 728, 734, 749, 755, 863, 865, 933, 1012, 1038, 1101, 1327, 1400]


In [53]:
my_id,my_vector=0,U[0]
best_match,best_match_id,best_match_vector=-1,-1,[]

for user_id,user_vector in enumerate(U):
  if my_id !=user_id:
    cos_similarity=compute_cos_similarity(my_vector,user_vector)
    if cos_similarity > best_match:
      best_match=cos_similarity
      best_match_id=user_id
      best_match_vector=user_vector

  
print('best Match: {}, Best Match ID : {}'.format(best_match,best_match_id))

best Match: 0.9999942295956321, Best Match ID : 235


In [54]:
adj_matrix[my_id]

array([5, 3, 4, ..., 0, 0, 0])

In [55]:
adj_matrix[best_match_id]

array([0, 0, 0, ..., 0, 0, 0])

we can append both of the movie that I havn't seen and the movie that is similar

In [56]:
recommend_list=[]
for i, log in enumerate(zip(adj_matrix[my_id],adj_matrix[best_match_id])):
  log1,log2=log
  if log1<1. and log2 > 0.:
    recommend_list.append(i)
print(recommend_list)

[272, 273, 274, 281, 285, 288, 293, 297, 303, 306, 312, 317, 327, 332, 369, 410, 418, 419, 422, 426, 428, 431, 434, 442, 461, 475, 477, 482, 495, 503, 504, 505, 506, 509, 519, 520, 522, 525, 531, 545, 548, 590, 594, 595, 613, 631, 654, 658, 660, 672, 684, 685, 691, 695, 698, 704, 716, 728, 734, 749, 755, 863, 865, 933, 1012, 1038, 1101, 1327, 1400]


* 항목 기반 추천
* 내가 본 항목과 비슷한 항목을 추천
* 항목 특징 벡터의 유사도 사용

In [57]:
my_id,my_vector=0,V.T[0]
best_match,best_match_id,best_match_vector=-1,-1,[]

for user_id,user_vector in enumerate(V.T):
  if my_id !=user_id:
    cos_similarity=compute_cos_similarity(my_vector,user_vector)
    if cos_similarity > best_match:
      best_match=cos_similarity
      best_match_id=user_id
      best_match_vector=user_vector

  
print('best Match: {}, Best Match ID : {}'.format(best_match,best_match_id))

best Match: 0.9999999951364141, Best Match ID : 1287


In [58]:
recommend_list=[]
for i, log in enumerate(adj_matrix):
  if adj_matrix[i][my_id]>0.9:
    recommend_list.append(i)
print(recommend_list)

[0, 1, 4, 5, 9, 12, 14, 15, 16, 17, 19, 20, 22, 24, 25, 37, 40, 41, 42, 43, 44, 48, 53, 55, 56, 57, 58, 61, 62, 63, 64, 65, 66, 69, 71, 72, 74, 76, 78, 80, 81, 82, 83, 88, 91, 92, 93, 94, 95, 96, 98, 100, 101, 105, 107, 108, 116, 119, 120, 123, 124, 127, 129, 130, 133, 136, 137, 140, 143, 144, 147, 149, 150, 156, 157, 159, 161, 167, 173, 176, 177, 180, 181, 183, 188, 192, 193, 197, 198, 199, 200, 201, 202, 203, 208, 209, 212, 215, 221, 222, 229, 230, 231, 233, 234, 241, 242, 243, 245, 246, 247, 248, 249, 250, 251, 252, 253, 255, 261, 262, 264, 267, 270, 273, 274, 275, 276, 278, 279, 285, 286, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 300, 302, 304, 306, 307, 310, 311, 312, 313, 319, 321, 323, 324, 325, 326, 329, 330, 331, 335, 337, 338, 339, 342, 343, 344, 346, 347, 349, 356, 358, 359, 362, 364, 370, 373, 377, 378, 379, 380, 386, 387, 388, 389, 392, 393, 394, 395, 397, 398, 400, 401, 402, 405, 406, 410, 411, 415, 416, 418, 421, 423, 424, 428, 431, 433, 434, 437, 440, 444, 