## 내용 기반 필터링
- 아이템의 내용이나 성질을 이용
- 두 내용 간 유사도 개념을 접목해 주어진 아이템과 유사한 아이템을 찾아내는 기법
- 코사인 유사도 사용

### 코사인 유사도
<h2>$ cos(\theta) = \frac{\sum_{i=1}^{n}A_{i}B_{i}}{\sqrt[]{\sum_{i=1}^{n}A_{i}^{2}}\sqrt[]{\sum_{i=1}^{n}B_{i}^{2}}}$ </h2>

## 협업 필터링
- 집단지성 접근 방식
- 어떤 특정인이 자신이 평가한 척도, 리뷰한 적도 없는 아이템에 관해 어느 정도의 선호도를 갖는지 예측할 때 그 아이템에 관한 다른 다수의 선호도 집합을 활용하는 방법이다.
- 유사도의 개념아래 작동

### 내용 기반 필터링 대비 협업 필터링의 장점
- 아이템 내용을 이해할 필요가 없다.
- 콜드 스타트(신규 아이템) 문제가 없다
- 시간 변화에 따른 고객의 변화를 감지
- 내재적인 미묘한 특징도 포착

### 협업 필터링을 위한 교대 최소 자승법에 의한 행렬 인수 분해
- 교대 최소 자승법 : 행렬 인수분해 문제를 풀기 위한 최적화 기법
- 상대적으로 적은 수의 미관측 자료를 사용해 그 기저 이유/요인을 파악해 설명하려는 기법

<h2>  $   A_{(m \times n)} \approx X_{(m \times k)} \times Y_{(k \times n)}       $  </h2>

-  Y값을 사용해 X값을 갱신, 학습률($\lambda$), 원시 희소 행렬($A$)
-  X값을 사용해 Y값을 갱신, 학습률($\lambda$), 원시 희소 행렬($A$)
-  학습률($\lambda$)은 수렴 속도를 조절하기 위해 사용

- 계산 방법

$ A \approx X \times Y $<br>
$ X \approx A \times Y^{-1} $<br>
$ X \approx A \times (Y^{T} \times {Y^{T}}^{-1}) \times Y^{-1} $<br>
$ X \approx A \times Y^{T} \times {({Y}Y^{T})}^{-1} $<br>
$ 오류최소화 = min(X - A \times Y^{T} \times {({Y}Y^{T})}^{-1} ) $

## 추천 엔진 모델의 평가

$ 제곱 평균 오차 = \frac{제곱 오차의 합}{전체 관측 개수} $<br>
$ RMSE(평균 제곱근 오차) = \sqrt[]{제곱 평균 오차}$

### 그리드 검색을 사용한 추천 엔진의 초매개변수 선택
- 반복 횟수 : ALS는 10회 이내의 반복으로 수렴하는 것이 증명 됐다.
- 잠재 요인 개수
- 학습률

### 무비렌즈 데이터에 적용한 추천 엔진

In [1]:
# 데이터 가져오기 - ratings
import pandas as pd
ratings = pd.read_csv('./Data/ml-latest-small/ratings.csv')

In [2]:
# 데이터 컬럼 및 데이터 타입 확인하기
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [4]:
# 데이터 shape 확인
ratings.shape

(100836, 4)

In [6]:
# 데이터 확인하기 - head()
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
# 데이터 가져오기 - movies
movies = pd.read_csv('./Data/ml-latest-small/movies.csv')

In [8]:
# 데이터 컬럼 및 데이터 타입 확인하기
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [9]:
# 데이터 shape 확인
movies.shape

(9742, 3)

In [10]:
# 데이터 확인 - head()
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


- ratings와 movies 데이터 프레임을 보면 movieid가 공통 컬럼이다.
- 이를 이용해서 데이터 프레임을 통합하자

In [18]:
# ratings와 movies 통합 - merge
ratings = pd.merge(ratings.drop('timestamp', axis = 1), movies.drop('genres', axis = 1), how = 'left', left_on = 'movieId', right_on = 'movieId')

In [19]:
# 데이터 프레임 확인
ratings.head()

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,1,3,4.0,Grumpier Old Men (1995)
2,1,6,4.0,Heat (1995)
3,1,47,5.0,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,"Usual Suspects, The (1995)"


In [20]:
# 데이터 프레임 컬럼 및 데이터 타입 확인
ratings.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100836 non-null  int64  
 1   movieId  100836 non-null  int64  
 2   rating   100836 non-null  float64
 3   title    100836 non-null  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 3.8+ MB


In [22]:
# 피벗 테이블 이용해서 유저별 영화 평점 데이터 프레임 얻기
rp = ratings.pivot_table(columns = ['movieId'], index = ['userId'], values = 'rating')

In [23]:
# 데이터 확인
rp

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


In [24]:
# NaN값 => 해당 영화에 평점을 부여하지 않은 것 => 0점으로 대치
rp = rp.fillna(0)

In [25]:
# 데이터 프레임 확인
rp

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
# 데이터 프레임 => 배열로 변환 => 연산속도 개선
rp_array = rp.to_numpy()

In [45]:
# 데이터 확인
rp_array.shape

(610, 9724)

#### 사용자 - 사용자 유사도 행렬

In [46]:
# 코사인 유사도 구하기
from scipy.spatial.distance import cosine
import numpy as np
m, n = rp_array.shape
array_users = np.zeros((m,m))

In [47]:
for i in range(m):
    for j in range(m):
        if i != j :
            array_users[i][j] = (1 - cosine(rp_array[i,:], rp_array[j,:]))
        else:
            array_users[i][j] = 0        

In [48]:
# 데이터 프레임 만들기
pd_users = pd.DataFrame(
    data = array_users,
    index = rp.index,
    columns = rp.index
)

In [50]:
# 데이터 프레임 확인
pd_users

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,0.027283,0.059720,0.194395,0.129080,0.128152,0.158744,0.136968,0.064263,0.016875,...,0.080554,0.164455,0.221486,0.070669,0.153625,0.164191,0.269389,0.291097,0.093572,0.145321
2,0.027283,0.000000,0.000000,0.003726,0.016614,0.025333,0.027585,0.027257,0.000000,0.067445,...,0.202671,0.016866,0.011997,0.000000,0.000000,0.028429,0.012948,0.046211,0.027565,0.102427
3,0.059720,0.000000,0.000000,0.002251,0.005020,0.003936,0.000000,0.004941,0.000000,0.000000,...,0.005048,0.004892,0.024992,0.000000,0.010694,0.012993,0.019247,0.021128,0.000000,0.032119
4,0.194395,0.003726,0.002251,0.000000,0.128659,0.088491,0.115120,0.062969,0.011361,0.031163,...,0.085938,0.128273,0.307973,0.052985,0.084584,0.200395,0.131746,0.149858,0.032198,0.107683
5,0.129080,0.016614,0.005020,0.128659,0.000000,0.300349,0.108342,0.429075,0.000000,0.030611,...,0.068048,0.418747,0.110148,0.258773,0.148758,0.106435,0.152866,0.135535,0.261232,0.060792
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.164191,0.028429,0.012993,0.200395,0.106435,0.102123,0.200035,0.099388,0.075898,0.088963,...,0.178084,0.116534,0.300669,0.066032,0.148141,0.000000,0.153063,0.262558,0.069622,0.201104
607,0.269389,0.012948,0.019247,0.131746,0.152866,0.162182,0.186114,0.185142,0.011844,0.010451,...,0.092525,0.199910,0.203540,0.137834,0.118780,0.153063,0.000000,0.283081,0.149190,0.139114
608,0.291097,0.046211,0.021128,0.149858,0.135535,0.178809,0.323541,0.187233,0.100435,0.077424,...,0.158355,0.197514,0.232771,0.155306,0.178142,0.262558,0.283081,0.000000,0.121993,0.322055
609,0.093572,0.027565,0.000000,0.032198,0.261232,0.214234,0.090840,0.423993,0.000000,0.021766,...,0.035653,0.335231,0.061941,0.236601,0.097610,0.069622,0.149190,0.121993,0.000000,0.053225


In [51]:
# 사용자 ID와 표시할 유사한 사람 수를 입력으로 받아 코사인 유사도 점수에 따라 유사한 사람을 출력
def topn_simusers(uid = 16, n = 5):
    users = pd_users.loc[uid, :].sort_values(ascending = False)
    topn_users = users.iloc[:n,]
    topn_users = topn_users.rename('score')
    return pd.DataFrame(topn_users)

In [53]:
topn_simusers(uid = 17, n = 10)

Unnamed: 0_level_0,score
userId,Unnamed: 1_level_1
16,0.456096
400,0.452319
434,0.452304
247,0.438913
399,0.414196
362,0.40518
549,0.404689
131,0.403409
72,0.392264
464,0.386456


In [60]:
# 사용자가 가장 높이 평가했던 영화
def topn_movieratings(uid = 355, n_ratings = 10):
    uid_ratings = ratings.loc[ratings['userId'] == uid]
    uid_ratings = uid_ratings.sort_values(by = 'rating', ascending = False)
    return uid_ratings.iloc[:n_ratings,]

In [61]:
topn_movieratings(uid = 596, n_ratings = 10)

Unnamed: 0,userId,movieId,rating,title
91864,596,3000,5.0,Princess Mononoke (Mononoke-hime) (1997)
91974,596,33649,5.0,Saving Face (2004)
92107,596,122906,5.0,Black Panther (2017)
91905,596,4878,5.0,Donnie Darko (2001)
92109,596,122916,5.0,Thor: Ragnarok (2017)
91928,596,5971,5.0,My Neighbor Totoro (Tonari no Totoro) (1988)
92084,596,110102,5.0,Captain America: The Winter Soldier (2014)
91837,596,2288,5.0,"Thing, The (1982)"
91969,596,31658,5.0,Howl's Moving Castle (Hauru no ugoku shiro) (2...
92100,596,122882,5.0,Mad Max: Fury Road (2015)
