# Collaborative Filtering

- Contents based recommender는 가까운 영화를 추천해줄 뿐 취향을 고려할 수 없음
- 계속 비슷한 장르의 영화만 추천 됨<br>
<br>
- Collaborative filtering은 자신과 비슷한 취향의 사용자가 좋아했던 영화를 자신도 좋아할 것이라 가정
- 상품의 특성이 아닌 사용자가 과거에 상품을 어떻게 평가했는지에 기반해 추천하는 방법
- 예를들어, 사용자 A와 B가 영화에 대해 유사한 평점을 매겼을 때 A가 좋게 평가한 영화를 B에게 추천

장점
- 상품에 대한 정보가 없어도 됨

단점
- 사용자의 평점이 없는 영화는 추천할 수가 없어 특히 새로 가입한 고객에게 추천이 어려움 (Sparsity problem)
- 마찬가지로 새롭게 개봉한 영화도 추천하기가 어려움

목차<br>
1. 유사도를 이용한 추천
2. SVD
3. Funk SVD
4. SVD 데이터에 적용: Movie Lens data
5. Funk SVD 데이터에 적용: Movie Lens data
6. 그 외에

# 1. 유사도를 이용한 추천

In [1]:
import math
import operator

#Building Custom Data for Movie Rating
review = {
'Marlon Brando': {
'The Godfather': 5.00, 
'The Godfather Part II': 4.29,
'Apocalypse Now': 5.00, 
'Jaws': 1.
},
'Stephen King': {
'The Shawshank Redemption': 4.89, 
'The Shining': 4.93 , 
'The Green Mile': 4.87,
'The Godfather': 1.33,
},
'Steven Spielberg': {
'Raiders of the Lost Ark': 5.0, 
'Jaws': 4.89,
'Saving Private Ryan': 4.78, 
'Star Wars Episode IV - A New Hope': 4.33,
'Close Encounters of the Third Kind': 4.77,
'The Godfather':  1.25,
'The Godfather Part II': 1.72
},
'George Lucas':{
'Star Wars Episode IV - A New Hope': 5.00	
},
'Al Pacino': {
'The Godfather': 4.02, 
'The Godfather Part II': 5.00,
},
'Robert DeNiro': {
'The Godfather': 3.07, 
'The Godfather Part II': 4.29, 
'Raging Bull': 5.00, 
'Goodfellas':  4.89
},
'Robert Duvall': {
'The Godfather': 3.80, 
'The Godfather Part II': 3.61,
'Apocalypse Now': 4.26 
},
'Jack Nicholson': {
'The Shining': 5.0,
'One Flew Over The Cuckoos Nest': 5.0,
'The Godfather': 2.22,
'The Godfather Part II': 3.34
},
'Morgan Freeman': {
'The Shawshank Redemption': 4.98,
'The Shining': 4.42,
'Apocalypse Now': 1.63,
'The Godfather': 1.12,
'The Godfather Part II': 2.16
},
'Harrison Ford': {
'Raiders of the Lost Ark': 5.0, 
'Star Wars Episode IV - A New Hope': 4.84,
},
'Tom Hanks': {
'Saving Private Ryan': 3.78, 
'The Green Mile': 4.96,
'The Godfather': 1.04,
'The Godfather Part II': 1.03
},
'Francis Ford Coppola': {
'The Godfather': 5.00, 
'The Godfather Part II': 5.0, 
'Jaws': 1.24,
'One Flew Over The Cuckoos Nest': 2.02
},
'Martin Scorsese': {
'Raging Bull': 5.0, 
'Goodfellas': 4.87,
'Close Encounters of the Third Kind': 1.14,
'The Godfather': 4.00
},
'Diane Keaton': {
'The Godfather': 2.98,
'The Godfather Part II': 3.93,
'Close Encounters of the Third Kind': 1.37
},
'Richard Dreyfuss': {
'Jaws': 5.0, 
'Close Encounters of the Third Kind': 5.0,
'The Godfather': 1.07,
'The Godfather Part II': 0.63
},
'Joe Pesci': {
'Raging Bull': 4.89, 
'Goodfellas': 5.0,
'The Godfather': 4.87,
'Star Wars Episode IV - A New Hope': 1.32
}
}

- 16명의 사용자의 영화 평점 데이터
- 각 사용자는 자신이 봤던 영화에 대해 0~5점의 평점을 매김

In [2]:
# Function to get common movies b/w Users
def get_common_movies(criticA,criticB):
    return [movie for movie in review[criticA] if movie in review[criticB]]
# Function to get reviews from the common movies
def get_reviews(criticA,criticB):
    common_movies = get_common_movies(criticA,criticB)
    return [(review[criticA][movie], review[criticB][movie]) for movie in common_movies]

In [3]:
print (get_common_movies('Marlon Brando','Robert DeNiro'))
print (get_reviews('Marlon Brando','Robert DeNiro') )
print (get_common_movies('Steven Spielberg','Tom Hanks'))
print (get_reviews('Steven Spielberg','Tom Hanks'))
print (get_common_movies('Martin Scorsese','Joe Pesci'))
print (get_reviews('Martin Scorsese','Joe Pesci'))

['The Godfather', 'The Godfather Part II']
[(5.0, 3.07), (4.29, 4.29)]
['Saving Private Ryan', 'The Godfather', 'The Godfather Part II']
[(4.78, 3.78), (1.25, 1.04), (1.72, 1.03)]
['Raging Bull', 'Goodfellas', 'The Godfather']
[(5.0, 4.89), (4.87, 5.0), (4.0, 4.87)]


- 말론 블란도와 로버트 드니로 사용자가 공통으로 본 영화는 대부1,2
- 말론 브란도는 대부 1을 높게 평가하고 대부 2는 낮게 평가
- 로버트 드니로는 대부 1과 대부 2를 비슷하게 평가 

In [4]:
# Function to get Euclidean Distance b/w 2 points 
def euclidean_distance(points):
    squared_diffs = [(point[0] - point[1]) ** 2 for point in points]
    summed_squared_diffs = sum(squared_diffs)
    distance = math.sqrt(summed_squared_diffs)
    return distance

유사도 $\frac{1}{1+d(x,y)}$
- 점수가 비슷할 수록(Euclidean distance로 가까울 수록) 높은 유사도를 주기 위해 역수를 취함
- 거리가 0이되면 분모가 0이 되버리기 때문에 +1를 해서 방지

In [5]:
# Function to  calculate similarity more similar less the distance and vice versa
# Added 1 for if highly similar can make the distance zero and give NotDefined Error
def similarity(reviews):
    return 1/ (1 + euclidean_distance(reviews))

In [6]:
# Function to get similarity b/w 2 users
def get_critic_similarity(criticA, criticB):
    reviews = get_reviews(criticA,criticB)
    return similarity(reviews)

In [7]:
print(get_critic_similarity('Marlon Brando','Robert DeNiro') )
print(get_critic_similarity('Steven Spielberg','Tom Hanks') )
print(get_critic_similarity('Martin Scorsese','Joe Pesci'))

0.341296928327645
0.4478352722730117
0.5300793497254199


말론 브란도와 로버트 드니로보다 마틴 스코세이지와 조가 비슷한 성향을 가진 사용자

In [8]:
# Function to give recommendation to users based on their reviews.
def recommend_movies(critic, num_suggestions):
    similarity_scores = [(get_critic_similarity(critic, other), other) for other in review if other != critic]
    # Get similarity Scores for all the critics
    similarity_scores.sort() 
    similarity_scores.reverse()
    similarity_scores = similarity_scores[0:num_suggestions]

    recommendations = {}
    # Dictionary to store recommendations
    for similarity, other in similarity_scores:
        reviewed = review[other]
        # Storing the review
        for movie in reviewed:
            if movie not in review[critic]:
                weight = similarity * reviewed[movie]
                # Weighing similarity with review
                if movie in recommendations:
                    sim, weights = recommendations[movie]
                    recommendations[movie] = (sim + similarity, weights + [weight])
                    # Similarity of movie along with weight
                else:
                    recommendations[movie] = (similarity, [weight])
                    

    for recommendation in recommendations:
        similarity, movie = recommendations[recommendation]
        recommendations[recommendation] = sum(movie) / similarity
        # Normalizing weights with similarity

    sorted_recommendations = sorted(recommendations.items(), key=operator.itemgetter(1), reverse=True)
    #Sorting recommendations with weight
    return sorted_recommendations

- 사용자 간에 공통으로 본 영화 평점을 통해 사용자 간의 유사도를 계산해서 자신과 유사한 K명 선택
- 자신이 안 본 영화 중 자신과 가장 유사한 성향의 사용자가 봤던 영화 선택
- 이 중 유사도와 유사한 사용자가 내린 평점의 가중 평균으로 추천 점수 결정
- 가장 높은 점수 순으로 배열
<br>
- 예를 들어 Marlon Brando에게 영화를 추천할 때 K=5로 설정시 성향이 가장 비슷한 조지 루카스,조 페스치 등 5명이 선택됨
- 이중 조지 루카스가 유사도 1, Joe Pesci가 0.89 였고 스타워즈에 대한 평점이 각각 5점, 1.69였으면 유사도의 가중평균 (1*5+0.89*1.96)/2로 점수 스타워즈에 대한 추천 점수가 3.82가 됨<br>
<br>
코드의 문제점
- 말론 브란도와 조지 루카스는 공통으로 본 영화가 없어서 거리가 계산이 안 되어야 함
- 하지만 유사도는 1/(1+0)으로 1이 되어버려 조지 루카스가 가장 유사한 성향이라고 잘 못 계산 됨

In [9]:
recommend_movies('Marlon Brando',4)

[('Goodfellas', 5.000000000000001),
 ('Raiders of the Lost Ark', 5.0),
 ('Raging Bull', 4.89),
 ('Star Wars Episode IV - A New Hope', 3.8157055214723923),
 ('One Flew Over The Cuckoos Nest', 2.02)]

In [10]:
recommend_movies('Robert DeNiro',4)

[('Raiders of the Lost Ark', 5.0),
 ('Star Wars Episode IV - A New Hope', 4.92),
 ('Close Encounters of the Third Kind', 1.2744773851327365)]

In [11]:
recommend_movies('Steven Spielberg',4)

[('The Shawshank Redemption', 4.928285762244913),
 ('The Green Mile', 4.87),
 ('The Shining', 4.71304734727882),
 ('Apocalypse Now', 1.63)]

In [12]:
recommend_movies('Tom Hanks',4)

[('Raiders of the Lost Ark', 5.0),
 ('Jaws', 5.0),
 ('Close Encounters of the Third Kind', 5.0),
 ('The Shining', 4.93),
 ('Star Wars Episode IV - A New Hope', 4.92),
 ('The Shawshank Redemption', 4.89)]

In [13]:
recommend_movies('Martin Scorsese',4)

[('Raiders of the Lost Ark', 5.0),
 ('Star Wars Episode IV - A New Hope', 4.92),
 ('The Godfather Part II', 4.3613513513513515),
 ('Apocalypse Now', 4.26)]

In [14]:
recommend_movies('Joe Pesci',4)

[('Apocalypse Now', 5.000000000000001),
 ('The Godfather Part II', 4.7280538302277435),
 ('One Flew Over The Cuckoos Nest', 2.02),
 ('Close Encounters of the Third Kind', 1.14),
 ('Jaws', 1.12)]

# 2. Singular Value Decompostion (SVD)

- 데이터의 차원이 너무 커지면 차원의 저주 문제 발생 (한 명이 영화 100개만 평가해도 100차원으로 증가)
- SVD를 사용하면 정보를 많이 잃지 않으면서도 차원을 축소할 수 있음<br>
<br>
- 또한 영화 목록과 사용자 간의 잠재된 특징(latent feature)을 찾게 해줌
- 데이터에 숨겨진 상관관계(latent factor)와 같은 유용한 정보를 발견할 수 있고 필요 없는 정보를 제거할 수 있음<br>

장점
- MSE, MAE와 같은 regression에 사용할 수 있는 척도를 사용할 수 있기 때문에 성능에 대한 평가가 가능
- 보편적으로 많이 사용

단점
- SVD의 문제는 sparse하지 않은 data에는 적합하지만 현실에서는 대부분의 데이터가 sparse 하다는 문제점이 있음
- 실제 데이터에서는 95% 이상이 결측 되있는 경우가 많음

잠재된 특징이란?
- 어떤 유저가 두 영화에 9점의 평점을 줬고 다른 두 영화에는 2점을 줌
- 영화 내용을 봤을 때 앞선 두 영화는 AI에 대한 것이었고 뒤에 두 영화는 강아지에 대한 것이었음
- 이 유저는 AI 영화를 좋아하고 강아지 영화를 좋아할 가능성이 높음
- 하지만 데이터에는 AI나 강아지라는 특징은 표시 되어 있지 않음
- SVD는 이러한 잠재된 특징을 찾아낼 수 있음

<b>SVD 해석</b>

<br>
$A=U \Sigma V^{T}$로 decompose 가능<br>
$U_{(m*m)}$: orthogonal matrix(left sigular vectors),<br> 
$\Sigma_{m*n}$: positive definite diagonal matrix (Singular values),<br>
$V_{n*n}$: orthogonal matrix (right sigular vectors)

In [15]:
import numpy as np
from scipy.linalg import svd

A=np.array([[1,1,1,0,0],[3,3,3,0,0],[4,4,4,0,0],[5,5,5,0,0],[0,2,0,4,4],[0,0,0,5,5],[0,1,0,2,2]])
U, s, VT = svd(A)
A

array([[1, 1, 1, 0, 0],
       [3, 3, 3, 0, 0],
       [4, 4, 4, 0, 0],
       [5, 5, 5, 0, 0],
       [0, 2, 0, 4, 4],
       [0, 0, 0, 5, 5],
       [0, 1, 0, 2, 2]])

$A_{(7*5)}$: 사용자들의 영화 평점
- 열은 매트릭스, 에일리언, 세레니티, 카사블랑카, 에밀리에 영화
- 행은 사용자 1,...사용자 7<br>
<br>
- 매트릭스, 에일리언, 세레니티는 SF 영화로 사용자 1부터 4가 주로 봤음
- 카사블랑카, 에밀리에는 로맨스 영화로 사용자 5부터 7이 주로 봤음
- 즉 데이터에는 표시 되어 있지 않지만 SF 영화와 로멘스 영화라는 Factor가 2개 있어 보임

In [41]:
print(np.abs(U[:,0]))
print(np.abs(U[:,1]) )

[0.13759913 0.41279738 0.5503965  0.68799563 0.15277509 0.07221651
 0.07638754]
[0.02361145 0.07083435 0.09444581 0.11805726 0.59110096 0.73131186
 0.29555048]


$U_{7*7}$
- 행은 사용자 열은 요인
- 사용자와 요인의 관계를 볼 수 있음
- 첫번 째 열을 보면 4번째 사용자의 값이 가장 높은데 SF 영화의 평점을 보면 5,5,5로 가장 높았음
- 반면에 5,6,7번의 값은 낮음
- 이것으로 보아 첫 번째 열은 SF 영화에 대한 사용자의 선호를 보여 주는 가중치로 보임<br>
<br>
- 두 번째 열은 로멘스 영화에 대한 사용자의 선호를 보여 주는 가중치
- 6번째 사용자의 값이 가장 높은데 로멘스 영화의 평점을 보면 5,5로 가장 높았음
- 이것으로 보아 두 번째 열은 SF 영화에 대한 사용자의 선호를 보여 주는 가중치로 보임<br>

In [17]:
s

array([1.24810147e+01, 9.50861406e+00, 1.34555971e+00, 1.84716760e-16,
       9.74452038e-33])

$\Sigma_{(7*5)}$
- 3 번째부터 5 번째 값이 상당히 작음
- 이는 데이터의 정보가 적다는 것으로 3번째부터 5번째 값의 대응되는 열을 모두 제거(1,2번 요인만 남김)

In [42]:
print(abs(VT[0,:]))
print(abs(VT[1,:]))

[0.56225841 0.5928599  0.56225841 0.09013354 0.09013354]
[0.12664138 0.02877058 0.12664138 0.69537622 0.69537622]


$V_{5*5}$ 
- 행은 요인 열은 영화를 나타냄
- 영화와 요인의 관계를 볼 수 있음<br>
<br>
- 첫 번째 영화와 1번 요인(SF) 간의 관계는 0.56, 2번 요인 (Romance) 간의 관계는 0.12로 요인 1에 가까운 영화
- 마찬가지로 영화 1,2,3은 요인 1과 가깝고 4,5는 2번 요인과 가까움

<b> 차원 축소 후 복원</b>

In [25]:
U1=U[:,0:2]
s1=np.diag(s[0:2])
VT1=VT[0:2,:]

array([[ 9.94042024e-01,  1.01170444e+00,  9.94042024e-01,
        -1.32719254e-03, -1.32719254e-03],
       [ 2.98212607e+00,  3.03511332e+00,  2.98212607e+00,
        -3.98157762e-03, -3.98157762e-03],
       [ 3.97616810e+00,  4.04681776e+00,  3.97616810e+00,
        -5.30877016e-03, -5.30877016e-03],
       [ 4.97021012e+00,  5.05852220e+00,  4.97021012e+00,
        -6.63596269e-03, -6.63596269e-03],
       [ 3.60313300e-01,  1.29216474e+00,  3.60313300e-01,
         4.08026301e+00,  4.08026301e+00],
       [-3.73850664e-01,  7.34429403e-01, -3.73850664e-01,
         4.91672142e+00,  4.91672142e+00],
       [ 1.80156650e-01,  6.46082370e-01,  1.80156650e-01,
         2.04013151e+00,  2.04013151e+00]])

In [None]:
A2=np.matmul(  np.matmul(U1,s1),VT1  )
A2

In [78]:
np.mean ((A-A2)**2)

0.051729455444565274

- singluar value를 2개 남겼지만 원래 행렬 A와 같이 크게 차이 안남
- $MSE=\frac{\sum (X-\hat{X})^{2}}{n}$는 0.05로 매우 작음
- 정보의 손실이 거의 없음
- 예측할 때는 $\hat{X}$를 이용하지만 여기서는 결측 데이터가 없어서 필요 없음

# 3. Funk SVD

- Funk SVD는 SVD라는 이름이 붙었지만 실제적으로는 SVD 방법을 사용하지는 않음
- SVD는 결측 데이터가 많은 경우 사용이 어렵지만 Funk SVD는 이러한 경우에 SVD보다 좋은 성능을 보임
- $U$:사용자와 잠재적 요인 행렬
- $V$:잠재적 요인과 영화 행렬<br>
- Prediction matrix:$\hat{X}=UV$<br>

- 목적 함수: $\arg\min_{U,V}||X-\hat{X}||_{F}$


In [82]:
B=np.array([[np.nan,np.nan,9,1   ],[3,np.nan,7,np.nan],[5,np.nan,np.nan,10],[np.nan,2,np.nan,np.nan]  ])
B

array([[nan, nan,  9.,  1.],
       [ 3., nan,  7., nan],
       [ 5., nan, nan, 10.],
       [nan,  2., nan, nan]])

In [106]:
np.random.seed(0)
U1=np.array([np.random.uniform(-5,5,3),np.random.uniform(-5,5,3),np.random.uniform(-5,5,3),np.random.uniform(-5,5,3)])
print(U1)

[[ 0.48813504  2.15189366  1.02763376]
 [ 0.44883183 -0.76345201  1.45894113]
 [-0.62412789  3.91773001  4.63662761]
 [-1.16558481  2.91725038  0.2889492 ]]


In [107]:
np.random.seed(1)
VT1=np.array([np.random.uniform(-5,5,4),np.random.uniform(-5,5,4),np.random.uniform(-5,5,4)])
print(VT1)

[[-0.82977995  2.20324493 -4.99885625 -1.97667427]
 [-3.53244109 -4.07661405 -3.13739789 -1.54439273]
 [-1.03232526  0.38816734 -0.80805486  1.852195  ]]


$U$ 행렬과 $V$ 행렬에 랜덤한 값을 기입

In [108]:
np.matmul(U1,VT1)

array([[ -9.06733456,  -7.29806503, -10.02184798,  -2.38487479],
       [  0.81831581,   4.66749893,  -1.02729755,   2.99411887],
       [-18.10776944, -15.54639241, -12.91820171,   3.77112228],
       [ -9.63612576, -14.34841209,  -3.55947106,  -1.66620851]])

$\hat{X}_{1}=U_{1}V_{1}^{T}$가 됨

Gradient Descent<br>
$U_{t+1}= U_{t}- 2\alpha(X-\hat{X}_{t})$<br>
$V_{t+1}= V_{t}- 2\alpha(X-\hat{X}_{t})$<br>
이와 같은 과정을 반복해 목적 함수를 최소화

- 하지만 Funk SVD라도 새로 가입한 유저에게 추천이 해줄 수 없음 (Cold Start Problem)
- Overfitting이 심하기 때문에 보통 regularization term을 추가


# 4. SVD 데이터에 적용: Movie Lens data

출처: https://beckernick.github.io/matrix-factorization-recommender/

<b>데이터 불러오기</b>

In [161]:
import pandas as pd
os.chdir('C:/Users/bki19/Desktop/recommender_system')
md =  pd.read_csv('./data/the-movies-dataset/movies_metadata.csv', low_memory=False)

In [109]:
import os
import pandas as pd
os.chdir('C:/Users/bki19/Desktop/recommender_system')
ratings = pd.read_csv('./data/the-movies-dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [124]:
import numpy as np
print(len(np.unique(ratings['userId'])))
print(len(np.unique(ratings['movieId'])))

671
9066


In [165]:
md = md.drop([19730, 29503, 35587])
md['id'] = md['id'].astype('int')
md2=md.loc[md['id'].isin(ratings['userId'])]

In [166]:
md2.shape

(527, 24)

In [173]:
md2.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
15,False,,52000000,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",,524,tt0112641,en,Casino,The life of the gambling paradise – Las Vegas ...,...,1995-11-22,116112375.0,178.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,No one stays at the top forever.,Casino,False,7.8,1343.0
17,False,,4000000,"[{'id': 80, 'name': 'Crime'}, {'id': 35, 'name...",,5,tt0113101,en,Four Rooms,It's Ted the Bellhop's first night on the job....,...,1995-12-09,4300000.0,98.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Twelve outrageous guests. Four scandalous requ...,Four Rooms,False,6.5,539.0
24,False,,3600000,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",http://www.mgm.com/title_title.do?title_star=L...,451,tt0113627,en,Leaving Las Vegas,"Ben Sanderson, an alcoholic Hollywood screenwr...",...,1995-10-27,49800000.0,112.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,I Love You... The Way You Are.,Leaving Las Vegas,False,7.1,365.0
31,False,,29500000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",,63,tt0114746,en,Twelve Monkeys,"In the year 2035, convict James Cole reluctant...",...,1995-12-29,168840000.0,129.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The future is history.,Twelve Monkeys,False,7.4,2470.0
44,False,,20000000,"[{'id': 14, 'name': 'Fantasy'}, {'id': 18, 'na...",,577,tt0114681,en,To Die For,Susan wants to work in television and will the...,...,1995-05-20,21284514.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,All she wanted was a little attention.,To Die For,False,6.7,177.0


671명의 사용자가 9066개의 영화 평가

<b>행을 유저, 열을 영화 아이디로 변환</b>

In [113]:
R_df = ratings.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
R_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Sparse한 데이터

In [119]:
R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

  """Entry point for launching an IPython kernel.


유저마다 평균을 빼서 centering

<b>SVD</b>

In [122]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

k: 축소할 차원<br>
하이퍼 파라미터 처럼 보고 cross validation을 할 수도 있음

In [123]:
#Singular value matrix
sigma = np.diag(sigma)

<b>Prediction</b>

In [125]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

In [145]:
preds_df.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
0,-0.054239,0.04513,-0.004835,-0.019817,-0.011284,0.041373,-0.007822,-0.017188,0.012246,0.03767,...,-0.005258,-0.005453,0.012369,-0.004991,-0.004639,-0.019055,0.021402,-0.006365,-0.006098,-0.004819
1,0.419835,1.40644,-0.188807,0.156658,0.268032,0.414698,0.052172,0.044728,-0.020198,2.220256,...,-0.005909,-0.003974,-0.012555,-0.003555,-0.002711,-0.071621,-0.016212,0.001047,-0.001468,-0.006577
2,1.345619,0.266505,-0.011962,0.012278,0.079508,0.09096,-0.122094,0.031327,-0.018023,0.141176,...,-0.002647,-0.002364,-0.010153,0.000277,-0.000116,-0.018063,-0.015761,0.010611,0.006792,-0.006357
3,1.133455,1.046982,0.141275,0.081841,-0.339675,-1.484659,-0.263096,-0.16975,-0.021862,1.611664,...,0.020805,0.00041,0.05604,-0.002817,-0.000767,0.159159,0.087519,-0.030854,-0.021279,0.048529
4,1.389578,1.466495,0.605557,-0.029647,0.72938,-0.118539,-0.026017,0.065577,-0.156655,0.307926,...,-0.007422,-0.01181,0.006644,-0.005159,-0.001249,-0.034658,0.016456,0.00171,-0.004166,-0.001864


$\hat{r}_{ij}$: i번째 유저가 j번째 영화에 줄 평점 예측<br>
$\hat{r}_{ij}=\bar{r}_{i}+U \Sigma V^{T} $<br>
$\bar{r}_{i}$: SVD를 적합시키기 전에 centering 해줬던 것을 다시 더 해줌<br>

In [153]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.userId == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'movieId', right_on = 'id').
                     sort_values(['rating'], ascending=False)
                 )

    print ('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]) )
    print ('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations) )
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['id'].isin(user_full['movieId'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'id',
               right_on = 'movieId').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

In [177]:
already_rated, predictions = recommend_movies(preds_df, 2, md2, ratings, 10)

User 2 has already rated 76 movies.
Recommending the highest 10 predicted ratings movies not already rated.


In [180]:
already_rated[['userId','rating','title']].dropna(subset=['title']).head(10)

Unnamed: 0,userId,rating,title
63,2,5.0,The Poseidon Adventure
25,2,5.0,Contempt
71,2,5.0,The Conversation
70,2,5.0,The Hours
65,2,5.0,"Monsters, Inc."
17,2,5.0,Berlin: Symphony of a Great City
9,2,5.0,48 Hrs.
24,2,5.0,Lili Marleen
1,2,5.0,The Dark
36,2,4.0,The Devil Wears Prada


# 5. Funk SVD 데이터에 적용: Movie Lens data

In [1]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

In [2]:
import os
import pandas as pd
os.chdir('C:/Users/bki19/Desktop/recommender_system')
ratings = pd.read_csv('./data/the-movies-dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [3]:
ratings.shape

(100004, 4)

In [4]:
import numpy as np
print(len(np.unique(ratings['userId'])))
print(len(np.unique(ratings['movieId'])))

671
9066


671명의 사용자가 9066개의 영화 평가

In [5]:
reader = Reader()
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [6]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9029  0.8947  0.9004  0.8897  0.8964  0.8968  0.0046  
MAE (testset)     0.6931  0.6901  0.6939  0.6855  0.6906  0.6906  0.0029  
Fit time          8.39    9.54    8.61    7.89    7.81    8.45    0.62    
Test time         0.23    0.34    0.20    0.20    0.22    0.24    0.05    


{'test_rmse': array([0.90292382, 0.8946516 , 0.90037901, 0.88967356, 0.89638446]),
 'test_mae': array([0.69309809, 0.69010149, 0.69387813, 0.68548234, 0.69059818]),
 'fit_time': (8.392886638641357,
  9.543054342269897,
  8.613791942596436,
  7.890775680541992,
  7.806208610534668),
 'test_time': (0.22590279579162598,
  0.34363627433776855,
  0.19997644424438477,
  0.20046520233154297,
  0.21593117713928223)}

SVD에서 트레인 테스트는 어떻게 나눌까?
- 먼저 전체 데이터셋으로 U,V를 추정한 후 복원 된 $\hat{X}$를 구함
- 행을 랜덤으로 트레인과 테스트로 나눔

- 5-fold Cross validation으로 예측
- 한번에 80%의 데이터로 SVD를 적합시킨후 20%의 데이터에 예측하는 과정을 5번
- RMSE의 평균이 0.8959로 상당히 높고 fold마다 큰 차이가 없어 SVD를 사용하기에 무리 없어 보임

In [7]:
trainset = data.build_full_trainset()
svd.train(trainset)



<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1fe4821c518>

In [8]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [14]:
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.7991029599998747, details={'was_impossible': False})

1번 유저는 302번 영화에 별점을 2.799 줄 것으로 예상

# 6. 그 외에

Hybrid Recommender

- Content Recommender와 Collaborative Filter를 모두 이용하는 방법
- 먼저 Content Recommender을 사용자 평점 데이터가 없는 부분에 적용

Association Rules Learning
- 특정 상품과 다른 상품의 연관성을 찾는 방법
- E-commerce에서 많이 사용되며 사용자가 주상품과 연관된 상품을 추천해 줌

출처: 
- https://towardsdatascience.com/learning-to-make-recommendations-745d13883951
- https://www.datacamp.com/community/tutorials/recommender-systems-python
- https://github.com/rounakbanik/movies/blob/master/movies_recommender.ipynb
- https://medium.com/datadriveninvestor/how-funk-singular-value-decomposition-algorithm-work-in-recommendation-engines-36f2fbf62cac

데이터 출처:
- https://nbviewer.jupyter.org/github/BadreeshShetty/Learnings-to-make-Recommedations/tree/master/Content%20Filtering/
- https://www.kaggle.com/rounakbanik/the-movies-dataset/data