# Matrix factorization을 활용한 영화 추천

MovieLens 프로젝트의 백만개의 영화 평점 데이터에 low-rank matrix factorization을 사용한 추천 알고리즘을 작성해보자.

##  평점 데이터 구성

필요한 라이브러리를 불러온다.

In [1]:
import pandas as pd
import numpy as np
import codecs

추천에 필요한 평점 정보, 사용자 정보, 영화 정보를 불러오고 데이터 프레임을 구성한다.

In [2]:
ratings_list = [i.strip().split("::") for i in codecs.open('./data/ml-1m/ratings.dat', 'r', encoding='latin').readlines()]
users_list = [i.strip().split("::") for i in codecs.open('./data/ml-1m/users.dat', 'r', encoding='latin').readlines()]
movies_list = [i.strip().split("::") for i in codecs.open('./data/ml-1m/movies.dat', 'r', encoding='latin').readlines()]

ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)

각각의 데이터 프레임을 살펴 보자.

영화 데이터 프레임은 다음과 같은 형태를 갖고 있다.

In [3]:
movies_df.head()

Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


평점 데이터 프레임은 다음과 같다.

In [4]:
ratings_df.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


하나의 행이 각 사용자를 나타내고, 각 열이 하나의 영화 평점을 갖도록 행렬을 구성하자. rating_df 데이터 프레임을 피벗하여 R_df 라는 데이터 프레임에 저장한다.

In [5]:
R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df.head()

MovieID,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
UserID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


각 사용자 마다 평균을 구해 데이터를 정규화한다. (de-mean)<br>정규화를 적용한 결과는 데이터 프레임에서 numpy 배열로 변환한다.

In [6]:
R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

## Singular Value Decomposition

Scipy와 Numpy 모두 singular value decomposition(SVD)를 위한 함수를 지원한다.<br> 여기서는 Scipy 함수를 사용할 것인데, 이유는 원본 평점 행렬을 근사할 때 latent factor 개수를 지정할 수 있기 때문이다. (나중에 몇 개를 사용할 지 잘라내는 것이 아니고 사전에 정의)

In [14]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50)

여기서 sigma는 대각 행렬이 아니라 그냥 값만 갖고 있기 때문에, 계산 편의를 위해 대각 행렬로 변환한다.

In [11]:
sigma = np.diag(sigma)

## 분해 행렬을 이용한 예측

각 사용자에 대한 영화 평점 예측을 할 준비가 다 됐다.<br> U,Σ,V_transpose의 행렬 곱과 계산 과정을 통해 랭크 k = 50을 갖는 R의 근사 행렬을 얻을 수 있다.<br> 5점 척도의 평점 예측을 위해 각 사용자의 평균을 다시 더해준다.

In [12]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)

## 영화 추천 생성

각 사용자에 대한 예측 행렬을 이용하여 각 사용자에 대해 영화를 추천해 주는 함수를 만들어보자.<br> 이 함수에서 하는 역할은 특정 사용자가 이전에 평점을 주지 않은 영화를 예측한 평점 순으로 리턴하는 것이다. <br>각 영화의 특징 정보(장르, 제목)가 없기 때문에 예측 행렬에 해당 정보들을 병합해준다.

In [14]:
def recommend_movies(predictions_df, userID, movies_df, original_ratings_df, num_recommendations=5):
    
    # Get and sort the user's predictions
    user_row_number = userID - 1 # UserID starts at 1, not 0
    sorted_user_predictions = predictions_df.iloc[user_row_number].sort_values(ascending=False)
    
    # Get the user's data and merge in the movie information.
    user_data = original_ratings_df[original_ratings_df.UserID == (userID)]
    user_full = (user_data.merge(movies_df, how = 'left', left_on = 'MovieID', right_on = 'MovieID').
                     sort_values(['Rating'], ascending=False)
                 )

    print('User {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))
    print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating movies that the user hasn't seen yet.
    recommendations = (movies_df[~movies_df['MovieID'].isin(user_full['MovieID'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'MovieID',
               right_on = 'MovieID').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

already_rated, predictions = recommend_movies(preds_df, 837, movies_df, ratings_df, 10)

User 837 has already rated 69 movies.
Recommending the highest 10 predicted ratings movies not already rated.


이미 사용자가 평점을 준 영화들을 살펴 본다.

In [15]:
already_rated.head(10)

Unnamed: 0,UserID,MovieID,Rating,Timestamp,Title,Genres
36,837,858,5,975360036,"Godfather, The (1972)",Action|Crime|Drama
35,837,1387,5,975360036,Jaws (1975),Action|Horror
65,837,2028,5,975360089,Saving Private Ryan (1998),Action|Drama|War
63,837,1221,5,975360036,"Godfather: Part II, The (1974)",Action|Crime|Drama
11,837,913,5,975359921,"Maltese Falcon, The (1941)",Film-Noir|Mystery
20,837,3417,5,975360893,"Crimson Pirate, The (1952)",Adventure|Comedy|Sci-Fi
34,837,2186,4,975359955,Strangers on a Train (1951),Film-Noir|Thriller
55,837,2791,4,975360893,Airplane! (1980),Comedy
31,837,1188,4,975360920,Strictly Ballroom (1992),Comedy|Romance
28,837,1304,4,975360058,Butch Cassidy and the Sundance Kid (1969),Action|Comedy|Western


예측한 영화 목록은 다음과 같다.

In [16]:
predictions

Unnamed: 0,MovieID,Title,Genres
516,527,Schindler's List (1993),Drama|War
1848,1953,"French Connection, The (1971)",Action|Crime|Drama|Thriller
596,608,Fargo (1996),Crime|Drama|Thriller
1235,1284,"Big Sleep, The (1946)",Film-Noir|Mystery
2085,2194,"Untouchables, The (1987)",Action|Crime|Drama
1188,1230,Annie Hall (1977),Comedy|Romance
1198,1242,Glory (1989),Action|Drama|War
897,922,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),Film-Noir
1849,1954,Rocky (1976),Action|Drama
581,593,"Silence of the Lambs, The (1991)",Drama|Thriller


추천이 잘 되었나요?