# **E15. 영화 추천 알고리즘**

**INDEX**

- 00. 필요한 모듈 가져오기

- 01. 데이터 준비 & 전처리

- 02. 분석하기

- 03. 데이터 추가

- 04. CSR Matrix 생성

- 05. 모델 훈련

- 06. 훈련 상태 확인

- 07. 가설 확인

- 08. 회고

---

## **00. 필요한 모듈 가져오기**

In [346]:
# from google.colab import drive
# drive.mount('/content/mydrive')

In [347]:
# !pip install implicit

In [348]:
import os
import numpy as np
import pandas as pd

from implicit.als import AlternatingLeastSquares
from scipy.sparse import csr_matrix

print("Done!")

Done!


## **01. 데이터 준비 & 전처리**

In [349]:
rating_file_path = '/content/mydrive/MyDrive/AIFFEL/E15/data/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'ratings', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
original_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,ratings,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [350]:
# 3점 이상만 남기기
ratings = ratings[ratings['ratings'] >= 3]
filtered_data_size = len(ratings)

print(f'original data size: {original_data_size}, filtered data size: {filtered_data_size}')
print(f'Ratio of remaining data is {filtered_data_size / original_data_size}')

original data size: 1000209, filtered data size: 836478
Ratio of remaining data is 0.8363032126285607


In [351]:
# rating 컬럼 이름 변경
ratings.rename(columns = {'ratings':'counts'}, inplace=True)

In [352]:
ratings['counts']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: counts, Length: 836478, dtype: int64

In [353]:
# 영화 제목을 보기 위해 메타 데이터 읽어오기
movie_file_path = '/content/mydrive/MyDrive/AIFFEL/E15/data/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [354]:
# 유저 정보 읽어오기
user_file_path = '/content/mydrive/MyDrive/AIFFEL/E15/data/users.dat'
cols = ['user_id', 'gender', 'age', 'occupation', 'zip-code'] 
users = pd.read_csv(user_file_path, sep='::', names = cols, engine='python', encoding='ISO-8859-1')
users.head()

Unnamed: 0,user_id,gender,age,occupation,zip-code
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


---

## **02. 분석하기**

In [355]:
ratings.head()

Unnamed: 0,user_id,movie_id,counts,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [356]:
# ratings의 유니크 영화 개수
ratings['movie_id'].nunique()

3628

In [357]:
# ratings에 있는 유니크한 사용자 수
ratings['user_id'].nunique()

6039

In [358]:
# 가장 인기 있는 영화 30개(인기순)
# 인기 있는 = 많이 봤다 (counts가 높다)
# 랜덤성 확보를 위해 샘플 추출

# 가장 인기 있는 영화 30개(인기순)
# movie_count = ratings.groupby('movie_id')['user_id'].count()
# movie_top30 = movie_count.sort_values(ascending=False).head(30)

movie_count = ratings.groupby('movie_id')['user_id'].count()
top30 = movie_count.sort_values(ascending=False)[:30]

In [359]:
# 30개의 영화 누적 count수
for i, k in zip(top30.index, top30.values):
    print(movies[movies['movie_id']==i]['title'].values[0], '\ncount:', k, '\n')

American Beauty (1999) 
count: 3211 

Star Wars: Episode IV - A New Hope (1977) 
count: 2910 

Star Wars: Episode V - The Empire Strikes Back (1980) 
count: 2885 

Star Wars: Episode VI - Return of the Jedi (1983) 
count: 2716 

Saving Private Ryan (1998) 
count: 2561 

Terminator 2: Judgment Day (1991) 
count: 2509 

Silence of the Lambs, The (1991) 
count: 2498 

Raiders of the Lost Ark (1981) 
count: 2473 

Back to the Future (1985) 
count: 2460 

Matrix, The (1999) 
count: 2434 

Jurassic Park (1993) 
count: 2413 

Sixth Sense, The (1999) 
count: 2385 

Fargo (1996) 
count: 2371 

Braveheart (1995) 
count: 2314 

Men in Black (1997) 
count: 2297 

Schindler's List (1993) 
count: 2257 

Princess Bride, The (1987) 
count: 2252 

Shakespeare in Love (1998) 
count: 2213 

L.A. Confidential (1997) 
count: 2210 

Shawshank Redemption, The (1994) 
count: 2194 

Godfather, The (1972) 
count: 2167 

Groundhog Day (1993) 
count: 2121 

E.T. the Extra-Terrestrial (1982) 
count: 2102 

Being J

## **03. 데이터 추가**

In [360]:
# 내가 좋아하는 영화 5가지 추가
# 있는지 확인

# 인타임
movies[movies['title'].str.contains('In Time')]

Unnamed: 0,movie_id,title,genre


In [361]:
# 라이프오브파이
movies[movies['title'].str.contains('Life of Pi')]

Unnamed: 0,movie_id,title,genre


In [362]:
# 다크 나이트
movies[movies['title'].str.contains('Knight')]

Unnamed: 0,movie_id,title,genre
166,168,First Knight (1995),Action|Adventure|Drama|Romance
324,328,Tales From the Crypt Presents: Demon Knight (1...,Horror
3492,3561,Stacy's Knights (1982),Drama
3550,3619,"Hollywood Knights, The (1980)",Comedy
3736,3805,Knightriders (1981),Action|Adventure|Drama


In [363]:
# 더 플랫폼
movies[movies['title'].str.contains('Platform')]

Unnamed: 0,movie_id,title,genre


In [364]:
# 덩케르크
movies[movies['title'].str.contains('Dunkirk')]

Unnamed: 0,movie_id,title,genre


데이터셋이 오래 되서 그런지 다 없습니다.

추가해 주겠습니다.

In [365]:
ratings.head()

Unnamed: 0,user_id,movie_id,counts,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [366]:
movies.tail()

Unnamed: 0,movie_id,title,genre
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama
3882,3952,"Contender, The (2000)",Drama|Thriller


In [367]:
users['user_id'].nunique()

6040

In [368]:
new_rat_col = ratings.columns
new_mov_col = movies.columns

new_rat = ratings.copy()
new_mov = movies.copy()

# ratings에 추가할 df
my_ratings = [[6041, 3953, 5, 0],
              [6041, 3954, 5, 0],
              [6041, 3955, 4, 0],
              [6041, 3956, 4, 0],
              [6041, 3957, 5, 0]]

# movies에 추가할 df
my_movies = [[3953, 'The Platform', 'Thriller'],
             [3954, 'Dunkirk', 'War'],
             [3955, 'Life of Pi', 'Adventure'],
             [3956, 'In Time', 'Action'],
             [3957, 'Batman: the Dark Knight', 'Action']]

print("Done!")

Done!


In [369]:
# ratings 데이터 추가
my_rat_df = pd.DataFrame(data=my_ratings, columns=new_rat_col)
new_ratings = pd.concat([new_rat, my_rat_df], axis=0)

# 인덱스 정리
new_ratings.reset_index(drop=True, inplace=True)
new_ratings.tail(8)

Unnamed: 0,user_id,movie_id,counts,timestamp
836475,6040,562,5,956704746
836476,6040,1096,4,956715648
836477,6040,1097,4,956715569
836478,6041,3953,5,0
836479,6041,3954,5,0
836480,6041,3955,4,0
836481,6041,3956,4,0
836482,6041,3957,5,0


In [370]:
# moives 데이터 추가
my_mov_df = pd.DataFrame(data=my_movies, columns=new_mov_col)
new_movies = pd.concat([new_mov, my_mov_df], axis=0)

# 인덱스 정리
new_movies.reset_index(drop=True, inplace=True)
new_movies.tail(8)

Unnamed: 0,movie_id,title,genre
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama
3882,3952,"Contender, The (2000)",Drama|Thriller
3883,3953,The Platform,Thriller
3884,3954,Dunkirk,War
3885,3955,Life of Pi,Adventure
3886,3956,In Time,Action
3887,3957,Batman: the Dark Knight,Action


---

## **04. CSR Matrix 생성**

In [371]:
num_users = new_ratings['user_id'].nunique()
num_movies = new_ratings['movie_id'].nunique()

csr_data = csr_matrix((new_ratings['counts'], (new_ratings.user_id, new_ratings.movie_id)))
csr_data

<6042x3958 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Row format>

---

## **05. 모델 훈련**

In [372]:
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

print("Done!")

Done!


In [373]:
'''
factors : 유저와 아이템의 벡터를 몇 차원으로 할 것인지
regularization : 과적합을 방지하기 위해 정규화 값을 얼마나 사용할 것인지
use_gpu : GPU를 사용할 것인지
iterations : epochs와 같은 의미입니다. 데이터를 몇 번 반복해서 학습할 것인지
'''

factors = 100
regularization = 0.01
use_gpu = False
iterations = 20

# 모델 선언
als_model = AlternatingLeastSquares(factors=factors, regularization=regularization,
                                    use_gpu=use_gpu, iterations=iterations, dtype=np.float32)
print("Done!")

Done!


In [374]:
# 모델 훈련
als_model.fit(csr_data)

  0%|          | 0/20 [00:00<?, ?it/s]

---

## **06. 훈련 상태 확인**

In [375]:
# 영화 id 가져오는 함수
def movie_name_to_id(name):
    return new_movies[new_movies['title']==name]['movie_id'].values[0]

print("Done!")

Done!


In [376]:
my_vector = als_model.user_factors[6041]
in_time = als_model.item_factors[movie_name_to_id('In Time')]

print("Done!")

Done!


In [377]:
my_vector

array([ 7.4394983e-03, -9.0810787e-03, -8.1397435e-03, -9.7019300e-03,
        9.5495244e-04,  2.4862655e-03, -1.0287154e-03,  1.1384330e-02,
        1.0265752e-02, -2.2273099e-03,  5.7821985e-06,  6.8017752e-03,
        6.0402439e-04, -1.4512817e-02,  9.9828020e-03,  1.7585456e-02,
        1.0231132e-02,  6.5308698e-03, -1.7749913e-03,  1.2240225e-03,
       -4.9522775e-03, -1.2486773e-02,  7.0962560e-04, -9.5520038e-03,
       -1.8711858e-03, -3.3248166e-04, -7.9490207e-03,  2.0686362e-03,
        1.5943352e-02,  6.3145272e-03,  1.0280245e-03, -3.5898730e-03,
        5.4706279e-03, -4.5025803e-04,  1.0798507e-03, -8.0868294e-03,
        8.6225076e-03, -8.1410678e-03,  9.6352363e-04, -1.4228067e-02,
       -5.2207345e-03,  6.3038309e-04,  4.5926081e-05,  8.4217712e-03,
        3.3220847e-03,  9.1559142e-03,  4.2440621e-03, -8.1496825e-03,
       -1.6741534e-03, -9.5359627e-03, -2.0623399e-02,  1.5048973e-02,
        9.3879737e-03, -4.5395773e-03, -3.4989712e-03,  3.3811408e-03,
      

In [378]:
in_time

array([ 3.16440739e-04,  7.25033751e-05,  6.71900780e-05, -4.64333316e-05,
        2.06856101e-04,  2.59242253e-04,  1.70856845e-04,  3.63589992e-04,
        3.24031396e-04,  1.77308742e-04,  1.78054528e-04,  2.27754426e-04,
        2.11495470e-04, -8.09225894e-05,  1.91532759e-04,  4.50412801e-04,
        3.54381540e-04,  2.48031021e-04,  1.87092024e-04,  2.40243549e-04,
        1.72674016e-04,  3.36341191e-06,  1.90173625e-04, -9.85157112e-06,
        1.56106500e-04,  1.12766669e-04,  3.88699118e-05,  2.33482890e-04,
        5.13888022e-04,  2.20923859e-04,  2.67173571e-04,  1.23977297e-04,
        2.30881749e-04,  1.01812991e-04,  2.13878200e-04,  7.41099575e-05,
        3.62967665e-04, -1.19251581e-05,  2.10519051e-04, -5.79794505e-05,
        7.20791068e-05,  1.10737288e-04,  1.12004069e-04,  1.94212422e-04,
        1.86292804e-04,  3.81930731e-04,  2.13041494e-04,  1.41924349e-04,
        1.17547708e-04,  5.27089869e-05, -8.62300949e-05,  4.54321824e-04,
        3.02761386e-04,  

In [379]:
np.dot(my_vector, in_time)

0.00010176865

흠, 내적이 상당히 맞지 않는데 한번 다른 영화들은 어떤지 확인해보겠습니다.

In [380]:
def mov_vec(movie_name):
    return als_model.item_factors[movie_name_to_id(movie_name)]

print("Done!")

Done!


In [381]:
np.dot(my_vector, mov_vec('Toy Story (1995)'))

-0.00015911096

여전히 내적값이 매우 적습니다.

그리고 저는 토이스토리를 좋아합니다.

아무래도 좋아하는 영화 5개가 전부 새로운 데이터다 보니 학습이 불평등하게 진행된 모양입니다.

그렇다면 기존 데이터에 있던 영화를 favorite으로 삼아 추천을 받아보겠습니다.

---

## **07. 가설 확인**

In [382]:
# ratings에 추가할 new df
my_ratings2 = [[6041, 1, 5, 0],
               [6041, 2, 5, 0],
               [6041, 3, 5, 0],
               [6041, 4, 5, 0],
               [6041, 5, 5, 0]]

In [383]:
# ratings 데이터 추가
my_rat_df2 = pd.DataFrame(data=my_ratings2, columns=new_rat_col)
new_ratings2 = pd.concat([new_ratings, my_rat_df2], axis=0)

# 인덱스 정리
new_ratings2.reset_index(drop=True, inplace=True)
new_ratings2.tail(8)

Unnamed: 0,user_id,movie_id,counts,timestamp
836480,6041,3955,4,0
836481,6041,3956,4,0
836482,6041,3957,5,0
836483,6041,1,5,0
836484,6041,2,5,0
836485,6041,3,5,0
836486,6041,4,5,0
836487,6041,5,5,0


In [384]:
num_users = new_ratings2['user_id'].nunique()
num_movies = new_ratings2['movie_id'].nunique()

csr_data = csr_matrix((new_ratings2['counts'], (new_ratings2.user_id, new_ratings2.movie_id)))
csr_data

<6042x3958 sparse matrix of type '<class 'numpy.longlong'>'
	with 836488 stored elements in Compressed Sparse Row format>

In [385]:
# 모델 선언
als_model = AlternatingLeastSquares(factors=factors, regularization=regularization,
                                    use_gpu=use_gpu, iterations=iterations, dtype=np.float32)

# 모델 훈련
als_model.fit(csr_data)

  0%|          | 0/20 [00:00<?, ?it/s]

In [386]:
my_vector = als_model.user_factors[6041]
toy_story = als_model.item_factors[movie_name_to_id('Toy Story (1995)')]

np.dot(my_vector, toy_story)

0.49242544

정상적으로 나오는군요!

이로써 <데이터가 적어서 똑바로 작동을 못한다>는 가설은 확인되었습니다.

기왕에 데이터를 추가했으니 한번 비슷한 영화 추천도 받아보죠.

In [387]:
# 내가 좋아하는 영화와 비슷한 영화 추천

in_time_id = movie_name_to_id('In Time')
life_of_pi_id = movie_name_to_id('Life of Pi')
dunkirk_id = movie_name_to_id('Dunkirk')
dark_knight_id = movie_name_to_id('Batman: the Dark Knight')
the_platform_id = movie_name_to_id('The Platform')

print("Done!")

Done!


In [388]:
similar_movie = als_model.similar_items(in_time_id, N=5)
similar_movie[0]

array([3956, 3955, 3957, 3954, 3953], dtype=int32)

음... 똑바로 추천을 못하는군요.

다른 영화를 찾아봅시다.

In [391]:
similar_movie = als_model.similar_items(life_of_pi_id, N=5)
similar_movie[0]

array([3955, 3956, 3957, 3954, 3953], dtype=int32)

In [392]:
similar_movie = als_model.similar_items(dunkirk_id, N=5)
similar_movie[0]

array([3954, 3957, 3953, 3955, 3956], dtype=int32)

In [393]:
similar_movie = als_model.similar_items(dark_knight_id, N=5)
similar_movie[0]

array([3957, 3955, 3954, 3956, 3953], dtype=int32)

In [394]:
similar_movie = als_model.similar_items(the_platform_id, N=5)
similar_movie[0]

array([3953, 3954, 3957, 3955, 3956], dtype=int32)

아무래도 아직 제 취향을 잘 모르는 것 같습니다.

그렇다면 과연 좋아할만한 영화 추천은 잘할 수 있을까요?

In [395]:
# user 추천
user = 6041
movie_recommended = als_model.recommend(user, csr_data, N=10, filter_already_liked_items=True)
movie_recommended

ValueError: ignored

여전히 똑바로 작동을 못합니다.

오류 메세지를 찾아보니 user_items.shape[0] != len(userid): 일때 발생하는 오류입니다.

filter_already_liked_items 값을 True로 주어서 다른 유사 영화를 못 찾는 것 같네요.



In [396]:
# user 추천
user = 6041
movie_recommended = als_model.recommend(user, csr_data, N=10, filter_already_liked_items=False)
movie_recommended

(array([3114,    1, 2355,    2,   34,  317,    3, 3450, 3489,  367],
       dtype=int32),
 array([0.5623268 , 0.49242544, 0.34549582, 0.31780282, 0.27811497,
        0.27285343, 0.24783081, 0.2373155 , 0.23183182, 0.2162874 ],
       dtype=float32))

In [397]:
movie_recommended[0]

array([3114,    1, 2355,    2,   34,  317,    3, 3450, 3489,  367],
      dtype=int32)

In [398]:
for id in (movie_recommended[0]):
    print(movies.loc[(movies['movie_id'] == id)]['title'])

3045    Toy Story 2 (1999)
Name: title, dtype: object
0    Toy Story (1995)
Name: title, dtype: object
2286    Bug's Life, A (1998)
Name: title, dtype: object
1    Jumanji (1995)
Name: title, dtype: object
33    Babe (1995)
Name: title, dtype: object
314    Santa Clause, The (1994)
Name: title, dtype: object
2    Grumpier Old Men (1995)
Name: title, dtype: object
3381    Grumpy Old Men (1993)
Name: title, dtype: object
3420    Hook (1991)
Name: title, dtype: object
363    Mask, The (1994)
Name: title, dtype: object


쥬만지, 토이스토리, 벅스라이프, 마스크는 아는 영화들입니다.

마침 제 취향에도 맞군요.

나머지 영화는 오래된 것들이라 잘 모르겠습니다.

이게 바로 Cold Start 문제인 것 같네요!

---

## **08. 회고**

- cold start 문제를 눈으로 확인할 수 있었는데, 처음엔 도대체 무슨 오류인가 싶어서 한동안 헤맸습니다.

- 라이브러리 깃허브에 찾아가서 코드 뜯어보느라 시간이 오래 걸렸습니다. 판다스 메소드의 리턴값을 정확히 몰라서 그것도 한참 헤맸네요.