# Movielens 영화 추천
## Movielens 1M Dataset
- 유저가 영화에 대해 평점을 매긴 데이터가 데이터 크기 별로 있다.  
- 별점 데이터는 대표적인 explicit 데이터지만 implicit 데이터로 간주하고 테스트할 수 있다.  
- 별점을 **시청횟수**로 해석해본다.  
- 유저가 3점 미만으로 준 데이터는 선호하지 않는다고 가정하고 제외한다.  

## 데이터 준비

In [1]:
import pandas as pd
import os

rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
...,...,...,...,...
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648


In [2]:
# 3점 이상만 남깁니다.
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [3]:
# rating 컬럼의 이름을 count로 바꿉니다.
ratings.rename(columns={'rating':'count'}, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [4]:
ratings['count']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: count, Length: 836478, dtype: int64

In [5]:
# 영화 제목을 보기 위해 메타 데이터를 읽어옵니다.
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'movie', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies

Unnamed: 0,movie_id,movie,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [6]:
# 데이터 병합
data = pd.merge(ratings, movies, on='movie_id')
data

Unnamed: 0,user_id,movie_id,count,timestamp,movie,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...
836473,5851,3607,5,957756608,One Little Indian (1973),Comedy|Drama|Western
836474,5854,3026,4,958346883,Slaughterhouse (1987),Horror
836475,5854,690,3,957744257,"Promise, The (Versprechen, Das) (1994)",Romance
836476,5938,2909,4,957273353,"Five Wives, Three Secretaries and Me (1998)",Documentary


In [7]:
# 필요없는 컬럼 삭제
data = data.drop('movie_id', axis=1)
data = data.drop('timestamp', axis=1)
data

Unnamed: 0,user_id,count,movie,genre
0,1,5,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,5,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,4,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,4,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,5,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...
836473,5851,5,One Little Indian (1973),Comedy|Drama|Western
836474,5854,4,Slaughterhouse (1987),Horror
836475,5854,3,"Promise, The (Versprechen, Das) (1994)",Romance
836476,5938,4,"Five Wives, Three Secretaries and Me (1998)",Documentary


In [8]:
# 첫 번째 유저가 어떤 영화를 보는지 확인
condition = (data['user_id']==data.loc[0, 'user_id'])
data.loc[condition]

Unnamed: 0,user_id,count,movie,genre
0,1,5,One Flew Over the Cuckoo's Nest (1975),Drama
1680,1,3,James and the Giant Peach (1996),Animation|Children's|Musical
2123,1,3,My Fair Lady (1964),Musical|Romance
2734,1,4,Erin Brockovich (2000),Drama
3957,1,5,"Bug's Life, A (1998)",Animation|Children's|Comedy
5556,1,3,"Princess Bride, The (1987)",Action|Adventure|Comedy|Romance
7808,1,5,Ben-Hur (1959),Action|Adventure|Drama
8474,1,5,"Christmas Story, A (1983)",Comedy|Drama
9764,1,4,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical
10471,1,4,"Wizard of Oz, The (1939)",Adventure|Children's|Drama|Musical


title 컬럼에 영화 제목과 년도를 분리해서 년대별로 취향을 분석해도 좋을 것 같다.

## 데이터 탐색
- ratings에 있는 유니크한 영화 개수
- rating에 있는 유니크한 사용자 수
- 가장 인기 있는 영화 30개(인기순)
- 가장 인기 있는 장르 30개(인기순)
- 유저별 몇 개의 영화를 보는지에 대한 통계

In [9]:
# ratings에 있는 유니크한 영화 개수
data['movie'].nunique()

3628

In [10]:
# rating에 있는 유니크한 사용자 수
data['user_id'].nunique()

6039

In [11]:
# 가장 인기있는 영화
movie_count = data.groupby('movie')['user_id'].count()
movie_count.sort_values(ascending=False).head(30)

movie
American Beauty (1999)                                   3211
Star Wars: Episode IV - A New Hope (1977)                2910
Star Wars: Episode V - The Empire Strikes Back (1980)    2885
Star Wars: Episode VI - Return of the Jedi (1983)        2716
Saving Private Ryan (1998)                               2561
Terminator 2: Judgment Day (1991)                        2509
Silence of the Lambs, The (1991)                         2498
Raiders of the Lost Ark (1981)                           2473
Back to the Future (1985)                                2460
Matrix, The (1999)                                       2434
Jurassic Park (1993)                                     2413
Sixth Sense, The (1999)                                  2385
Fargo (1996)                                             2371
Braveheart (1995)                                        2314
Men in Black (1997)                                      2297
Schindler's List (1993)                                  2257
Pr

In [12]:
# 가장 인기있는 장르
genre_count = data.groupby('genre')['user_id'].count()
genre_count.sort_values(ascending=False).head(30)

genre
Drama                               99388
Comedy                              94264
Comedy|Drama                        36871
Comedy|Romance                      35888
Drama|Romance                       24835
Action|Thriller                     22675
Drama|Thriller                      16133
Horror                              15260
Thriller                            14925
Action|Adventure|Sci-Fi             14277
Drama|War                           13766
Action|Sci-Fi|Thriller              11657
Action|Drama|War                    11316
Crime|Drama                         10960
Action|Sci-Fi                       10594
Action                               9930
Comedy|Drama|Romance                 9804
Action|Adventure                     8744
Action|Drama                         8611
Comedy|Sci-Fi                        7797
Comedy|Horror                        7523
Animation|Children's                 7461
Animation|Children's|Musical         7237
Animation|Children's|Comedy 

In [13]:
# 유저별 몇 개의 영화를 보는지에 대한 통계
user_count = data.groupby('user_id')['movie'].count()
user_count.describe()

count    6039.000000
mean      138.512668
std       156.241599
min         1.000000
25%        38.000000
50%        81.000000
75%       177.000000
max      1968.000000
Name: movie, dtype: float64

## 내가 선호하는 영화 5가지 rating에 추가

In [14]:
# 본인이 좋아하는 영화 데이터
my_favorite = ['Star Wars: Episode IV - A New Hope (1977)', 
               'Sixth Sense, The (1999)', 
               'Terminator 2: Judgment Day (1991)', 
               'Men in Black (1997)', 
               'E.T. the Extra-Terrestrial (1982)']

### 선호하는 영화 5가지의 genre 확인

In [15]:
print(data['genre'].loc[data['movie'] == 'Star Wars: Episode IV - A New Hope (1977)'])

43536    Action|Adventure|Fantasy|Sci-Fi
43537    Action|Adventure|Fantasy|Sci-Fi
43538    Action|Adventure|Fantasy|Sci-Fi
43539    Action|Adventure|Fantasy|Sci-Fi
43540    Action|Adventure|Fantasy|Sci-Fi
                      ...               
46441    Action|Adventure|Fantasy|Sci-Fi
46442    Action|Adventure|Fantasy|Sci-Fi
46443    Action|Adventure|Fantasy|Sci-Fi
46444    Action|Adventure|Fantasy|Sci-Fi
46445    Action|Adventure|Fantasy|Sci-Fi
Name: genre, Length: 2910, dtype: object


In [16]:
print(data['genre'].loc[data['movie'] == 'Sixth Sense, The (1999)'])

35022    Thriller
35023    Thriller
35024    Thriller
35025    Thriller
35026    Thriller
           ...   
37402    Thriller
37403    Thriller
37404    Thriller
37405    Thriller
37406    Thriller
Name: genre, Length: 2385, dtype: object


In [17]:
print(data['genre'].loc[data['movie'] == 'Terminator 2: Judgment Day (1991)'])

89415    Action|Sci-Fi|Thriller
89416    Action|Sci-Fi|Thriller
89417    Action|Sci-Fi|Thriller
89418    Action|Sci-Fi|Thriller
89419    Action|Sci-Fi|Thriller
                  ...          
91919    Action|Sci-Fi|Thriller
91920    Action|Sci-Fi|Thriller
91921    Action|Sci-Fi|Thriller
91922    Action|Sci-Fi|Thriller
91923    Action|Sci-Fi|Thriller
Name: genre, Length: 2509, dtype: object


In [18]:
print(data['genre'].loc[data['movie'] == 'Men in Black (1997)'])

166207    Action|Adventure|Comedy|Sci-Fi
166208    Action|Adventure|Comedy|Sci-Fi
166209    Action|Adventure|Comedy|Sci-Fi
166210    Action|Adventure|Comedy|Sci-Fi
166211    Action|Adventure|Comedy|Sci-Fi
                       ...              
168499    Action|Adventure|Comedy|Sci-Fi
168500    Action|Adventure|Comedy|Sci-Fi
168501    Action|Adventure|Comedy|Sci-Fi
168502    Action|Adventure|Comedy|Sci-Fi
168503    Action|Adventure|Comedy|Sci-Fi
Name: genre, Length: 2297, dtype: object


In [19]:
print(data['genre'].loc[data['movie'] == 'E.T. the Extra-Terrestrial (1982)'])

27021    Children's|Drama|Fantasy|Sci-Fi
27022    Children's|Drama|Fantasy|Sci-Fi
27023    Children's|Drama|Fantasy|Sci-Fi
27024    Children's|Drama|Fantasy|Sci-Fi
27025    Children's|Drama|Fantasy|Sci-Fi
                      ...               
29118    Children's|Drama|Fantasy|Sci-Fi
29119    Children's|Drama|Fantasy|Sci-Fi
29120    Children's|Drama|Fantasy|Sci-Fi
29121    Children's|Drama|Fantasy|Sci-Fi
29122    Children's|Drama|Fantasy|Sci-Fi
Name: genre, Length: 2102, dtype: object


### 데이터 추가

In [20]:
# user_id 6041이 위 영화들의 count를 5씩 했다고 가정
my_count = pd.DataFrame({'user_id' : ['hoseong']*5, 
                         'movie' : my_favorite, 
                         'count' : [5]*5, 
                         'genre' : ['Action|Adventure|Fantasy|Sci-Fi', 
                                    'Thriller', 
                                    'Action|Sci-Fi|Thriller', 
                                    'Action|Adventure|Comedy|Sci-Fi', 
                                    'Children\'s|Drama|Fantasy|Sci-Fi']
                        })
# user_id에 6041이라는 데이터가 없다면 위에 임의로 만든 my_favorite 데이터를 추가해 줍니다. 
if not data.isin({'user_id':['hoseong']})['user_id'].any():
    data = data.append(my_count)

data.tail(10)

Unnamed: 0,user_id,count,movie,genre
836473,5851,5,One Little Indian (1973),Comedy|Drama|Western
836474,5854,4,Slaughterhouse (1987),Horror
836475,5854,3,"Promise, The (Versprechen, Das) (1994)",Romance
836476,5938,4,"Five Wives, Three Secretaries and Me (1998)",Documentary
836477,5948,5,Identification of a Woman (Identificazione di ...,Drama
0,hoseong,5,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Fantasy|Sci-Fi
1,hoseong,5,"Sixth Sense, The (1999)",Thriller
2,hoseong,5,Terminator 2: Judgment Day (1991),Action|Sci-Fi|Thriller
3,hoseong,5,Men in Black (1997),Action|Adventure|Comedy|Sci-Fi
4,hoseong,5,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi


## 전처리
- title 컬럼의 제목과 년도 분리
- 관리하기 쉽게 user와 movie 각각에 indexing 해줍니다.

https://yganalyst.github.io/data_handling/memo_9/#4-%EB%81%9D%EA%B8%80%EC%9E%90-%EC%9D%B8%EC%8B%9D-strendswith

In [21]:
# movie 컬럼 분리
data_split = data['movie'].str.split('(')
data['year'] = data_split.str[1]
data['movie'] = data_split.str[0]
data

Unnamed: 0,user_id,count,movie,genre,year
0,1,5,One Flew Over the Cuckoo's Nest,Drama,1975)
1,2,5,One Flew Over the Cuckoo's Nest,Drama,1975)
2,12,4,One Flew Over the Cuckoo's Nest,Drama,1975)
3,15,4,One Flew Over the Cuckoo's Nest,Drama,1975)
4,17,5,One Flew Over the Cuckoo's Nest,Drama,1975)
...,...,...,...,...,...
0,hoseong,5,Star Wars: Episode IV - A New Hope,Action|Adventure|Fantasy|Sci-Fi,1977)
1,hoseong,5,"Sixth Sense, The",Thriller,1999)
2,hoseong,5,Terminator 2: Judgment Day,Action|Sci-Fi|Thriller,1991)
3,hoseong,5,Men in Black,Action|Adventure|Comedy|Sci-Fi,1997)


In [22]:
# 분리한 컬럼 전처리
data['year'] = data['year'].str.replace(pat=r'[^\w]', repl=r'', regex=True)
data['movie'] = data['movie'].str.strip()
data

Unnamed: 0,user_id,count,movie,genre,year
0,1,5,One Flew Over the Cuckoo's Nest,Drama,1975
1,2,5,One Flew Over the Cuckoo's Nest,Drama,1975
2,12,4,One Flew Over the Cuckoo's Nest,Drama,1975
3,15,4,One Flew Over the Cuckoo's Nest,Drama,1975
4,17,5,One Flew Over the Cuckoo's Nest,Drama,1975
...,...,...,...,...,...
0,hoseong,5,Star Wars: Episode IV - A New Hope,Action|Adventure|Fantasy|Sci-Fi,1977
1,hoseong,5,"Sixth Sense, The",Thriller,1999
2,hoseong,5,Terminator 2: Judgment Day,Action|Sci-Fi|Thriller,1991
3,hoseong,5,Men in Black,Action|Adventure|Comedy|Sci-Fi,1997


In [23]:
# 고유한 유저, 영화를 찾아내는 코드
user_unique = data['user_id'].unique()
movie_unique = data['movie'].unique()

# 유저, 영화 indexing 하는 코드 idx는 index의 약자입니다.
user_to_idx = {v:k for k,v in enumerate(user_unique)}
movie_to_idx = {v:k for k,v in enumerate(movie_unique)}

In [24]:
# 확인
print(user_to_idx['hoseong'])
print(movie_to_idx['Men in Black'])

6039
175


In [25]:
# indexing을 통해 데이터 컬럼 내 값을 바꾸는 코드

temp_user_data = data['user_id'].map(user_to_idx.get).dropna()
if len(temp_user_data) == len(data):
    print('user_id column indexing OK!!')
    data['user_id'] = temp_user_data
else:
    print('user_id column indexing Fail!!')
    
temp_movie_data = data['movie'].map(movie_to_idx.get).dropna()
if len(temp_movie_data) == len(data):
    print('movie column indexing OK!!')
    data['movie'] = temp_movie_data
else:
    print('movie column indexing Fail!!')
    
data

user_id column indexing OK!!
movie column indexing OK!!


Unnamed: 0,user_id,count,movie,genre,year
0,0,5,0,Drama,1975
1,1,5,0,Drama,1975
2,2,4,0,Drama,1975
3,3,4,0,Drama,1975
4,4,5,0,Drama,1975
...,...,...,...,...,...
0,6039,5,44,Action|Adventure|Fantasy|Sci-Fi,1977
1,6039,5,38,Thriller,1999
2,6039,5,92,Action|Sci-Fi|Thriller,1991
3,6039,5,175,Action|Adventure|Comedy|Sci-Fi,1997


## CSR matrix

In [26]:
data.user_id

0       0
1       1
2       2
3       3
4       4
     ... 
0    6039
1    6039
2    6039
3    6039
4    6039
Name: user_id, Length: 836483, dtype: int64

In [27]:
from scipy.sparse import csr_matrix

num_user = data['user_id'].nunique()
num_movie = data['movie'].nunique()

csr_data = csr_matrix((data['count'], (data.user_id, data.movie)), 
                      shape= (num_user, num_movie))
csr_data

<6040x3579 sparse matrix of type '<class 'numpy.longlong'>'
	with 834086 stored elements in Compressed Sparse Row format>

## als_model=AlternatingLeastSquares 모델 훈련

In [40]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

# Implicit AlternatingLeastSquares 모델의 선언
als_model = AlternatingLeastSquares(factors=500, 
                                    regularization=0.01, 
                                    use_gpu=False, 
                                    iterations=200, 
                                    dtype=np.float32
                                   )

In [41]:
# als 모델은 input으로 (item X user 꼴의 matrix를 받기 때문에 Transpose해줍니다.)
csr_data_transpose = csr_data.T
csr_data_transpose

<3579x6040 sparse matrix of type '<class 'numpy.longlong'>'
	with 834086 stored elements in Compressed Sparse Column format>

In [42]:
# 모델 훈련
als_model.fit(csr_data_transpose)

  0%|          | 0/200 [00:00<?, ?it/s]

## 내가 선호하는 5가지 영화 중 하나와 그 외의 영화 하나를 골라 훈련된 모델이 예측한 나의 선호도 파악

In [43]:
hoseong, Men_In_Black = user_to_idx['hoseong'], movie_to_idx['Men in Black']
hoseong_vector, Men_In_Black_vector = als_model.user_factors[hoseong], als_model.item_factors[Men_In_Black]


In [46]:
# hoseong과 Men_In_Black을 내적하는 코드
np.dot(hoseong_vector, Men_In_Black_vector)

0.9366808

In [48]:
# American Beauty를 좋아할지 예측
American_Beauty = movie_to_idx['American Beauty']
American_Beauty_vector = als_model.item_factors[American_Beauty]
np.dot(hoseong_vector, American_Beauty_vector)

0.01606674

## 내가 좋아하는 영화와 비슷한 영화 추천받기

In [52]:
idx_to_movie = {v:k for k,v in movie_to_idx.items()}
def get_similar_movie(movie_name: str):
    movie_id = movie_to_idx[movie_name]
    similar_movie = als_model.similar_items(movie_id)
    similar_movie = [idx_to_movie[i[0]] for i in similar_movie]
    return similar_movie

In [55]:
# Men in Black과 비슷한 영화 추천받기
get_similar_movie('Men in Black')

['Men in Black',
 'Jurassic Park',
 'Schlafes Bruder',
 'Independence Day',
 'Digimon: The Movie',
 'Terminator 2: Judgment Day',
 'My Life So Far',
 'Total Recall',
 'Man of No Importance, A',
 'Bewegte Mann, Der']

In [56]:
# Terminator 2: Judgment Day과 비슷한 영화 추천받기
get_similar_movie('Terminator 2: Judgment Day')

['Terminator 2: Judgment Day',
 'Terminator, The',
 'Matrix, The',
 'Total Recall',
 'City of the Living Dead',
 'Running Free',
 'Schlafes Bruder',
 'Grosse Fatigue',
 'Simon Sez',
 'Sorority House Massacre II']

## 내가 가장 좋아할 만한 영화들 추천받기

In [64]:
user = user_to_idx['hoseong']
# recommend에서는 user*item CSR Matrix를 받아서 좋아할 만한 영화들을 가져옵니다.
movie_recommended = als_model.recommend(user, csr_data, N=20, 
                                        filter_already_liked_items=True)
[idx_to_movie[i[0]] for i in movie_recommended]

['Terminator, The',
 'Matrix, The',
 'Star Wars: Episode I - The Phantom Menace',
 'Fugitive, The',
 'Total Recall',
 'Jurassic Park',
 'Contact',
 'Arachnophobia',
 'Bad Boys',
 'Sleepless in Seattle',
 'Mystery Train',
 'Star Wars: Episode V - The Empire Strikes Back',
 'Talented Mr. Ripley, The',
 'Dead Poets Society',
 'Planet of the Apes',
 'Superman',
 'Alien',
 'Splash',
 'Insider, The',
 'Hackers']

In [65]:
# explain 메소드로 Matrix, The가 뽑히는데에 기여한 정도를 확인
Hackers = movie_to_idx['Hackers']
explain = als_model.explain(user, csr_data, itemid=Hackers)
[(idx_to_movie[i[0]], i[1]) for i in explain[1]]

[('E.T. the Extra-Terrestrial', 0.0407931228911812),
 ('Terminator 2: Judgment Day', 0.017843543401141488),
 ('Sixth Sense, The', 0.00511948833505832),
 ('Men in Black', 0.0045535627915906625),
 ('Star Wars: Episode IV - A New Hope', -0.0038984926965799534)]