### E7 Project Recommend Movie

- 목적 
    - 추천시스템의 개념과 목적을 이해한다.
    - Implicit 라이브러리를 활용하여 Matrix Factorization 기반의 추천 모델을 만들어 본다. 
    - CSR Matrix를 익힌다 

In [208]:
import pandas as pd
from scipy.sparse import csr_matrix

#### 01 데이터 준비와 전처리

In [209]:
import os
rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python')
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


- 3점 이상의 rating만 남겨 두고 나머지는 삭제합니다.

In [210]:
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


- rating 값을 Implicit data로 가정하기 위해서 count로 변경하여 줍니다.
- timestamp 값은 필요가 없기 때문에 삭제하여 줍니다.

In [211]:
ratings.rename(columns={'rating':'count'}, inplace=True)
select_col = ['user_id', 'movie_id', 'count']
ratings_df = ratings[select_col]
ratings_df.tail()

Unnamed: 0,user_id,movie_id,count
1000203,6040,1090,3
1000205,6040,1094,5
1000206,6040,562,5
1000207,6040,1096,4
1000208,6040,1097,4


In [212]:
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre'] 
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [213]:
merge_df = pd.merge(ratings_df, movies)
merge_df.tail()

Unnamed: 0,user_id,movie_id,count,title,genre
836473,5851,3607,5,One Little Indian (1973),Comedy|Drama|Western
836474,5854,3026,4,Slaughterhouse (1987),Horror
836475,5854,690,3,"Promise, The (Versprechen, Das) (1994)",Romance
836476,5938,2909,4,"Five Wives, Three Secretaries and Me (1998)",Documentary
836477,5948,1360,5,Identification of a Woman (Identificazione di ...,Drama


#### 02 분석

- ratings에 있는 유니크한 영화 개수

In [214]:
movie_unique = merge_df['title'].unique()
merge_df['title'].nunique()


3628

- ratings에 있는 유니크한 사용자 수 

In [215]:
user_unique = merge_df['user_id'].unique()
merge_df['user_id'].nunique()

6039

- 가장 인기있는 영화 30개(인기순)

In [216]:
count_mean_df=pd.DataFrame(merge_df.groupby('title').mean()).sort_values(by = 'count', ascending=False)
count_mean_df.reset_index(level=['title'], inplace=True)

In [217]:
popular_ls = count_mean_df['title'][0:30]
pd.DataFrame(popular_ls)


Unnamed: 0,title
0,Ulysses (Ulisse) (1954)
1,Country Life (1994)
2,Schlafes Bruder (Brother of Sleep) (1995)
3,Foreign Student (1994)
4,Follow the Bitch (1998)
5,One Little Indian (1973)
6,Criminal Lovers (Les Amants Criminels) (1999)
7,Message to Love: The Isle of Wight Festival (1...
8,Identification of a Woman (Identificazione di ...
9,Late Bloomers (1996)


#### 03 내가 선호하는 영화를 5가지 골라서 rating에 추가

In [218]:
my_movies = ['Titanic (1953)', 'Small Soldiers (1998)', 'Toy Story 2 (1999)', 'Miss Julie (1999)', 'Terminator 2: Judgment Day (1991)']
counts = [5, 4, 4, 5, 4]
genre = ['Drama|Romance', "Animation|Children's|Fantasy|War", "Animation|Children's|Comedy", 'Drama', 'Action|Sci-Fi|Thriller']
my_movies_df = pd.DataFrame({'user_id':[6041]*5, 'title':my_movies, 'count':counts,'genre':genre})
my_movies_df

Unnamed: 0,user_id,title,count,genre
0,6041,Titanic (1953),5,Drama|Romance
1,6041,Small Soldiers (1998),4,Animation|Children's|Fantasy|War
2,6041,Toy Story 2 (1999),4,Animation|Children's|Comedy
3,6041,Miss Julie (1999),5,Drama
4,6041,Terminator 2: Judgment Day (1991),4,Action|Sci-Fi|Thriller


In [219]:
columns = ['user_id', 'title', 'count', 'genre']
total_df = merge_df[columns]
total_df = pd.concat([total_df, my_movies_df], ignore_index=True)
total_df.tail(10)

Unnamed: 0,user_id,title,count,genre
836473,5851,One Little Indian (1973),5,Comedy|Drama|Western
836474,5854,Slaughterhouse (1987),4,Horror
836475,5854,"Promise, The (Versprechen, Das) (1994)",3,Romance
836476,5938,"Five Wives, Three Secretaries and Me (1998)",4,Documentary
836477,5948,Identification of a Woman (Identificazione di ...,5,Drama
836478,6041,Titanic (1953),5,Drama|Romance
836479,6041,Small Soldiers (1998),4,Animation|Children's|Fantasy|War
836480,6041,Toy Story 2 (1999),4,Animation|Children's|Comedy
836481,6041,Miss Julie (1999),5,Drama
836482,6041,Terminator 2: Judgment Day (1991),4,Action|Sci-Fi|Thriller


In [220]:
genre_unique = total_df['genre'].unique()
user_unique = total_df['user_id'].unique()
movie_unique = total_df['title'].unique()
genre_to_idx = {v:k for k, v in enumerate(genre_unique)}
user_to_idx = {v:k for k, v in enumerate(user_unique)}
movie_to_idx = {v:k for k, v in enumerate(movie_unique)}

In [221]:
temp_genre_data = total_df['genre'].map(genre_to_idx).dropna()
temp_user_data = total_df['user_id'].map(user_to_idx).dropna()
temp_movie_data = total_df['title'].map(movie_to_idx).dropna()
if len(temp_genre_data) == len(total_df):
    print('genre column indexing OK!!')
    total_df['genre'] = temp_genre_data
else:
    print('genre column indexing Fail!!')

if len(temp_user_data) == len(total_df):
    print('user_id column indexing OK!!')
    total_df['user_id'] = temp_user_data
else:
    print('user_id column indexing Fail!!')

if len(temp_movie_data) == len(total_df):
    print('movie_id column indexing OK!!')
    total_df['movie_id'] = temp_movie_data
else:
    print('movie_id column indexing Fail!!')

total_df

genre column indexing OK!!
user_id column indexing OK!!
movie_id column indexing OK!!


Unnamed: 0,user_id,title,count,genre,movie_id
0,0,One Flew Over the Cuckoo's Nest (1975),5,0,0
1,1,One Flew Over the Cuckoo's Nest (1975),5,0,0
2,2,One Flew Over the Cuckoo's Nest (1975),4,0,0
3,3,One Flew Over the Cuckoo's Nest (1975),4,0,0
4,4,One Flew Over the Cuckoo's Nest (1975),5,0,0
...,...,...,...,...,...
836478,6039,Titanic (1953),5,18,1626
836479,6039,Small Soldiers (1998),4,271,1736
836480,6039,Toy Story 2 (1999),4,3,50
836481,6039,Miss Julie (1999),5,0,3319


In [222]:
num_user = total_df['user_id'].nunique()
num_movie = total_df['title'].nunique()

#### 04 CSR_Matrix 만들기
 - count 기준

In [223]:
csr_data = csr_matrix((total_df['count'], (total_df.user_id, total_df.movie_id)), shape=(num_user, num_movie))
csr_data

<6040x3628 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

In [224]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [225]:
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=20, dtype=np.float32)

In [226]:
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

#### 05 모델 학습

In [227]:
als_model.fit(csr_data_transpose)

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))




In [228]:
toy_story2 = movie_to_idx['Toy Story 2 (1999)']

In [229]:
toy_stort2_vector = als_model.item_factors[toy_story2]

In [230]:
me = user_to_idx[6041]
me_vector = als_model.user_factors[me]

#### 06 모델이 예측한 나의 선호도

- 내가 좋아하는 영화

In [231]:
np.dot(me_vector, toy_stort2_vector)

0.50424635

- 그외의 영화 

In [232]:
star_wars =movie_to_idx['Star Wars: Episode I - The Phantom Menace (1999)']

In [233]:
star_wars_vector = als_model.item_factors[star_wars]

In [234]:
np.dot(me_vector, star_wars_vector)

0.08793535

In [235]:
similar_movies = als_model.similar_items(toy_story2, N = 15)

[(50, 0.24838728),
 (40, 0.19701295),
 (4, 0.1691786),
 (851, 0.10528524),
 (322, 0.09318946),
 (1706, 0.09312941),
 (474, 0.088070214),
 (126, 0.08670205),
 (1635, 0.085579604),
 (33, 0.08351587),
 (32, 0.08239664),
 (16, 0.07620292),
 (110, 0.074123695),
 (2137, 0.073202126),
 (1996, 0.069461524)]

- Toy Story2와 비슷한 영화 

In [236]:
idx_to_artist = {v:k for k, v in movie_to_idx.items()}
[idx_to_artist[i[0]] for i in similar_movies]

['Toy Story 2 (1999)',
 'Toy Story (1995)',
 "Bug's Life, A (1998)",
 'Iron Giant, The (1999)',
 'Babe (1995)',
 'Stuart Little (1999)',
 'Chicken Run (2000)',
 'Shakespeare in Love (1998)',
 'Dinosaur (2000)',
 'Aladdin (1992)',
 'Hercules (1997)',
 'Tarzan (1999)',
 'Groundhog Day (1993)',
 'Tigger Movie, The (2000)',
 'George of the Jungle (1997)']

- 나에게 추천하는 영화

In [237]:
movie_recommended = als_model.recommend(me,csr_data, N=20, filter_already_liked_items=True)
idx_to_movie = {v:k for k, v in movie_to_idx.items()}
[idx_to_movie[i[0]] for i in movie_recommended]

['Toy Story (1995)',
 "Bug's Life, A (1998)",
 'Matrix, The (1999)',
 'Iron Giant, The (1999)',
 'Fugitive, The (1993)',
 'Jurassic Park (1993)',
 'Total Recall (1990)',
 'Terminator, The (1984)',
 'Hunt for Red October, The (1990)',
 'Aladdin (1992)',
 'Braveheart (1995)',
 'Lion King, The (1994)',
 'Men in Black (1997)',
 'Chicken Run (2000)',
 'Mask, The (1994)',
 'Groundhog Day (1993)',
 'Nightmare Before Christmas, The (1993)',
 'Beauty and the Beast (1991)',
 'North by Northwest (1959)',
 'Nikita (La Femme Nikita) (1990)']

- genre를 기준으로 변경하여 모델 학습

In [238]:
csr_data = csr_matrix((total_df['genre'], (total_df.user_id, total_df.movie_id)), shape=(num_user, num_movie))
csr_data

<6040x3628 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Row format>

In [239]:
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=20, dtype=np.float32)

In [240]:
csr_data_transpose = csr_data.T
csr_data_transpose

<3628x6040 sparse matrix of type '<class 'numpy.int64'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [241]:
als_model.fit(csr_data_transpose)

HBox(children=(FloatProgress(value=0.0, max=15.0), HTML(value='')))




In [242]:
toy_story2 = movie_to_idx['Toy Story 2 (1999)']

In [243]:
toy_stort2_vector = als_model.item_factors[toy_story2]
me = user_to_idx[6041]
me_vector = als_model.user_factors[me]

- 내가 좋아하는 영화

In [244]:
np.dot(me_vector, toy_stort2_vector)

0.17622632

- 그 외의 영화 

In [245]:
star_wars =movie_to_idx['Star Wars: Episode I - The Phantom Menace (1999)']
star_wars_vector = als_model.item_factors[star_wars]
np.dot(me_vector, star_wars_vector)

0.40741587

- Toy Story2와 비슷한 영화 

In [249]:
similar_movies = als_model.similar_items(toy_story2, N = 15)
idx_to_artist = {v:k for k, v in movie_to_idx.items()}
[idx_to_artist[i[0]] for i in similar_movies]

['Toy Story 2 (1999)',
 'Toy Story (1995)',
 "Bug's Life, A (1998)",
 'Iron Giant, The (1999)',
 'Tarzan (1999)',
 'Mulan (1998)',
 'Aladdin (1992)',
 'Fantasia 2000 (1999)',
 'Chicken Run (2000)',
 'Beauty and the Beast (1991)',
 'Babe (1995)',
 'Lion King, The (1994)',
 'Pleasantville (1998)',
 'Balto (1995)',
 'Hunchback of Notre Dame, The (1996)']

**장르를 데이터로 입력하여 주어 토이 스토리2와 비슷한 영화로 에니메이션이 많이 추천이 되는 것을 알 수 있었습니다**

- 추천하는 영화 

In [247]:
movie_recommended = als_model.recommend(me,csr_data, N=20, filter_already_liked_items=True)
idx_to_movie = {v:k for k, v in movie_to_idx.items()}
[idx_to_movie[i[0]] for i in movie_recommended]

['Jurassic Park (1993)',
 'Men in Black (1997)',
 'Braveheart (1995)',
 'Matrix, The (1999)',
 'Saving Private Ryan (1998)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Raiders of the Lost Ark (1981)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Forrest Gump (1994)',
 'Stand by Me (1986)',
 'Star Wars: Episode I - The Phantom Menace (1999)',
 'Silence of the Lambs, The (1991)',
 'Fugitive, The (1993)',
 'Titanic (1997)',
 'Mission: Impossible (1996)',
 'Total Recall (1990)',
 'Get Shorty (1995)',
 'From Russia with Love (1963)',
 'Hoop Dreams (1994)']

#### 07 고찰

- 평점을 기준으로 하는것보다는 유저가 본 영화의 장르를 기준으로 하는것이 추천을 하는데 더 효과적이지 않을까 생각하여 학습을 시켜보았는데 두 모델이 추천해주는 영화가 차이가 났지만 공통적인 영화도 존재하였습니다.   
- 장르를 입력하여 학습하였을 때의 결과는 기대한 결과가 비슷한 모습을 보여주었지만 오히려 평점을 기준으로 하였을 때 장르도 반영이 된듯한 결과가 제가 예상밖이었습니다.    
- 장르에 대한 정보는 모델에 입력하지 않고 평점만 입력하였는데 장르가 반영된 듯 추천하여주는 것이 단지 인기 있던 영화가 장르가 비슷한 우연의 결과인지 장르 또한 학습이 된것인지 이를 확인해 볼 수 있는 방법을 좀더 고민해 보아야 할 것 같습니다. 
