# [E-8] 프로젝트: Movielens 영화 추천 실습

## 1. 데이터 준비와 전처리

### 1-1. 데이터 준비

1) wget으로 데이터 다운로드
$ wget http://files.grouplens.org/datasets/movielens/ml-1m.zip

2) 다운받은 데이터를 작업디렉토리로 옮김
$ mv ml-1m.zip ~/aiffel/recommendata_iu/data

3) 작업디렉토리로 이동
$ cd ~/aiffel/recommendata_iu/data

4) 압축 해제
$ unzip ml-1m.zip

### 1-2. 데이터 전처리

> * 별점을 시청횟수로 해석
> * 유저가 3점 미만을 준 데이터는 선호하지 않는다고 가정하고 제외

In [1]:
import pandas as pd
import os

rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python')    
orginal_data_size = len(ratings)

ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [2]:
ratings.tail()

Unnamed: 0,user_id,movie_id,rating,timestamp
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648
1000208,6040,1097,4,956715569


In [3]:
# 3점 이상만 남긴다.

ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [4]:
# rating 컬럼의 이름을 count로 바꾼다.

ratings.rename(columns={'rating':'count'}, inplace=True)
ratings['count']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: count, Length: 836478, dtype: int64

시청횟수가 3 ~ 5인 데이터만 남겨두었다.

In [5]:
ratings.describe()

Unnamed: 0,user_id,movie_id,count,timestamp
count,836478.0,836478.0,836478.0,836478.0
mean,3033.120626,1849.099114,3.958293,972162800.0
std,1729.255651,1091.870094,0.76228,12062160.0
min,1.0,1.0,3.0,956703900.0
25%,1531.0,1029.0,3.0,965279500.0
50%,3080.0,1747.0,4.0,972838800.0
75%,4485.0,2763.0,5.0,975206400.0
max,6040.0,3952.0,5.0,1046455000.0


### 결과

> 'rating.dat'의 user_id'는 사용자 식별 번호이며, 'movie_id'는 영화 식별 번호, 'rating'은 별점(시청횟수로 가정), timestamp로 이루어진 데이터셋이며, 이미 indexing이 되어 있는 것을 확인하였다. 
'user_id'는 1 ~ 6040, 'movie_id'는 1 ~ 3952의 범위를 갖고 있다.

### 1-3. 데이터셋 병합하기

> 인덱싱된 값 중 영화 제목을 확인하기 위해 'movies.dat' 데이터셋을 병합한다.

In [6]:
movies_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
col_names = ['movie_id', 'title', 'genre']   
movies = pd.read_csv(movies_file_path, sep='::', names=col_names, engine='python',  encoding = 'ISO-8859-1')   
orginal_data_size = len(movies)
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
movies.tail()          

Unnamed: 0,movie_id,title,genre
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama
3882,3952,"Contender, The (2000)",Drama|Thriller


In [8]:
movies.describe()

Unnamed: 0,movie_id
count,3883.0
mean,1986.049446
std,1146.778349
min,1.0
25%,982.5
50%,2010.0
75%,2980.5
max,3952.0


영화는 총 3952개가 리스트업 되어 있다.

In [9]:
merge_data=ratings.merge(movies)
merge_data_sort = merge_data.sort_values(by="user_id")           # 오름차순으로 user_id 기준 정렬
merge_data_sort.head()

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
31113,1,2294,4,978824291,Antz (1998),Animation|Children's
31674,1,3186,4,978300019,"Girl, Interrupted (1999)",Drama
32044,1,1566,4,978824330,Hercules (1997),Adventure|Animation|Children's|Comedy|Musical
32415,1,588,4,978824268,Aladdin (1992),Animation|Children's|Comedy|Musical


In [10]:
merge_data_sort.tail()

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
657728,6040,334,4,957717503,Vanya on 42nd Street (1994),Drama
393446,6040,1294,4,957716949,M*A*S*H (1970),Comedy|War
253075,6040,994,3,960972693,Big Night (1996),Drama
127665,6040,2396,3,956704475,Shakespeare in Love (1998),Comedy|Romance
738957,6040,2725,4,997454180,Twin Falls Idaho (1999),Drama


In [11]:
merge_data_sort['title'] = merge_data_sort['title'].str.lower() # 검색을 쉽게하기 위해 문자열을 소문자로 바꿔줍시다.
merge_data_sort.head()

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
0,1,1193,5,978300760,one flew over the cuckoo's nest (1975),Drama
31113,1,2294,4,978824291,antz (1998),Animation|Children's
31674,1,3186,4,978300019,"girl, interrupted (1999)",Drama
32044,1,1566,4,978824330,hercules (1997),Adventure|Animation|Children's|Comedy|Musical
32415,1,588,4,978824268,aladdin (1992),Animation|Children's|Comedy|Musical


In [12]:
merge_data_sort['genre'] = merge_data_sort['genre'].str.lower() # 검색을 쉽게하기 위해 문자열을 소문자로 바꿔줍시다.
merge_data_sort.head()

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
0,1,1193,5,978300760,one flew over the cuckoo's nest (1975),drama
31113,1,2294,4,978824291,antz (1998),animation|children's
31674,1,3186,4,978300019,"girl, interrupted (1999)",drama
32044,1,1566,4,978824330,hercules (1997),adventure|animation|children's|comedy|musical
32415,1,588,4,978824268,aladdin (1992),animation|children's|comedy|musical


### 결과

> 'ratings.dat'과 'movies.dat'이 병합되어 ratings에서 'title'과 'genre' 칼럼이 추가되었다. 

In [13]:
# 첫번째 유저의 영화 확인하기

condition = (merge_data_sort['user_id']== merge_data_sort.loc[0, 'user_id'])
merge_data_sort.loc[condition]

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
0,1,1193,5,978300760,one flew over the cuckoo's nest (1975),drama
31113,1,2294,4,978824291,antz (1998),animation|children's
31674,1,3186,4,978300019,"girl, interrupted (1999)",drama
32044,1,1566,4,978824330,hercules (1997),adventure|animation|children's|comedy|musical
32415,1,588,4,978824268,aladdin (1992),animation|children's|comedy|musical
33643,1,1907,4,978824330,mulan (1998),animation|children's
34086,1,783,4,978824291,"hunchback of notre dame, the (1996)",animation|children's|musical
34399,1,1836,5,978300172,"last days of disco, the (1998)",drama
34497,1,1022,5,978300055,cinderella (1950),animation|children's|musical
35022,1,2762,4,978302091,"sixth sense, the (1999)",thriller


## 2. 분석

> * 유니크한 영화 개수
>* 유니크한 사용자 수
>* 가장 인기 있는 영화 30개(인기순)

### 2-1. 기초 통계 분석

In [14]:
merge_data_sort['user_id'].nunique()        # 사용자 수

6039

In [15]:
merge_data_sort['movie_id'].nunique()       # 영화 개수

3628

In [16]:
# 유저별 몇 개의 영화를 보았는지에 대한 통계

user_count = merge_data_sort.groupby('user_id')['movie_id'].count()
user_count.describe()

count    6039.000000
mean      138.512668
std       156.241599
min         1.000000
25%        38.000000
50%        81.000000
75%       177.000000
max      1968.000000
Name: movie_id, dtype: float64

In [17]:
# 유저별 시청횟수 중앙값에 대한 통계

user_median = merge_data_sort.groupby('user_id')['count'].median()
user_median.describe()

count    6039.000000
mean        4.055970
std         0.432143
min         3.000000
25%         4.000000
50%         4.000000
75%         4.000000
max         5.000000
Name: count, dtype: float64

### 결과

> 총 유저는 6039명, 총 영화는 3628개로 유저들은 평균적으로 138.51개의 영화를 보았고, 4번 시청하였다. 유저 중 최소 영화를 1개 본 사람부터 최대 1968개를 본 사람이 있으며, 시청 횟수는 3회가 최소, 5회가 최대이다.

### 2-2. 인기 있는 영화 순위 30

In [18]:
movies['title'] = movies['title'].str.lower() # 검색을 쉽게하기 위해 문자열을 소문자로 바꿔준다.
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,toy story (1995),Animation|Children's|Comedy
1,2,jumanji (1995),Adventure|Children's|Fantasy
2,3,grumpier old men (1995),Comedy|Romance
3,4,waiting to exhale (1995),Comedy|Drama
4,5,father of the bride part ii (1995),Comedy


In [19]:
movies['genre'] = movies['genre'].str.lower() # 검색을 쉽게하기 위해 문자열을 소문자로 바꿔준다.
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,toy story (1995),animation|children's|comedy
1,2,jumanji (1995),adventure|children's|fantasy
2,3,grumpier old men (1995),comedy|romance
3,4,waiting to exhale (1995),comedy|drama
4,5,father of the bride part ii (1995),comedy


In [20]:
# 인기 있는 영화 30개

movie_count = merge_data_sort.groupby('title')['user_id'].count()
movie_count.sort_values(ascending=False).head(30)              # 내림차순 정렬

title
american beauty (1999)                                   3211
star wars: episode iv - a new hope (1977)                2910
star wars: episode v - the empire strikes back (1980)    2885
star wars: episode vi - return of the jedi (1983)        2716
saving private ryan (1998)                               2561
terminator 2: judgment day (1991)                        2509
silence of the lambs, the (1991)                         2498
raiders of the lost ark (1981)                           2473
back to the future (1985)                                2460
matrix, the (1999)                                       2434
jurassic park (1993)                                     2413
sixth sense, the (1999)                                  2385
fargo (1996)                                             2371
braveheart (1995)                                        2314
men in black (1997)                                      2297
schindler's list (1993)                                  2257
pr

> American Beauty (1999)가 총 3211명이 시청하여 인기있는 영화 중 1위를 하였다.

## 3. 내가 선호하는 영화(5개) ratings에  추가

### 3-1. 내가 선호하는 영화(5개) 선정

American Beauty (1999), Silence of the Lambs, The (1991), Matrix, The (1999), Forrest Gump (1994), Godfather, The (1972)  

3211, 2498, 2434, 2022,  2167           # 해당 movie_id

In [21]:
# 이름은 꼭 데이터셋에 있는 것과 동일하게 맞춰준다.
my_favorite = [3211, 2498, 2434, 2022,  2167]
my_title = ['american beauty (1999)', 'silence of the lambs, the (1991)', 'matrix, the (1999)', 'forrest gump (1994)', 'godfather, the (1972)' ]

# '6041'이라는 새로운 user_id가 위의 영화를 5회씩 시청했다고 가정한다.
my_movielist = pd.DataFrame({'user_id': ['6041']*5, 'movie_id': my_favorite, 'count':[5]*5, 'title': my_title })

if not merge_data_sort.isin({'user_id':['6041']})['user_id'].any():  # user_id에 '6041'이라는 데이터가 없다면
    merge_data_sort = merge_data_sort.append(my_movielist)                           # 위에 임의로 만든 my_favorite 데이터를 추가해 준다. 

merge_data_sort.tail(10)       # 잘 추가되었는지 확인한다.

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
657728,6040,334,4,957717503.0,vanya on 42nd street (1994),drama
393446,6040,1294,4,957716949.0,m*a*s*h (1970),comedy|war
253075,6040,994,3,960972693.0,big night (1996),drama
127665,6040,2396,3,956704475.0,shakespeare in love (1998),comedy|romance
738957,6040,2725,4,997454180.0,twin falls idaho (1999),drama
0,6041,3211,5,,american beauty (1999),
1,6041,2498,5,,"silence of the lambs, the (1991)",
2,6041,2434,5,,"matrix, the (1999)",
3,6041,2022,5,,forrest gump (1994),
4,6041,2167,5,,"godfather, the (1972)",


### 3-2. 추가 전처리

In [22]:
del merge_data_sort['timestamp']  
del merge_data_sort['title']                          # 불필요한 칼럼 삭제하기
del merge_data_sort['genre'] 
print(merge_data_sort.columns)

Index(['user_id', 'movie_id', 'count'], dtype='object')


In [23]:
merge_data_sort.isnull().sum()      # 결측값 없음

user_id     0
movie_id    0
count       0
dtype: int64

In [24]:
merge_data_sort.duplicated()        # 중복값 없음

0        False
31113    False
31674    False
32044    False
32415    False
         ...  
0        False
1        False
2        False
3        False
4        False
Length: 836483, dtype: bool

In [25]:
merge_data_sort.head()

Unnamed: 0,user_id,movie_id,count
0,1,1193,5
31113,1,2294,4
31674,1,3186,4
32044,1,1566,4
32415,1,588,4


In [26]:
merge_data_sort.tail()

Unnamed: 0,user_id,movie_id,count
0,6041,3211,5
1,6041,2498,5
2,6041,2434,5
3,6041,2022,5
4,6041,2167,5


## 4. CSR(Compressed Sparse Row) matrix 만들기

> CSR Matrix는 Sparse한 matrix에서 0이 아닌 유효한 데이터로 채워지는 데이터의 값과 좌표 정보만으로 구성하여 메모리 사용량을 최소화하면서도 Sparse한 matrix와 동일한 행렬을 표현할 수 있도록 하는 데이터 구조이다.

In [27]:
 merge_data_sort['user_id'].nunique() 

6040

In [28]:
merge_data_sort['movie_id'].nunique() 

3628

In [29]:
from scipy.sparse import csr_matrix
import numpy as np

num_user = merge_data_sort['user_id'].nunique()    
num_movie = merge_data_sort['movie_id'].nunique()

In [30]:
csr_data = csr_matrix((merge_data_sort.count, (merge_data_sort.user_id, merge_data_sort.movie_id)), shape= (num_user, num_movie))

TypeError: len() of unsized object

**CSR matrix로 변환하는 과정에서 "TypeError: len() of unsized object"가 발생한다. 원인을 발견하기 어려워, matrix를 직접 만들어 보는 것을 시도해본다.**

In [31]:
data_matrix = merge_data_sort.pivot_table('count', index='user_id', columns='movie_id')
data_matrix.head(3)

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


In [32]:
# NaN 값을 모두 0 으로 변환해서 채우기
data_matrix = data_matrix.fillna(0)
data_matrix.head(3)

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
from sklearn.metrics.pairwise import cosine_similarity
data_matrix_T = data_matrix.transpose()
item_sim = cosine_similarity(data_matrix_T, data_matrix_T)
item_sim_df = pd.DataFrame(data=item_sim, index=data_matrix.columns,
                          columns=data_matrix.columns)

In [38]:
def predict_count(data_arr, item_sim_arr ):
    count_pred = data_arr.dot(item_sim_arr)/ np.array([np.abs(item_sim_arr).sum(axis=1)])
    return count_pred

In [39]:
movie_pred = predict_count(data_matrix.values , item_sim_df.values)
movie_pred_matrix = pd.DataFrame(data=movie_pred, index= data_matrix.index,
                                   columns = data_matrix.columns)
movie_pred_matrix.head(3)

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.171004,0.130613,0.124445,0.112798,0.12403,0.120524,0.129862,0.132819,0.08135,0.125418,...,0.101566,0.091741,0.112906,0.080854,0.065347,0.134164,0.109472,0.087865,0.089632,0.118025
2,0.286873,0.240697,0.239858,0.206211,0.228211,0.291785,0.245049,0.204122,0.246315,0.291707,...,0.20247,0.189822,0.130006,0.198696,0.174892,0.254559,0.212125,0.184905,0.151912,0.245159
3,0.146762,0.123792,0.120613,0.088194,0.110527,0.134972,0.109669,0.098065,0.114908,0.143809,...,0.092013,0.068172,0.105541,0.100441,0.072191,0.124748,0.097244,0.081575,0.057358,0.102455


## 5. als_model= AlternatingLeastSquares 구성 및 훈련

In [40]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

In [41]:
# Implicit AlternatingLeastSquares 모델의 선언
als_model = AlternatingLeastSquares(factors=100, regularization=0.01, use_gpu=False, iterations=15, dtype=np.float32)

In [42]:
# als 모델은 input으로 (item X user 꼴의 matrix를 받기 때문에 Transpose해줍니다.)
data_transpose = movie_pred_matrix.T
data_transpose

user_id,1,2,3,4,5,6,7,8,9,10,...,6032,6033,6034,6035,6036,6037,6038,6039,6040,6041
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.171004,0.286873,0.146762,0.065236,0.279414,0.161217,0.107052,0.305255,0.279472,0.995300,...,0.253884,0.154212,0.039427,0.346389,1.242990,0.426051,0.046286,0.257872,0.591521,0.007384
2,0.130613,0.240697,0.123792,0.052145,0.202649,0.138923,0.092747,0.246597,0.200716,0.920035,...,0.189227,0.136406,0.027893,0.298968,1.022200,0.314655,0.032989,0.200157,0.419483,0.008254
3,0.124445,0.239858,0.120613,0.046033,0.200271,0.150033,0.089666,0.260581,0.211794,0.886715,...,0.169018,0.128465,0.024309,0.301417,0.953682,0.291469,0.033383,0.187995,0.379450,0.006675
4,0.112798,0.206211,0.088194,0.032079,0.209850,0.131803,0.064170,0.284765,0.183982,0.738915,...,0.137572,0.090236,0.018673,0.276519,0.926976,0.243900,0.030530,0.160454,0.367732,0.008043
5,0.124030,0.228211,0.110527,0.040281,0.186738,0.147721,0.079981,0.245789,0.198673,0.877871,...,0.155739,0.115423,0.020731,0.294539,0.907381,0.262944,0.030305,0.176334,0.350149,0.008236
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3948,0.134164,0.254559,0.124748,0.053545,0.277126,0.138636,0.100146,0.302004,0.273984,0.847699,...,0.193071,0.132720,0.033302,0.310612,1.122789,0.363769,0.036870,0.196683,0.523175,0.007552
3949,0.109472,0.212125,0.097244,0.045921,0.293586,0.107565,0.073728,0.279168,0.231350,0.685872,...,0.185051,0.103087,0.037280,0.266132,1.143982,0.354558,0.033581,0.182007,0.560369,0.007340
3950,0.087865,0.184905,0.081575,0.038842,0.214007,0.093183,0.059031,0.216735,0.176767,0.600053,...,0.165847,0.090744,0.033064,0.207974,0.998171,0.310428,0.030431,0.176976,0.464022,0.006668
3951,0.089632,0.151912,0.057358,0.028033,0.240598,0.093484,0.041007,0.209862,0.171168,0.480409,...,0.140657,0.055653,0.033603,0.192530,0.989162,0.281044,0.027598,0.153907,0.510118,0.005605


In [43]:
# 모델 훈련
als_model.fit(data_transpose)

AttributeError: 'DataFrame' object has no attribute 'tocsr'

## 6. 내가 선호하는 영화와 그 외 영화에 대한 선호도 파악

In [None]:
user_vector, movie_vector = als_model.user_factors[6041], als_model.item_factors[3211]

In [None]:
user_vector

In [None]:
movie_vector

In [None]:
# user_id'6041'과 movie_id'3211'을 내적하는 코드
np.dot(user_vector, movie_vector)

## 7. 내가 좋아하는 영화와 비슷한 영화 추천

In [None]:
similar_movie = als_model.similar_items(3211, N=15)
similar_movie

In [None]:
# 영화 title을 확인하기 위해 movies 와 조인
similar_movie_title = pd.merge(similar_movie, movies, on='movie_id')

In [None]:
def get_similar_movie(movie_title: str):                    # 함수 생성
    similar_movie = als_model.similar_items(movie_id)
    similar_movie = ['title' for i in similar_movie]
    return similar_movie_title

In [None]:
get_similar_movie('forrest gump (1994)')

## 8. 내가 가장 좋아할 만한 영화 추천

In [None]:
user = user_id['6041']
# recommend에서는 user*item CSR Matrix를 받습니다.
movie_recommended = als_model.recommend(user, data_matrix, N=20, filter_already_liked_items=True)
movie_recommended

In [None]:
[title[i[0]] for i in movie_recommended]

In [None]:
explain = als_model.explain(user, data_matrix, itemid= )

In [None]:
[(title[i[0]], i[1]) for i in explain[1]]

## 총평

> 추천 시스템(recommendations)은 유튜브, 음악 컨텐츠 포털 등에서 사용자의 취향을 이해하고 맞춤 상품을 제공하기 위한 것이다. 추천 시스템은 크게 콘텐츠 기반 필터링 방식(Content based filtering)과 협업 필터링 방식(Collaborateive Filtering)으로 나뉜다. 그리고 협업 필터링 방식은 다시 최근접 이웃(Nearest Neighbor), 잠재요인(Latent Factor) 협업 필터링으로 나뉜다. 추천 시스템의 초창기에는 콘텐츠 기반 필터링이나 최근접 이웃 기반 협업 필터링이 주로 사용됐지만, 행렬 분해(Matrix Factorization) 기법을 이용한 잠재 요인 협업 필터링 방식이 등장하고 대부분 잠재 요인 협업 필터링 기반의 추천 시스템을 적용하고 있다. 
>
>이번 프로젝트도 행렬 분해(Matrix Factorization) 기법을 이용한 잠재 요인 협업 필터링 방식으로 진행되었다. 노드에서 추천 시스템이 어떠한 방식으로 이루어지는지 데이터의 전처리부터 학습, 예측까지 실습할 수 있어 매우 흥미로웠다. 다만, 프로젝트에서 생성된 매트릭스로 직접 학습 및 예측을 완결하지 못한 점이 아쉬움으로 남는다. 제출 이후, 찬찬히 다시 시도해 볼 계획이다. 그리고 다양한 추천 시스템의 필터링 방식으로 학습-예측한 결과를 비교해보는 것도 좋은 시도일 것 같다. 