## 회고

### Data manuplating skills

- 딕셔너리 컴프리핸션 활용하여 idx:name 사전 만들기
    
    `user_to_idx = {v:k for k,v in enumerate(user_unique)}`
    
- groupby 활용하여 특정 칼럼에서 동일한 요소들에 대해서 일괄 함수를 적용하는 구문 만들기
    
    `artist_count = data.groupby('artist')['user_id'].count()`
    
- tsv파일을 받을때는 read_csv에서 sep만 '\t'로 설정하면 됨
- dataframe의 str요소들에 대해 str메서드를 쓰는법
    
    `data['artist'] = data['artist'].str.lower()`
    
- data.loc[] 활용하여 특정 값인 행만 바꾸기
    
    `condition = (data['user_id']== data.loc[0, 'user_id'])`
    
- pd.unique() == len(pd.unique())
- pd.append()함수 활용하여 데이터 프레임에 바로 append
    
    `data = data.append(my_playlist, ignore_index=True) # 위에 임의로 만든 my_favorite 데이터를 추가해 줍니다.`
    
- pandas의 map메소드 사용해서 특정 열 전체 딕셔너리 value로 바꾸기
    
    `temp_user_data = data['user_id'].map(user_to_idx.get).dropna()`
    

### 루브릭

- CSR Matrix를 정상적으로 만들었습니다.
- MF 모델이 정상적으로 훈련이 되었고, 백터의 내적을 통해 제 선호도를 구하였습니다.
- 저의 영화 선호도, 영화간 유사도, 사전 입력된 영화를 제외한 추천도를 구하였습니다.

### 후기

- 추천시스템에서 굉장히 기본적인 모델을 배운 것 같습니다.
    - CNN을 공부하면서 k-nn을 배운느낌?

## 데이터 준비와 전처리

In [137]:
import pandas as pd
import os
rating_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/ratings.dat'
ratings_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv(rating_file_path, sep='::', names=ratings_cols, engine='python', encoding = "ISO-8859-1")
orginal_data_size = len(ratings)
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


In [138]:
# 3점 이상만 남깁니다.
ratings = ratings[ratings['rating']>=3]
filtered_data_size = len(ratings)

print(f'orginal_data_size: {orginal_data_size}, filtered_data_size: {filtered_data_size}')
print(f'Ratio of Remaining Data is {filtered_data_size / orginal_data_size:.2%}')

orginal_data_size: 1000209, filtered_data_size: 836478
Ratio of Remaining Data is 83.63%


In [139]:
# rating 컬럼의 이름을 count로 바꿉니다.
ratings.rename(columns={'rating':'count'}, inplace=True)

In [140]:
ratings['count']

0          5
1          3
2          3
3          4
4          5
          ..
1000203    3
1000205    5
1000206    5
1000207    4
1000208    4
Name: count, Length: 836478, dtype: int64

In [141]:
# 영화 제목을 보기 위해 메타 데이터를 읽어옵니다.
movie_file_path=os.getenv('HOME') + '/aiffel/recommendata_iu/data/ml-1m/movies.dat'
cols = ['movie_id', 'title', 'genre']
movies = pd.read_csv(movie_file_path, sep='::', names=cols, engine='python', encoding='ISO-8859-1')
movies.head()

Unnamed: 0,movie_id,title,genre
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [142]:
movies.columns

Index(['movie_id', 'title', 'genre'], dtype='object')

In [143]:
from scipy.sparse import csr_matrix

In [144]:
ratings.columns

Index(['user_id', 'movie_id', 'count', 'timestamp'], dtype='object')

## 데이터 분석

In [145]:
# Unique movie and user number

num_user = ratings['user_id'].nunique()
num_movie = ratings['movie_id'].nunique()
print(num_movie, num_user)

3628 6039


In [146]:
data = ratings.copy().merge(movies, how='left', on='movie_id')
data

Unnamed: 0,user_id,movie_id,count,timestamp,title,genre
0,1,1193,5,978300760,One Flew Over the Cuckoo's Nest (1975),Drama
1,1,661,3,978302109,James and the Giant Peach (1996),Animation|Children's|Musical
2,1,914,3,978301968,My Fair Lady (1964),Musical|Romance
3,1,3408,4,978300275,Erin Brockovich (2000),Drama
4,1,2355,5,978824291,"Bug's Life, A (1998)",Animation|Children's|Comedy
...,...,...,...,...,...,...
836473,6040,1090,3,956715518,Platoon (1986),Drama|War
836474,6040,1094,5,956704887,"Crying Game, The (1992)",Drama|Romance|War
836475,6040,562,5,956704746,Welcome to the Dollhouse (1995),Comedy|Drama
836476,6040,1096,4,956715648,Sophie's Choice (1982),Drama


In [147]:
idx_to_title = {i:val for val,i in title_to_idx.items()}
idx_to_title[1]

'James and the Giant Peach (1996)'

In [148]:
# Top popular 30

for i in data.groupby('title')['user_id'].count().sort_values(ascending=False)[:30]:
    print(i)
    print(idx_to_title[i])

3211
Dancemaker (1998)
2910
New Jersey Drive (1995)
2885
Nina Takes a Lover (1994)
2716
My Giant (1998)
2561
City, The (1998)
2509
Matewan (1987)
2498
Two Family House (2000)
2473
Sum of Us, The (1994)
2460
Blackbeard's Ghost (1968)
2434
Girl 6 (1996)
2413
Duets (2000)
2385
Wonderful, Horrible Life of Leni Riefenstahl, The (Die Macht der Bilder) (1993)
2371
Gate of Heavenly Peace, The (1995)
2314
I Saw What You Did (1965)
2297
Robert A. Heinlein's The Puppet Masters (1994)
2257
Where Eagles Dare (1969)
2252
Night Falls on Manhattan (1997)
2213
Morning After, The (1986)
2210
Brighton Beach Memoirs (1986)
2194
Violets Are Blue... (1986)
2167
Quatermass II (1957)
2121
All the King's Men (1949)
2102
Gulliver's Travels (1939)
2066
Don't Be a Menace to South Central While Drinking Your Juice in the Hood (1996)
2051
Tumbleweeds (1999)
2030
Different for Girls (1996)
2022
Oliver & Company (1988)
2019
Blank Check (1994)
2000
Baby Geniuses (1999)
1941
Sudden Death (1995)


## 선호하는 5가지 영화 추가

In [149]:
my_favorite = ['Toy Story (1995)', 'Shawshank Redemption, The (1994)',
          'Forrest Gump (1994)', 'Fargo (1996)', 'American Beauty (1999)']

# 'taekyun'이라는 user_id가 위 아티스트의 노래를 30회씩 들었다고 가정하겠습니다.
my_playlist = pd.DataFrame({'user_id': [6041]*5, 'title': my_favorite, 'count':[5]*5})

if not data.isin({'user_id':[6041]})['user_id'].any():  # user_id에 'taekyun'이라는 데이터가 없다면
    data = data.append(my_playlist)                           # 위에 임의로 만든 my_favorite 데이터를 추가해 줍니다.

In [150]:
unique_user = data['user_id'].unique()
len(unique_user)

6040

In [151]:
unique_movie = data['title'].unique()
len(unique_movie)

3628

In [152]:
title_to_idx = {val:idx for idx,val in enumerate(unique_movie)}
title_to_idx['Toy Story (1995)']

40

In [153]:
temp_title_data = data['title'].map(title_to_idx.get).dropna()
len(temp_title_data)

836483

In [154]:
len(data) == len(temp_title_data)

True

In [155]:
data['title'] = temp_title_data
data['title'].head()

0    0
1    1
2    2
3    3
4    4
Name: title, dtype: int64

In [156]:
temp = data[['user_id','title','count']]
temp.head()

Unnamed: 0,user_id,title,count
0,1,0,5
1,1,1,3
2,1,2,3
3,1,3,4
4,1,4,5


In [157]:
data = temp

In [158]:
num_user = data['user_id'].nunique() + 2
num_movie = data['title'].nunique()
print(num_user, num_movie)

6042 3628


In [159]:
data.reset_index(drop=True, inplace=True)

## CSX matrix 직접 만들어보기

In [160]:
csr_data = csr_matrix((data['count'], (data.user_id, data['title'])), shape= (num_user, num_movie + 2))
csr_data

<6042x3630 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Row format>

In [161]:
max(data.user_id)

6041

In [162]:
max(data['title'])

3627

In [163]:
from implicit.als import AlternatingLeastSquares
import os
import numpy as np

In [164]:
# implicit 라이브러리에서 권장하고 있는 부분입니다. 학습 내용과는 무관합니다.
os.environ['OPENBLAS_NUM_THREADS']='1'
os.environ['KMP_DUPLICATE_LIB_OK']='True'
os.environ['MKL_NUM_THREADS']='1'

## als_model = AlternatingLeastSquares 모델 직접 구성

In [165]:
# Implicit AlternatingLeastSquares 모델의 선언
# use_gpu True로 하면 안되나? 설정이 많아지나
# factors 와 iterations(epoch)를 늘릴 수록 학습데이터는 잘 학습 but 과적ㅎ합
als_model = AlternatingLeastSquares(factors=128, regularization=0.01, use_gpu=False, iterations=15, dtype=np.float32)

In [166]:
# als 모델은 input으로 (item X user 꼴의 matrix를 받기 때문에 Transpose해줍니다.)
csr_data_transpose = csr_data.T
csr_data_transpose

<3630x6042 sparse matrix of type '<class 'numpy.longlong'>'
	with 836483 stored elements in Compressed Sparse Column format>

In [167]:
# 모델 훈련
als_model.fit(csr_data_transpose)

  0%|          | 0/15 [00:00<?, ?it/s]

## 내가 선호하는 5가지 영화 중 하나와 그 이외 영화 하나

In [168]:
taekyun, toy_story_peas = 6041, title_to_idx['Toy Story (1995)']
taekyun_vector, toy_story_vector = als_model.user_factors[taekyun], als_model.item_factors[toy_story_peas]

print('슝=3')

슝=3


In [169]:
taekyun_vector

array([-2.28445772e-02, -3.55581731e-01, -1.51362821e-01,  4.48059440e-01,
       -2.77255833e-01,  1.04844965e-01,  2.10265726e-01,  3.11535746e-01,
       -5.06369352e-01, -5.42475991e-02,  5.55406868e-01, -8.17392766e-02,
       -4.77685750e-01, -3.84262210e-04, -3.58845800e-01,  7.21139237e-02,
        2.43006110e-01, -1.89841509e-01,  2.32996836e-01,  6.00283593e-02,
        7.24924430e-02,  3.06546301e-01,  2.77443141e-01, -8.36717486e-01,
        3.37124407e-01,  5.56117631e-02, -1.85433820e-01, -1.16437769e+00,
        2.17059165e-01,  2.41147950e-01, -6.39786124e-01,  2.89922327e-01,
       -3.35196525e-01,  1.41557634e-01,  4.35516238e-02, -6.58139408e-01,
       -5.53304434e-01,  1.67163000e-01, -6.29599988e-01,  5.09103954e-01,
        3.95877182e-01,  7.29851604e-01,  5.41852772e-01,  1.60975233e-02,
       -2.76753634e-01, -1.50116265e+00,  3.09041858e-01,  3.15028638e-01,
       -3.80887926e-01, -4.72187102e-01, -8.85699213e-01, -7.26929009e-01,
        5.74395135e-02,  

In [170]:
toy_story_vector

array([ 6.32811990e-03, -1.50140878e-02,  1.43374354e-02,  1.66987758e-02,
        2.09714677e-02, -9.11675952e-03,  1.69126783e-02,  2.88944747e-02,
       -1.06902197e-02, -1.61132514e-02, -3.97491641e-03, -3.46483453e-03,
        1.89545110e-03,  2.44080760e-02,  1.36322118e-02,  1.64025296e-02,
        6.64420286e-03,  5.61199337e-03,  1.48632834e-02,  3.85277555e-03,
       -2.30608229e-02, -1.04189413e-02,  1.07208723e-02, -4.17488907e-03,
        1.79513581e-02,  3.03279106e-02, -1.34938834e-02,  1.14382281e-04,
        5.85799571e-03, -1.67566650e-02, -8.03175662e-03, -5.95282577e-03,
       -2.35268322e-04,  2.03536935e-02, -2.22514533e-02, -2.30560414e-02,
        5.07773645e-03,  2.59428006e-02, -1.94527451e-02,  2.56992131e-02,
        6.04494824e-04, -3.57811514e-04,  1.92835908e-02, -6.85752602e-03,
       -1.88824686e-03, -2.11925544e-02,  1.36912521e-02, -1.89087391e-02,
       -1.25825126e-02,  7.67756160e-03,  1.59497559e-02, -3.95723917e-02,
       -1.13532937e-03,  

In [171]:
# zimin과 black_eyed_peas를 내적하는 코드
np.dot(taekyun_vector, toy_story_vector)

0.5969763

0.57.. 조금은 아쉽다

In [172]:
duet = title_to_idx['Duets (2000)']
duet_vector = als_model.item_factors[duet]
np.dot(taekyun_vector, duet_vector)

-0.030773023

0 이하의 선호도가 나왔다.

## 내가 좋아하는 영화와 비슷한 영화 추천

In [173]:
favorite_movie = 'Toy Story (1995)'
movie_id = title_to_idx[favorite_movie]
similar_movie = als_model.similar_items(movie_id, N=15)
similar_movie

[(40, 0.99999994),
 (50, 0.7670828),
 (4, 0.51179916),
 (33, 0.50434),
 (322, 0.48507348),
 (110, 0.4196053),
 (330, 0.39828762),
 (255, 0.3485177),
 (10, 0.32601625),
 (160, 0.32142982),
 (277, 0.31840315),
 (20, 0.3108548),
 (32, 0.30998418),
 (126, 0.29190722),
 (478, 0.2827667)]

In [174]:
#artist_to_idx 를 뒤집어, index로부터 artist 이름을 얻는 dict를 생성합니다.
idx_to_title = {v:k for k,v in title_to_idx.items()}
[idx_to_title[i[0]] for i in similar_movie]

['Toy Story (1995)',
 'Toy Story 2 (1999)',
 "Bug's Life, A (1998)",
 'Aladdin (1992)',
 'Babe (1995)',
 'Groundhog Day (1993)',
 'Lion King, The (1994)',
 "There's Something About Mary (1998)",
 'Beauty and the Beast (1991)',
 'Forrest Gump (1994)',
 'Babe: Pig in the City (1998)',
 'Pleasantville (1998)',
 'Hercules (1997)',
 'Shakespeare in Love (1998)',
 "Wayne's World (1992)"]

와우 애니매이션 영화들이 가득 나왔다 잘 돌아가는 것 같다

## 내가 가장 좋아할 만한 영화들을 추천

In [175]:
movie_recommended = als_model.recommend(taekyun, csr_data, N=10, filter_already_liked_items=True)

In [176]:
[[idx_to_title[movie[0]] for movie in movie_recommended]]

[['Silence of the Lambs, The (1991)',
  "Schindler's List (1993)",
  'Toy Story 2 (1999)',
  'Pulp Fiction (1994)',
  'Saving Private Ryan (1998)',
  'Groundhog Day (1993)',
  'Being John Malkovich (1999)',
  "Bug's Life, A (1998)",
  'GoodFellas (1990)',
  'Babe (1995)']]

추천 영화 10개를 얻었다!