# **ch9 추천시스템**
도서 : **머신러닝 완벽 가이드 (2019/3)** [**GitHub**](https://github.com/wikibook/ml-definitive-guide)

## **1 문서의 유사도 측정**
개념들 보완 : 딥러닝 활용한 자연어 입문 [**(WikiBook)**](https://wikidocs.net/24603)
### **01 코싸인 유사도 측정**
- 다른 책에서도 개념이 애매했는데 여기서 정리

In [1]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
    return dot(A, B)/(norm(A)*norm(B))

doc1 = np.array([0,1,1,1])
doc2 = np.array([1,0,1,1])
doc3 = np.array([2,0,2,2])

print("Doc1|Dic2 Cos: {:.3f}\nDoc1|Doc3 Cos: {:.3f}\nDoc2|Doc3 Cos: {:.3f}".format(
    cos_sim(doc1, doc2), cos_sim(doc1, doc3), cos_sim(doc2, doc3))) #문서2과 문서3의 코사인 유사도

Doc1|Dic2 Cos: 0.667
Doc1|Doc3 Cos: 0.667
Doc2|Doc3 Cos: 1.000


### **02 Document 에서 vector 추출**
- Sklearn Document [**(공식문서)**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- 위에서 생성한 **vector 를 문서에서** 추출하기
- **CountVectorizer, TfidfVectorizer** 개념의 구분

In [2]:
document = ["저는 사과 좋아요", 
            "저는 바나나 좋아요", 
            "저는 바나나 좋아요 저는 바나나 좋아요"]

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(document)
X.toarray()

array([[0.        , 0.76749457, 0.45329466, 0.45329466],
       [0.67325467, 0.        , 0.52284231, 0.52284231],
       [0.67325467, 0.        , 0.52284231, 0.52284231]])

In [3]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(document)
X.toarray()

array([[0, 1, 1, 1],
       [1, 0, 1, 1],
       [2, 0, 2, 2]], dtype=int64)

# **2 추천시스템**
데이터를 활용한 **추천모델** 만들기

### **01 영화 내용 설명의 유사도**
- **tf-idf, cosin 유사도** 측정을 활용하여 추천시스템 구축하기
- **tf-idf 벡터** 를 구하는 방식은 **sklearn** 모듈을 사용합니다 [**설명**](https://hoony-gunputer.tistory.com/148)
- 자료의 **"overview"** 필드의 내용을 활용하여 데이터를 추출합니다 
- 데이터셋 다운로드 [**G-Drive**](https://drive.google.com/drive/folders/1JnQXDCsGAb75I4PRRMDHUO0WxmXT-usv)

```python
# Dot Product를 계산하면 Cosine Similarity Score가 바로 제공
from sklearn.metrics.pairwise import linear_kernel
```

In [1]:
# 전체 45,466 개의 영화
import pandas as pd
data = pd.read_csv("data/movies_metadata.csv", low_memory=False)
print(data.shape)
data.iloc[:2, 8:13]

(45466, 24)


Unnamed: 0,original_title,overview,popularity,poster_path,production_companies
0,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]"
1,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'na..."


In [2]:
# 5,000 개의 영화, 22,304 단어행렬 (stopword 제거)
from sklearn.feature_extraction.text import TfidfVectorizer
data             = data.head(5000)             # 부하를 줄이기 위해 5000개만 추출
tfidf            = TfidfVectorizer(stop_words='english')
data['overview'] = data['overview'].fillna('') # 줄거리 NaN 면 인덱스 제거
tfidf_matrix     = tfidf.fit_transform(data['overview'])
print(tfidf_matrix.shape)                      # overview에 대해서 tf-idf 수행

(5000, 22304)


In [5]:
data.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [6]:
# cosine_sim 코싸인 유사도 행렬
# TF-IDF Vectorizer간 Dot Product 계산시 Cosine Similarity Score 제공
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
indices    = pd.Series(data.index, index=data['title']).drop_duplicates()
idx        = indices['Father of the Bride Part II']
print("\nFather of the Bride Part II 의 인덱스", idx)


Father of the Bride Part II 의 인덱스 4


In [7]:
# OverView 데이터를 사용하여 영화간 유사도를 측정합니다
def get_recommendations(title, cosine_sim=cosine_sim, rank=11):
    idx        = indices[title]                # 해당영화의 타이틀로 인덱스를 호출
    sim_scores = list(enumerate(cosine_sim[idx])) # 모든 영화에 대한 해당영화 유사도
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True) # 유사도 정렬
    sim_scores = sim_scores[1: rank]           # 가장 유사한 10개의 영화를 받아옵니다.
    movie_indices = [i[0] for i in sim_scores] # 유사도 높은 10개 영화
    return data['title'].iloc[movie_indices]   # 가장 유사한 10개의 영화의 제목을 리턴

get_recommendations('Heat')

2361    The Color of Money
4430                 Twins
4083             Manhunter
4363          Criminal Law
1017       Beautiful Thing
4350          Biloxi Blues
109            Taxi Driver
265           Mi Vida Loca
4158     A Home of Our Own
2904          Falling Down
Name: title, dtype: object

In [8]:
get_recommendations('Bad Boys')

3782               Under Suspicion
3933               The Amati Girls
339     Ace Ventura: Pet Detective
1960         The Trip to Bountiful
1596                       Witness
3714                   Phantasm II
4886             The Shipping News
1581             Hurricane Streets
3642                          F/X2
2840      Someone to Watch Over Me
Name: title, dtype: object

### **02 제작자와 배우의 유사도를 활용한 추천시스템**
- 제작진과 배우를 사용한 추천시스템

In [14]:
# # 제작진 데이터 불러오기
# credits  = pd.read_csv('data/credits.csv')
# keywords = pd.read_csv('data/keywords.csv')

# # 데이터 전처리 작업
# keywords['id'] = keywords['id'].astype(int)
# credits['id']  = credits['id'].astype(int)
# data['id']     = data['id'].astype(int)

# data   = data.merge(credits,  on = 'id')
# data   = data.merge(keywords, on = 'id')
# # s_data = data[data['id'].isin(link_small)]
# data.columns

In [15]:
# print(data.head(1))

In [11]:
# data['cast']      = data['cast'].apply(literal_eval) # cast 추출
# data['crew']      = data['crew'].apply(literal_eval)
# data['keywords']  = data['keywords'].apply(literal_eval)
# data['cast_size'] = data['cast'].apply(lambda x :len(x)) # 배우 수를 적는다.
# data['crew_size'] = data['crew'].apply(lambda x: len(x)) # 스태프 수를 적는다.

In [12]:
# def get_director(x):
#     for i in x:
#         if i['job'] == 'Director':
#             return i['name']
#     return np.nan

# # 감독, 배우(상위 3명), 키워드 필드 추가
# data['director'] = data['crew'].apply(get_director)
# data['cast']     = data['cast'].apply(
#     lambda x:[i['name'] for i in x] if isinstance(x,list) else [])
# data['cast']     = data['cast'].apply(
#     lambda x:x[:3] if len(x)>3 else x)
# data['keywords'] = data['keywords'].apply(
#     lambda x:[i['name'] for i in x] if isinstance(x, list) else [])

# # 데이터 필드의 대소문자를 소문자로 전처리를 하고, 단어간 공백을 제거합니다
# data['cast']     = data['cast'].apply(
#     lambda x:[str.lower(i.replace(" ", "")) for i in x])
# data['director'] = data['director'].astype(str).apply(
#     lambda x:str.lower(x.replace(" ", "")))
# # 감독정보는 중요도가 높은만큼 3번씩 반복하여 생성 합니다
# data['director'] = data['director'].apply(lambda x: [x, x, x])
# s      = data.apply(lambda x: pd.Series(x['keywords']), axis=1).stack().reset_index(level=1, drop=True)
# s.name = 'keywords'
# s = s.value_counts()

In [13]:
# # Dict 데이터를 List 로 변환
# from ast import literal_eval    
# data['genres'] = data['genres'].fillna('[]').apply(literal_eval).apply(
#     lambda x : [i['name'] for i in x] if isinstance(x, list) else [])
# data.genres[:3]

## **3 (9.4) 잠재요인 협업 필터링**
### **01 경사하강을 이용한 행렬 분해**

In [14]:
import numpy as np
# 원본 행렬 R 생성, 분해 행렬 P와 Q 초기화, 잠재요인 차원 K는 3 설정. 
R = np.array([[4, np.NaN, np.NaN, 2, np.NaN ],
              [np.NaN, 5, np.NaN, 3, 1 ],
              [np.NaN, np.NaN, 3, 4, 4 ],
              [5, 2, 1, 2, np.NaN ]])

# P와 Q 매트릭스의 크기를 지정하고 정규분포를 가진 random한 값으로 입력합니다. 
num_users, num_items = R.shape
K = 3
np.random.seed(1234)
P = np.random.normal(scale=1./K, size=(num_users, K))
Q = np.random.normal(scale=1./K, size=(num_items, K))

In [15]:
from sklearn.metrics import mean_squared_error
def get_rmse(R, P, Q, non_zeros):
    error = 0
    full_pred_matrix = np.dot(P, Q.T)  # 행렬 P와 Q.T의 내적으로 예측 R 생성    
    # 실제 R 행렬에서 널이 아닌 값의 위치 인덱스 추출하여 실제 R 행렬과 예측 행렬의 RMSE 추출
    x_non_zero_ind = [non_zero[0] for non_zero in non_zeros]
    y_non_zero_ind = [non_zero[1] for non_zero in non_zeros]
    R_non_zeros    = R[x_non_zero_ind, y_non_zero_ind]
    full_pred_matrix_non_zeros = full_pred_matrix[x_non_zero_ind, y_non_zero_ind]
    mse  = mean_squared_error(R_non_zeros, full_pred_matrix_non_zeros)
    rmse = np.sqrt(mse)
    return rmse

In [16]:
# R > 0 인 행 위치, 열 위치, 값을 non_zeros 리스트에 저장. 
non_zeros = [ (i, j, R[i,j]) for i in range(num_users) for j in range(num_items) if R[i,j] > 0 ]
steps         = 1000
learning_rate = 0.01
r_lambda      = 0.01

# SGD 기법으로 P와 Q 매트릭스를 계속 업데이트. 
for step in range(steps):
    for i, j, r in non_zeros:
        eij = r - np.dot(P[i, :], Q[j, :].T) # 실제 값과 예측 값의 차이
        # Regularization을 반영한 SGD 업데이트 공식
        P[i,:] = P[i,:] + learning_rate*(eij * Q[j, :] - r_lambda*P[i,:])
        Q[j,:] = Q[j,:] + learning_rate*(eij * P[i, :] - r_lambda*Q[j,:])

    rmse = get_rmse(R, P, Q, non_zeros)
    if (step % 100) == 0 :
        print("### iteration step : {:4,} rmse : {:.4f}".format(step, rmse))

### iteration step :    0 rmse : 3.2843
### iteration step :  100 rmse : 0.0920
### iteration step :  200 rmse : 0.0218
### iteration step :  300 rmse : 0.0157
### iteration step :  400 rmse : 0.0146
### iteration step :  500 rmse : 0.0145
### iteration step :  600 rmse : 0.0144
### iteration step :  700 rmse : 0.0144
### iteration step :  800 rmse : 0.0144
### iteration step :  900 rmse : 0.0144


In [17]:
pred_matrix = np.dot(P, Q.T)
print('예측 행렬:\n', np.round(pred_matrix, 3))

예측 행렬:
 [[3.992 1.68  1.197 1.998 1.458]
 [5.41  4.976 0.659 2.987 1.005]
 [5.296 2.325 2.988 3.98  3.985]
 [4.971 2.005 1.004 2.004 1.095]]


## **9.5 컨텐츠 기반 필터링 실습**
TMDB 5000 Movie Dataset

In [18]:
import numpy as np
import pandas as pd
import warnings
# warnings.filterwarnings('ignore')

movies =pd.read_csv('data/tmdb_5000_movies.csv')
print(movies.shape)
movies.head(1)

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [19]:
movies_df = movies[['id','title', 'genres', 'vote_average', 'vote_count',
                 'popularity', 'keywords', 'overview']]

pd.set_option('max_colwidth', 100)
movies_df[['genres','keywords']][:1]

Unnamed: 0,genres,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""sp..."


In [20]:
from ast import literal_eval
movies_df['genres']   = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)
movies_df['genres']   = movies_df['genres'].apply(lambda x : [ y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [ y['name'] for y in x])
movies_df[['genres', 'keywords']][:1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cavea

Unnamed: 0,genres,keywords
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa..."


In [21]:
# CountVectorizer를 적용하기 위해 공백문자로 word 단위가 구분되는 문자열로 변환. 
from sklearn.feature_extraction.text import CountVectorizer

movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))
count_vect = CountVectorizer(min_df=0, ngram_range=(1,2))
genre_mat  = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

(4803, 276)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [22]:
from sklearn.metrics.pairwise import cosine_similarity
genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:2])

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]]


In [23]:
genre_sim_sorted_ind = genre_sim.argsort()[:, ::-1]
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]


In [24]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    
    # 인자로 입력된 movies_df DataFrame에서 'title' 컬럼이 입력된 title_name 값인 DataFrame추출
    title_movie = df[df['title'] == title_name]
    
    # title_named을 가진 DataFrame의 index 객체를 ndarray로 반환하고 
    # sorted_ind 인자로 입력된 genre_sim_sorted_ind 객체에서 유사도 순으로 top_n 개의 index 추출
    title_index = title_movie.index.values
    similar_indexes = sorted_ind[title_index, :(top_n)]
    
    # 추출된 top_n index들 출력. top_n index는 2차원 데이터 임. 
    #dataframe에서 index로 사용하기 위해서 1차원 array로 변경
    print(similar_indexes)
    similar_indexes = similar_indexes.reshape(-1)    
    return df.iloc[similar_indexes]

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average']]

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]


Unnamed: 0,title,vote_average
2731,The Godfather: Part II,8.3
1243,Mean Streets,7.2
3636,Light Sleeper,5.7
1946,The Bad Lieutenant: Port of Call - New Orleans,6.0
2640,Things to Do in Denver When You're Dead,6.7
4065,Mi America,0.0
1847,GoodFellas,8.2
4217,Kids,6.8
883,Catch Me If You Can,7.7
3866,City of God,8.1


In [25]:
movies_df[['title','vote_average','vote_count']].sort_values('vote_average', ascending=False)[:10]

Unnamed: 0,title,vote_average,vote_count
3519,Stiff Upper Lips,10.0,1
4247,Me You and Five Bucks,10.0,2
4045,"Dancer, Texas Pop. 81",10.0,1
4662,Little Big Top,10.0,1
3992,Sardaarji,9.5,2
2386,One Man's Hero,9.3,2
2970,There Goes My Baby,8.5,2
1881,The Shawshank Redemption,8.5,8205
2796,The Prisoner of Zenda,8.4,11
3337,The Godfather,8.4,5893


In [26]:
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print('C:',round(C,3), 'm:',round(m,3))

C: 6.092 m: 370.2


In [27]:
percentile = 0.6
m = movies_df['vote_count'].quantile(percentile)
C = movies_df['vote_average'].mean()

def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    return ((v/(v+m))*R) + ((m/(m+v))*C)   

movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis=1) 
movies_df[['title', 'vote_average', 'weighted_vote', 'vote_count']].sort_values(
    'weighted_vote', ascending=False)[:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


Unnamed: 0,title,vote_average,weighted_vote,vote_count
1881,The Shawshank Redemption,8.5,8.396052,8205
3337,The Godfather,8.4,8.263591,5893
662,Fight Club,8.3,8.216455,9413
3232,Pulp Fiction,8.3,8.207102,8428
65,The Dark Knight,8.2,8.13693,12002
1818,Schindler's List,8.3,8.126069,4329
3865,Whiplash,8.3,8.123248,4254
809,Forrest Gump,8.2,8.105954,7927
2294,Spirited Away,8.3,8.105867,3840
2731,The Godfather: Part II,8.3,8.079586,3338


In [28]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values
    # top_n의 2배에 해당하는 쟝르 유사성이 높은 index 추출 
    similar_indexes = sorted_ind[title_index, :(top_n*2)]
    similar_indexes = similar_indexes.reshape(-1)
    # 기준 영화 index는 제외
    similar_indexes = similar_indexes[similar_indexes != title_index]
    # top_n의 2배에 해당하는 후보군에서 weighted_vote 높은 순으로 top_n 만큼 추출 
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top_n]

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average', 'weighted_vote']]

Unnamed: 0,title,vote_average,weighted_vote
2731,The Godfather: Part II,8.3,8.079586
1847,GoodFellas,8.2,7.976937
3866,City of God,8.1,7.759693
1663,Once Upon a Time in America,8.2,7.657811
883,Catch Me If You Can,7.7,7.557097
281,American Gangster,7.4,7.141396
4041,This Is England,7.4,6.739664
1149,American Hustle,6.8,6.717525
1243,Mean Streets,7.2,6.626569
2839,Rounders,6.9,6.530427
