## 9.  추천 시스템
<img src='../img/recommend.png' width='60%'>

### 추천 시스템 방식
<img src='../img/recommend_system.png' width='60%'>

#### 컨텐츠 기반 필터링
<img src='../img/contents_filiter.png' width='70%' align='left' />
<img src='../img/contents_filiter2.PNG' width='70%' align='left' />

### 9.1 컨텐츠 기반 필터링 실습 – TMDB 5000 Movie Dataset
>  Data : https://www.kaggle.com/tmdb/tmdb-movie-metadata/data#
1. 콘텐츠에 대한 여러 '텍스트 정보'를 피처 벡터화
2. 코사인 유사도로 콘텐츠별 유사도 계산
3. 콘텐츠별로 가중 평점 계산
4. 유사도가 높은 작품 중에 평점도 높은 작품 순으로 추천

In [8]:
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

movies =pd.read_csv('./tmdb-5000-movie-dataset/tmdb_5000_movies.csv')
print(movies.shape)
movies.head(1)

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""sp...",en,Avatar,"In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, ...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289}, {""name"": ""Twentieth Century Fox Film Corporatio...","[{""iso_3166_1"": ""US"", ""name"": ""United States of America""}, {""iso_3166_1"": ""GB"", ""name"": ""United ...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [9]:
movies_df = movies[['id','title', 'genres', 'vote_average', 'vote_count',
                 'popularity', 'keywords', 'overview']]


In [10]:
pd.set_option('max_colwidth', 100)
movies_df[['genres','keywords']][:1]


Unnamed: 0,genres,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""name"": ""Fantasy""}, {...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"": 2964, ""name"": ""future""}, {""id"": 3386, ""name"": ""sp..."


In [11]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 8 columns):
id              4803 non-null int64
title           4803 non-null object
genres          4803 non-null object
vote_average    4803 non-null float64
vote_count      4803 non-null int64
popularity      4803 non-null float64
keywords        4803 non-null object
overview        4800 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 300.3+ KB


**텍스트 문자 1차 가공. 파이썬 딕셔너리 변환 후 리스트 형태로 변환**
* 현재 'genre', 'keywords' 두 개가 리스트 내 여러개 딕셔너리 형태로 들어가 있음
* 각 딕셔너리에서 장르명만 추출할 예정

> ast module 내 literal_eval 을 사용하여 현재 genres에 단순 string으로 되어 있는 값들을
list 및 dict 객체로 변환시켜줌

In [19]:
from ast import literal_eval
print(type(movies_df['genres'].iloc[1]))
movies_df['genres'] = movies_df['genres'].apply(literal_eval)
print(type(movies_df['genres2'].iloc[1]))    # 리스트 
print(type(movies_df['genres2'].iloc[1][0])) # 리스트 내 첫번째 객체를 dict으로 잘 인식함
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)

<class 'str'>
<class 'list'>
<class 'dict'>


In [20]:
movies_df['genres'].head(1)

0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {...
Name: genres, dtype: object

In [21]:
movies_df['genres'] = movies_df['genres'].apply(lambda x : [ y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [ y['name'] for y in x])
movies_df[['genres', 'keywords']][:1]

Unnamed: 0,genres,keywords
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa..."


**장르 콘텐츠 필터링을 이용한 영화 추천. 장르 문자열을 Count 벡터화 후에 코사인 유사도로 각 영화를 비교**

**장르 문자열의 Count기반 피처 벡터화**

In [23]:
print(('*').join(['test', 'test2']), type(('*').join(['test', 'test2'])))

test*test2 <class 'str'>


In [24]:
from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer를 적용하기 위해 공백문자로 word 단위가 구분되는 문자열로 변환. 
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))
count_vect = CountVectorizer(min_df=0, ngram_range=(1,2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)
# col 개수가 276인데, 이는 개별 영화가 가진 영화장르의 집합(aciton, adventure, ~~)의 개수라고 보면 될 듯

(4803, 276)


In [27]:
movies_df.head(2)

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres2,genres_literal
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",7.2,11800,150.437577,"[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa...","In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, ...","[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {...",Action Adventure Fantasy Science Fiction
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]",6.9,4500,139.082615,"[ocean, drug abuse, exotic island, east india trading company, love of one's life, traitor, ship...","Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of t...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 28, 'name': 'Action'}]",Adventure Fantasy Action


**장르에 따른 영화별 코사인 유사도 추출**

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)  # 4803개 영화에 대한 4803개의 코사인 유사도 matrix가 생성됨
# 첫번째 행[1, 0.5962, 0.4472 ...]은 첫번째 영화를 의미, 자신을 제외한 나머지는 4802개 영화의 각 유사도임
print(genre_sim[:2])    

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]]


In [29]:
# argsort() : array를 정렬하되 인덱스를 반환함, 현재는 뒤에 [:, ::-1] 이 붙어서 역순으로 재정렬
# genre_sim_sorted_ind 에는 결국 유사도가 높은 장르 순으로 array index가 부여됨
genre_sim_sorted_ind = genre_sim.argsort()[:, ::-1]
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]


**특정 영화와 장르별 유사도가 높은 영화를 반환하는 함수 생성**

In [57]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    
    # 인자로 입력된 movies_df DataFrame에서 'title' 컬럼이 입력된 title_name 값인 DataFrame추출
    title_movie = df[df['title'] == title_name]
    
    # title_named을 가진 DataFrame의 index 객체를 ndarray로 반환하고 
    title_index = title_movie.index.values
    # sorted_ind 인자로 입력된 genre_sim_sorted_ind 객체에서 유사도 순으로 top_n 개의 index 추출
    similar_indexes = sorted_ind[title_index, :(top_n)]
    
    # 추출된 top_n index들 출력. top_n index는 2차원 데이터 임. top_n.shape = (1, 10)
    #dataframe에서 index로 사용하기 위해서 1차원 array로 변경
    print("IDX :",similar_indexes, '| shape :',similar_indexes.shape)
    similar_indexes = similar_indexes.reshape(-1)
    
    return df.iloc[similar_indexes]


In [58]:
similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average']]

# Mi America 는 평점이 0인데도 장르가 유사해서 추천됨, GoodFellas는 8.2 평점인데도 뒤로 밀림

IDX : [[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]] | shape : (1, 10)


Unnamed: 0,title,vote_average
2731,The Godfather: Part II,8.3
1243,Mean Streets,7.2
3636,Light Sleeper,5.7
1946,The Bad Lieutenant: Port of Call - New Orleans,6.0
2640,Things to Do in Denver When You're Dead,6.7
4065,Mi America,0.0
1847,GoodFellas,8.2
4217,Kids,6.8
883,Catch Me If You Can,7.7
3866,City of God,8.1


**평점이 높은 영화 정보 확인**
* 평가수(vote count)가 매우 적은 것은 신뢰하기 어려움

In [32]:
movies_df[['title','vote_average','vote_count']].sort_values('vote_average', ascending=False)[:10]

Unnamed: 0,title,vote_average,vote_count
3519,Stiff Upper Lips,10.0,1
4247,Me You and Five Bucks,10.0,2
4045,"Dancer, Texas Pop. 81",10.0,1
4662,Little Big Top,10.0,1
3992,Sardaarji,9.5,2
2386,One Man's Hero,9.3,2
2970,There Goes My Baby,8.5,2
1881,The Shawshank Redemption,8.5,8205
2796,The Prisoner of Zenda,8.4,11
3337,The Godfather,8.4,5893


**평가 횟수에 대한 가중치가 부여된 평점(Weighted Rating) 계산  
         가중 평점(Weighted Rating) = (v/(v+m)) * R + (m/(v+m)) * C**
         
* v: 개별 영화에 평점을 투표한 횟수
* m: 평점을 부여하기 위한 최소 투표 횟수
* R: 개별 영화에 대한 평균 평점.
* C: 전체 영화에 대한 평균 평점

In [53]:
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print('C:',round(C,3), 'm:',round(m,3))

C: 6.092 m: 370.2


In [54]:
percentile = 0.6
m = movies_df['vote_count'].quantile(percentile)
C = movies_df['vote_average'].mean()

def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    
    return ( (v/(v+m)) * R ) + ( (m/(m+v)) * C )   

movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis=1) 


In [55]:
movies_df[['title','vote_average','weighted_vote','vote_count']].sort_values('weighted_vote',
                                                                          ascending=False)[:10]


Unnamed: 0,title,vote_average,weighted_vote,vote_count
1881,The Shawshank Redemption,8.5,8.396052,8205
3337,The Godfather,8.4,8.263591,5893
662,Fight Club,8.3,8.216455,9413
3232,Pulp Fiction,8.3,8.207102,8428
65,The Dark Knight,8.2,8.13693,12002
1818,Schindler's List,8.3,8.126069,4329
3865,Whiplash,8.3,8.123248,4254
809,Forrest Gump,8.2,8.105954,7927
2294,Spirited Away,8.3,8.105867,3840
2731,The Godfather: Part II,8.3,8.079586,3338


In [59]:
def adv_find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values
    
    # top_n의 2배에 해당하는 쟝르 유사성이 높은 index 추출 
    similar_indexes = sorted_ind[title_index, :(top_n*2)]
    similar_indexes = similar_indexes.reshape(-1)
    # 기준 영화 index는 제외 (=자기 자신 제외)
    similar_indexes = similar_indexes[similar_indexes != title_index]
    
    # top_n의 2배에 해당하는 후보군에서 weighted_vote 높은 순으로 top_n 만큼 추출 
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top_n]

similar_movies = adv_find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average', 'weighted_vote']]
# 가중치를 주기 전보다 보다 나은 추천 순위를 제공함
# GoofFellas가 7위에서 2위까지 올라옴

Unnamed: 0,title,vote_average,weighted_vote
2731,The Godfather: Part II,8.3,8.079586
1847,GoodFellas,8.2,7.976937
3866,City of God,8.1,7.759693
1663,Once Upon a Time in America,8.2,7.657811
883,Catch Me If You Can,7.7,7.557097
281,American Gangster,7.4,7.141396
4041,This Is England,7.4,6.739664
1149,American Hustle,6.8,6.717525
1243,Mean Streets,7.2,6.626569
2839,Rounders,6.9,6.530427


In [60]:
# vote weight 없던 결과
similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average']]

IDX : [[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]] | shape : (1, 10)


Unnamed: 0,title,vote_average
2731,The Godfather: Part II,8.3
1243,Mean Streets,7.2
3636,Light Sleeper,5.7
1946,The Bad Lieutenant: Port of Call - New Orleans,6.0
2640,Things to Do in Denver When You're Dead,6.7
4065,Mi America,0.0
1847,GoodFellas,8.2
4217,Kids,6.8
883,Catch Me If You Can,7.7
3866,City of God,8.1
