### 사용된 데이터 세트
- **TMDB 5000 Movie Dataset (https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv)**

### 참고 자료
- https://www.kaggle.com/rounakbanik/movie-recommender-systems
- https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system


## 데이터 전처리

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# # 구글 드라이브 연동
# from google.colab import drive
# drive.mount('/content/drive')

# # 드라이브 파일 목록 확인
# !ls drive/'My Drive'/'Colab Notebooks'/'추천시스템'/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
'01. 추천 시스템 - 콘텐츠 기반(content based filtering).ipynb'
 keywords.csv
 movies_metadata.csv
 pre_tmdb_5000_movies.csv
 tmdb_5000_movies.csv


In [2]:
# 데이터 로드
# data = pd.read_csv('drive/My Drive/Colab Notebooks/추천시스템/tmdb_5000_movies.csv')
data = pd.read_csv('./dataset/tmdb_5000_movies.csv')

In [3]:
data.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
data.shape

(4803, 20)

데이터 셋에 대한 설명은 위의 데이터 셋 링크 참조

실험에 사용할 컬럼은 아래와 같다.

- genres : 영화 장르
- keywords : 영화의 키워드
- original_language : 영화 언어
- title : 제목
- vote_average : 평점 평균
- vote_count : 평점 카운트
- popularity : 인기도
- overview : 개요 설명

In [5]:
# 사용할 컬럼만 뽑기
# vote_average는 vote 수가 적은데 평점이 높으면 높게 나옴 즉, vote가 많을수록 불리하다고 예측할 수 있음

data = data[['id','genres', 'vote_average', 'vote_count','popularity','title',  'keywords', 'overview']]

위와 같은 문제점을 처리하기 위해 imdb에서 처리한 방법이 존재함.

해당 이슈는 (https://www.quora.com/How-does-IMDbs-rating-system-work)에서 확인할 수 있다.

![image.png](https://camo.githubusercontent.com/c518f092cd890ff82aa0187d7b787bd812947247f6134969ec6fc0e767e55071/68747470733a2f2f757365722d696d616765732e67697468756275736572636f6e74656e742e636f6d2f32343633343035342f37313737343437302d64313437306338302d326662322d313165612d386131652d6161303138646436643235612e4a5047)

- r : 개별 영화 평점
- v : 개별 영화에 평점을 투표한 횟수
- m : 250위 안에 들어야 하는 최소 투표 (이건 실험자의 재량? 느낌)
- c : 전체 영화에 대한 평균 평점

여기서 m을 **500위로 가정하고 진행해보자**

m이 500위 안에 들기 위해서 vote_count가 상위 몇 %인지 확인은 quantile(사분위)로 확인할 수 있음.

In [6]:
tmp_m = data['vote_count'].quantile(0.9)
tmp_m

1838.4000000000015

In [7]:
# 상위 90%로 했을 때 481위 안에 들어가는걸 확인할 수 있으니 90%로 가정하고 진행

tmp_data = data.copy().loc[data['vote_count'] >= tmp_m]
tmp_data.shape

(481, 8)

In [8]:
m = data['vote_count'].quantile(0.9)
data = data.loc[data['vote_count'] >= m]
data.head()

Unnamed: 0,id,genres,vote_average,vote_count,popularity,title,keywords,overview
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",7.2,11800,150.437577,Avatar,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di..."
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",6.9,4500,139.082615,Pirates of the Caribbean: At World's End,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha..."
2,206647,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",6.3,4466,107.376788,Spectre,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...
3,49026,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",7.6,9106,112.31295,The Dark Knight Rises,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...
4,49529,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",6.1,2124,43.926995,John Carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca..."


In [9]:
C = data['vote_average'].mean()

In [10]:
print(C)
print(m)

6.962993762993763
1838.4000000000015


In [11]:
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    
    return ( v / (v+m) * R ) + (m / (m + v) * C)

In [12]:
data['score'] = data.apply(weighted_rating, axis = 1)
data.head()

Unnamed: 0,id,genres,vote_average,vote_count,popularity,title,keywords,overview,score
0,19995,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",7.2,11800,150.437577,Avatar,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di...",7.168053
1,285,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",6.9,4500,139.082615,Pirates of the Caribbean: At World's End,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha...",6.918271
2,206647,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",6.3,4466,107.376788,Spectre,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...,6.493333
3,49026,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",7.6,9106,112.31295,The Dark Knight Rises,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...,7.492998
4,49529,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",6.1,2124,43.926995,John Carter,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca...",6.500396


추가로, **['genres', 'keywords'] 컬럼**을 보면 list 내부에 dict가 있는 구조로 되어있다.

이렇게 표현된 이유는 하나의 영화가 하나의 장르에만 속하지 않고, 하나의 키워드만 있지 않기 때문이라고 함.

그런데 문제점이 지금 내부에는 문자열로 들어가 있기 때문에 **ast 패키지**를 사용해서 풀어줘야됨.

- ast 패키지에 literal_eval을 사용하면 list와 dictionary 형태로 바뀌게 됨

In [13]:
data[['genres', 'keywords']].head(2)

Unnamed: 0,genres,keywords
0,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":..."
1,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na..."


In [14]:
data['genres'] = data['genres'].apply(literal_eval)
data['keywords'] = data['keywords'].apply(literal_eval)

data[['genres', 'keywords']].head(3)

Unnamed: 0,genres,keywords
0,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[{'id': 1463, 'name': 'culture clash'}, {'id':..."
1,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na..."
2,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...","[{'id': 470, 'name': 'spy'}, {'id': 818, 'name..."


id는 실험에 필요없기 때문에 **id를 제거한 후 name만** 뽑아서 사용

In [15]:
data['genres'] = data['genres'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))
data['keywords'] = data['keywords'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))

data.head(3)

Unnamed: 0,id,genres,vote_average,vote_count,popularity,title,keywords,overview,score
0,19995,Action Adventure Fantasy Science Fiction,7.2,11800,150.437577,Avatar,culture clash future space war space colony so...,"In the 22nd century, a paraplegic Marine is di...",7.168053
1,285,Adventure Fantasy Action,6.9,4500,139.082615,Pirates of the Caribbean: At World's End,ocean drug abuse exotic island east india trad...,"Captain Barbossa, long believed to be dead, ha...",6.918271
2,206647,Action Adventure Crime,6.3,4466,107.376788,Spectre,spy based on novel secret agent sequel mi6 bri...,A cryptic message from Bond’s past sends him o...,6.493333


In [16]:
# 데이터 저장

# data.to_csv('drive/My Drive/Colab Notebooks/추천시스템/pre_tmdb_5000_movies.csv', index=False)
data.to_csv('./dataset/pre_tmdb_5000_movies.csv', index=False)

## 콘텐츠 기반 필터링 추천(Content based filtering)

- 콘텐츠 기반 필터링은 **비슷한 콘텐츠를 사용자에게 추천**하는 것을 의미함.

- 여기서 비슷한 콘텐츠는 대표적으로 'genres'(장르)가 될 수 있음.

- 현재 장르는 **문자열**로 구성되어 있으므로, 문자열을 숫자로 바꾸어 벡터화를 해야함.

### 사용된 데이터 세트
- **The movies Dataset (https://www.kaggle.com/rounakbanik/the-movies-dataset)**

- movie_metadata.csv
  - 제목과 장르 등의 영화 메타 데이터
- keywords.csv
  - 영화 id에 따라 keyword 값

In [18]:
# movie_data = pd.read_csv('drive/My Drive/Colab Notebooks/추천시스템/movies_metadata.csv')
movie_data = pd.read_csv('./dataset/movies_metadata.csv')
movie_data = movie_data.loc[movie_data['original_language'] == 'en', :]
movie_data = movie_data[['id', 'title', 'original_language', 'genres']]

print(movie_data.shape)
movie_data.head(3)

(32269, 4)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,title,original_language,genres
0,862,Toy Story,en,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '..."
1,8844,Jumanji,en,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '..."
2,15602,Grumpier Old Men,en,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ..."


In [19]:
# movie_keyword = pd.read_csv('drive/My Drive/Colab Notebooks/추천시스템/keywords.csv')
movie_keyword = pd.read_csv('./dataset/keywords.csv')
print(movie_keyword.shape)
movie_keyword.head()

(46419, 2)


Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


위 **데이터 2개(movies_metadata, keywords)**를 id를 기준으로 merge 해줍니다.

In [20]:
movie_data.id = movie_data.id.astype(int)
movie_keyword.id = movie_keyword.id.astype(int)
movie_data = pd.merge(movie_data, movie_keyword, on='id')
print(movie_data.shape)
movie_data.head()

(32852, 5)


Unnamed: 0,id,title,original_language,genres,keywords
0,862,Toy Story,en,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,Jumanji,en,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,Grumpier Old Men,en,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,Waiting to Exhale,en,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,Father of the Bride Part II,en,"[{'id': 35, 'name': 'Comedy'}]","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


## 데이터 전처리

앞서 진행했던 방법 그대로 **ast 패키지**를 사용하여 전처리 진행
- ast 패키지에 literal_eval을 사용하면 list와 dictionary 형태로 바뀌게 됨

In [21]:
movie_data['genres'] = movie_data['genres'].apply(literal_eval)
movie_data['genres'] = movie_data['genres'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))

movie_data['keywords'] = movie_data['keywords'].apply(literal_eval)
movie_data['keywords'] = movie_data['keywords'].apply(lambda x : [d['name'] for d in x]).apply(lambda x : " ".join(x))

movie_data.head(3)

Unnamed: 0,id,title,original_language,genres,keywords
0,862,Toy Story,en,Animation Comedy Family,jealousy toy boy friendship friends rivalry bo...
1,8844,Jumanji,en,Adventure Fantasy Family,board game disappearance based on children's b...
2,15602,Grumpier Old Men,en,Romance Comedy,fishing best friend duringcreditsstinger old men


## TF-IDF 벡터화

전처리를 진행한 데이터를 **TF-IDF** 방법을 이용해 문자열을 벡터로 만들어 주어야 함.
- 'genres'와 'keyword'를 하나로 합친 후 **tfidf vector**로 만들면 됨.

In [22]:
tfidf_vector = TfidfVectorizer()
#tfidf_vector = TfidfVectorizer(ngram_range=(1,2))
tfidf_matrix = tfidf_vector.fit_transform(movie_data['genres'] + " " + movie_data['keywords']).toarray()
#tfidf_matrix = tfidf_vector.fit_transform(movie_data['genres']).toarray()
tfidf_matrix_feature = tfidf_vector.get_feature_names()



In [23]:
tfidf_matrix.shape

(32852, 11437)

In [24]:
tfidf_matrix = pd.DataFrame(tfidf_matrix, columns=tfidf_matrix_feature, index = movie_data.title)
print(tfidf_matrix.shape)
tfidf_matrix.head(3)

(32852, 11437)


Unnamed: 0_level_0,077,10,11,13,1500s,15th,16th,17th,1812,18th,...,βάφτηκε,γη,κόκκινο,το,χώμα,миньоны,卧底肥妈,绝地奶霸,自然界大事件,超级妈妈
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 유사도 구하기

위와 같이 만들어진 tfidf vector를 **코사인 유사도**를 사용하여 유사도 값을 구해줘야 함.

이렇게 하면 영화 개수(n) 만큼 n x n의 matrix 형태가 생성됨.

In [25]:
%%time

cosine_sim = cosine_similarity(tfidf_matrix)

cosine_sim.shape

Wall time: 53.5 s


(32852, 32852)

In [26]:
cosine_sim_df = pd.DataFrame(cosine_sim, index = movie_data.title, columns = movie_data.title)
print(cosine_sim_df.shape)
cosine_sim_df.head()

(32852, 32852)


title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Heat,Sabrina,Tom and Huck,Sudden Death,GoldenEye,...,Deep Hearts,The Morning After,House of Horrors,Shadow of the Blair Witch,The Burkittsville 7,Caged Heat 3000,Robin Hood,Betrayal,Satan Triumphant,Queerama
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,1.0,0.041569,0.008708,0.006937,0.005595,0.0,0.006456,0.059202,0.0,0.0,...,0.0,0.05111,0.028298,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.041569,1.0,0.0,0.065065,0.0,0.0,0.0,0.165721,0.028302,0.011462,...,0.0,0.0,0.039299,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Grumpier Old Men,0.008708,0.0,1.0,0.035846,0.010906,0.0,0.033363,0.0,0.0,0.0,...,0.0,0.099628,0.0,0.0,0.0,0.0,0.106819,0.0,0.0,0.0
Waiting to Exhale,0.006937,0.065065,0.035846,1.0,0.093741,0.003806,0.063686,0.027484,0.0,0.0,...,0.0,0.135728,0.0,0.0,0.0,0.0,0.121701,0.037622,0.0,0.0
Father of the Bride Part II,0.005595,0.0,0.010906,0.093741,1.0,0.0,0.038016,0.0,0.0,0.0,...,0.0,0.064015,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


이제 **content base recsys** 결과를 뽑아내기 위한 함수를 생성하고, 아래와 같은 기능을 함.

- target title( 추천 결과를 조회할 영화 제목 )에 따라서 코사인 유사도를 구한 matrix에서 유사도 데이터를 가져옴
- 유사도 데이터 중 가장 유사도 값이 큰 데이터를 가져옴
  - 가져올 때 top k개를 가져옴
- 해당 추천 값 출력

In [27]:
def genre_recommendations(target_title, matrix, items, k=10):
    recom_idx = matrix.loc[:, target_title].values.reshape(1, -1).argsort()[:, ::-1].flatten()[1:k+1]
    recom_title = items.iloc[recom_idx, :].title.values
    recom_genre = items.iloc[recom_idx, :].genres.values
    target_title_list = np.full(len(range(k)), target_title)
    target_genre_list = np.full(len(range(k)), items[items.title == target_title].genres.values)
    d = {
        'target_title':target_title_list,
        'target_genre':target_genre_list,
        'recom_title' : recom_title,
        'recom_genre' : recom_genre
    }
    return pd.DataFrame(d)

In [28]:
genre_recommendations('The Dark Knight Rises', cosine_sim_df, movie_data)

Unnamed: 0,target_title,target_genre,recom_title,recom_genre
0,The Dark Knight Rises,Action Crime Drama Thriller,The Dark Knight,Drama Action Crime Thriller
1,The Dark Knight Rises,Action Crime Drama Thriller,The Burglar,Crime Drama
2,The Dark Knight Rises,Action Crime Drama Thriller,Batman Begins,Action Crime Drama
3,The Dark Knight Rises,Action Crime Drama Thriller,Batman & Robin,Action Crime Fantasy
4,The Dark Knight Rises,Action Crime Drama Thriller,Batman,Fantasy Action
5,The Dark Knight Rises,Action Crime Drama Thriller,Raffles,Adventure Comedy Crime Drama History Romance T...
6,The Dark Knight Rises,Action Crime Drama Thriller,Hero at Large,Action Comedy Drama
7,The Dark Knight Rises,Action Crime Drama Thriller,DC Showcase: Catwoman,Action Adventure Animation Science Fiction
8,The Dark Knight Rises,Action Crime Drama Thriller,DC Super Hero Girls: Hero of the Year,Animation
9,The Dark Knight Rises,Action Crime Drama Thriller,Batman Returns,Action Fantasy
