# Content based Recommender

- 상품에 대한 정보를 이용해 상품을 추천하는 방법
- 상품 간에 유사성을 계산하여 비슷한 상품을 추천
- 예를들어 특정 장르의 영화를 봤을 때 그 장르의 영화 중에 영화 제목, 개봉 연도, 감독 등의 정보를 고려해 비슷한 영화를 추천하는 방법

목차<br>
1. Content based Recommender 방법
2. 비슷한 장르의 영화를 추천하려면?
3. 비슷한 제목의 영화를 추천하려면?
4. 데이터에 적용: Moive Lens Data

# 1. Content based Recommender 방법

<b>방법1. 별점을 활용하는 경우</b><br>
사용자 A가 봤던 영화 중 평점이 좋았던 영화와 비슷한 장르의 영화 추천<br>
예)인터스텔라 영화에 대해 좋은 평점을 매겼으면 같은 장르인 과학 영화 중에 평점이 제일 좋았던 인셉션 추천<br>
<br>
장점<br>
:상품에 대한 사용자의 리뷰가 없을 때도 사용 가능<br>
<br>
단점<br>
:상품에 대한 상세적인 정보가 필요

<b>방법 2. 상품의 세부적 정보로 추천</b><br>
<br>
영화의 제목, 요약, 테그, 장르와 같은 상품의 세부적인 정보를 기반으로 비슷한 상품을 추천

In [1]:
# Load libraries and data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
os.chdir('C:/Users/bki19/Desktop/recommender_system')
movies = pd.read_csv('./data/movies.csv', sep=',', encoding='latin-1', usecols=['title', 'genres'])

In [2]:
#Convert Genres' string data into list

# Break up the big genre string into a string array
movies['genres'] = movies['genres'].str.split('|')
# Convert genres to string value
movies['genres'] = movies['genres'].fillna("").astype('str')

In [3]:
movies.head()

Unnamed: 0,title,genres
0,Toy Story,"['Adventure', 'Animation', 'Children', 'Comedy..."
1,Jumanji,"['Adventure', 'Children', 'Fantasy']"
2,Grumpier Old Men,"['Comedy', 'Romance']"
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']"
4,Father of the Bride Part II,['Comedy']


<b>데이터 소개</b>
- 9,274개의 영화에 제목과 장르가 표시
- 영화의 장르는 여러 가지일 수 있음 
- 예를들어 주만지는 모험, 어린이, 판타지 등 여러 가지 장르로 포함

# 2. 비슷한 장르의 영화를 추천하려면?

공포 영화를 좋아하는 사용자는 공포 영화 관련 영화를 더 찾지 않을까요?<br>
Tf-idf를 이용한 비슷한 장르 추천 방법에 대해 알아보겠습니다.

<b>TF-IDF (Term Frequency-Inverse Document Frequency)</b><br>
: 문서 집합에서 특정 문서에 특정 단어가 얼마나 중요한지를 측정하는 방법<br>
<br>
$Tf(t)=\frac{문서에서 단어 t가 나온 횟 수}{ 문서 안에 모든 단어의 수}$<br>
특정 문서에서 특정 단어가 자주 나올 수록 높아짐<br>
<br>
$Idf(t)=log_{10} \frac{총 문서의 수}{단어 t를 포함하고 있는 문서의 수}$<br>
특정 단어를 포함하고 있는 문서가 적을 수록 높아짐<br>
<br>
$Tf-Idf=Tf(t)*Idf(t)$<br>
<br>
예) 백 만개의 문서가 있을 때<br>
문서 1에는 총 100개의 단어가 있고 '오딧세이'라는 단어 3 번 나옴<br>
$Tf=\frac{3}{100}=0.03$<br>
문서 백만 개 중에 '오딧세이'가 나오는 문서는 1,000개<br>
$Idf=log_{10}(\frac{1000000}{1000})=3$<br>
$Tf*Idf=0.09$<br>
따라서 '오딧세이'라는 단어의 중요도는 0.09로 측정<br>
<br>
반대로 '심슨'이라는 단어가 모든 문서에 나온다면 $Idf=0$이 되어 $Tf-Idf$ 역시 0이 되어 중요하지 않은 단어가 됨<br>
<br>
=>모든 문서에서 흔하게 나타나지 않으면서 특정 문서에서만 많이 나오는 단어가 중요한 단어가 됨

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['genres'])
tfidf_matrix.shape

(9742, 177)

- 장르를 단어로 보고 영화 하나 하나를 문서로 보고 Tf-Idf 적용
- 문서와는 달리 영화 장르는 한 영화에서 최대 한번만 나타남
- 예를 들어, 주만지는 3개의 장르가 있고 각각의 장르는 한번만 나타나기 때문에 동일하게 3분의 1의 Tf를 갖게 됨 
- 모든 영화에서 흔하지 않은 장르이면서, 특정 영화의 표시된 장르가 적을 수록 높게 높은 점수

In [60]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim[:4, :4]

array([[1.        , 0.31379419, 0.0611029 , 0.05271111],
       [0.31379419, 1.        , 0.        , 0.        ],
       [0.0611029 , 0.        , 1.        , 0.35172407],
       [0.05271111, 0.        , 0.35172407, 1.        ]])

- Cosine Similiarity를 통해 Tf-idf가 비슷한 영화 계산

In [61]:
# Build a 1-dimensional array with movie titles
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])
indices[:10]

title
Toy Story                       0
Jumanji                         1
Grumpier Old Men                2
Waiting to Exhale               3
Father of the Bride Part II     4
Heat                            5
Sabrina                         6
Tom and Huck                    7
Sudden Death                    8
GoldenEye                       9
dtype: int64

In [62]:
# Function that get movie recommendations based on the cosine similarity score of movie genres
def genre_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [63]:
genre_recommendations('Dark Knight ').head(20)

8387                          Need for Speed 
8149      Grandmaster, The (Yi dai zong shi) 
123                                Apollo 13 
8026                              Life of Pi 
8396                                    Noah 
38                           Dead Presidents 
341                              Bad Company 
347             Faster Pussycat! Kill! Kill! 
430                        Menace II Society 
568                          Substitute, The 
665                          Nothing to Lose 
1645                       Untouchables, The 
1696                           Monument Ave. 
2563                              Death Wish 
2574                        Band of the Hand 
3037                              Foxy Brown 
3124    Harley Davidson and the Marlboro Man 
3167                                Scarface 
3217                               Swordfish 
3301                           Above the Law 
Name: title, dtype: object

=> 영화 다크나이트와 유사한 장르의 영화 20개를 추천

# 3.  비슷한 제목의 영화를 추천하려면?

반지의 제왕 1을 봤으면 반지의 제왕 2를 보고 싶지 않을까요?<br>
Tf-idf를 이용하여 유사한 제목의 영화 추천 방법에 대해 알아보겠습니다.

In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(movies['title'])
tfidf_matrix.shape

(9742, 20558)

In [65]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
cosine_sim[:4, :4]

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [66]:
# Build a 1-dimensional array with movie titles
titles = movies['title']
indices = pd.Series(movies.index, index=movies['title'])

In [67]:
# Function that get movie recommendations based on the cosine similarity score of movie genres
def title_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:21]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

In [68]:
title_recommendations('Dark Knight ').head(20)

7768                     Dark Knight Rises, The 
8032    Batman: The Dark Knight Returns, Part 1 
8080    Batman: The Dark Knight Returns, Part 2 
140                                First Knight 
2417                         Cry in the Dark, A 
5778                          Alone in the Dark 
7375                             Knight and Day 
3576                               Black Knight 
3190                           Knight's Tale, A 
6858                       Alone in the Dark II 
4242                                  Dark Blue 
5060                                  Dark Days 
1305                                  Dark City 
5483                                  Dark Star 
6815                      Batman: Gotham Knight 
5934                                 Dark Water 
4749                        Shot in the Dark, A 
7877                               Dark Shadows 
8766                            The Dark Valley 
6690                      Taxi to the Dark Side 
Name: title, dtype: 

=> Dark나, Knight라는 단어가 담긴 제목이 비슷한 영화들이 추천 됨

# 4. 데이터에 적용: Moive Lens Data

영화의 줄거리, 출연진, 평점 등을 고려하여 추천하는 방법에 대해 알아보겠습니다

In [4]:
import os
import pandas as pd
os.chdir('C:/Users/bki19/Desktop/recommender_system')
md =  pd.read_csv('./data/the-movies-dataset/movies_metadata.csv', low_memory=False)

<b> 1.영화 줄거리와 태그라인을 이용해 추천하기 </b>

In [5]:
md['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [6]:
#convert release date to year
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [7]:
links_small = pd.read_csv('./data/the-movies-dataset/links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')

In [8]:
md = md.drop([19730, 29503, 35587])
md['id'] = md['id'].astype('int')
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

메모리 부족으로 데이터 축소

In [9]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [10]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tf.fit_transform(smd['description'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(9099, 268124)

In [11]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [12]:
#Construct a reverse map of indices and movie titles
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [13]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations2(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 30 most similar movies
    sim_scores = sim_scores[1:31]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return smd['title'].iloc[movie_indices]

In [14]:
get_recommendations2('The Dark Knight Rises').head(10)

132                              Batman Forever
6900                            The Dark Knight
1113                             Batman Returns
2579               Batman: Mask of the Phantasm
524                                      Batman
7565                 Batman: Under the Red Hood
7901                           Batman: Year One
8227    Batman: The Dark Knight Returns, Part 2
6144                              Batman Begins
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

Bat Man이라는 이름이 들어간 영화들이 많이 추천 됨

<b>2. Credit, Genre,keyword를 이용한 추천</b>

In [15]:
credits = pd.read_csv('./data/the-movies-dataset/credits.csv')
keywords = pd.read_csv('./data/the-movies-dataset/keywords.csv')

In [16]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [17]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

- credits: 출연진이 누군지, 감독, 작가 등이 누군지에 대한 데이터
- 여기서는 감독과 주연 배우만 추출
- keywords: 영화를 함축하는 단어

In [18]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')
smd = md[md['id'].isin(links_small)]
smd.shape

(9219, 28)

In [19]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    smd[feature] = smd[feature].apply(literal_eval)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


"stringified" lists를 변환

In [20]:
# Import Numpy 
import numpy as np
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [21]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [22]:
# Define new director, cast, genres and keywords features that are in a suitable form.
smd['director'] = smd['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    smd[feature] = smd[feature].apply(get_list)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [23]:
# Print the new features of the first 3 films
smd[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


In [24]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [25]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    smd[feature] = smd[feature].apply(clean_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [26]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

Vectorizer하기 위해 string으로 변환

In [27]:
# Create a new soup feature
smd['soup'] = smd.apply(create_soup, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [28]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [29]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [30]:
# Reset index of your main DataFrame and construct reverse mapping as before
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [31]:
get_recommendations2('The Dark Knight Rises', cosine_sim2).head(10)

6981                  The Dark Knight
6218                    Batman Begins
5436                         Mitchell
467                 Romeo Is Bleeding
6623                     The Prestige
3647                  An Innocent Man
8727     Revenge of the Green Dragons
8927          Kidnapping Mr. Heineken
9121                     İtirazım Var
2051    Beyond the Poseidon Adventure
Name: title, dtype: object

- 영화 줄거리나 테그만 사용했을 때 보다 더 다양한 영화를 추천한 결과를 볼 수 있음
- Prestige 같은 놀란 감독의 다른 영화도 추천해 줌

<b>3. 인기나 평점까지 반영</b>

In [57]:
def improved_recommendations(title):

    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim2[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:30]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.5)
    def weighted_rating(x):
        v = x['vote_count']
        R = x['vote_average']
        return (v/(v+m) * R) + (m/(m+v) * C)

    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['Score'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('Score', ascending=False).head(10)
    return qualified

- Credit, 장르, 키워드로 영화간의 유사도가 가장 높은 영화 중 상위 30개 선택
- 영화 중에 투표 수가 하위 60%인 영화 제거하여 계산 된 평점과 투표 수의 가중 평균으로 점수 계산
- 점수가 가장 높은 순으로 다시 정렬

In [58]:
improved_recommendations('The Dark Knight')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,title,vote_count,vote_average,year,Score
6623,The Prestige,4510,8,2006,7.956928
8031,The Dark Knight Rises,9263,7,2012,6.988128
6218,Batman Begins,7511,7,2005,6.985391
7583,Kick-Ass,4747,7,2010,6.977038
4021,The Long Good Friday,87,7,1980,6.362069
8467,Kick-Ass 2,2275,6,2013,5.989839
7380,Bronson,756,6,2008,5.97153
7912,Takers,399,6,2010,5.950617
6996,Street Kings,369,6,2008,5.947368
6551,Chaos,278,6,2005,5.934247


출처: 
- https://towardsdatascience.com/learning-to-make-recommendations-745d13883951
- https://www.datacamp.com/community/tutorials/recommender-systems-python
- https://github.com/rounakbanik/movies/blob/master/movies_recommender.ipynb

데이터 출처:
- https://nbviewer.jupyter.org/github/BadreeshShetty/Learnings-to-make-Recommedations/tree/master/Content%20Filtering/
- https://www.kaggle.com/rounakbanik/the-movies-dataset/data