## Movie Recommendation (TMDB) Using Cosine Similarity
- data from [Kaggle: TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)
- references:
    - https://medium.com/@developeraritro/building-a-recommendation-system-using-weighted-hybrid-technique-75598b6be8ed 
    - (Korean) https://datainclude.me/posts/%EC%98%81%ED%99%94%EB%8D%B0%EC%9D%B4%ED%84%B0%EB%A1%9C_%ED%95%B4%EB%B3%B4%EB%8A%94_%EC%B6%94%EC%B2%9C_%EC%8B%9C%EC%8A%A4%ED%85%9C/

In [2]:
# !kaggle datasets download -d tmdb/tmdb-movie-metadata

In [3]:
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

from ast import literal_eval   # for converting strings to tuples

from sklearn.feature_extraction.text import CountVectorizer   # for creating token matrix
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
movies = pd.read_csv('./tmdb-movie-metadata/tmdb_5000_movies.csv')
print(movies.shape)
movies.head()

(4803, 20)


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


### Select Columns

In [5]:
movies.columns

Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

In [6]:
movies_df = movies[['id', 'title', 'genres', 'vote_average', 'vote_count',
        'popularity', 'keywords', 'overview']]

movies_df.head()

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",7.2,11800,150.437577,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",6.9,4500,139.082615,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",6.3,4466,107.376788,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",7.6,9106,112.31295,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",Following the death of District Attorney Harve...
4,49529,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",6.1,2124,43.926995,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","John Carter is a war-weary, former military ca..."


### Clean 'genres', 'keywords' columns

Each cell in 'genre', 'keywords' columns contain a list of dictionaries in STRING TYPE.

We could use __list_eval__ to convert the data into list of dictionaries.
- [ast.literal_eval documentation](https://docs.python.org/3/library/ast.html)

In [7]:
print(movies_df['genres'][:1].values)
print()
print(type(movies_df['genres'][1]))

['[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]']

<class 'str'>


In [8]:
movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)
movies_df.head()

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview
0,19995,Avatar,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",7.2,11800,150.437577,"[{'id': 1463, 'name': 'culture clash'}, {'id':...","In the 22nd century, a paraplegic Marine is di..."
1,285,Pirates of the Caribbean: At World's End,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",6.9,4500,139.082615,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...","Captain Barbossa, long believed to be dead, ha..."
2,206647,Spectre,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",6.3,4466,107.376788,"[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...",A cryptic message from Bond’s past sends him o...
3,49026,The Dark Knight Rises,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",7.6,9106,112.31295,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...",Following the death of District Attorney Harve...
4,49529,John Carter,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",6.1,2124,43.926995,"[{'id': 818, 'name': 'based on novel'}, {'id':...","John Carter is a war-weary, former military ca..."


In [9]:
print(movies_df['genres'][:1].values)
print()
print(type(movies_df['genres'][1]))

[list([{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 878, 'name': 'Science Fiction'}])]

<class 'list'>


In [10]:
movies_df['genres'][0]

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [11]:
movies_df['keywords'][0]

[{'id': 1463, 'name': 'culture clash'},
 {'id': 2964, 'name': 'future'},
 {'id': 3386, 'name': 'space war'},
 {'id': 3388, 'name': 'space colony'},
 {'id': 3679, 'name': 'society'},
 {'id': 3801, 'name': 'space travel'},
 {'id': 9685, 'name': 'futuristic'},
 {'id': 9840, 'name': 'romance'},
 {'id': 9882, 'name': 'space'},
 {'id': 9951, 'name': 'alien'},
 {'id': 10148, 'name': 'tribe'},
 {'id': 10158, 'name': 'alien planet'},
 {'id': 10987, 'name': 'cgi'},
 {'id': 11399, 'name': 'marine'},
 {'id': 13065, 'name': 'soldier'},
 {'id': 14643, 'name': 'battle'},
 {'id': 14720, 'name': 'love affair'},
 {'id': 165431, 'name': 'anti war'},
 {'id': 193554, 'name': 'power relations'},
 {'id': 206690, 'name': 'mind and soul'},
 {'id': 209714, 'name': '3d'}]

Notice there are unncessary info, such as keys ('id', 'name'), or id values.

Let's get rid of those. (Just keep the value of 'name's.)

In [12]:
movies_df['genres'] = movies_df['genres'].apply(lambda x: [y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x: [y['name'] for y in x])

movies_df[['genres', 'keywords']][:2]

Unnamed: 0,genres,keywords
0,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon..."
1,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ..."


'genres' and 'keywords' columns contain list of individual items. 

Let's convert these to a single string where each item is separated by a whitespace.

We need this to conduct __CountVectorize__ on the data (below).

In [13]:
movies_df['genres'][0], movies_df['keywords'][0]

(['Action', 'Adventure', 'Fantasy', 'Science Fiction'],
 ['culture clash',
  'future',
  'space war',
  'space colony',
  'society',
  'space travel',
  'futuristic',
  'romance',
  'space',
  'alien',
  'tribe',
  'alien planet',
  'cgi',
  'marine',
  'soldier',
  'battle',
  'love affair',
  'anti war',
  'power relations',
  'mind and soul',
  '3d'])

Like this.

In [14]:
(' ').join(movies_df['genres'][0])

'Action Adventure Fantasy Science Fiction'

In [15]:
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x: (' ').join(x))
movies_df.head()

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",7.2,11800,150.437577,"[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",Action Adventure Fantasy Science Fiction
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]",6.9,4500,139.082615,"[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",Adventure Fantasy Action
2,206647,Spectre,"[Action, Adventure, Crime]",6.3,4466,107.376788,"[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,Action Adventure Crime
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]",7.6,9106,112.31295,"[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,Action Crime Drama Thriller
4,49529,John Carter,"[Action, Adventure, Science Fiction]",6.1,2124,43.926995,"[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",Action Adventure Science Fiction


### CountVectorize
- Convert a collection of text documents to a matrix of token counts.
- [CountVectorizer Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [16]:
count_vect = CountVectorizer(min_df=0.0, ngram_range=(1,2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

(4803, 276)


### Cosine Similarity
- 0: Orthogonal vectors
- 1: Similar vectors
- -1: Opposite vectors

In [17]:
genre_mat

<4803x276 sparse matrix of type '<class 'numpy.int64'>'
	with 20631 stored elements in Compressed Sparse Row format>

In [18]:
genre_sim = cosine_similarity(genre_mat)
print(genre_sim.shape)
print(genre_sim[:2])

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]]


Sort the index of similarities in descending order for each row of the genre_sim matrix (highest to lowest similarity).

In [19]:
genre_sim_sorted_indices = genre_sim.argsort()[:, ::-1]   # sort in descending order
genre_sim_sorted_indices[0]

array([   0, 3494,  813, ..., 3038, 3037, 2401])

This indicates the following:

- The item (or genre) at index 0 has the highest similarity with itself (which is expected), so the first index is 0.
- The item (or genre) at index 3494 is the next most similar to the item at index 0.
- The item at index 813 is the third most similar to the item at index 0, and so on.

The indices towards the end of the array (3038, 3037, 2401) correspond to items that have the lowest similarity with the item at index 0.

### Top 10 movies similar to 'The Godfather'
- Create dataframe that returns the top n films of similar genre.

In [24]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie = df[df['title'] == title_name]

    title_index = title_movie.index.values
    similar_indices = sorted_ind[title_index, :(top_n)]

    print(f'INDEX OF MOVIES SIMILAR TO: "{title_name}": {similar_indices}')
    similar_indices = similar_indices.reshape(-1)

    return df.iloc[similar_indices]

In [25]:
similar_movies_godfather = find_sim_movie(movies_df, genre_sim_sorted_indices, 'The Godfather', 10)
similar_movies_godfather[['title', 'vote_average']]

INDEX OF MOVIES SIMILAR TO: "The Godfather": [[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]


Unnamed: 0,title,vote_average
2731,The Godfather: Part II,8.3
1243,Mean Streets,7.2
3636,Light Sleeper,5.7
1946,The Bad Lieutenant: Port of Call - New Orleans,6.0
2640,Things to Do in Denver When You're Dead,6.7
4065,Mi America,0.0
1847,GoodFellas,8.2
4217,Kids,6.8
883,Catch Me If You Can,7.7
3866,City of God,8.1


'Mi America', with vote_average of 0.0 suggests that the data contains entries that haven't been rated.

In [27]:
movies_df[['title', 'vote_average', 'vote_count']].sort_values('vote_average', ascending=True)[:10]

Unnamed: 0,title,vote_average,vote_count
4633,Death Calls,0.0,0
4305,Down & Out With The Dolls,0.0,0
4653,Rust,0.0,0
4293,The Algerian,0.0,0
4118,Hum To Mohabbat Karega,0.0,0
4186,A Beginner's Guide to Snuff,0.0,0
4638,Amidst the Devil's Wings,0.0,0
4307,Certifiably Jonathan,0.0,0
1464,Black Water Transit,0.0,0
4444,Elza,0.0,0


In [28]:
movies_df[['title', 'vote_average', 'vote_count']].sort_values('vote_average', ascending=False)[:10]

Unnamed: 0,title,vote_average,vote_count
3519,Stiff Upper Lips,10.0,1
4247,Me You and Five Bucks,10.0,2
4045,"Dancer, Texas Pop. 81",10.0,1
4662,Little Big Top,10.0,1
3992,Sardaarji,9.5,2
2386,One Man's Hero,9.3,2
2970,There Goes My Baby,8.5,2
1881,The Shawshank Redemption,8.5,8205
2796,The Prisoner of Zenda,8.4,11
3337,The Godfather,8.4,5893


Entries where vote count is too small can also be problematic.

To address this issue, let's apply weighted averages to the data.

### Get Weighted Average

$
\text{Weighted Average} = \left( \frac{v}{v + m} \right)R + \left( \frac{m}{v+m} \right)C
$

Where:
- $v $: vote count for an individual movie
- $m $: minimum vote count
- $R $: average rating for an individual movie
- $ C $: average rating for all movies.

This seems to be the current rating method in IMDB (see: [Quora: How does IMDb's rating system work?](https://www.quora.com/How-does-IMDbs-rating-system-work))

For now, set m to be values that have more than 60 percentile of votes.


In [29]:
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print('C:', round(C,3), 'm:', round(m,3))

C: 6.092 m: 370.2


In [30]:
def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']

    return ( (v/(v+m)) * R ) + ( (m/(m+v)) * C )

In [31]:
movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis=1)
movies_df.head()

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal,weighted_vote
0,19995,Avatar,"[Action, Adventure, Fantasy, Science Fiction]",7.2,11800,150.437577,"[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",Action Adventure Fantasy Science Fiction,7.166301
1,285,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]",6.9,4500,139.082615,"[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",Adventure Fantasy Action,6.838594
2,206647,Spectre,"[Action, Adventure, Crime]",6.3,4466,107.376788,"[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,Action Adventure Crime,6.284091
3,49026,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]",7.6,9106,112.31295,"[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,Action Crime Drama Thriller,7.541095
4,49529,John Carter,"[Action, Adventure, Science Fiction]",6.1,2124,43.926995,"[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",Action Adventure Science Fiction,6.098838


In [32]:
movies_df[movies_df['vote_count']<10]

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal,weighted_vote
463,161795,Déjà Vu,"[Romance, Drama]",8.0,1,0.605645,"[love, american, pin, stranger, ruby]",L.A. shop owner Dana and Englishman Sean meet ...,Romance Drama,6.097311
492,293644,Top Cat Begins,"[Comedy, Animation]",5.3,9,0.719996,[3d],Top Cat has arrived to charm his way into your...,Comedy Animation,6.073370
1023,7504,Earth,[Drama],6.6,9,1.246883,"[based on novel, war of independence, period d...",It's 1947 and the borderlines between India an...,Drama,6.104224
1039,113464,Inchon,"[Drama, History, War]",6.5,2,0.146783,[],A noisy and absurd re-telling of the great 195...,Drama History War,6.094363
1453,49478,Warriors of Virtue,"[Fantasy, Family, Action]",4.7,9,0.912395,"[american football, mythology, chinese food, k...","A young man, Ryan, suffering from a disability...",Fantasy Family Action,6.059130
...,...,...,...,...,...,...,...,...,...,...
4795,124606,Bang,[Drama],6.0,1,0.918116,"[gang, audition, police fake, homeless, actress]",A young woman in L.A. is having a bad day: she...,Drama,6.091923
4797,67238,Cavite,"[Foreign, Thriller]",7.5,2,0.022173,[],"Adam, a security guard, travels from Californi...",Foreign Thriller,6.099736
4799,72766,Newlyweds,"[Comedy, Romance]",5.9,5,0.642552,[],A newlywed couple's honeymoon is upended by th...,Comedy Romance,6.089611
4800,231617,"Signed, Sealed, Delivered","[Comedy, Drama, Romance, TV Movie]",7.0,6,1.444476,"[date, love at first sight, narration, investi...","""Signed, Sealed, Delivered"" introduces a dedic...",Comedy Drama Romance TV Movie,6.106650


In [33]:
movies_df[['title', 'vote_average', 'weighted_vote', 'vote_count']].sort_values('weighted_vote', ascending=False)[:10]

Unnamed: 0,title,vote_average,weighted_vote,vote_count
1881,The Shawshank Redemption,8.5,8.396052,8205
3337,The Godfather,8.4,8.263591,5893
662,Fight Club,8.3,8.216455,9413
3232,Pulp Fiction,8.3,8.207102,8428
65,The Dark Knight,8.2,8.13693,12002
1818,Schindler's List,8.3,8.126069,4329
3865,Whiplash,8.3,8.123248,4254
809,Forrest Gump,8.2,8.105954,7927
2294,Spirited Away,8.3,8.105867,3840
2731,The Godfather: Part II,8.3,8.079586,3338


### (RE) Top 10 movies similar to 'The Godfather' (using weighted average ratings)

In [None]:
similar_movies_godfather = find_sim_movie(movies_df, genre_sim_sorted_indices, 'The Godfather', 10)
similar_movies_godfather[['title', 'vote_average', 'weighted_vote', 'vote_count']]

In [36]:
# function adjusted
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values

    similar_indices = sorted_ind[title_index, :(top_n*2)]
    # print(f'INDEX OF MOVIES SIMILAR TO: "{title_name}": {similar_indices}')
    similar_indices = similar_indices.reshape(-1)

    similar_indices = similar_indices[similar_indices != title_index]

    return df.iloc[similar_indices].sort_values('weighted_vote', ascending=False)[:top_n]

In [37]:
similar_movies_godfather = find_sim_movie(movies_df, genre_sim_sorted_indices, 'The Godfather', 10)
similar_movies_godfather[['title', 'vote_average', 'weighted_vote']]

Unnamed: 0,title,vote_average,weighted_vote
2731,The Godfather: Part II,8.3,8.079586
1847,GoodFellas,8.2,7.976937
3866,City of God,8.1,7.759693
1663,Once Upon a Time in America,8.2,7.657811
883,Catch Me If You Can,7.7,7.557097
281,American Gangster,7.4,7.141396
4041,This Is England,7.4,6.739664
1149,American Hustle,6.8,6.717525
1243,Mean Streets,7.2,6.626569
2839,Rounders,6.9,6.530427
