# Movie Recommender System
- Here we try to implement a recommender system using meta-data such as keywords, cast, and crew.
- We hope to suggest movies with similar keywords or cast or crew members

## Importing Data Processing Libraries

In [None]:
import numpy as np
import pandas as pd

## Reading the Datasets into python

In [None]:
cred_df = pd.read_csv('/content/drive/MyDrive/Ml_course/recommender_systems/bootcamp/credits.csv')
cred_df.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [None]:
key_df = pd.read_csv('/content/drive/MyDrive/Ml_course/recommender_systems/bootcamp/keywords.csv')
key_df.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


We can merge dataframes on id as they are similar columns.

In [None]:
cred_df.shape

(45476, 3)

In [None]:
key_df.shape

(46419, 2)

In [None]:
key_df['keywords'][0]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

In [None]:
cred_df['cast'][0]

"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4t

In [None]:
key_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


In [None]:
cred_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


## Merging Dataframes

Since the shape is different, we can check for duplicates.

In [None]:
cred_df['id'].nunique()

45432

In [None]:
key_df['id'].nunique()

45432

In [None]:
key_df.drop_duplicates(subset=['id'], inplace=True)

In [None]:
cred_df.drop_duplicates(subset=['id'], inplace=True)

In [None]:
key_df.shape, cred_df.shape

((45432, 2), (45432, 3))

After dropping duplicates, the shapes align

In [None]:
new_df = key_df.merge(cred_df, on='id')

In [None]:
new_df.head()

Unnamed: 0,id,keywords,cast,crew
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':...","[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de..."


## Processing Data

In [None]:
new_df['keywords'][0]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

In [None]:
new_df.isnull().sum()

id          0
keywords    0
cast        0
crew        0
dtype: int64

We can use literal eval to convert the values into list type from string

In [None]:
from ast import literal_eval

In [None]:
new_df['keywords'] = new_df['keywords'].apply(literal_eval)

In [None]:
new_df['keywords'][0][0]['name']

'jealousy'

Taking only the keywords into list format.

In [None]:
new_df['keywords'] = new_df['keywords'].apply(lambda x: [i['name'].lower() for i in x] if isinstance(x, list) else [])

In [None]:
new_df['keywords']

0        [jealousy, toy, boy, friendship, friends, riva...
1        [board game, disappearance, based on children'...
2        [fishing, best friend, duringcreditsstinger, o...
3        [based on novel, interracial relationship, sin...
4        [baby, midlife crisis, confidence, aging, daug...
                               ...                        
45427                                        [tragic love]
45428                                [artist, play, pinoy]
45429                                                   []
45430                                                   []
45431                                                   []
Name: keywords, Length: 45432, dtype: object

In [None]:
new_df['cast'][0]

"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4t

Since there are many cast members, we can take the first four as they are the major characters in the movie.

In [None]:
new_df['cast'] = new_df['cast'].apply(literal_eval)

In [None]:
new_df['cast'] = new_df['cast'].apply(lambda x: [i['name'].lower() for i in x[:4]] if isinstance(x, list) else [])

In [None]:
new_df['cast']

0          [tom hanks, tim allen, don rickles, jim varney]
1        [robin williams, jonathan hyde, kirsten dunst,...
2        [walter matthau, jack lemmon, ann-margret, sop...
3        [whitney houston, angela bassett, loretta devi...
4        [steve martin, diane keaton, martin short, kim...
                               ...                        
45427          [leila hatami, kourosh tahami, elham korda]
45428    [angel aquino, perry dizon, hazel orencio, joe...
45429    [erika eleniak, adam baldwin, julie du page, j...
45430    [iwan mosschuchin, nathalie lissenko, pavel pa...
45431                                                   []
Name: cast, Length: 45432, dtype: object

In [None]:
new_df['crew'][0]

'[{\'credit_id\': \'52fe4284c3a36847f8024f49\', \'department\': \'Directing\', \'gender\': 2, \'id\': 7879, \'job\': \'Director\', \'name\': \'John Lasseter\', \'profile_path\': \'/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f4f\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12891, \'job\': \'Screenplay\', \'name\': \'Joss Whedon\', \'profile_path\': \'/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f55\', \'department\': \'Writing\', \'gender\': 2, \'id\': 7, \'job\': \'Screenplay\', \'name\': \'Andrew Stanton\', \'profile_path\': \'/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f5b\', \'department\': \'Writing\', \'gender\': 2, \'id\': 12892, \'job\': \'Screenplay\', \'name\': \'Joel Cohen\', \'profile_path\': \'/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg\'}, {\'credit_id\': \'52fe4284c3a36847f8024f61\', \'department\': \'Writing\', \'gender\': 0, \'id\': 12893, \'job\': \'Screenplay\', \'name\': \'A

Similarly for crew, we take first four names as they are: the director, the writers, and screenplay writers.

In [None]:
new_df['crew'] = new_df['crew'].apply(literal_eval)

In [None]:
new_df['crew'] = new_df['crew'].apply(lambda x: [i['name'].lower() for i in x[:4]] if isinstance(x, list) else [])

In [None]:
new_df.head()

Unnamed: 0,id,keywords,cast,crew
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[tom hanks, tim allen, don rickles, jim varney]","[john lasseter, joss whedon, andrew stanton, j..."
1,8844,"[board game, disappearance, based on children'...","[robin williams, jonathan hyde, kirsten dunst,...","[larry j. franco, jonathan hensleigh, james ho..."
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[walter matthau, jack lemmon, ann-margret, sop...","[howard deutch, mark steven johnson, mark stev..."
3,31357,"[based on novel, interracial relationship, sin...","[whitney houston, angela bassett, loretta devi...","[forest whitaker, ronald bass, ronald bass, ez..."
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[steve martin, diane keaton, martin short, kim...","[alan silvestri, elliot davis, nancy meyers, n..."


### Converting to string

We convert the data extracted into string and then join the three features into one single feature.

In [None]:
','.join(map(str, new_df['keywords'][0]))

'jealousy,toy,boy,friendship,friends,rivalry,boy next door,new toy,toy comes to life'

In [None]:
new_df['new_key'] = new_df['keywords'].apply(lambda x: ','.join(map(str, x)))

In [None]:
new_df.head()

Unnamed: 0,id,keywords,cast,crew,new_key
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[tom hanks, tim allen, don rickles, jim varney]","[john lasseter, joss whedon, andrew stanton, j...","jealousy,toy,boy,friendship,friends,rivalry,bo..."
1,8844,"[board game, disappearance, based on children'...","[robin williams, jonathan hyde, kirsten dunst,...","[larry j. franco, jonathan hensleigh, james ho...","board game,disappearance,based on children's b..."
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[walter matthau, jack lemmon, ann-margret, sop...","[howard deutch, mark steven johnson, mark stev...","fishing,best friend,duringcreditsstinger,old men"
3,31357,"[based on novel, interracial relationship, sin...","[whitney houston, angela bassett, loretta devi...","[forest whitaker, ronald bass, ronald bass, ez...","based on novel,interracial relationship,single..."
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[steve martin, diane keaton, martin short, kim...","[alan silvestri, elliot davis, nancy meyers, n...","baby,midlife crisis,confidence,aging,daughter,..."


In [None]:
new_df['new_cast'] = new_df['cast'].apply(lambda x: ','.join(map(str, x)))

In [None]:
new_df['new_crew'] = new_df['crew'].apply(lambda x: ','.join(map(str, x)))

In [None]:
new_df.head()

Unnamed: 0,id,keywords,cast,crew,new_key,new_cast,new_crew
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[tom hanks, tim allen, don rickles, jim varney]","[john lasseter, joss whedon, andrew stanton, j...","jealousy,toy,boy,friendship,friends,rivalry,bo...","tom hanks,tim allen,don rickles,jim varney","john lasseter,joss whedon,andrew stanton,joel ..."
1,8844,"[board game, disappearance, based on children'...","[robin williams, jonathan hyde, kirsten dunst,...","[larry j. franco, jonathan hensleigh, james ho...","board game,disappearance,based on children's b...","robin williams,jonathan hyde,kirsten dunst,bra...","larry j. franco,jonathan hensleigh,james horne..."
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[walter matthau, jack lemmon, ann-margret, sop...","[howard deutch, mark steven johnson, mark stev...","fishing,best friend,duringcreditsstinger,old men","walter matthau,jack lemmon,ann-margret,sophia ...","howard deutch,mark steven johnson,mark steven ..."
3,31357,"[based on novel, interracial relationship, sin...","[whitney houston, angela bassett, loretta devi...","[forest whitaker, ronald bass, ronald bass, ez...","based on novel,interracial relationship,single...","whitney houston,angela bassett,loretta devine,...","forest whitaker,ronald bass,ronald bass,ezra s..."
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[steve martin, diane keaton, martin short, kim...","[alan silvestri, elliot davis, nancy meyers, n...","baby,midlife crisis,confidence,aging,daughter,...","steve martin,diane keaton,martin short,kimberl...","alan silvestri,elliot davis,nancy meyers,nancy..."


In [None]:
def merge_cols(X):
    a = X['new_key']
    b = X['new_cast']
    c = X['new_crew']
    return f'{a}, {b}, {c}'

In [None]:
new_df['movie_details'] = new_df.apply(merge_cols, axis=1)

In [None]:
new_df.head()

Unnamed: 0,id,keywords,cast,crew,new_key,new_cast,new_crew,movie_details
0,862,"[jealousy, toy, boy, friendship, friends, riva...","[tom hanks, tim allen, don rickles, jim varney]","[john lasseter, joss whedon, andrew stanton, j...","jealousy,toy,boy,friendship,friends,rivalry,bo...","tom hanks,tim allen,don rickles,jim varney","john lasseter,joss whedon,andrew stanton,joel ...","jealousy,toy,boy,friendship,friends,rivalry,bo..."
1,8844,"[board game, disappearance, based on children'...","[robin williams, jonathan hyde, kirsten dunst,...","[larry j. franco, jonathan hensleigh, james ho...","board game,disappearance,based on children's b...","robin williams,jonathan hyde,kirsten dunst,bra...","larry j. franco,jonathan hensleigh,james horne...","board game,disappearance,based on children's b..."
2,15602,"[fishing, best friend, duringcreditsstinger, o...","[walter matthau, jack lemmon, ann-margret, sop...","[howard deutch, mark steven johnson, mark stev...","fishing,best friend,duringcreditsstinger,old men","walter matthau,jack lemmon,ann-margret,sophia ...","howard deutch,mark steven johnson,mark steven ...","fishing,best friend,duringcreditsstinger,old m..."
3,31357,"[based on novel, interracial relationship, sin...","[whitney houston, angela bassett, loretta devi...","[forest whitaker, ronald bass, ronald bass, ez...","based on novel,interracial relationship,single...","whitney houston,angela bassett,loretta devine,...","forest whitaker,ronald bass,ronald bass,ezra s...","based on novel,interracial relationship,single..."
4,11862,"[baby, midlife crisis, confidence, aging, daug...","[steve martin, diane keaton, martin short, kim...","[alan silvestri, elliot davis, nancy meyers, n...","baby,midlife crisis,confidence,aging,daughter,...","steve martin,diane keaton,martin short,kimberl...","alan silvestri,elliot davis,nancy meyers,nancy...","baby,midlife crisis,confidence,aging,daughter,..."


In [None]:
new_df['movie_details'][0]

'jealousy,toy,boy,friendship,friends,rivalry,boy next door,new toy,toy comes to life, tom hanks,tim allen,don rickles,jim varney, john lasseter,joss whedon,andrew stanton,joel cohen'

## Reading movie data to get names of movies

In [None]:
movies = pd.read_csv('/content/drive/MyDrive/Ml_course/recommender_systems/bootcamp/movies_metadata.csv')
movies.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [None]:
movies.shape

(45466, 24)

In [None]:
movies.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [None]:
movies['id'].nunique()

45436

In [None]:
movies.drop_duplicates(subset=['id'], inplace=True)

In [None]:
movies.shape

(45436, 24)

In [None]:
movies = movies.iloc[:20000,:]

In [None]:
def to_num(x):
    try:
        return int(x)
    except ValueError:
        return 0

In [None]:
movies['id'] = movies['id'].apply(to_num)

In [None]:
movies.loc[movies['id'] == 0]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",0,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,


**Since the process crashed when taking cosine simialarity several times, we are only taking 19500 rows**

In [None]:
movies = movies.iloc[:19500,:]

In [None]:
movies.loc[movies['id'] == 0]

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


In [None]:
movies.shape

(19500, 24)

In [None]:
movies['id'].loc[~movies['id'].isin(new_df['id'])]

Series([], Name: id, dtype: int64)

In [None]:
new_df.shape

(45432, 8)

In [None]:
new_df = new_df.iloc[:19500,:]

**Adding Movies titles to working dataframe**

In [None]:
movie_names = movies['title'].to_list()

In [None]:
new_df.shape

(19500, 8)

In [None]:
new_df.loc[new_df['movie_details'] == '']

Unnamed: 0,id,keywords,cast,crew,new_key,new_cast,new_crew,movie_details


In [None]:
new_df['title'] = movie_names

In [None]:
new_df.columns

Index(['id', 'keywords', 'cast', 'crew', 'new_key', 'new_cast', 'new_crew',
       'movie_details', 'title'],
      dtype='object')

In [None]:
new_df.drop(columns=['keywords', 'cast', 'crew', 'new_key', 'new_cast', 'new_crew'], inplace=True)

### Final Data

In [None]:
new_df.head()

Unnamed: 0,id,movie_details,title
0,862,"jealousy,toy,boy,friendship,friends,rivalry,bo...",Toy Story
1,8844,"board game,disappearance,based on children's b...",Jumanji
2,15602,"fishing,best friend,duringcreditsstinger,old m...",Grumpier Old Men
3,31357,"based on novel,interracial relationship,single...",Waiting to Exhale
4,11862,"baby,midlife crisis,confidence,aging,daughter,...",Father of the Bride Part II


## TFIDF Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
tfidf = TfidfVectorizer(stop_words='english')

In [None]:
new_df.isnull().sum()

id               0
movie_details    0
title            0
dtype: int64

In [None]:
tfidf_matrix = tfidf.fit_transform(new_df['movie_details'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(19500, 44736)

In [None]:
# Import linear_kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
indices = pd.Series(new_df.index, index=new_df['title']).drop_duplicates()

In [None]:
indices

title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Lightnin'                      19495
The Tall Man                   19496
Laurence Anyways               19497
The Magic of Belle Isle        19498
The Lie                        19499
Length: 19500, dtype: int64

In [None]:
# Function that takes in movie title as input and gives recommendations 
def content_recommender(title, cosine_sim=cosine_sim, df=new_df, indices=indices):

    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    # And convert it into a list of tuples as described above
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 10 most similar movies. Ignore the first movie.
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return df['title'].iloc[movie_indices]

In [None]:
content_recommender('The Lion King')

8179                        Shark Tale
965                           Infinity
4808               The Final Countdown
4653     The French Lieutenant's Woman
13169                    Che: Part One
3042                          Magnolia
162         Die Hard: With a Vengeance
15064                        Razorback
7980                   Hometown Legend
15288                         Betrayal
Name: title, dtype: object

In [None]:
new_df['movie_details'].loc[new_df['title'] == 'Shark Tale'].values

array(['fish,hero,mission of murder,threat to death,secret love,animation,shark,woman director, will smith,robert de niro,renée zellweger,jack black, mark a. mangini,hans zimmer,richard l. anderson,michael j. wilson'],
      dtype=object)

In [None]:
new_df['movie_details'].loc[new_df['title'] == 'The Lion King'].values

array(['loss of parents,wild boar,uncle,shaman,redemption,king,scar,hyena,meerkat, jonathan taylor thomas,matthew broderick,james earl jones,jeremy irons, mark a. mangini,hans zimmer,richard l. anderson,john carnochan'],
      dtype=object)

Here we can see some similarities in crew and keywords. So our recommendation system is working as intended.

***