## MOVIE RECOMMENDATION SYSTEM

## PROJECT BACKGROUND
In the age of digital entertainment, users are often overwhelmed by the sheer volume of movie options available. A movie recommender system helps personalize the user experience by suggesting films tailored to individual preferences.

A content-based movie recommender system adresses this by analyzing the features of movies—such as genre, keywords, plot summaries, or cast—and recommending titles similar to those a user has previously enjoyed. This project focuses on building a personalized recommendation engine using content-based filtering techniques, enhancing user experience through relevant and meaningful suggestions.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## 1. DATA CLEANING AND PREPROCESSING

In [2]:
movies=pd.read_csv('tmdb_5000_movies.csv')
credits=pd.read_csv('tmdb_5000_credits.csv')

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


In [4]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [5]:
movies.shape

(4803, 20)

In [6]:
credits.shape

(4803, 4)

In [7]:
movies['title'].value_counts()

title
The Host                 2
Batman                   2
Out of the Blue          2
Avatar                   1
The Girl on the Train    1
                        ..
Step Up 3D               1
Secondhand Lions         1
The Age of Adaline       1
Drag Me to Hell          1
My Date with Drew        1
Name: count, Length: 4800, dtype: int64

In [8]:
credits['title'].value_counts()

title
The Host                 2
Batman                   2
Out of the Blue          2
Avatar                   1
The Girl on the Train    1
                        ..
Step Up 3D               1
Secondhand Lions         1
The Age of Adaline       1
Drag Me to Hell          1
My Date with Drew        1
Name: count, Length: 4800, dtype: int64

In [9]:
credits.duplicated().sum()

0

In [10]:
movies.isnull().sum()

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   3
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title                      0
vote_average               0
vote_count                 0
dtype: int64

In [11]:
credits.isnull().sum()

movie_id    0
title       0
cast        0
crew        0
dtype: int64

In [12]:
movies.merge(credits,on='title').shape

(4809, 23)

In [13]:
movies_data=movies.merge(credits,how='inner',on='title')

In [14]:
movies_data.shape

(4809, 23)

In [15]:
movies_data.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [16]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4809 non-null   int64  
 1   genres                4809 non-null   object 
 2   homepage              1713 non-null   object 
 3   id                    4809 non-null   int64  
 4   keywords              4809 non-null   object 
 5   original_language     4809 non-null   object 
 6   original_title        4809 non-null   object 
 7   overview              4806 non-null   object 
 8   popularity            4809 non-null   float64
 9   production_companies  4809 non-null   object 
 10  production_countries  4809 non-null   object 
 11  release_date          4808 non-null   object 
 12  revenue               4809 non-null   int64  
 13  runtime               4807 non-null   float64
 14  spoken_languages      4809 non-null   object 
 15  status               

## 2. CONTENT BASED RECOMMENDATION SYSTEM 

In [17]:
#genres
#id
#keywords
#title
#overview
#release_date
#cast
#crew

In [18]:
movies_data=movies_data[['id','title','release_date','genres','overview','keywords','cast','crew']]

In [19]:
movies_data.shape

(4809, 8)

In [20]:
movies_data.head()

Unnamed: 0,id,title,release_date,genres,overview,keywords,cast,crew
0,19995,Avatar,2009-12-10,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","In the 22nd century, a paraplegic Marine is di...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,2007-05-19,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","Captain Barbossa, long believed to be dead, ha...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,2015-10-26,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",A cryptic message from Bond’s past sends him o...,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,2012-07-16,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",Following the death of District Attorney Harve...,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,2012-03-07,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","John Carter is a war-weary, former military ca...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [21]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            4809 non-null   int64 
 1   title         4809 non-null   object
 2   release_date  4808 non-null   object
 3   genres        4809 non-null   object
 4   overview      4806 non-null   object
 5   keywords      4809 non-null   object
 6   cast          4809 non-null   object
 7   crew          4809 non-null   object
dtypes: int64(1), object(7)
memory usage: 300.7+ KB


In [22]:
movies_data.isnull().sum()

id              0
title           0
release_date    1
genres          0
overview        3
keywords        0
cast            0
crew            0
dtype: int64

In [23]:
movies_data.dropna(inplace=True)

In [24]:
movies_data.shape

(4805, 8)

In [25]:
movies_data.duplicated().sum()

0

In [26]:
movies_data.iloc[0]['genres']
#list of dictionaries -> list of genres names

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [27]:
import ast
def convert(obj):
    l=[]
    for i in ast.literal_eval(obj):
        l.append(i['name'])
    return l

In [28]:
convert(movies_data.iloc[0]['genres'])

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [29]:
movies_data['genres']=movies_data['genres'].apply(convert)

In [30]:
movies_data.iloc[0]['keywords']

'[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "space war"}, {"id": 3388, "name": "space colony"}, {"id": 3679, "name": "society"}, {"id": 3801, "name": "space travel"}, {"id": 9685, "name": "futuristic"}, {"id": 9840, "name": "romance"}, {"id": 9882, "name": "space"}, {"id": 9951, "name": "alien"}, {"id": 10148, "name": "tribe"}, {"id": 10158, "name": "alien planet"}, {"id": 10987, "name": "cgi"}, {"id": 11399, "name": "marine"}, {"id": 13065, "name": "soldier"}, {"id": 14643, "name": "battle"}, {"id": 14720, "name": "love affair"}, {"id": 165431, "name": "anti war"}, {"id": 193554, "name": "power relations"}, {"id": 206690, "name": "mind and soul"}, {"id": 209714, "name": "3d"}]'

In [31]:
movies_data['keywords']=movies_data['keywords'].apply(convert)

In [32]:
movies_data['cast'].iloc[0]

'[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_id": "52fe48009251416c750ac9cb", "gender": 1, "id": 8691, "name": "Zoe Saldana", "order": 1}, {"cast_id": 25, "character": "Dr. Grace Augustine", "credit_id": "52fe48009251416c750aca39", "gender": 1, "id": 10205, "name": "Sigourney Weaver", "order": 2}, {"cast_id": 4, "character": "Col. Quaritch", "credit_id": "52fe48009251416c750ac9cf", "gender": 2, "id": 32747, "name": "Stephen Lang", "order": 3}, {"cast_id": 5, "character": "Trudy Chacon", "credit_id": "52fe48009251416c750ac9d3", "gender": 1, "id": 17647, "name": "Michelle Rodriguez", "order": 4}, {"cast_id": 8, "character": "Selfridge", "credit_id": "52fe48009251416c750ac9e1", "gender": 2, "id": 1771, "name": "Giovanni Ribisi", "order": 5}, {"cast_id": 7, "character": "Norm Spellman", "credit_id": "52fe48009251416c750ac9dd", "gender": 

In [33]:
import ast
def convert2(obj):
    l=[]
    cnt=0
    for i in ast.literal_eval(obj):
        if cnt!=3:
            l.append(i['name'])
            cnt+=1
        else:
            break
    return l

In [34]:
movies_data['cast'].apply(convert2)

0        [Sam Worthington, Zoe Saldana, Sigourney Weaver]
1           [Johnny Depp, Orlando Bloom, Keira Knightley]
2            [Daniel Craig, Christoph Waltz, Léa Seydoux]
3            [Christian Bale, Michael Caine, Gary Oldman]
4          [Taylor Kitsch, Lynn Collins, Samantha Morton]
                              ...                        
4804    [Carlos Gallardo, Jaime de Hoyos, Peter Marqua...
4805         [Edward Burns, Kerry Bishé, Marsha Dietlein]
4806           [Eric Mabius, Kristin Booth, Crystal Lowe]
4807            [Daniel Henney, Eliza Coupe, Bill Paxton]
4808    [Drew Barrymore, Brian Herzlinger, Corey Feldman]
Name: cast, Length: 4805, dtype: object

In [35]:
movies_data['cast']=movies_data['cast'].apply(convert2)

In [36]:
movies_data['crew'].iloc[0]

'[{"credit_id": "52fe48009251416c750aca23", "department": "Editing", "gender": 0, "id": 1721, "job": "Editor", "name": "Stephen E. Rivkin"}, {"credit_id": "539c47ecc3a36810e3001f87", "department": "Art", "gender": 2, "id": 496, "job": "Production Design", "name": "Rick Carter"}, {"credit_id": "54491c89c3a3680fb4001cf7", "department": "Sound", "gender": 0, "id": 900, "job": "Sound Designer", "name": "Christopher Boyes"}, {"credit_id": "54491cb70e0a267480001bd0", "department": "Sound", "gender": 0, "id": 900, "job": "Supervising Sound Editor", "name": "Christopher Boyes"}, {"credit_id": "539c4a4cc3a36810c9002101", "department": "Production", "gender": 1, "id": 1262, "job": "Casting", "name": "Mali Finn"}, {"credit_id": "5544ee3b925141499f0008fc", "department": "Sound", "gender": 2, "id": 1729, "job": "Original Music Composer", "name": "James Horner"}, {"credit_id": "52fe48009251416c750ac9c3", "department": "Directing", "gender": 2, "id": 2710, "job": "Director", "name": "James Cameron"},

In [37]:
import ast
def convert4(obj):
    l=[]
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            l.append(i['name'])
            break
    return l

In [38]:
movies_data['crew']=movies_data['crew'].apply(convert4)

In [39]:
movies_data['overview']=movies_data['overview'].apply(lambda x: x.split())

In [40]:
movies_data['release_date']=pd.to_datetime(movies_data['release_date'],errors='coerce')

In [41]:
movies_data

Unnamed: 0,id,title,release_date,genres,overview,keywords,cast,crew
0,19995,Avatar,2009-12-10,"[Action, Adventure, Fantasy, Science Fiction]","[In, the, 22nd, century,, a, paraplegic, Marin...","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]
1,285,Pirates of the Caribbean: At World's End,2007-05-19,"[Adventure, Fantasy, Action]","[Captain, Barbossa,, long, believed, to, be, d...","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",[Gore Verbinski]
2,206647,Spectre,2015-10-26,"[Action, Adventure, Crime]","[A, cryptic, message, from, Bond’s, past, send...","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",[Sam Mendes]
3,49026,The Dark Knight Rises,2012-07-16,"[Action, Crime, Drama, Thriller]","[Following, the, death, of, District, Attorney...","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",[Christopher Nolan]
4,49529,John Carter,2012-03-07,"[Action, Adventure, Science Fiction]","[John, Carter, is, a, war-weary,, former, mili...","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",[Andrew Stanton]
...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,1992-09-04,"[Action, Crime, Thriller]","[El, Mariachi, just, wants, to, play, his, gui...","[united states–mexico barrier, legs, arms, pap...","[Carlos Gallardo, Jaime de Hoyos, Peter Marqua...",[Robert Rodriguez]
4805,72766,Newlyweds,2011-12-26,"[Comedy, Romance]","[A, newlywed, couple's, honeymoon, is, upended...",[],"[Edward Burns, Kerry Bishé, Marsha Dietlein]",[Edward Burns]
4806,231617,"Signed, Sealed, Delivered",2013-10-13,"[Comedy, Drama, Romance, TV Movie]","[""Signed,, Sealed,, Delivered"", introduces, a,...","[date, love at first sight, narration, investi...","[Eric Mabius, Kristin Booth, Crystal Lowe]",[Scott Smith]
4807,126186,Shanghai Calling,2012-05-03,[],"[When, ambitious, New, York, attorney, Sam, is...",[],"[Daniel Henney, Eliza Coupe, Bill Paxton]",[Daniel Hsia]


In [42]:
movies_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4805 entries, 0 to 4808
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            4805 non-null   int64         
 1   title         4805 non-null   object        
 2   release_date  4805 non-null   datetime64[ns]
 3   genres        4805 non-null   object        
 4   overview      4805 non-null   object        
 5   keywords      4805 non-null   object        
 6   cast          4805 non-null   object        
 7   crew          4805 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(6)
memory usage: 337.9+ KB


In [43]:
movies_data['genres']=movies_data['genres'].apply(lambda x:[i.replace(" ","") for i in x ])
movies_data['keywords']=movies_data['keywords'].apply(lambda x:[i.replace(" ","") for i in x ])
movies_data['cast']=movies_data['cast'].apply(lambda x:[i.replace(" ","") for i in x ])
movies_data['crew']=movies_data['crew'].apply(lambda x:[i.replace(" ","") for i in x ])

In [44]:
movies_data

Unnamed: 0,id,title,release_date,genres,overview,keywords,cast,crew
0,19995,Avatar,2009-12-10,"[Action, Adventure, Fantasy, ScienceFiction]","[In, the, 22nd, century,, a, paraplegic, Marin...","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]
1,285,Pirates of the Caribbean: At World's End,2007-05-19,"[Adventure, Fantasy, Action]","[Captain, Barbossa,, long, believed, to, be, d...","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski]
2,206647,Spectre,2015-10-26,"[Action, Adventure, Crime]","[A, cryptic, message, from, Bond’s, past, send...","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes]
3,49026,The Dark Knight Rises,2012-07-16,"[Action, Crime, Drama, Thriller]","[Following, the, death, of, District, Attorney...","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan]
4,49529,John Carter,2012-03-07,"[Action, Adventure, ScienceFiction]","[John, Carter, is, a, war-weary,, former, mili...","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton]
...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,1992-09-04,"[Action, Crime, Thriller]","[El, Mariachi, just, wants, to, play, his, gui...","[unitedstates–mexicobarrier, legs, arms, paper...","[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",[RobertRodriguez]
4805,72766,Newlyweds,2011-12-26,"[Comedy, Romance]","[A, newlywed, couple's, honeymoon, is, upended...",[],"[EdwardBurns, KerryBishé, MarshaDietlein]",[EdwardBurns]
4806,231617,"Signed, Sealed, Delivered",2013-10-13,"[Comedy, Drama, Romance, TVMovie]","[""Signed,, Sealed,, Delivered"", introduces, a,...","[date, loveatfirstsight, narration, investigat...","[EricMabius, KristinBooth, CrystalLowe]",[ScottSmith]
4807,126186,Shanghai Calling,2012-05-03,[],"[When, ambitious, New, York, attorney, Sam, is...",[],"[DanielHenney, ElizaCoupe, BillPaxton]",[DanielHsia]


In [45]:
movies_data['tag']=movies_data['overview']+movies_data['genres']+movies_data['keywords']+movies_data['cast']+movies_data['crew']

In [46]:
movies_data

Unnamed: 0,id,title,release_date,genres,overview,keywords,cast,crew,tag
0,19995,Avatar,2009-12-10,"[Action, Adventure, Fantasy, ScienceFiction]","[In, the, 22nd, century,, a, paraplegic, Marin...","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."
1,285,Pirates of the Caribbean: At World's End,2007-05-19,"[Adventure, Fantasy, Action]","[Captain, Barbossa,, long, believed, to, be, d...","[ocean, drugabuse, exoticisland, eastindiatrad...","[JohnnyDepp, OrlandoBloom, KeiraKnightley]",[GoreVerbinski],"[Captain, Barbossa,, long, believed, to, be, d..."
2,206647,Spectre,2015-10-26,"[Action, Adventure, Crime]","[A, cryptic, message, from, Bond’s, past, send...","[spy, basedonnovel, secretagent, sequel, mi6, ...","[DanielCraig, ChristophWaltz, LéaSeydoux]",[SamMendes],"[A, cryptic, message, from, Bond’s, past, send..."
3,49026,The Dark Knight Rises,2012-07-16,"[Action, Crime, Drama, Thriller]","[Following, the, death, of, District, Attorney...","[dccomics, crimefighter, terrorist, secretiden...","[ChristianBale, MichaelCaine, GaryOldman]",[ChristopherNolan],"[Following, the, death, of, District, Attorney..."
4,49529,John Carter,2012-03-07,"[Action, Adventure, ScienceFiction]","[John, Carter, is, a, war-weary,, former, mili...","[basedonnovel, mars, medallion, spacetravel, p...","[TaylorKitsch, LynnCollins, SamanthaMorton]",[AndrewStanton],"[John, Carter, is, a, war-weary,, former, mili..."
...,...,...,...,...,...,...,...,...,...
4804,9367,El Mariachi,1992-09-04,"[Action, Crime, Thriller]","[El, Mariachi, just, wants, to, play, his, gui...","[unitedstates–mexicobarrier, legs, arms, paper...","[CarlosGallardo, JaimedeHoyos, PeterMarquardt]",[RobertRodriguez],"[El, Mariachi, just, wants, to, play, his, gui..."
4805,72766,Newlyweds,2011-12-26,"[Comedy, Romance]","[A, newlywed, couple's, honeymoon, is, upended...",[],"[EdwardBurns, KerryBishé, MarshaDietlein]",[EdwardBurns],"[A, newlywed, couple's, honeymoon, is, upended..."
4806,231617,"Signed, Sealed, Delivered",2013-10-13,"[Comedy, Drama, Romance, TVMovie]","[""Signed,, Sealed,, Delivered"", introduces, a,...","[date, loveatfirstsight, narration, investigat...","[EricMabius, KristinBooth, CrystalLowe]",[ScottSmith],"[""Signed,, Sealed,, Delivered"", introduces, a,..."
4807,126186,Shanghai Calling,2012-05-03,[],"[When, ambitious, New, York, attorney, Sam, is...",[],"[DanielHenney, ElizaCoupe, BillPaxton]",[DanielHsia],"[When, ambitious, New, York, attorney, Sam, is..."


In [47]:
new_df=movies_data[['id','title','release_date','tag']]
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4805 entries, 0 to 4808
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            4805 non-null   int64         
 1   title         4805 non-null   object        
 2   release_date  4805 non-null   datetime64[ns]
 3   tag           4805 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 187.7+ KB


In [48]:
new_df['tag']=new_df['tag'].apply(lambda x:" ".join(x))

In [49]:
new_df['tag'][0]

'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization. Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron'

In [50]:
new_df['tag']=new_df['tag'].apply(lambda x:x.lower())

In [51]:
new_df.head()

Unnamed: 0,id,title,release_date,tag
0,19995,Avatar,2009-12-10,"in the 22nd century, a paraplegic marine is di..."
1,285,Pirates of the Caribbean: At World's End,2007-05-19,"captain barbossa, long believed to be dead, ha..."
2,206647,Spectre,2015-10-26,a cryptic message from bond’s past sends him o...
3,49026,The Dark Knight Rises,2012-07-16,following the death of district attorney harve...
4,49529,John Carter,2012-03-07,"john carter is a war-weary, former military ca..."


In [52]:
import nltk

In [53]:
from nltk.stem.porter import PorterStemmer
ps= PorterStemmer()

#stemming

In [54]:
def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)   
    

In [55]:
new_df['tag']=new_df['tag'].apply(stem)

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

In [57]:
vectors=cv.fit_transform(new_df['tag']).toarray()

In [58]:
cv.fit_transform(new_df['tag']).toarray().shape

(4805, 5000)

In [59]:
vectors
#vectorized
#sparse matrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [60]:
vectors[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [61]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

In [62]:
#cosine similarity- more reliable measure to find similarity in higher dim.
#dist inv. prop to similarity
from sklearn.metrics.pairwise import cosine_similarity

In [63]:
similarity=cosine_similarity(vectors)

In [64]:
cosine_similarity(vectors).shape

(4805, 4805)

In [65]:
similarity

array([[1.        , 0.08346223, 0.0860309 , ..., 0.04499213, 0.        ,
        0.        ],
       [0.08346223, 1.        , 0.06063391, ..., 0.02378257, 0.        ,
        0.02615329],
       [0.0860309 , 0.06063391, 1.        , ..., 0.02451452, 0.        ,
        0.        ],
       ...,
       [0.04499213, 0.02378257, 0.02451452, ..., 1.        , 0.03962144,
        0.04229549],
       [0.        , 0.        , 0.        , ..., 0.03962144, 1.        ,
        0.08714204],
       [0.        , 0.02615329, 0.        , ..., 0.04229549, 0.08714204,
        1.        ]])

In [66]:
similarity[0].shape

(4805,)

In [67]:
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])
#finding top 5 movies

[(0, 1.0000000000000002),
 (1214, 0.28676966733820225),
 (2405, 0.26901379342448517),
 (3728, 0.2605130246476754),
 (507, 0.255608593705383),
 (539, 0.2503866978335957),
 (582, 0.24511108480187255),
 (1202, 0.24455799402225925),
 (1192, 0.2367785320221084),
 (61, 0.23179316248638274),
 (778, 0.23174488732966075),
 (4046, 0.2278389747471728),
 (1916, 0.22528177844479153),
 (2782, 0.21853668936906193),
 (172, 0.21239769762143662),
 (972, 0.2108663315950723),
 (322, 0.2105263157894737),
 (2329, 0.20443988269091456),
 (3606, 0.20437977982832192),
 (260, 0.20395079136182276),
 (151, 0.2029530274475215),
 (4190, 0.2029530274475215),
 (1440, 0.20277677641345318),
 (74, 0.20246457179963143),
 (1087, 0.2020475485519274),
 (3671, 0.1979082783981174),
 (973, 0.19767387315371682),
 (577, 0.1976738731537168),
 (47, 0.19672236884115843),
 (2969, 0.19252140716412977),
 (942, 0.19134594929397597),
 (495, 0.19088542889273336),
 (1199, 0.19088542889273336),
 (305, 0.1900748713929803),
 (4, 0.18929940971

In [68]:
def recommend(movie):
    idx=new_df[new_df['title']==movie].index[0]
    distances=similarity[idx]
    movies_list=sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]

    for i in movies_list:
        print(new_df.iloc[i[0]]['title'])

In [69]:
recommend('Avatar')

Aliens vs Predator: Requiem
Aliens
Falcon Rising
Independence Day
Titan A.E.


In [70]:
recommend('Batman Begins')

The Dark Knight
Batman
Batman
The Dark Knight Rises
10th & Wolf


## 3. PICKLING

In [211]:
import pickle

In [217]:
pickle.dump(new_df.to_dict(),open('movies_dict1.pkl','wb'))

In [218]:
pickle.dump(similarity,open('similarity.pkl','wb'))