<a href="https://colab.research.google.com/github/ajaythakur3369/DIGIBHEM/blob/main/Task_2_Movie_Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project - Movie Recommendation System**
## **Developed By - Ajay Thakur (2016kuec2026@iiitkota.ac.in)**
## **Branch Name - Electronics and Communication Engineering**
## **Institute Name - Indian Institute of Information Technology Kota (An Institute of National Importance under an Act of Parliament)**
## **Submitted To - Digital Bhem**
## **Project Link (GitHub) - [Click here](https://github.com/ajaythakur3369/DIGIBHEM/blob/main/Task_2_Movie_Recommendation_System.ipynb)**

## **Project Link (Drive) - [Click here](https://colab.research.google.com/drive/1uubLLg7K2dcyYlzx_Xxx3G3pNQTjnubc)**

# **Problem Statement -**

**To build a system that recommends movies to users based on their past
viewing history and ratings using a variety of algorithms, such as collaborative filtering or content-based filtering algorithms.**

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

**Load Data**

In [3]:
# Access the Drive from Colab, for accessing the file
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [4]:
df_links = pd.read_csv('/content/drive/MyDrive/Folders/links.csv')
df_movies = pd.read_csv('/content/drive/MyDrive/Folders/movies.csv')
df_ratings = pd.read_csv('/content/drive/MyDrive/Folders/ratings.csv')

In [5]:
# This dataset contains the Id's of the movies
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [6]:
df_links.shape

(45843, 3)

**Observation of df_links dataset:- This contains the Id's of more than 60 thousands movies**

In [7]:
# This dataset has title and genres of the movies
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [8]:
df_movies.shape

(27278, 3)

**Observations: df_links dataset and df_movies dataset have same number of rows 62423**

**Merging the df_links dataset and df_movies dataset on movieId**

In [9]:
df = df_movies.merge(df_links, on = 'movieId')
df

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0
...,...,...,...,...,...
27246,131254,Kein Bund für's Leben (2007),Comedy,466713,4436.0
27247,131256,"Feuer, Eis & Dosenbier (2002)",Comedy,277703,9274.0
27248,131258,The Pirates (2014),Adventure,3485166,285213.0
27249,131260,Rentun Ruusu (2001),(no genres listed),249110,32099.0


In [10]:
# We need only 'imdbId' instead of 'tmdbId' so drop tmdbId
# Run once otherwise it will show errors
df.drop('tmdbId', axis = 1, inplace = True)
df.head()

Unnamed: 0,movieId,title,genres,imdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497
2,3,Grumpier Old Men (1995),Comedy|Romance,113228
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885
4,5,Father of the Bride Part II (1995),Comedy,113041


In [11]:
# This dataset has 'rating' of the user on the movie
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [12]:
df_ratings.shape

(26024289, 4)

**Observation of df_ratings dataset:- This datasets has 25 millions ratings of different users on the different movies**

In [13]:
# Droping the 'timestamp' column because we don't need of that feature
df_ratings.drop('timestamp', axis = 1, inplace = True)

**Droping the followings:-**
Drop those users that has given less than 100 ratings on the movies and
Drop those movies that has given less than 100 ratings by the users

In [14]:
df_rating_ge_100 = df_ratings.groupby('userId')['rating'].count()>100
df_rating_ge_100 = df_rating_ge_100.reset_index().rename(columns = {'rating': 'Count>100'})

In [15]:
# Uncomment the following one by one to see the intermediate ste
# df_ratings.groupby('userId')['rating'].count()
# df_ratings.groupby('userId')['rating'].count()>100
# df_rating_ge_100
# df_rating_ge_100['Count>100'].sum()

In [16]:
# Extracting those userId's that has given more than 100 ratings
df_movie_rating_userId = df_rating_ge_100[df_rating_ge_100['Count>100'] == True]['userId'].reset_index().drop('index', axis = 1)
df_movie_rating_userId

Unnamed: 0,userId
0,8
1,11
2,12
3,15
4,16
...,...
63572,270885
63573,270887
63574,270893
63575,270894


In [17]:
# Now, merging these extracted userId's with the df_ratings
df_movie_rating_userId = df_movie_rating_userId.merge(df_ratings, on = 'userId')

In [18]:
df_movie_rating_userId

Unnamed: 0,userId,movieId,rating
0,8,1,4.0
1,8,7,2.0
2,8,44,3.0
3,8,47,4.0
4,8,170,3.0
...,...,...,...
20026646,270896,58559,5.0
20026647,270896,60069,5.0
20026648,270896,63082,4.5
20026649,270896,64957,4.5


In [19]:
# Taking the movieId's that has more than 100 rating counts
df_movie_rating_ge_100 = df_movie_rating_userId.groupby('movieId')['rating'].count()>100

In [20]:
df_movie_rating_ge_100 = df_movie_rating_ge_100.reset_index().rename(columns = {'rating': 'Count>100'})

In [21]:
# Uncomment the following one by one to see the intermediate steps
# df_movie_rating_userId.groupby('movieId')['rating'].count()
# df_movie_rating_ge_100
# df_movie_rating_ge_100.sum()

In [22]:
df_movie_rating_movieId = df_movie_rating_ge_100[df_movie_rating_ge_100['Count>100'] == True]['movieId'].reset_index().drop('index', axis = 1)

In [23]:
df_movie_rating_movieId

Unnamed: 0,movieId
0,1
1,2
2,3
3,4
4,5
...,...
9647,170875
9648,171763
9649,171765
9650,173145


**Note:- Here dataset df_movie_rating_movieId, is the extracted both (by userId and movieId), because the extraction (movie rating count>100) is applied to the extracted dataset (userId rating count>100)**

In [24]:
# Now, merging these extracted movieId's with the extracted userId's  (Here df_movie_rating_movieId having both extractions)
df_movie_rating_userId.merge(df_movie_rating_movieId, on = 'movieId')

Unnamed: 0,userId,movieId,rating
0,8,1,4.0
1,12,1,4.0
2,20,1,4.0
3,24,1,4.0
4,34,1,3.0
...,...,...,...
19576828,259984,139120,5.0
19576829,260843,139120,4.0
19576830,263655,139120,3.5
19576831,265140,139120,4.5


In [25]:
df_rate = df_movie_rating_userId.merge(df_movie_rating_movieId, on = 'movieId')

In [26]:
# Taking average ratings of each movies
df_average_rate = df_rate.groupby('movieId')['rating'].mean().reset_index().rename(columns = {'rating': 'AverageRate'})
df_average_rate

Unnamed: 0,movieId,AverageRate
0,1,3.866701
1,2,3.138467
2,3,3.027735
3,4,2.775529
4,5,2.930767
...,...,...
9647,170875,3.119910
9648,171763,3.986056
9649,171765,3.610656
9650,173145,3.570093


In [27]:
# Now merging the datasets movies (df) with the extracted dataset df_movie_rating_movieId
df.merge(df_movie_rating_movieId, on = 'movieId')

Unnamed: 0,movieId,title,genres,imdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497
2,3,Grumpier Old Men (1995),Comedy|Romance,113228
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885
4,5,Father of the Bride Part II (1995),Comedy,113041
...,...,...,...,...
9286,130578,The Gunman (2015),Action|Thriller,2515034
9287,130840,Spring (2015),Horror|Romance|Sci-Fi,3395184
9288,130970,George Carlin: Life Is Worth Losing (2005),Comedy,484855
9289,131013,Get Hard (2015),Comedy,2561572


In [28]:
df_rm = df.merge(df_movie_rating_movieId, on = 'movieId')

In [29]:
# Now taking out the imdbId
imdbID = df_rm['imdbId'].reset_index().drop('index', axis = 1)

In [30]:
imdbID

Unnamed: 0,imdbId
0,114709
1,113497
2,113228
3,114885
4,113041
...,...
9286,2515034
9287,3395184
9288,484855
9289,2561572


**We need more feature for the recommendation system (e.g. cast, crew, plot, etc.)**

So imdbId have been taken from the dataset, to fetch the features with the help of API

In [31]:
# Uncomment this for downloading the csv file imdbID, so that it fetch the attributes from API (I used google colab to fetch the features)
# compression_opts = dict(method = 'zip', archive_name = 'out_id.csv')
# imdbID.to_csv('out_id.zip', compression = compression_opts)

In [32]:
df_rm[df_rm['imdbId'] == 118114]

Unnamed: 0,movieId,title,genres,imdbId
627,720,Wallace & Gromit: The Best of Aardman Animatio...,Adventure|Animation|Comedy,118114


In [33]:
# Droping because we are not able to fetch the attributes via API
df_rm.drop(622, inplace = True)

In [34]:
# Droping because we are not able to fetch the attributes via API
df_rm.drop(623, inplace = True)

In [35]:
df_rm[df_rm['imdbId'] == 118114]

Unnamed: 0,movieId,title,genres,imdbId
627,720,Wallace & Gromit: The Best of Aardman Animatio...,Adventure|Animation|Comedy,118114


In [36]:
df1 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first.csv')
df3 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first100_200.csv')
df4 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first200_300.csv')
df2 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first50_100.csv')
df5 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first300_400.csv')
df6 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first400_600.csv')
df7 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first600_622.csv')
df8 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first624_900.csv')
df9 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first900_1200.csv')
df10 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first1200_1500.csv')
df11 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first1500_1700.csv')
df12 = pd.read_csv('/content/drive/MyDrive/Folders/dt1700_2000.csv')
df13 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first2000_2300.csv')
df14 = pd.read_csv('/content/drive/MyDrive/Folders/dt_first2300_2500.csv')
df15 = pd.read_csv('/content/drive/MyDrive/Folders/dt2500_2800.csv')
df16 = pd.read_csv('/content/drive/MyDrive/Folders/dt2800_3000.csv')
df17 = pd.read_csv('/content/drive/MyDrive/Folders/dt3000_3300.csv')
df18 = pd.read_csv('/content/drive/MyDrive/Folders/dt3300_3500.csv')
df19 = pd.read_csv('/content/drive/MyDrive/Folders/dt3500_3700.csv')
df20 = pd.read_csv('/content/drive/MyDrive/Folders/dt3700_4000.csv')
df21 = pd.read_csv('/content/drive/MyDrive/Folders/dt4000_4500.csv')
df22 = pd.read_csv('/content/drive/MyDrive/Folders/dt4500_5000.csv')
df23 = pd.read_csv('/content/drive/MyDrive/Folders/dt5000_5500.csv')
df24 = pd.read_csv('/content/drive/MyDrive/Folders/dt5500_6000.csv')
df25 = pd.read_csv('/content/drive/MyDrive/Folders/dt6000_7000.csv')
df26 = pd.read_csv('/content/drive/MyDrive/Folders/dt7000_8000.csv')
df27 = pd.read_csv('/content/drive/MyDrive/Folders/dt8000_8300.csv')
df28 = pd.read_csv('/content/drive/MyDrive/Folders/dt8300_9000.csv')
df29 = pd.read_csv('/content/drive/MyDrive/Folders/dt9000_10068.csv')

**The above datasets, from df1 to df29 have come from the fetching the features using the API.**

**Here there are huge number of csv files, the reason is when I had put whole the data (imdbId) to fetch the attributes then some kinds of error happened due to the API and the environment issues.**

**These csv files can be combined to a single csv file, but I put these as it is, to explain the problems that I had faced on fetching the features.**

In [37]:
# Observing the one of the datasets
df13.head(10)

Unnamed: 0,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,92112,Hard rock idol Sammi Curr burns to death in a ...,"['Marc Price', 'Tony Fields', 'Lisa Orgolini',...","['Devorah S.F. Arlook', 'Page Lynn Campbell', ...",['Charles Martin Smith'],['United States'],['English'],['De Laurentiis Entertainment Group (DEG)']
1,90917,Paul Conway and his mother Jeannie Conway trav...,"['Matthew Labyorteaux', 'Kristy Swanson', 'Mic...","['Ruth Ellen Daniels', 'William H. Faeth', 'Pa...",['Wes Craven'],['United States'],['English'],"['Warner Bros.', 'Pan Arts', 'Layton']"
2,91605,14. century Franciscan monk William of Baskerv...,"['Sean Connery', 'Christian Slater', 'Helmut Q...","['Frans J. Afman', 'Daniele Alexandre-Bidon', ...",['Jean-Jacques Annaud'],"['West Germany', 'Italy', 'France']","['English', 'Italian', 'Latin', 'German']","['Constantin Film', 'Cristaldifilm', 'Les Film..."
3,91306,"Terry works for a bank, and uses computers to ...","['Whoopi Goldberg', 'Stephen Collins', 'John W...","['Anthony Blake Brand', 'Patti Calhoun', 'Jane...",['Penny Marshall'],['United States'],['English'],"['Lawrence Gordon Productions', 'Silver Pictur..."
4,91738,Peggy Sue Bodell (Turner) attends her 25-year ...,"['Kathleen Turner', 'Nicolas Cage', 'Barry Mil...","['Toni Basil', 'Chrissy Bocchino', 'Michael Cr...",['Francis Ford Coppola'],['United States'],['English'],"['TriStar Pictures', 'Rastar Pictures', 'Zoetr..."
5,90555,Intrigued by the near-death experience of the ...,"['Paul Hogan', 'Linda Kozlowski', 'John Meillo...","['Doon Allen', 'Dale Aspin', 'Victor Atkinson'...",['Peter Faiman'],['Australia'],['English'],['Rimfire Films']
6,92493,One short year after the life-changing adventu...,"['Paul Hogan', 'Linda Kozlowski', 'John Meillo...","['Jeffrey Bacon', 'Angus Banks-Stewart', 'Jeff...",['John Cornell'],['Australia'],"['English', 'Spanish', 'Japanese']","['Paramount Pictures', 'Rimfire Films']"
7,92105,Harry and Archie are released from prison read...,"['Burt Lancaster', 'Kirk Douglas', 'Charles Du...","['Brad Dechter', 'Kirk Douglas', 'Debra Ginsbe...",['Jeff Kanew'],['United States'],['English'],"['Touchstone Pictures', 'Silver Screen Partner..."
8,91991,"Sadly, all good things must come to an end, an...","['C. Thomas Howell', 'Rae Dawn Chong', 'Arye G...","['Rocky Alan', 'Lori A. Balton', 'Stan Cohen',...",['Steve Miner'],['United States'],['English'],"['Balcor Film Investors', 'Steve Tisch Company']"
9,90863,Pool hustler Fast Eddie Felson finds the young...,"['Paul Newman', 'Tom Cruise', 'Mary Elizabeth ...","['Kathy Anderson', 'William Arnold', 'Karen Br...",['Martin Scorsese'],['United States'],"['English', 'Spanish']","['Touchstone Pictures', 'Silver Screen Partner..."


In [38]:
# Concatenating these datasets into one dataset
result = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16, df17, df18, df19, df20, df21, df22, df23, df24, df25, df26, df27, df28, df29], axis = 0)

In [39]:
result = result.reset_index().drop('index', axis = 1)
result.head(5)

Unnamed: 0,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,114709,A little boy named Andy loves to be in his roo...,"['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...","['April', 'Ronen Barzel', 'Susan Bradley', 'An...",['John Lasseter'],"['United States', 'Japan']",['English'],"['Walt Disney Pictures', 'Pixar Animation Stud..."
1,113497,"Jumanji, one of the most unique--and dangerous...","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['Mary Arnold', 'Carra Braveman', 'Adam Bryant...",['Joe Johnston'],['United States'],"['English', 'French']","['TriStar Pictures', 'Interscope Communication..."
2,113228,Things don't seem to change much in Wabasha Co...,"['Walter Matthau', 'Jack Lemmon', 'Sophia Lore...","['David Bifano', 'Brooke Brooks', 'Jill Carval...",['Howard Deutch'],['United States'],"['English', 'Italian', 'German']","['Lancaster Gate', 'Warner Bros.']"
3,114885,This story based on the best selling novel by ...,"['Whitney Houston', 'Angela Bassett', 'Loretta...","['R. Elizabeth Aaron', 'Laurie Badami', 'Ryan ...",['Forest Whitaker'],['United States'],['English'],['Twentieth Century Fox']
4,113041,"In this sequel to ""Father of the Bride"", Georg...","['Steve Martin', 'Diane Keaton', 'Martin Short...","['Dave Ange', 'Roshanna Baron', 'Ronald Castro...",['Charles Shyer'],['United States'],['English'],"['Touchstone Pictures', 'Sandollar Productions..."


In [40]:
# Since, all the datasets have been combined to single dataset
# So, downloading the csv file with name movie_attribute.csv
compression_opts = dict(method = 'zip', archive_name = 'movie_attribute.csv')
result.to_csv('movie_attribute.zip', compression = compression_opts)

**Merging the dataset df_rm (title, genres, imdbId) with the dataset result ( imdbId, plot, cast, crew, director, countries, languages, production_companies)**

In [41]:
df_rec = df_rm.merge(result, on = 'imdbId')

In [42]:
df_rec.head()

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,A little boy named Andy loves to be in his roo...,"['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...","['April', 'Ronen Barzel', 'Susan Bradley', 'An...",['John Lasseter'],"['United States', 'Japan']",['English'],"['Walt Disney Pictures', 'Pixar Animation Stud..."
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,"Jumanji, one of the most unique--and dangerous...","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['Mary Arnold', 'Carra Braveman', 'Adam Bryant...",['Joe Johnston'],['United States'],"['English', 'French']","['TriStar Pictures', 'Interscope Communication..."
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,Things don't seem to change much in Wabasha Co...,"['Walter Matthau', 'Jack Lemmon', 'Sophia Lore...","['David Bifano', 'Brooke Brooks', 'Jill Carval...",['Howard Deutch'],['United States'],"['English', 'Italian', 'German']","['Lancaster Gate', 'Warner Bros.']"
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,This story based on the best selling novel by ...,"['Whitney Houston', 'Angela Bassett', 'Loretta...","['R. Elizabeth Aaron', 'Laurie Badami', 'Ryan ...",['Forest Whitaker'],['United States'],['English'],['Twentieth Century Fox']
4,5,Father of the Bride Part II (1995),Comedy,113041,"In this sequel to ""Father of the Bride"", Georg...","['Steve Martin', 'Diane Keaton', 'Martin Short...","['Dave Ange', 'Roshanna Baron', 'Ronald Castro...",['Charles Shyer'],['United States'],['English'],"['Touchstone Pictures', 'Sandollar Productions..."


**Feature engineering ( Feature Transformation)**

Transforming all the columns data into list

In [43]:
df_rec['genres'][0].split("|")

['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy']

In [44]:
def convert_list(val):
    return val.split("|")

In [45]:
# Transforming genres
df_rec['genres'] = df_rec['genres'].apply(convert_list)

In [46]:
df_rec.head(2)

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",114709,A little boy named Andy loves to be in his roo...,"['Tom Hanks', 'Tim Allen', 'Don Rickles', 'Jim...","['April', 'Ronen Barzel', 'Susan Bradley', 'An...",['John Lasseter'],"['United States', 'Japan']",['English'],"['Walt Disney Pictures', 'Pixar Animation Stud..."
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",113497,"Jumanji, one of the most unique--and dangerous...","['Robin Williams', 'Jonathan Hyde', 'Kirsten D...","['Mary Arnold', 'Carra Braveman', 'Adam Bryant...",['Joe Johnston'],['United States'],"['English', 'French']","['TriStar Pictures', 'Interscope Communication..."


In [47]:
df_rec['cast'][0].replace("'","").replace(" ","")

'[TomHanks,TimAllen,DonRickles,JimVarney,WallaceShawn]'

In [48]:
def convert_a_word(val):
    val = val.replace("'","").replace(" ","").replace("-","")
    # Converting into list
    return  val.replace("]","").replace("[","").split(",")

In [49]:
# Transforming cast
df_rec['cast'] = df_rec['cast'].apply(convert_a_word)

In [50]:
df_rec.head(2)

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",114709,A little boy named Andy loves to be in his roo...,"[TomHanks, TimAllen, DonRickles, JimVarney, Wa...","['April', 'Ronen Barzel', 'Susan Bradley', 'An...",['John Lasseter'],"['United States', 'Japan']",['English'],"['Walt Disney Pictures', 'Pixar Animation Stud..."
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",113497,"Jumanji, one of the most unique--and dangerous...","[RobinWilliams, JonathanHyde, KirstenDunst, Br...","['Mary Arnold', 'Carra Braveman', 'Adam Bryant...",['Joe Johnston'],['United States'],"['English', 'French']","['TriStar Pictures', 'Interscope Communication..."


In [51]:
# Transforming crew
df_rec['crew'] = df_rec['crew'].apply(convert_a_word)

In [52]:
df_rec.head(2)

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",114709,A little boy named Andy loves to be in his roo...,"[TomHanks, TimAllen, DonRickles, JimVarney, Wa...","[April, RonenBarzel, SusanBradley, AndrewCaldw...",['John Lasseter'],"['United States', 'Japan']",['English'],"['Walt Disney Pictures', 'Pixar Animation Stud..."
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",113497,"Jumanji, one of the most unique--and dangerous...","[RobinWilliams, JonathanHyde, KirstenDunst, Br...","[MaryArnold, CarraBraveman, AdamBryant, KenCos...",['Joe Johnston'],['United States'],"['English', 'French']","['TriStar Pictures', 'Interscope Communication..."


In [53]:
# Transforming director, countries, languages, production_companies

In [54]:
df_rec['director'] = df_rec['director'].apply(convert_a_word)

In [55]:
df_rec['countries'] = df_rec['countries'].apply(convert_a_word)

In [56]:
df_rec['languages'] = df_rec['languages'].apply(convert_a_word)

In [57]:
df_rec['production_companies'] = df_rec['production_companies'].apply(convert_a_word)

In [58]:
df_rec.head(2)

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",114709,A little boy named Andy loves to be in his roo...,"[TomHanks, TimAllen, DonRickles, JimVarney, Wa...","[April, RonenBarzel, SusanBradley, AndrewCaldw...",[JohnLasseter],"[UnitedStates, Japan]",[English],"[WaltDisneyPictures, PixarAnimationStudios]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",113497,"Jumanji, one of the most unique--and dangerous...","[RobinWilliams, JonathanHyde, KirstenDunst, Br...","[MaryArnold, CarraBraveman, AdamBryant, KenCos...",[JoeJohnston],[UnitedStates],"[English, French]","[TriStarPictures, InterscopeCommunications, Te..."


In [59]:
df_rec['plot'].apply(lambda x: x.split() if isinstance(x, str) else x)

0       [A, little, boy, named, Andy, loves, to, be, i...
1       [Jumanji,, one, of, the, most, unique--and, da...
2       [Things, don't, seem, to, change, much, in, Wa...
3       [This, story, based, on, the, best, selling, n...
4       [In, this, sequel, to, "Father, of, the, Bride...
                              ...                        
9013    [A, sniper, on, a, mercenary, assassination, t...
9014    [['A, young, man, in, a, personal, tailspin, f...
9015    [['George, Carlin, continues, making, people, ...
9016    [Kevin, Hart, plays, the, role, of, Darnell--a...
9017    [An, alien, similar, to, Ra, appears, out, of,...
Name: plot, Length: 9018, dtype: object

In [60]:
df_rec['plot'] = df_rec['plot'].apply(lambda x:x.split() if isinstance(x, str) else x)

In [61]:
df_rec.head(2)

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",114709,"[A, little, boy, named, Andy, loves, to, be, i...","[TomHanks, TimAllen, DonRickles, JimVarney, Wa...","[April, RonenBarzel, SusanBradley, AndrewCaldw...",[JohnLasseter],"[UnitedStates, Japan]",[English],"[WaltDisneyPictures, PixarAnimationStudios]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",113497,"[Jumanji,, one, of, the, most, unique--and, da...","[RobinWilliams, JonathanHyde, KirstenDunst, Br...","[MaryArnold, CarraBraveman, AdamBryant, KenCos...",[JoeJohnston],[UnitedStates],"[English, French]","[TriStarPictures, InterscopeCommunications, Te..."


In [62]:
# Transforming title
df_rec['title'] = df_rec['title'].str.replace(r'\(\d{4}\)', '').str.strip()

In [63]:
def convert_title(val):
  # Converting into list
    return  val.split(" ")

In [64]:
# Making seperate column for building the recommendation system
df_rec['title_tag'] = df_rec['title'].apply(convert_title)

In [65]:
df_rec.head(2)

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies,title_tag
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",114709,"[A, little, boy, named, Andy, loves, to, be, i...","[TomHanks, TimAllen, DonRickles, JimVarney, Wa...","[April, RonenBarzel, SusanBradley, AndrewCaldw...",[JohnLasseter],"[UnitedStates, Japan]",[English],"[WaltDisneyPictures, PixarAnimationStudios]","[Toy, Story]"
1,2,Jumanji,"[Adventure, Children, Fantasy]",113497,"[Jumanji,, one, of, the, most, unique--and, da...","[RobinWilliams, JonathanHyde, KirstenDunst, Br...","[MaryArnold, CarraBraveman, AdamBryant, KenCos...",[JoeJohnston],[UnitedStates],"[English, French]","[TriStarPictures, InterscopeCommunications, Te...",[Jumanji]


**Adding the feature to make the one feature**

That one feature has the information about the movie

In [66]:
df_rec['tags'] = df_rec['title_tag'] + df_rec['genres'] + df_rec['plot'] + df_rec['cast'] + df_rec['crew'] + df_rec['director'] + df_rec['countries'] + df_rec['languages'] + df_rec['production_companies']


In [67]:
df_rec.head(2)

Unnamed: 0,movieId,title,genres,imdbId,plot,cast,crew,director,countries,languages,production_companies,title_tag,tags
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",114709,"[A, little, boy, named, Andy, loves, to, be, i...","[TomHanks, TimAllen, DonRickles, JimVarney, Wa...","[April, RonenBarzel, SusanBradley, AndrewCaldw...",[JohnLasseter],"[UnitedStates, Japan]",[English],"[WaltDisneyPictures, PixarAnimationStudios]","[Toy, Story]","[Toy, Story, Adventure, Animation, Children, C..."
1,2,Jumanji,"[Adventure, Children, Fantasy]",113497,"[Jumanji,, one, of, the, most, unique--and, da...","[RobinWilliams, JonathanHyde, KirstenDunst, Br...","[MaryArnold, CarraBraveman, AdamBryant, KenCos...",[JoeJohnston],[UnitedStates],"[English, French]","[TriStarPictures, InterscopeCommunications, Te...",[Jumanji],"[Jumanji, Adventure, Children, Fantasy, Jumanj..."


In [68]:
# Droping the columns
df_rec.drop(columns = ['plot', 'genres', 'director', 'cast', 'crew', 'countries', 'languages', 'production_companies', 'title_tag'])

Unnamed: 0,movieId,title,imdbId,tags
0,1,Toy Story,114709,"[Toy, Story, Adventure, Animation, Children, C..."
1,2,Jumanji,113497,"[Jumanji, Adventure, Children, Fantasy, Jumanj..."
2,3,Grumpier Old Men,113228,"[Grumpier, Old, Men, Comedy, Romance, Things, ..."
3,4,Waiting to Exhale,114885,"[Waiting, to, Exhale, Comedy, Drama, Romance, ..."
4,5,Father of the Bride Part II,113041,"[Father, of, the, Bride, Part, II, Comedy, In,..."
...,...,...,...,...
9013,130578,The Gunman,2515034,"[The, Gunman, Action, Thriller, A, sniper, on,..."
9014,130840,Spring,3395184,"[Spring, Horror, Romance, Sci-Fi, ['A, young, ..."
9015,130970,George Carlin: Life Is Worth Losing,484855,"[George, Carlin:, Life, Is, Worth, Losing, Com..."
9016,131013,Get Hard,2561572,"[Get, Hard, Comedy, Kevin, Hart, plays, the, r..."


In [69]:
new_df = df_rec.drop(columns = ['plot', 'genres', 'director', 'cast', 'crew', 'countries', 'languages', 'production_companies', 'title_tag'])

In [70]:
new_df['tags'] = new_df['tags'].apply(lambda x: " ".join(x) if isinstance(x, list) else x)

In [71]:
new_df.head()

Unnamed: 0,movieId,title,imdbId,tags
0,1,Toy Story,114709,Toy Story Adventure Animation Children Comedy ...
1,2,Jumanji,113497,"Jumanji Adventure Children Fantasy Jumanji, on..."
2,3,Grumpier Old Men,113228,Grumpier Old Men Comedy Romance Things don't s...
3,4,Waiting to Exhale,114885,Waiting to Exhale Comedy Drama Romance This st...
4,5,Father of the Bride Part II,113041,Father of the Bride Part II Comedy In this seq...


In [72]:
# Merging the average rate with the dataset
new_df = new_df.merge(df_average_rate, on = 'movieId')

In [73]:
new_df.head()

Unnamed: 0,movieId,title,imdbId,tags,AverageRate
0,1,Toy Story,114709,Toy Story Adventure Animation Children Comedy ...,3.866701
1,2,Jumanji,113497,"Jumanji Adventure Children Fantasy Jumanji, on...",3.138467
2,3,Grumpier Old Men,113228,Grumpier Old Men Comedy Romance Things don't s...,3.027735
3,4,Waiting to Exhale,114885,Waiting to Exhale Comedy Drama Romance This st...,2.775529
4,5,Father of the Bride Part II,113041,Father of the Bride Part II Comedy In this seq...,2.930767


In [74]:
# This function makes the imdbId to the api_imdbId( ths will be used to fetch the recommended movies from the API)
def get_imdbId(movie_Ids):
    v = str(movie_Ids)
    if len(v) == 6:
        vd = 'tt0'
        v = vd+v
    if len(v) == 7:
        gt = 'tt'
        v = gt+v
    if len(v) == 5:
        jt = 'tt00'
        v = jt+v
    return v

In [75]:
new_df['api_imdbId'] = new_df['imdbId'].apply(get_imdbId)

In [76]:
new_df.head()

Unnamed: 0,movieId,title,imdbId,tags,AverageRate,api_imdbId
0,1,Toy Story,114709,Toy Story Adventure Animation Children Comedy ...,3.866701,tt0114709
1,2,Jumanji,113497,"Jumanji Adventure Children Fantasy Jumanji, on...",3.138467,tt0113497
2,3,Grumpier Old Men,113228,Grumpier Old Men Comedy Romance Things don't s...,3.027735,tt0113228
3,4,Waiting to Exhale,114885,Waiting to Exhale Comedy Drama Romance This st...,2.775529,tt0114885
4,5,Father of the Bride Part II,113041,Father of the Bride Part II Comedy In this seq...,2.930767,tt0113041


In [77]:
new_df.isnull().sum()

movieId        0
title          0
imdbId         0
tags           1
AverageRate    0
api_imdbId     0
dtype: int64

In [78]:
# Dropping the null values from the dataset
new_df.dropna(inplace = True)

In [79]:
new_df.isnull().sum()

movieId        0
title          0
imdbId         0
tags           0
AverageRate    0
api_imdbId     0
dtype: int64

**Converting text data into a sparse matrix**

CountVectorizer is a feature extraction method that converts text data into a sparse matrix where each row corresponds to a document, and each column corresponds to a unique word or token in the entire document collection. The values in the matrix indicate the count (frequency) of each word's occurrence in each document.

In [80]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 7000, stop_words = 'english')

In [81]:
vector = cv.fit_transform(new_df['tags']).toarray()

In [82]:
vector

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

**Evaluating cosine similarity among the each vectors**

In [83]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(vector)

In [84]:
similarity

array([[1.        , 0.05710402, 0.07874163, ..., 0.08939982, 0.07372098,
        0.06059827],
       [0.05710402, 1.        , 0.03064257, ..., 0.06262243, 0.03688556,
        0.08489527],
       [0.07874163, 0.03064257, 1.        , ..., 0.1727021 , 0.13563219,
        0.13657399],
       ...,
       [0.08939982, 0.06262243, 0.1727021 , ..., 1.        , 0.13859179,
        0.11961783],
       [0.07372098, 0.03688556, 0.13563219, ..., 0.13859179, 1.        ,
        0.04697114],
       [0.06059827, 0.08489527, 0.13657399, ..., 0.11961783, 0.04697114,
        1.        ]])

In [85]:
similarity.shape

(9017, 9017)

The similarity matrix gives the similarity between all the movies to a particular movie

**Content-boosted Collaborative Filtering**

In this approach, content-based recommendations are enhanced with collaborative filtering (user-item interaction). Item content is used as an initial recommendation, and then user preferences are factored in to refine the suggestions.

In [86]:
# Combining similarity with rating
for i in range(similarity.shape[0]):
    similarity[i] = similarity[i]*0.9 + new_df['AverageRate']*0.2*0.1
    similarity[i][i] = 1.0000000

**Fetch posters and recommended movies name**

In [87]:
import requests
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from io import BytesIO
%matplotlib inline

In [88]:
def fetch_poster(imdb_id):
    api_key = 'f63a16f0'
    url = f'http://www.omdbapi.com/?i = {imdb_id}&apikey = {api_key}'
    response = requests.get(url)
    data = response.json()
    poster_url = data['Poster']
    poster_response = requests.get(poster_url)
    image = Image.open(BytesIO(poster_response.content))
    return display(image)

In [89]:
def recommend(movie):
    movie_index = new_df[new_df['title'] == movie].index[0]
    distances = similarity[movie_index]
    movies_list = sorted(list(enumerate(distances)), reverse = True, key = lambda x:x[1])[1:6]
    for i in movies_list:
        print(new_df.iloc[i[0]].title)

In [90]:
new_df.iloc[1216].title

'Love and Other Catastrophes'

In [91]:
new_df.head(2)

Unnamed: 0,movieId,title,imdbId,tags,AverageRate,api_imdbId
0,1,Toy Story,114709,Toy Story Adventure Animation Children Comedy ...,3.866701,tt0114709
1,2,Jumanji,113497,"Jumanji Adventure Children Fantasy Jumanji, on...",3.138467,tt0113497


In [92]:
recommend('Avatar')

Day for Night (La Nuit Américaine)
Life and Death of Peter Sellers, The
CQ
Sita Sings the Blues
My Week with Marilyn


In [93]:
recommend('Batman')

Batman: Mask of the Phantasm
Batman Returns
Batman Forever
Dark Knight, The
Batman Begins


**Taking the dataframe new_df and similarity for making website**

In [94]:
import pickle

In [95]:
new_df.head()

Unnamed: 0,movieId,title,imdbId,tags,AverageRate,api_imdbId
0,1,Toy Story,114709,Toy Story Adventure Animation Children Comedy ...,3.866701,tt0114709
1,2,Jumanji,113497,"Jumanji Adventure Children Fantasy Jumanji, on...",3.138467,tt0113497
2,3,Grumpier Old Men,113228,Grumpier Old Men Comedy Romance Things don't s...,3.027735,tt0113228
3,4,Waiting to Exhale,114885,Waiting to Exhale Comedy Drama Romance This st...,2.775529,tt0114885
4,5,Father of the Bride Part II,113041,Father of the Bride Part II Comedy In this seq...,2.930767,tt0113041


In [96]:
# Taking only those columns that are nececessary to fetch the recommended movies
final_df = new_df[['title', 'api_imdbId']]

In [97]:
final_df.head()

Unnamed: 0,title,api_imdbId
0,Toy Story,tt0114709
1,Jumanji,tt0113497
2,Grumpier Old Men,tt0113228
3,Waiting to Exhale,tt0114885
4,Father of the Bride Part II,tt0113041


In [98]:
# Convert the dataframe new_df into dictionary
pickle.dump(final_df.to_dict(), open('/content/drive/MyDrive/Folders/movie_dict.pkl', 'wb'))

In [99]:
pickle.dump(similarity, open('/content/drive/MyDrive/Folders/similarity.pkl', 'wb'))