<a href="https://colab.research.google.com/github/ananyaj247/Movie-Recommendation-System/blob/main/Movie_recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing libraries


In [1]:
import numpy as np
import pandas as pd
import difflib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

Loading csv and checking it


In [None]:
movie_dataset = pd.read_csv('/content/drive/MyDrive/colabfiles/moviesdt.csv')
movie_dataset.head(10)
movie_dataset.tail()

Checking the dataset, shape and null values

In [None]:
movie_dataset.shape

In [17]:
movie_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 4803 non-null   int64  
 1   budget                4803 non-null   int64  
 2   genres                4803 non-null   object 
 3   homepage              1712 non-null   object 
 4   id                    4803 non-null   int64  
 5   keywords              4803 non-null   object 
 6   original_language     4803 non-null   object 
 7   original_title        4803 non-null   object 
 8   overview              4803 non-null   object 
 9   popularity            4803 non-null   float64
 10  production_companies  4803 non-null   object 
 11  production_countries  4803 non-null   object 
 12  release_date          4802 non-null   object 
 13  revenue               4803 non-null   int64  
 14  runtime               4801 non-null   float64
 15  spoken_languages     

Let's decide the relevant features 

In [8]:
features_selected = ['genres', 'keywords', 'original_language', 'overview', 'tagline', 'cast', 'director' ]
print(features_selected)

['genres', 'keywords', 'original_language', 'overview', 'tagline', 'cast', 'director']


Null values being replaced by null strings

In [9]:
for feature in features_selected:
  movie_dataset[feature] = movie_dataset[feature].fillna(' ')

Combine these features in one string

In [13]:
combined_ft = movie_dataset['genres']+' '+movie_dataset['keywords']+' '+movie_dataset['original_language']+' '+movie_dataset['overview']+' '+movie_dataset['tagline']+' '+movie_dataset['cast']+' '+movie_dataset['director']
print(combined_ft)


0       Action Adventure Fantasy Science Fiction cultu...
1       Adventure Fantasy Action ocean drug abuse exot...
2       Action Adventure Crime spy based on novel secr...
3       Action Crime Drama Thriller dc comics crime fi...
4       Action Adventure Science Fiction based on nove...
                              ...                        
4798    Action Crime Thriller united states\u2013mexic...
4799    Comedy Romance   en A newlywed couple's honeym...
4800    Comedy Drama Romance TV Movie date love at fir...
4801        en When ambitious New York attorney Sam is...
4802    Documentary obsession camcorder crush dream gi...
Length: 4803, dtype: object


Tfidf vectorizing

In [None]:
vectorizer = TfidfVectorizer()
features_as_vectors = vectorizer.fit_transform(combined_ft)
print(features_as_vectors)

Let's get the similarity

In [18]:
similarity = cosine_similarity(features_as_vectors)
print(similarity)

[[1.         0.05148617 0.03392811 ... 0.02801039 0.03099337 0.00770394]
 [0.05148617 1.         0.04417805 ... 0.05125681 0.03149969 0.01564774]
 [0.03392811 0.04417805 1.         ... 0.02694306 0.04797411 0.01414021]
 ...
 [0.02801039 0.05125681 0.02694306 ... 1.         0.03518829 0.03579459]
 [0.03099337 0.03149969 0.04797411 ... 0.03518829 1.         0.03130853]
 [0.00770394 0.01564774 0.01414021 ... 0.03579459 0.03130853 1.        ]]


In [19]:
print(similarity.shape)

(4803, 4803)


Input the movie's name 

In [21]:
input_movie = input('Watch more movies like this: ')


Watch more movies like this: iron man


List of all the movies in dataser and finding the closest match


In [22]:
list_of_all = movie_dataset['title'].tolist()
print(list_of_all)

['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre', 'The Dark Knight Rises', 'John Carter', 'Spider-Man 3', 'Tangled', 'Avengers: Age of Ultron', 'Harry Potter and the Half-Blood Prince', 'Batman v Superman: Dawn of Justice', 'Superman Returns', 'Quantum of Solace', "Pirates of the Caribbean: Dead Man's Chest", 'The Lone Ranger', 'Man of Steel', 'The Chronicles of Narnia: Prince Caspian', 'The Avengers', 'Pirates of the Caribbean: On Stranger Tides', 'Men in Black 3', 'The Hobbit: The Battle of the Five Armies', 'The Amazing Spider-Man', 'Robin Hood', 'The Hobbit: The Desolation of Smaug', 'The Golden Compass', 'King Kong', 'Titanic', 'Captain America: Civil War', 'Battleship', 'Jurassic World', 'Skyfall', 'Spider-Man 2', 'Iron Man 3', 'Alice in Wonderland', 'X-Men: The Last Stand', 'Monsters University', 'Transformers: Revenge of the Fallen', 'Transformers: Age of Extinction', 'Oz: The Great and Powerful', 'The Amazing Spider-Man 2', 'TRON: Legacy', 'Cars 2', 'Green Lant

In [23]:
close_match = difflib.get_close_matches(input_movie, list_of_all)
print(close_match)

['Iron Man', 'Iron Man 3', 'Iron Man 2']


In [24]:
match = close_match[0]
print(match)

Iron Man


Finding the index of movie and checking and sorting it on the basis of score

In [25]:
movie_index = movie_dataset[movie_dataset.title == match]['index'].values[0]
print(movie_index)

68


In [26]:
similarity_score = list(enumerate(similarity[movie_index]))
print(similarity_score)


[(0, 0.05083893409016364), (1, 0.03716124266452726), (2, 0.011666622125149337), (3, 0.007065208444118491), (4, 0.02681067303732928), (5, 0.014937662548823703), (6, 0.033275097510479544), (7, 0.19387577136079415), (8, 0.011150025778084528), (9, 0.03714009111727031), (10, 0.03913959010029089), (11, 0.013666002661827157), (12, 0.011200401141530363), (13, 0.010982692323718322), (14, 0.05885574506267876), (15, 0.028978596494491672), (16, 0.1326279488524102), (17, 0.016309152117409006), (18, 0.021219276416786153), (19, 0.042881868818267325), (20, 0.037167286151230536), (21, 0.025035457114422503), (22, 0.009800047555443829), (23, 0.018281436551421955), (24, 0.01855705525872456), (25, 0.0038761939173283582), (26, 0.1161694271291082), (27, 0.03351049695738133), (28, 0.03784802332489708), (29, 0.010035475453279534), (30, 0.08022661555648156), (31, 0.291038149651374), (32, 0.026176440603042844), (33, 0.07834291234956449), (34, 0.006557033170460639), (35, 0.021662176706434243), (36, 0.025798892491

In [27]:
len(similarity_score)

4803

In [28]:
sorted_order = sorted(similarity_score, key = lambda x:x[1], reverse = True)
print(sorted_order)

[(68, 1.0000000000000002), (79, 0.32963783181961487), (31, 0.291038149651374), (7, 0.19387577136079415), (16, 0.1326279488524102), (4401, 0.11927686124340056), (511, 0.11797878533982765), (26, 0.1161694271291082), (64, 0.11326362388763227), (182, 0.10863636266223915), (3623, 0.10498977034596504), (94, 0.10069283807574862), (46, 0.08869856707974016), (4033, 0.08718643334213297), (783, 0.0869877309280045), (174, 0.08652876117743809), (618, 0.08467939038428837), (85, 0.0835806383758512), (3133, 0.08054168374039268), (101, 0.08051837756387765), (1740, 0.08037545226989536), (30, 0.08022661555648156), (3466, 0.07912166868722419), (33, 0.07834291234956449), (203, 0.07767268316867439), (1177, 0.07721723535766267), (420, 0.07711865054993461), (131, 0.07577573725671409), (1406, 0.07570968036377682), (318, 0.07509831193636846), (788, 0.07435216695109356), (2229, 0.0742949066654058), (122, 0.07389477676326056), (1135, 0.07357295909022865), (2235, 0.07322277804308293), (138, 0.07305964318841898), (

In [31]:
print('If you liked', input_movie, 'you can watch: \n')

i = 1

for movie in sorted_order:
  index = movie[0]
  title_from_index = movie_dataset[movie_dataset.index==index]['title'].values[0]
  if(i<20):
    print(i, '.',title_from_index)
    i = i+1

If you liked iron man you can watch: 

1 . Iron Man
2 . Iron Man 2
3 . Iron Man 3
4 . Avengers: Age of Ultron
5 . The Avengers
6 . The Helix... Loaded
7 . X-Men
8 . Captain America: Civil War
9 . X-Men: Apocalypse
10 . Ant-Man
11 . Made
12 . Guardians of the Galaxy
13 . X-Men: Days of Future Past
14 . Super
15 . Mortdecai
16 . The Incredible Hulk
17 . Mystery Men
18 . Captain America: The Winter Soldier
19 . The Devil's Tomb


Summarising the code

In [32]:
input_movie = input('Watch more movies like this: ')
list_of_all = movie_dataset['title'].tolist()
close_match = difflib.get_close_matches(input_movie, list_of_all)
match = close_match[0]
movie_index = movie_dataset[movie_dataset.title == match]['index'].values[0]
similarity_score = list(enumerate(similarity[movie_index]))
sorted_order = sorted(similarity_score, key = lambda x:x[1], reverse = True)
print('If you liked', input_movie, 'you can watch: \n')

i = 1

for movie in sorted_order:
  index = movie[0]
  title_from_index = movie_dataset[movie_dataset.index==index]['title'].values[0]
  if(i<20):
    print(i, '.',title_from_index)
    i = i+1

Watch more movies like this: Spectre
If you liked Spectre you can watch: 

1 . Spectre
2 . Skyfall
3 . Never Say Never Again
4 . Quantum of Solace
5 . Diamonds Are Forever
6 . From Russia with Love
7 . Die Another Day
8 . Dr. No
9 . Octopussy
10 . Thunderball
11 . The Sentinel
12 . The Prince of Egypt
13 . Restless
14 . The Man with the Golden Gun
15 . Casino Royale
16 . Tinker Tailor Soldier Spy
17 . The Spy Who Loved Me
18 . The Green Hornet
19 . Live and Let Die
