<a href="https://www.kaggle.com/code/gargivipat/movie-recommendation-using-plots?scriptVersionId=158678280" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/movies-similarity/movies.csv


# Data Loading and Preprocessing

In [2]:
movies = pd.read_csv("/kaggle/input/movies-similarity/movies.csv")

In [3]:
movies.head(11)

Unnamed: 0,rank,title,genre,wiki_plot,imdb_plot
0,0,The Godfather,"[u' Crime', u' Drama']","On the day of his only daughter's wedding, Vit...","In late summer 1945, guests are gathered for t..."
1,1,The Shawshank Redemption,"[u' Crime', u' Drama']","In 1947, banker Andy Dufresne is convicted of ...","In 1947, Andy Dufresne (Tim Robbins), a banker..."
2,2,Schindler's List,"[u' Biography', u' Drama', u' History']","In 1939, the Germans move Polish Jews into the...",The relocation of Polish Jews from surrounding...
3,3,Raging Bull,"[u' Biography', u' Drama', u' Sport']","In a brief scene in 1964, an aging, overweight...","The film opens in 1964, where an older and fat..."
4,4,Casablanca,"[u' Drama', u' Romance', u' War']",It is early December 1941. American expatriate...,"In the early years of World War II, December 1..."
5,5,One Flew Over the Cuckoo's Nest,[u' Drama'],"In 1963 Oregon, Randle Patrick ""Mac"" McMurphy ...","In 1963 Oregon, Randle Patrick McMurphy (Nicho..."
6,6,Gone with the Wind,"[u' Drama', u' Romance', u' War']",\nPart 1\n \n Part 1 Part 1 \n \n On the...,"The film opens in Tara, a cotton plantation ow..."
7,7,Citizen Kane,"[u' Drama', u' Mystery']",\n\n\n\nOrson Welles as Charles Foster Kane\n\...,"It's 1941, and newspaper tycoon Charles Foster..."
8,8,The Wizard of Oz,"[u' Adventure', u' Family', u' Fantasy', u' Mu...",The film starts in sepia-tinted Kansas in the ...,Dorothy Gale (Judy Garland) is an orphaned tee...
9,9,Titanic,"[u' Drama', u' Romance']","In 1996, treasure hunter Brock Lovett and his ...","In 1996, treasure hunter Brock Lovett and his ..."


In [4]:
movies.shape

(100, 5)

In [5]:
movies.isnull().sum()

rank          0
title         0
genre         0
wiki_plot     0
imdb_plot    10
dtype: int64

In [6]:
movies.duplicated().sum()

0

In [7]:
movies.drop('rank',axis=1,inplace=True)

In [8]:
movies.drop('genre',axis=1,inplace=True)

In [9]:
movies.head()

Unnamed: 0,title,wiki_plot,imdb_plot
0,The Godfather,"On the day of his only daughter's wedding, Vit...","In late summer 1945, guests are gathered for t..."
1,The Shawshank Redemption,"In 1947, banker Andy Dufresne is convicted of ...","In 1947, Andy Dufresne (Tim Robbins), a banker..."
2,Schindler's List,"In 1939, the Germans move Polish Jews into the...",The relocation of Polish Jews from surrounding...
3,Raging Bull,"In a brief scene in 1964, an aging, overweight...","The film opens in 1964, where an older and fat..."
4,Casablanca,It is early December 1941. American expatriate...,"In the early years of World War II, December 1..."


In [10]:
movies["plot"] = movies["wiki_plot"].astype(str) + "\n" + movies["imdb_plot"].astype(str)

In [11]:
movies.head()

Unnamed: 0,title,wiki_plot,imdb_plot,plot
0,The Godfather,"On the day of his only daughter's wedding, Vit...","In late summer 1945, guests are gathered for t...","On the day of his only daughter's wedding, Vit..."
1,The Shawshank Redemption,"In 1947, banker Andy Dufresne is convicted of ...","In 1947, Andy Dufresne (Tim Robbins), a banker...","In 1947, banker Andy Dufresne is convicted of ..."
2,Schindler's List,"In 1939, the Germans move Polish Jews into the...",The relocation of Polish Jews from surrounding...,"In 1939, the Germans move Polish Jews into the..."
3,Raging Bull,"In a brief scene in 1964, an aging, overweight...","The film opens in 1964, where an older and fat...","In a brief scene in 1964, an aging, overweight..."
4,Casablanca,It is early December 1941. American expatriate...,"In the early years of World War II, December 1...",It is early December 1941. American expatriate...


In [12]:
movies.drop(columns=['wiki_plot','imdb_plot'],axis = 1,inplace=True)

**As for this particular notebook the plot is the main concern, therefore a new column "plot" has been created by combining the wiki_plot and imdb_plot columns.**

In [13]:
movies.head()

Unnamed: 0,title,plot
0,The Godfather,"On the day of his only daughter's wedding, Vit..."
1,The Shawshank Redemption,"In 1947, banker Andy Dufresne is convicted of ..."
2,Schindler's List,"In 1939, the Germans move Polish Jews into the..."
3,Raging Bull,"In a brief scene in 1964, an aging, overweight..."
4,Casablanca,It is early December 1941. American expatriate...


*So, now that we know that the data doesn't have any null values or duplicate value and the irrelavant feature values have been dropped, next step is to transform the text data into numerical data or vectors.*

# Transforming Text data

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
cv = CountVectorizer(max_features=5000,stop_words='english')

In [16]:
vector = cv.fit_transform(movies['plot']).toarray()

In [17]:
vector.shape

(100, 5000)

In [18]:
cv.get_feature_names_out()

array(['000', '10', '101st', ..., 'zone', 'zuzu', 'éowyn'], dtype=object)

In [19]:
len(list(cv.get_feature_names_out()))

5000

# Cosine Similarity Calculation

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
similarity = cosine_similarity(vector)

In [22]:
similarity.shape

(100, 100)

In [23]:
similarity[0]

array([1.        , 0.04001917, 0.05771261, 0.07307696, 0.04888757,
       0.04132722, 0.07509634, 0.0580935 , 0.05614053, 0.04230219,
       0.05457121, 0.70830163, 0.08195636, 0.06086628, 0.04150478,
       0.08704406, 0.06891063, 0.09446582, 0.0482471 , 0.06467636,
       0.25892356, 0.04252807, 0.05140361, 0.07768898, 0.06146464,
       0.20224308, 0.06701862, 0.04324335, 0.05576948, 0.05321243,
       0.05441928, 0.04529924, 0.0627203 , 0.05802484, 0.06623272,
       0.05016551, 0.09608497, 0.06840808, 0.03127585, 0.05377635,
       0.06669834, 0.08290048, 0.11416908, 0.05352984, 0.09782191,
       0.04422474, 0.05444377, 0.08988961, 0.07640729, 0.07930142,
       0.08219254, 0.06435246, 0.04611364, 0.05989681, 0.05856924,
       0.06985556, 0.06098673, 0.04497401, 0.1017692 , 0.12153548,
       0.06492489, 0.10138292, 0.08677346, 0.06930517, 0.10219443,
       0.06647271, 0.1017301 , 0.06384807, 0.06092115, 0.04044447,
       0.07698201, 0.07507301, 0.084338  , 0.05942183, 0.07215

**Here the similarity[0] variable is showing similarity of the first movie or zeroth movie with every other movie. This similarity is calculated by cosine similarity which is in simple terms angle between two vectors.**
* More the angle or distance between two vectors they are less likely to be similar.

In [24]:
movies[movies['title']=='Star Wars']

Unnamed: 0,title,plot
19,Star Wars,"The galaxy is in a civil war, and spies for th..."


# Recommendation Function

In [25]:
def recommend(movie):
    index = movies[movies['title']==movie].index[0]
    distance = sorted(list(enumerate(similarity[index])),reverse=True,key = lambda x: x[1])
    j = 1
    print(f"Movies similar to {movie} are:")
    for i in distance[1:6]:
        print(f"{j}. {movies.iloc[i[0]].title}")
        j+=1

In [26]:
movies.head(-1)

Unnamed: 0,title,plot
0,The Godfather,"On the day of his only daughter's wedding, Vit..."
1,The Shawshank Redemption,"In 1947, banker Andy Dufresne is convicted of ..."
2,Schindler's List,"In 1939, the Germans move Polish Jews into the..."
3,Raging Bull,"In a brief scene in 1964, an aging, overweight..."
4,Casablanca,It is early December 1941. American expatriate...
...,...,...
94,Double Indemnity,\n\n\n\nNeff confesses into a Dictaphone.\n\n ...
95,Rebel Without a Cause,\n\n\n\nJim Stark is in police custody.\n\n \...
96,Rear Window,\n\n\n\nJames Stewart as L.B. Jefferies\n\n \...
97,The Third Man,\n\n\n\nSocial network mapping all major chara...


In [27]:
recommend('Star Wars')

Movies similar to Star Wars are:
1. Stagecoach
2. 2001: A Space Odyssey
3. City Lights
4. Close Encounters of the Third Kind
5. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb


In [28]:
recommend('Titanic')

Movies similar to Titanic are:
1. The African Queen
2. The Apartment
3. The Silence of the Lambs
4. Psycho
5. Pulp Fiction


**This is an Initial Draft of Movie Recommendation System using Plots as a feature value. In this notebook I used the following:**
* CountVectorizer : To convert the text data into vectors containing frequencies of 5000 feature values that were generated by count vectorizer.
* Cosine Similarity : This is a metric used to find the distance between two vectors. In simpler terms it finds the distance between two vectors by calculating angles between them. Less the distance, More similar are those vectors.