# Movies recommendation System

This project involves two datasets: 'movies' and 'credits'. The objective is to develop a recommendation system that suggests the top 5 movies based on the input movie name provided by the user.

In [1]:
import numpy as np
import pandas as pd
import ast
import nltk
from nltk.stem.porter import PorterStemmer
import pickle

import warnings
warnings.filterwarnings('ignore')

The first dataset contains the following features:-

- movie_id - A unique identifier for each movie.
- cast - The name of lead and supporting actors.
- crew - The name of Director, Editor, Composer, Writer etc.


The second dataset has the following features:-

- budget - The budget in which the movie was made.
- genre - The genre of the movie, Action, Comedy ,Thriller etc.
- homepage - A link to the homepage of the movie.
- id - This is infact the movie_id as in the first dataset.
- keywords - The keywords or tags related to the movie.
- original_language - The language in which the movie was made.
- original_title - The title of the movie before translation or adaptation.
- overview - A brief description of the movie.
- popularity - A numeric quantity specifying the movie popularity.
- production_companies - The production house of the movie.
- production_countries - The country in which it was produced.
- release_date - The date on which it was released.
- revenue - The worldwide revenue generated by the movie
- runtime - The running time of the movie in minutes.
- status - "Released" or "Rumored".
- tagline - Movie's tagline.
- title - Title of the movie.
- vote_average - average ratings the movie recieved.
- vote_count - the count of votes recieved.

### Movie Dataset

In [2]:
movies=pd.read_csv('tmdb_5000_movies.csv')

In [3]:
movies.head(1)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800


### Credits dataset

In [4]:
credits=pd.read_csv('tmdb_5000_credits.csv')

In [5]:
credits.head(1)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


We've extracted the necessary features and seamlessly integrated both datasets by merging them based on the shared 'movie-id' attribute and by named it as 'movies'.

In [6]:
movies=movies.merge(credits,on='title')

These are the extracted features.

In [7]:
movies=movies[['movie_id','title','overview','genres','keywords','cast','crew']]

## Data Cleaning

### Identifying Null Values 

There are just three null values in the'overview' feature, we can efficiently address this issue by simply dropping the corresponding rows.

In [8]:
movies.isnull().sum()

movie_id    0
title       0
overview    3
genres      0
keywords    0
cast        0
crew        0
dtype: int64

In [9]:
movies.dropna(inplace=True)

### Identifying Duplicate values

There are no duplicated values in dataset

In [10]:
movies.duplicated().sum()

0

## Data Preprocessing

In [11]:
movies.iloc[0].genres

'[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

ast - In short, the ast module in Python is used for parsing, analyzing, and generating Abstract Syntax Trees (ASTs) from Python code. It allows developers to programmatically inspect and manipulate Python code structures, making it useful for tasks like code analysis, transformation, and generation.

By defining the function convert, we've created a tool that specifically takes in a JSON-like string and returns a list containing just the 'name' attribute from each object within that string. This makes it easier to work with the data if we're only interested in the names.

In [12]:
def convert(object):
    L=[]
    for i in ast.literal_eval(object):
        L.append(i['name']) 
    return L

In [13]:
movies['genres']=movies['genres'].apply(convert)

In [14]:
movies['keywords']=movies['keywords'].apply(convert)

In [41]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


This revised convert1 function achieves the task of converting a JSON-like string into a list of names, but now it includes a feature to limit the extraction to the first three names found in the string.

In [15]:
def convert1(object):
    L=[]
    counter=0
    for i in ast.literal_eval(object):
        if counter != 3:
            L.append(i['name']) 
            counter += 1
        else:
            break
    return L

In [16]:
movies['cast']=movies['cast'].apply(convert1)

In [42]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


This fetch_director function retrieves the name of the director from movie crew data by searching for the first occurrence of a crew member with the job title 'Director'

In [17]:
def fetch_director(object):
    L=[]
    for i in ast.literal_eval(object):
        if i['job'] == 'Director':
            L.append(i['name']) 
            break
    return L

In [18]:
movies['crew']=movies['crew'].apply(fetch_director)

In [43]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [19]:
movies['overview']=movies['overview'].apply(lambda x:x.split())

In [20]:
movies['genres']=movies['genres'].apply(lambda x:[i.replace(' ','') for i in x])
movies['keywords']=movies['keywords'].apply(lambda x:[i.replace(' ','') for i in x])
movies['cast']=movies['cast'].apply(lambda x:[i.replace(' ','') for i in x])
movies['crew']=movies['crew'].apply(lambda x:[i.replace(' ','') for i in x])

In [44]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


We've concatenated the necessary columns to create a new one called 'tags'. Then, we've formed a new DataFrame using the previous one, as it now contains the combined data or features in the 'tags' column.

In [21]:
movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']

In [45]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [22]:
new_df=movies[['movie_id','title','tags']]

In [46]:
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a parapleg marin is dispa..."


In [23]:
new_df['tags']=new_df['tags'].apply(lambda x:' '.join(x))

In [24]:
new_df['tags']=new_df['tags'].apply(lambda x:x.lower())

In [25]:
ps=PorterStemmer()

The Porter Stemmer is a linguistic algorithm that reduces words to their root form, called a stem, by removing suffixes.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000,stop_words='english')

In [27]:
def stem(text):
    y=[]
    
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

In [28]:
ps.stem('loving')

'love'

In [29]:
new_df['tags']=new_df['tags'].apply(stem)

In [30]:
vectors=cv.fit_transform(new_df['tags']).toarray()

In [31]:
cv.get_feature_names_out()

array(['000', '007', '10', ..., 'zone', 'zoo', 'zooeydeschanel'],
      dtype=object)

In NLP, cosine similarity measures how similar two documents are by comparing the angle between their word frequency vectors. A value closer to 1 indicates high similarity, while a value closer to 0 or -1 indicates less similarity. It's commonly used for tasks like document similarity, information retrieval, text classification, and recommendation systems.

In [32]:
from sklearn.metrics.pairwise import cosine_similarity
similarity=cosine_similarity(vectors)

In [35]:
pickle.dump(new_df.to_dict(),open('movies.pkl','wb'))

### Get top 5 to store

In [37]:
def get_top_recommend(movie_name):
    movie_index = new_df[new_df["title"] == movie_name].index[0]
    try:
        rec = sorted(list(enumerate(similarity[movie_index])), reverse = True, key = lambda x: x[1])[1:6]
        title = []
        for i in rec:
            title.append(new_df.iloc[i[0]]["title"])
        id_ = []
        for i in rec:
            id_.append(new_df.iloc[i[0]]["movie_id"])
        return title, id_
    except:
        return movie_name

In [38]:
get_top_recommend("Signed, Sealed, Delivered")

'Signed, Sealed, Delivered'

In [39]:
recommendations = {}
for i in new_df['title']:
    try:
        title, id_ = get_top_recommend(i)
        recommendations[i] = {"title": title, "id": id_}
    except:
        continue

In [40]:
pickle.dump(recommendations, open("similarity_dict.pkl", "wb"))