### Movie Recommendation System 
This movie recommendation system is a machine learning project that comprises of dataset obtained from Kaggle. This project makes use of two data sets:
- tmdb_5000_credits.csv
- tmdb-5000_movies.csv

This project will result in the creation of a content-based recommendation system.

#### Importing Python libraries


In [2]:
import numpy as np
import pandas as pd
import ast

### Data Preprocessing
#### Importing datasets


In [3]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

We will merge both datasets on the basis of Title


In [4]:
movies = movies.merge(credits, on='title')     # on="" explains which title to base the merging on

We will now only keep the columns that will benefit us in creating the recommendation system.
These columns are: movie_id, title, overview, genres, keywords, cast, and crew

In [5]:
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]

We want only three columns in the dataset. Titles, Movie_id, and tags (which will be created by merging other columns)

In [6]:
# Dropping the rows which have null values
movies.dropna(inplace=True)

In [None]:
# We will create a helper function which will extract the name of the genres from the genre column

def convert(obj):               # Passing the List of dictionaries (It will be originally as string)
    L = []
    for i in ast.literal_eval(obj):               # Going over each dictionary
        L.append(i['name'])                       # Extracting the name of the genre
    return L

In [8]:
movies['genres'] = movies['genres'].apply(convert)

In [9]:
# Similarly applying it on the keywords
movies['keywords'] = movies['keywords'].apply(convert)

In [10]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [11]:
# Creating a function that will help us in extracting the names of the first three cast members

def convert3(obj):               # Passing the List of dictionaries (It will be originally as string)
    L = []
    counter = 0
    for i in ast.literal_eval(obj):               # Going over each dictionary
        if counter != 3:                          # Since we require only the first three cast members
            L.append(i['name'])                   # Extracting the name of the genre
            counter += 1
        else:
            break
    return L

In [12]:
movies['cast'] = movies['cast'].apply(convert3)

In [13]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."


In [14]:
# We need a function that will only extract the name of the director from the crew column
# We will need the dictionary, in which the job description is director

def fetch_director(obj):               # Passing the List of dictionaries (It will be originally as string)
    L = []
    for i in ast.literal_eval(obj):               # Going over each dictionary
        if i['job'] == 'Director':
            L.append(i['name'])                   # Extracting the name of the genre
            break
    return L

In [15]:
movies['crew'] = movies['crew'].apply(fetch_director)

In [16]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [17]:
# Converting the overview (Present in string format) to List format so that it can be concatenated with other columns
movies['overview'] = movies['overview'].apply(lambda x:x.split())

In [18]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",[James Cameron]


In [19]:
# We will be removing the blank space between words so that our recommendation engine becomes more accurate.
movies['genres'] = movies['genres'].apply(lambda x:[i.replace(" ","") for i in x])
movies['keywords'] = movies['keywords'].apply(lambda x:[i.replace(" ","") for i in x])
movies['cast'] = movies['cast'].apply(lambda x:[i.replace(" ","") for i in x])
movies['crew'] = movies['crew'].apply(lambda x:[i.replace(" ","") for i in x])

In [20]:
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron]


In [21]:
# Creating a tag column which will be a concatenation of overview, genres, keywords, cast, and crew
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
movies.head(1)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin...","[Action, Adventure, Fantasy, ScienceFiction]","[cultureclash, future, spacewar, spacecolony, ...","[SamWorthington, ZoeSaldana, SigourneyWeaver]",[JamesCameron],"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [22]:
# Creating a new dataframe which consists only the required columns

new_df = movies[['movie_id', 'title', 'tags']]
new_df.head(1)

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"[In, the, 22nd, century,, a, paraplegic, Marin..."


In [23]:
# Converting the tags column into a string

new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))
new_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:" ".join(x))


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di..."


In [25]:
# Converting tags column into lower case (Recommended practice)

new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())
new_df.head(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags'] = new_df['tags'].apply(lambda x:x.lower())


Unnamed: 0,movie_id,title,tags
0,19995,Avatar,"in the 22nd century, a paraplegic marine is di..."


### Vectorization

#### We will perform text vectorization after data preprocessing