# Content-Based Movie Recommender System

### Importing Pandas

In [28]:
import pandas as pd

### Converting to Dataframe

In [29]:
df = pd.read_csv('data/tmdb_5000_movies.csv')

### Visualizing the Data

In [30]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


We see that some columns may not be relevant for movie recommendations, and that some of the text can be formatted in a more appropriate manner. 

### Filtering Out Irrelevant Columns

In [31]:
df = df.loc[:, ['genres', 'keywords', 'overview', 'title']]

### Parsing and Extracting Key Info from Certain Columns

In [32]:
import ast

def extract_strings(string_in):
    '''
    Parses and formats text to extract key info
    input: a list of strings containing movie information
    return: list of space-separated key information better suited for vectorization.
    '''
    try:
        info = ast.literal_eval(string_in)  # reads string to python code (some columns have dictionary format)
        return ' '.join([entry['name'] for entry in info]) # the 'name' key contains key info, so we separate it by that for all entries
    except:
        return ''
    
extract_cols = ['genres', 'keywords']
for col in extract_cols:
    df[col] = df[col].apply(extract_strings)

### Checking for and Replacing Null or NaN Values

In [33]:
df.isnull().sum()

genres      0
keywords    0
overview    3
title       0
dtype: int64

We see that overview has some null values, which can be a problem for vectorization. We can resolve this by replacing them with empty "" strings.

In [34]:
df['overview'] = df['overview'].fillna("")

### Visualizing Preprocessed Data

In [35]:
df.head()

Unnamed: 0,genres,keywords,overview,title
0,Action Adventure Fantasy Science Fiction,culture clash future space war space colony so...,"In the 22nd century, a paraplegic Marine is di...",Avatar
1,Adventure Fantasy Action,ocean drug abuse exotic island east india trad...,"Captain Barbossa, long believed to be dead, ha...",Pirates of the Caribbean: At World's End
2,Action Adventure Crime,spy based on novel secret agent sequel mi6 bri...,A cryptic message from Bond’s past sends him o...,Spectre
3,Action Crime Drama Thriller,dc comics crime fighter terrorist secret ident...,Following the death of District Attorney Harve...,The Dark Knight Rises
4,Action Adventure Science Fiction,based on novel mars medallion space travel pri...,"John Carter is a war-weary, former military ca...",John Carter


### TF-IDF Vectorization

We'll create a new column for the vectorization content. This should include all vital information that may contain key words (movie summary, genre, and keywords).

In [36]:
df['metadata'] = df['overview'] + ' ' + df['genres'] + ' ' + df['keywords'] # create space-separated column with all key info

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_idf = TfidfVectorizer(stop_words = 'english')
tf_idf_mtx = tf_idf.fit_transform(df['metadata']) # generate matrix of numerically-coded words

In [38]:
def input_process(input, tf_idf):
    '''
    this function is to vectorize the user's input using TF-IDF
    inputs: user input string, tf_idf algo
    output: vectorized input
    '''
    
    ip_vec = tf_idf.transform([input])
    return ip_vec

### Cosine Similarity to Find Similar Movies

In [39]:
from sklearn.metrics.pairwise import cosine_similarity

def movie_rec(input, df, tf_idf, tf_idf_mtx, n = 5):
    '''
    compares vectorized user input with our vectorized matrix from the dataset, then returns 5 most similar entries using cosine similarity
    input: user input string, dataframe, tf-idf algo, our TF-IDF matrix generated from the dataset, number of top similar movies
    output: returns the top n similar movies
    '''
    ip_vec = input_process(input, tf_idf)
    cos_sim = cosine_similarity(ip_vec, tf_idf_mtx)
    idx = cos_sim.argsort()[0][len(df) - n : len(df)] # finds the indices of the top 5 most similar movies to the user's input.

    output = df.iloc[idx]['title'].values # returns the top 5 most similar movies' titles.

    print('Here are 5 Movies You\'d like:\n' + '\n'.join(output))

In [40]:
query = input('What type of movie do you like?')
movie_rec(query, df, tf_idf, tf_idf_mtx)

Here are 5 Movies You'd like:
My Big Fat Independent Movie
Scary Movie 2
Grindhouse
The Conjuring
Grave Encounters
