## Recommender Systems:
- __Simple recommenders:__ offer generalized recommendations to every user, based on movie popularity and/or genre. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience. IMDB Top 250 is an example of this system.
- __Content-based recommenders:__ suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.
- __Collaborative filtering engines:__ these systems try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

In [252]:
import pandas as pd
data = pd.read_csv('data/movies_metadata.csv', low_memory = False)

In [253]:
data.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


### Simple Recommenders

In [254]:
'''
Weighted Rating(WR) = (v/(v+m))*R + (m/(v+m))*C
v: the number of votes for the movie (vote_count)
m: the minimum votes required to be considered for recommendation
R: the average rating of a movie (vote_average)
C: the mean vote across the whole movies
'''
C = data['vote_average'].mean()
m = data['vote_count'].quantile(.90)

In [255]:
# qualified movies to to considered for recommendations
q_movies = data.copy().loc[data['vote_count'] >= m]
# data.shape
# q_movie.shape

In [256]:
# function that computes the weighted rating of each movie
def weighted_rating(movie, m=m, C=C):
    v = movie['vote_count']
    R = movie['vote_average']
    WR = (v/(v+m)*R) + (m/(m+v)*C)
    return WR

In [257]:
# create a new feature 'score'
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [258]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

q_movies[['title', 'vote_count', 'vote_average', 'score']].head(5)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385


### Content-Based Recommenders

In [228]:
# To avoid memory crash for cosine_simiarity we only use q_movies
#Print plot overviews of the first 5 movies.
q_movies['overview'].head()

314      Framed in the 1940s for the double murder of h...
834      Spanning the years 1945 to 1955, a chronicle o...
10309    Raj is a rich, carefree, happy-go-lucky second...
12481    Batman raises the stakes in his war on crime. ...
2843     A ticking-time-bomb insomniac and a slippery s...
Name: overview, dtype: object

The TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs.

In [235]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Remove all english stop words such as 'the', 'a'
vectorizer = TfidfVectorizer(stop_words = 'english')
# Replace NaN with an empty string
q_movies['overview'] = q_movies['overview'].fillna('')
matrix = vectorizer.fit_transform(q_movies['overview'])
matrix.shape

(4555, 19694)

In [246]:
from sklearn.metrics.pairwise import linear_kernel  # to compute the cosine_similarity between all movies
import numpy as np
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(matrix, matrix)

print(cosine_sim.shape)

''' code to compute similarity in batches!
# Change chunk_size to control resource consumption and speed
# Higher chunk_size means more memory/RAM needed but also faster   
chunk_size = 500 
matrix_len = matrix.shape[0] # Not sparse numpy.ndarray
cosine_sim = np.array([])  # to keep the final result of similarities
def similarity_cosine_by_chunk(start, end):
    if end > matrix_len:
        end = matrix_len
    return cosine_similarity(X=matrix[start:end], Y=matrix) # scikit-learn function

for chunk_start in range(0, matrix_len, chunk_size):
    cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
    if cosine_sim.shape[0] == 0:
        cosine_sim =  np.copy(cosine_similarity_chunk)
    else:
        cosine_sim =  np.concatenate((cosine_sim,cosine_similarity_chunk), axis=0)

print(cosine_sim.shape)
'''

(4555, 4555)


' code to compute similarity in batches!\n# Change chunk_size to control resource consumption and speed\n# Higher chunk_size means more memory/RAM needed but also faster   \nchunk_size = 500 \nmatrix_len = matrix.shape[0] # Not sparse numpy.ndarray\ncosine_sim = np.array([])  # to keep the final result of similarities\ndef similarity_cosine_by_chunk(start, end):\n    if end > matrix_len:\n        end = matrix_len\n    return cosine_similarity(X=matrix[start:end], Y=matrix) # scikit-learn function\n\nfor chunk_start in range(0, matrix_len, chunk_size):\n    cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)\n    if cosine_sim.shape[0] == 0:\n        cosine_sim =  np.copy(cosine_similarity_chunk)\n    else:\n        cosine_sim =  np.concatenate((cosine_sim,cosine_similarity_chunk), axis=0)\n\nprint(cosine_sim.shape)\n'

In [250]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return q_movies['title'].iloc[movie_indices]


#Construct a reverse map of indices and movie titles
indices = pd.Series(q_movies.index, index=q_movies['title']).drop_duplicates()
print(get_recommendations('The Shawshank Redemption'))

20597    The Incredible Burt Wonderstone
23861             Magic in the Moonlight
15930                    The Illusionist
23331                      The Immigrant
17652                       Fright Night
11634                               Next
4603      The Curse of the Jade Scorpion
1003                 Alice in Wonderland
1135                        Delicatessen
42209                           Chocolat
Name: title, dtype: object


### Credits, Genres and Keywords Based Recommender
- It goes without saying that the quality of a recommender would be increased with the usage of better metadata. 
- This is a recommender based on these metadata: the 3 top actors, the director, related genres and the movie plot keywords.

In [None]:
import numpy as np

# Load keywords and credits
credits = pd.read_csv('data/credits.csv')
keywords = pd.read_csv('data/keywords.csv')

# Remove rows with bad IDs.
data = data.drop([19730, 29503, 35587])

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
data['id'] = data['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
data = data.merge(credits, on='id')
data = data.merge(keywords, on='id')

In [None]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    data[feature] = data[feature].apply(literal_eval)

In [None]:
# Get the director's name from the crew feature. If director is not listed, return NaN
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
# Returns the list top 3 elements or entire list; whichever is more.
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [None]:
# Define new director, cast, genres and keywords features that are in a suitable form.
data['director'] = data['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    data[feature] = data[feature].apply(get_list)
# Print the new features of the first 3 films
data[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    data[feature] = data[feature].apply(clean_data)

In [None]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

Here we use the CountVectorizer() instead of TF-IDF. This is because we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies.

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])
get_recommendations('The Dark Knight Rises', cosine_sim2)

### Collaborative Filtering
- __User-based Filtering:__ these systems recommend products to a user that similar users have liked. For example, let's say Alice and Bob have a similar interest in books (that is, they largely like and dislike the same books). Now, let's say a new book has been launched into the market and Alice has read and loved it. It is therefore, highly likely that Bob will like it too and therefore, the system recommends this book to Bob.

- __Item-based Filtering:__ these systems are extremely similar to the content recommendation engine that you built. These systems identify similar items based on how people have rated it in the past. For example, if Alice, Bob and Eve have given 5 stars to The Lord of the Rings and The Hobbit, the system identifies the items as similar. Therefore, if someone buys The Lord of the Rings, the system also recommends The Hobbit to him or her.

### Three Similarity metrics

- __Jaccard Similarity:__ Similarity is based on the number of users which have rated item A and B divided by the number of users who have rated either A or B. It is typically used where we don’t have a numeric rating but just a boolean value like a product being bought or an add being clicked
- __Cosine Similarity:__ Similarity is the cosine of the angle between the 2 vectors of the item vectors of A and B. The closer the vectors, the smaller will be the angle and larger the cosine.
- __Pearson Similarity:__ Similarity is the pearson coefficient between the two vectors.