# Movies Recommender System

<img src='http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg', width=500>

In the notebook we will be working with a dataset of movies from IMDB. The data comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Notebook adapted from [here](https://github.com/rounakbanik/movies/)

## Content Based Recommender

We are going to build a recommender that computes similarity between movies based on their description and suggests movies that are most similar to a particular movie that a user liked. 

Our Content Based Recommenders will be based on Movie Overviews and Taglines

### Load libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Read and inspect the data

In [None]:
smd = pd.read_csv("movies_data_small.csv")

In [None]:
smd.head(3)

In [None]:
smd.columns

The data contain information about movies collected from IMDB. This information includes the title, an overview and a tagline for each movie. It also includes a rating in the form of votes. 

Lets create a new 'description' column which is the combination of the tagline and overviiew

In [None]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

### Natural language processing

As we will be analysing natural language in the movie description, we will need to contruct features from plain text. There are a number of steps that can be taken in this process. The function TfidfVectorizer below, convert the description of each movie into a vector of features, with each word as a feature. It does this for every movie in our data, returing a matrix of movies and words in their descriptions.

The TfidfVectorizer does not just count words, but calculates their Term-Frequency Inverse Document frequency (TF-IDF). TF-IDF reflects how important a word is to a document in a collection. The TfidfVectorizer also removes stop words, and creates features not only for single words, but for consecutive words (1- and 2- ngrams)

In [None]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [None]:
tfidf_matrix.shape

### Cosine Similarity

We use Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies

In [None]:
cosine_sim = cosine_similarity(tfidf_matrix)

In [None]:
cosine_sim[0]

We now have a pairwise cosine similarity matrix for all the movies in our dataset. Lets write a function that returns the most similar movies based on the cosine similarity score.

In [None]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [None]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Lets now try and get the top recommendations for a few movies and see how good the recommendations are.

In [None]:
get_recommendations('Tsotsi').head(10)

In [None]:
get_recommendations('District 9').head(10)

We see that for District 9, our system is able to identify it as a South African film and subsequently recommends other South African films as its top recommendations. This is great if we are only interested in this genre of movie, but the kind of person who likes Distrct 9 would also like other movies that might not be directly related