# Recommendation system

This recommendation system is  an __item to item__ approach - the input is an already watched movie, and the output is a list of similar movies - that gathers info from the tags textual data in order to enhance the recommendation. It is composed of the following steps:

- tag summarizing:
    - tokenization & stemming
    - Tf Idf algorithm
    - DBSCAN clustering
    - tag labelling
    - pivot label data
- movie data preprocessing:
    - preprocessing movie dataset (genres)
    - pivot genre data & merge tags
- recommendation algorithm:
    - Tf Idf algorithm
    - compute cosine similarity distances
    - sort movies

In [2]:
import pandas as pd
from sklearn.cluster import KMeans, DBSCAN
from sklearn.feature_extraction.text import TfidfVectorizer
import collections
from pprint import pprint

## Tag Summarizing

### Tokenization & Stemming

The first step in order to group the tags is to separate each word, remove stopwords such as "and", "the" etc. and find the stem of each word, removing the flexing present. at the end of this step we will have a vocabulary that will be used by the tf idf to estimate the similarity between each tag.


In [3]:
from nltk.stem.snowball import PorterStemmer, SnowballStemmer
import nltk
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel

def tag_tokenize(tag):
    tokens = nltk.word_tokenize(tag)
    stemmer = SnowballStemmer('english')
    sw = stopwords.words('english')
    stems = [stemmer.stem(t) for t in tokens if t not in sw and len(t) > 2]
    return stems
    

### tag similarities & clustering

This step we estimate the similarities between each tag using the tokens extracted and using the DBSCAN clustering to group the tags.

I chose the DBSCAN algorithm simply because it has interesting limits on maximum cluster distances, and more importantly, the number of clusters is not a parameter, that is useful in this case that the number of groups vary wildly depending on the tags present.

In [4]:
def cluster_tags(texts):
    text_frequencies = TfidfVectorizer(tokenizer=tag_tokenize,
                                       lowercase=True,
                                       ngram_range=(1, 2), min_df=0).fit_transform(texts)
    
    clustering_model = DBSCAN(eps=0.001, min_samples=5, metric='cosine').fit(text_frequencies)
    substitutive_class = {}
    labels = {-1: 'indefinite'}
 
    for idx, label in enumerate(clustering_model.labels_):
        if labels.get(label) == None:
            labels[label] = texts[idx]
        substitutive_class[texts[idx]] = label
 
    return substitutive_class, labels

In [5]:
tags = pd.read_csv("ml-latest-small/tags.csv")
tags = tags[tags.tag.isna() == False]

### tag labeling & pivoting

The labels obtained previously will be used on our recommendation system, and in order for it to work, the data will be pivoted, so for each move we have a array of tags that define them. 

The tags data ins very sparse, many movies have no tag to relate with other movies, so in order to support the recommendation engine,  it will be necessary to use the genre information, present in the movie dataset.

In [6]:
clusters, labels = cluster_tags(tags.tag)
tag_rework = tags.replace({'tag': clusters}).groupby(['movieId', 'tag']).size().reset_index()
tag_pivot = tag_rework.pivot(index='movieId', columns='tag')
tag_pivot.columns = tag_pivot.columns.droplevel()
tag_pivot.rename(columns=labels, inplace=True)
tag_pivot.head()
#xxx = xxx.drop(-1, axis=1)

tag,indefinite,funny,will ferrell,drugs,Leonardo DiCaprio,Al Pacino,gangster,mafia,holocaust,twist ending,...,sad,fun,heartwarming,touching,existentialism,philosophical,gritty,cerebral,beautiful,poignant
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,1.0,,,,,,,,
2,3.0,,,,,,,,,,...,,,,,,,,,,
3,2.0,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,


## Movies data preprocessing

The genre information is summarized in a single string separated by pipe. The following code is used to separate each individual genre and pivot it to merge the movie data with the tags data.

In [7]:
movies = pd.read_csv("ml-latest-small/movies.csv")
movies.genres = movies.genres.apply(lambda x: x.split('|'))
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),[Comedy]


In [8]:
newmovies = movies.genres.apply(pd.Series) \
    .merge(movies, right_index = True, left_index = True) \
    .drop(["genres"], axis = 1) \
    .melt(id_vars = ['movieId', 'title'], value_name = "genre") \
    .drop("variable", axis = 1) \
    .dropna()
newmovies['n']=1
newmovies[newmovies['movieId'] == 1].head()

Unnamed: 0,movieId,title,genre,n
0,1,Toy Story (1995),Adventure,1
9742,1,Toy Story (1995),Animation,1
19484,1,Toy Story (1995),Children,1
29226,1,Toy Story (1995),Comedy,1
38968,1,Toy Story (1995),Fantasy,1


### pivot genre data & merge tags

After separating each tag is possible to pivot the genre data and use it to merge with the tags data. This will be the attribute vector that defines the movie, based on iths genre and the tag groups

In [9]:
movies_pivot = newmovies.pivot(index='movieId', columns='genre', values='n')
tabela_final = pd.concat([movies_pivot, tag_pivot], axis=1).fillna(0)
tabela_final.head()

Unnamed: 0_level_0,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,...,sad,fun,heartwarming,touching,existentialism,philosophical,gritty,cerebral,beautiful,poignant
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Recommendation Algorithm

Using the characteristic vector I've assembled, I'll build the similarities matrix training a tfidf algorithm, and compute an cosine similarity distance of the dataset from a movie I want recommendation from.

In [11]:
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.spatial.distance import cdist
def recommend(title):
    input_id = movies[movies['title']==title].movieId.iloc[0]
    
    vec = TfidfTransformer()
    model = vec.fit_transform(tabela_final.values).toarray()
    inp = vec.transform(tabela_final.loc[input_id].values).toarray()
    
    out = cdist(inp, model, 'cosine').argsort()
    return out[0]

def recommend_simple(title):
    input_id = movies[movies['title']==title].movieId.iloc[0]
    
    vec = TfidfTransformer()
    model = vec.fit_transform(wtf.fillna(0).values).toarray()
    inp = vec.transform(wtf.loc[input_id].fillna(0).values).toarray()
    
    out = cdist(inp, model, 'cosine')#.argsort()
    return out#[0][:5]

In [15]:
ratings = pd.read_csv("ml-latest-small/ratings.csv")

In [101]:
rating_mean = ratings.groupby(['movieId'], as_index = False, sort = False).mean().rename(columns = {'rating': 'rating_mean'})[['movieId','rating_mean']]
adjusted_ratings = pd.merge(ratings,rating_mean,on = 'movieId', how = 'left', sort = False)
adjusted_ratings['rating_adjusted']=adjusted_ratings['rating']-adjusted_ratings['rating_mean']
# replace 0 adjusted rating values to 1*e-8 in order to avoid 0 denominator
adjusted_ratings.loc[adjusted_ratings['rating_adjusted'] == 0, 'rating_adjusted'] = 1e-8
adjusted_ratings

Unnamed: 0,userId,movieId,rating,timestamp,rating_mean,rating_adjusted
0,1,1,4.0,964982703,3.920930,7.906977e-02
1,1,3,4.0,964981247,3.259615,7.403846e-01
2,1,6,4.0,964982224,3.946078,5.392157e-02
3,1,47,5.0,964983815,3.975369,1.024631e+00
4,1,50,5.0,964982931,4.237745,7.622549e-01
5,1,70,3.0,964982400,3.509091,-5.090909e-01
6,1,101,5.0,964980868,3.782609,1.217391e+00
7,1,110,4.0,964982176,4.031646,-3.164557e-02
8,1,151,5.0,964984041,3.545455,1.454545e+00
9,1,157,5.0,964984100,2.863636,2.136364e+00


In [100]:
def predict_ratings(user_id, movie_id):
    movie_index = movies[movies['movieId'] == movie_id].index[0]
    #print(movie_index)
    movie_ids = movies[movies['movieId'].isin(ratings[ratings['userId']==user_id].movieId)].index
    #print(movie_ids)
    user_ratings = list(adjusted_ratings[adjusted_ratings['userId']==user_id].rating_adjusted)
    w = 1.0-out[movie_index][movie_ids]
    #print(user_ratings*w)
    return list(adjusted_ratings[adjusted_ratings['movieId']==movie_id].rating_mean)[0]+(sum(user_ratings*w))/sum(w)
    
predict_ratings(1, 91658)

4.2657040001300608

In [25]:
1.-out[0]

array([ 1.        ,  0.12023611,  0.02464271, ...,  0.        ,
        0.15033203,  0.09554217])

In [23]:
vec = TfidfTransformer()
model = vec.fit_transform(tabela_final.values).toarray()
inp = vec.transform(tabela_final.values).toarray()

out = cdist(inp, model, 'cosine')

In [204]:
a = recommend('Toy Story (1995)')

movies.loc[a].head(10)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1757,2355,"Bug's Life, A (1998)","[Adventure, Animation, Children, Comedy]"
2355,3114,Toy Story 2 (1999),"[Adventure, Animation, Children, Comedy, Fantasy]"
8695,122918,Guardians of the Galaxy 2 (2017),"[Action, Adventure, Sci-Fi]"
3000,4016,"Emperor's New Groove, The (2000)","[Adventure, Animation, Children, Comedy, Fantasy]"
2809,3754,"Adventures of Rocky and Bullwinkle, The (2000)","[Adventure, Animation, Children, Comedy, Fantasy]"
1706,2294,Antz (1998),"[Adventure, Animation, Children, Comedy, Fantasy]"
6194,45074,"Wild, The (2006)","[Adventure, Animation, Children, Comedy, Fantasy]"
8927,136016,The Good Dinosaur (2015),"[Adventure, Animation, Children, Comedy, Fantasy]"
3568,4886,"Monsters, Inc. (2001)","[Adventure, Animation, Children, Comedy, Fantasy]"


In [208]:
a = recommend('Jumanji (1995)')

movies.loc[a].head(10)

Unnamed: 0,movieId,title,genres
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]"
6751,59501,"Chronicles of Narnia: Prince Caspian, The (2008)","[Adventure, Children, Fantasy]"
3638,4993,"Lord of the Rings: The Fellowship of the Ring,...","[Adventure, Fantasy]"
4800,7153,"Lord of the Rings: The Return of the King, The...","[Action, Adventure, Drama, Fantasy]"
4137,5952,"Lord of the Rings: The Two Towers, The (2002)","[Adventure, Fantasy]"
8296,106489,"Hobbit: The Desolation of Smaug, The (2013)","[Adventure, Fantasy, IMAX]"
701,919,"Wizard of Oz, The (1939)","[Adventure, Children, Fantasy, Musical]"
6774,60074,Hancock (2008),"[Action, Adventure, Comedy, Crime, Fantasy]"
6905,63992,Twilight (2008),"[Drama, Fantasy, Romance, Thriller]"
6505,53464,Fantastic Four: Rise of the Silver Surfer (2007),"[Action, Adventure, Sci-Fi]"


## Next Steps

I want to measure the efficiency of this recommendation engine against a similar, item to item recommendation system.
There are probably a lot of ground to cover relating this topic. Synopsis are also a good textual source in movies to worh with.

It would be nice to mesh the tags, with the user ratings in a user to item recommendation strategy.