# **Advanced** **Algorithms** **Project**

* Nireeksha D Rai 
* PES1UG20CS668


Hands on approach to a content based recommender. 
Using the well known MovieLens dataset, and showing how new movies could be recommended based on their features.

Dataset Description
* dataset containing 1M anonymous ratings of approximately 4000 movies made by 6000 MovieLens users, released in 2/2003.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import combinations
import seaborn as sns

# Read data and display

In [5]:
users = pd.read_csv('/content/users.dat', sep='::',
                        engine='python',encoding='latin-1',
                        names=['userid', 'gender', 'age', 'occupation', 'zip']).set_index('userid')
ratings = pd.read_csv('/content/ratings.dat', engine='python',encoding='latin-1',
                          sep='::', names=['userid', 'movieid', 'rating', 'timestamp'])
movies = pd.read_csv('/content/movies.dat', engine='python',encoding='latin-1',
                         sep='::', names=['movieid', 'title', 'genre']).set_index('movieid')

In [6]:
ratings.shape

(599613, 4)

In [7]:
ratings.sample(5)

Unnamed: 0,userid,movieid,rating,timestamp
451804,2781,3185,3.0,973019789.0
364631,2124,2617,5.0,974653766.0
431073,2627,1370,2.0,973629294.0
220638,1338,1287,5.0,974777777.0
382651,2236,551,4.0,974596508.0


In [8]:
movies.sample(5) 

Unnamed: 0_level_0,title,genre
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
2385,Home Fries (1998),Comedy|Romance
1519,Broken English (1996),Drama
3145,"Cradle Will Rock, The (1999)",Drama
3683,Blood Simple (1984),Drama|Film-Noir
1922,Whatever (1998),Drama


In [9]:
users.head()

Unnamed: 0_level_0,gender,age,occupation,zip
userid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,F,1,10,48067
2,M,56,16,70072
3,M,25,15,55117
4,M,45,7,2460
5,M,25,20,55455


# Genre and building recommender
The genres alone can be used to provide a reasonably good content based recommendation. 
Building a fairly simple recommender, based on the movie genres. A fairly common approach is to use a tf-idf vectorizer.

In [12]:
genre_popularity = (movies.genre.str.split('|')
                      .explode()
                      .value_counts()
                      .sort_values(ascending=False))
genre_popularity.head(10)

Drama         1603
Comedy        1200
Action         503
Thriller       492
Romance        471
Horror         343
Adventure      283
Sci-Fi         276
Children's     251
Crime          211
Name: genre, dtype: int64

#tf-idf
To obtain the tf-idf vectors sklearn's TfidfVectorizer is used. 

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
s = "Animation Children's Comedy"
tf_wrong = TfidfVectorizer(analyzer='word', ngram_range=(1,2))
tf_wrong.fit([s])
tf_wrong.get_feature_names_out()

array(['animation', 'animation children', 'children', 'children comedy',
       'comedy'], dtype=object)

In [21]:
[c for i in range(1,2) for c in combinations(s.split(), r=i)]

[('Animation',), ("Children's",), ('Comedy',)]

Finding the sets of combinations of genres up to 4 here.

In [24]:
tf = TfidfVectorizer(analyzer=lambda s: (c for i in range(1,4)
                                             for c in combinations(s.split('|'), r=i)))
tfidf_matrix = tf.fit_transform(movies['genre'])
tfidf_matrix.shape

(3883, 353)

In [26]:
pd.DataFrame(tfidf_matrix.todense(), columns=tf.get_feature_names_out(), index=movies.title).sample(5, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,"(Comedy, Horror, Sci-Fi)","(Children's, Drama, Sci-Fi)","(Children's, Comedy, Sci-Fi)","(Fantasy, Musical)","(Comedy, Crime, Fantasy)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Kid, The (1921)",0.0,0.0,0.0,0.0,0.0
"Corrina, Corrina (1994)",0.0,0.0,0.0,0.0,0.0
"Baby, The (1973)",0.0,0.0,0.0,0.0,0.0
Mr. Magoo (1997),0.0,0.0,0.0,0.0,0.0
"Eyes of Tammy Faye, The (2000)",0.0,0.0,0.0,0.0,0.0
Hero (1992),0.0,0.0,0.0,0.0,0.0
Three to Tango (1999),0.0,0.0,0.0,0.0,0.0
28 Days (2000),0.0,0.0,0.0,0.0,0.0
Normal Life (1996),0.0,0.0,0.0,0.0,0.0
Sleepers (1996),0.0,0.0,0.0,0.0,0.0


# Similarity between vectors
To find similar vectors (movies) each movie's genre is encoded into its tf-idf representation, now it's proximity measure has to be defined. A commonly used measure is the cosine similarity.

In [27]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix)

In [28]:
cosine_sim_df = pd.DataFrame(cosine_sim, index=movies['title'], columns=movies['title'])
print('Shape:', cosine_sim_df.shape)
cosine_sim_df.sample(5, axis=1).round(2)

Shape: (3883, 3883)


title,Erin Brockovich (2000),"Money Pit, The (1986)",Maximum Risk (1996),"South Park: Bigger, Longer and Uncut (1999)",Butterfly (La Lengua de las Mariposas) (2000)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Toy Story (1995),0.00,0.17,0.00,0.62,0.00
Jumanji (1995),0.00,0.00,0.08,0.00,0.00
Grumpier Old Men (1995),0.00,0.40,0.00,0.11,0.00
Waiting to Exhale (1995),0.39,0.45,0.00,0.13,0.11
Father of the Bride Part II (1995),0.00,1.00,0.00,0.28,0.00
...,...,...,...,...,...
Meet the Parents (2000),0.00,1.00,0.00,0.28,0.00
Requiem for a Dream (2000),1.00,0.00,0.00,0.00,0.28
Tigerland (2000),1.00,0.00,0.00,0.00,0.28
Two Family House (2000),1.00,0.00,0.00,0.00,0.28


In [29]:
def genre_recommendations(i, M, items, k=10):
    """
    Recommends movies based on a similarity dataframe

    Parameters
    ----------
    i : str
        Movie (index of the similarity dataframe)
    M : pd.DataFrame
        Similarity dataframe, symmetric, with movies as indices and columns
    items : pd.DataFrame
        Contains both the title and some other features used to define similarity
    k : int
        Amount of recommendations to return

    """
    ix = M.loc[:,i].to_numpy().argpartition(range(-1,-k,-1))
    closest = M.columns[ix[-1:-(k+2):-1]]
    closest = closest.drop(i, errors='ignore')
    return pd.DataFrame(closest).merge(items).head(k)


#Testing the recommender

In [30]:
movies[movies.title.eq('2001: A Space Odyssey (1968)')]

Unnamed: 0_level_0,title,genre
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
924,2001: A Space Odyssey (1968),Drama|Mystery|Sci-Fi|Thriller


In [31]:
genre_recommendations('2001: A Space Odyssey (1968)', cosine_sim_df, movies[['title', 'genre']])

Unnamed: 0,title,genre
0,"X-Files: Fight the Future, The (1998)",Mystery|Sci-Fi|Thriller
1,"Client, The (1994)",Drama|Mystery|Thriller
2,"Talented Mr. Ripley, The (1999)",Drama|Mystery|Thriller
3,Communion (1989),Drama|Sci-Fi|Thriller
4,Gattaca (1997),Drama|Sci-Fi|Thriller
5,"Thirteenth Floor, The (1999)",Drama|Sci-Fi|Thriller
6,Event Horizon (1997),Action|Mystery|Sci-Fi|Thriller
7,2010 (1984),Mystery|Sci-Fi
8,Stalker (1979),Mystery|Sci-Fi
9,Deep Impact (1998),Action|Drama|Sci-Fi|Thriller


In [32]:
print(movies[movies.title.eq('Contact (1997)')])

                  title         genre
movieid                              
1584     Contact (1997)  Drama|Sci-Fi


In [33]:
genre_recommendations('Contact (1997)', cosine_sim_df, movies[['title', 'genre']])

Unnamed: 0,title,genre
0,Nineteen Eighty-Four (1984),Drama|Sci-Fi
1,Twelve Monkeys (1995),Drama|Sci-Fi
2,"Day the Earth Stood Still, The (1951)",Drama|Sci-Fi
3,Solaris (Solyaris) (1972),Drama|Sci-Fi
4,Powder (1995),Drama|Sci-Fi
5,"Goodbye, 20th Century (Zbogum na dvadesetiot v...",Drama|Sci-Fi
6,Until the End of the World (Bis ans Ende der W...,Drama|Sci-Fi
7,Conceiving Ada (1997),Drama|Sci-Fi
8,"Brother from Another Planet, The (1984)",Drama|Sci-Fi
9,Close Encounters of the Third Kind (1977),Drama|Sci-Fi


In [34]:
movies[movies.title.eq('Jungle Book, The (1967)')]

Unnamed: 0_level_0,title,genre
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
2078,"Jungle Book, The (1967)",Animation|Children's|Comedy|Musical


In [35]:
genre_recommendations('Jungle Book, The (1967)', cosine_sim_df, movies[['title', 'genre']])

Unnamed: 0,title,genre
0,Steamboat Willie (1940),Animation|Children's|Comedy|Musical
1,Aladdin (1992),Animation|Children's|Comedy|Musical
2,Hercules (1997),Adventure|Animation|Children's|Comedy|Musical
3,"Little Mermaid, The (1989)",Animation|Children's|Comedy|Musical|Romance
4,Lady and the Tramp (1955),Animation|Children's|Comedy|Musical|Romance
5,Alice in Wonderland (1951),Animation|Children's|Musical
6,Cinderella (1950),Animation|Children's|Musical
7,Beauty and the Beast (1991),Animation|Children's|Musical
8,"Lion King, The (1994)",Animation|Children's|Musical
9,Cats Don't Dance (1997),Animation|Children's|Musical


In [36]:
movies[movies.title.eq('Saving Private Ryan (1998)')]

Unnamed: 0_level_0,title,genre
movieid,Unnamed: 1_level_1,Unnamed: 2_level_1
2028,Saving Private Ryan (1998),Action|Drama|War


In [37]:
genre_recommendations('Saving Private Ryan (1998)', cosine_sim_df, movies[['title', 'genre']])

Unnamed: 0,title,genre
0,"Fighting Seabees, The (1944)",Action|Drama|War
1,Glory (1989),Action|Drama|War
2,"Boat, The (Das Boot) (1981)",Action|Drama|War
3,Full Metal Jacket (1987),Action|Drama|War
4,"Patriot, The (2000)",Action|Drama|War
5,G.I. Jane (1997),Action|Drama|War
6,Heaven & Earth (1993),Action|Drama|War
7,"Thin Red Line, The (1998)",Action|Drama|War
8,Braveheart (1995),Action|Drama|War
9,"Longest Day, The (1962)",Action|Drama|War
