## Week 8
## Exercise 1: Collaborative Filtering

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [36]:
links = pd.read_csv("movielens/links.csv")
movies = pd.read_csv("movielens/movies.csv")
ratings = pd.read_csv("movielens/ratings.csv")
tags = pd.read_csv("movielens/tags.csv")

In [37]:
links.isna().sum()

movieId    0
imdbId     0
tmdbId     8
dtype: int64

### Q1) Read about the movielens dataset and write down a summary of metadata.

## Metadata:

The data is developed by GroupLens Research for their Movie Recommender System Project - MovieLens. It contains four csv datasets - movies.csv, ratings.csv, links.csv and tags.csv each of which have either of the two identifiers - userId and movieId. The dataset was developed over a period of 22 years wherein only those randomized and anonymized users who have rated atleast 20 movies have been selected, and only those movies which have atleast 1 rating or tag have been selected.

### Columns

Identifiers:

- UserId - random and anonymous IDs given to identify users. This is the only way a user can be recognized as no other demographic information is given. The IDs are a complete sequence from 1 to 610. movieId - IDs given to identify movies. Since only those movies which have 1 or more rating or tag are selected, it is not a complete sequence. movies.csv:

- Title - the title of the movie alongwith the year of release. 

- Genre - tags of genres given. Possible values: action, adventure, animation, children's, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, western, (no genre listed).

- ratings.csv:
    - Rating - ratings given by user. Maximum Value: 5; Minimum Value: 0.5; Step Value: 0
    - Timestamp - timestamp of when the movie was rated (UTC Time).

- links.csv:
    - ImdbId: link to the imdb page of the movie.
    - TmbdId: link to the moviedb page of the movie.

- tags.csv:
    - Tag: a word or short phrase that describes the user's impressions about the movie
    - timestamp - timestamp of when the movie was rated (UTC Time).


### Q2) Read the “ratings.csv” file and create a pivot table with index=‘userId’, columns=‘movieId’, values = “rating.

In [38]:
df = pd.pivot_table(ratings, index="userId", columns="movieId", values="rating", fill_value=0.0)
df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0,0.0,4.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,4.0,...,0.0,0,0,0,0.0,0,0.0,0.0,0.0,0


### Q3) sklearn.metrics.pairwise_distances can be used to compute distance between all pairs of users. pairwise_distances() takes a metric parameter for what distance measure to use. Use cosine similarity for finding similarity among users. Use the following packages.

### Q4) from sklearn.metrics import pairwise_distances
### Q5) from scipy.spatial.distance import cosine, correlation

In [39]:
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine, correlation

In [40]:
distance = pairwise_distances(df, metric='cosine')
distance

array([[0.00000000e+00, 9.72717135e-01, 9.40279738e-01, ...,
        7.08902628e-01, 9.06428070e-01, 8.54679193e-01],
       [9.72717135e-01, 3.33066907e-16, 1.00000000e+00, ...,
        9.53789046e-01, 9.72434599e-01, 8.97573246e-01],
       [9.40279738e-01, 1.00000000e+00, 1.11022302e-16, ...,
        9.78871538e-01, 1.00000000e+00, 9.67881252e-01],
       ...,
       [7.08902628e-01, 9.53789046e-01, 9.78871538e-01, ...,
        1.11022302e-16, 8.78007286e-01, 6.77945142e-01],
       [9.06428070e-01, 9.72434599e-01, 1.00000000e+00, ...,
        8.78007286e-01, 1.11022302e-16, 9.46774537e-01],
       [8.54679193e-01, 8.97573246e-01, 9.67881252e-01, ...,
        6.77945142e-01, 9.46774537e-01, 0.00000000e+00]])

### Q6) Find the 5 most similar user for user with user Id 25.

In [41]:
distance[24].argsort()[1:6] + 1

array([189, 209, 407,  30, 515], dtype=int64)

### Q7) Use the “movies” dataset to find out the names of movies, user 1 and user 338 have watched in common and how they have rated each one of them.

In [42]:
set1 = set(ratings.loc[ratings.userId==1, 'movieId'])
set2 = set(ratings.loc[ratings.userId==338, 'movieId'])
final = set1.intersection(set2)

for i in final:
    print("Movie ID "+ str(i) + " : " + movies.loc[movies.movieId==i, 'title'].values[0])
    print('Rating by User 1: ', ratings.loc[((ratings.userId==1) & (ratings.movieId==i)), 'rating'].values[0])
    print('Rating by User 338: ', ratings.loc[((ratings.userId==338) & (ratings.movieId==i)), 'rating'].values[0])


Movie ID 296 : Pulp Fiction (1994)
Rating by User 1:  3.0
Rating by User 338:  4.5
Movie ID 2959 : Fight Club (1999)
Rating by User 1:  5.0
Rating by User 338:  4.5
Movie ID 527 : Schindler's List (1993)
Rating by User 1:  5.0
Rating by User 338:  5.0
Movie ID 593 : Silence of the Lambs, The (1991)
Rating by User 1:  4.0
Rating by User 338:  4.0
Movie ID 50 : Usual Suspects, The (1995)
Rating by User 1:  5.0
Rating by User 338:  4.5


### Q8) Use the movies dataset to find out the common movie names between user 2 and user 338 with least rating of 4.0

In [43]:
set1 = set(ratings.loc[((ratings.userId==2) & (ratings.rating>=4.0)), 'movieId'])
set2 = set(ratings.loc[((ratings.userId==338) & (ratings.rating>=4.0)), 'movieId'])
set1.intersection(set2).pop()

6874

In [44]:
print('Movie ID 6874: ', movies.loc[movies.movieId==6874, 'title'].values[0])

Movie ID 6874:  Kill Bill: Vol. 1 (2003)


### Q9) Create a pivot table for representing the similarity among movies using correlation.

In [45]:
df = ratings.merge(movies, on='movieId')

df = pd.DataFrame(pd.pivot_table(df, index='title', columns='userId', values='rating', fill_value=0))
df

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,4.0
'Hellboy': The Seeds of Creation (2004),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
'Round Midnight (1986),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
'Salem's Lot (2004),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
'Til There Was You (1997),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
eXistenZ (1999),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,5,0,0.0,0.0,0,4.5,0,0.0
xXx (2002),0,0.0,0.0,0,0,0,0.0,0,1,0.0,...,0.0,0,0,0,0.0,0.0,0,3.5,0,2.0
xXx: State of the Union (2005),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,1.5
¡Three Amigos! (1986),4,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0


### Q10) Find the top 5 movies which are similar to the movie “Godfather”

In [46]:
distance = pairwise_distances(df, metric='correlation')
df[df.index.str.contains('Godfather')]


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Godfather, The (1972)",0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,5.0,0,5,0,0.0,4.0,4,5.0,0,5.0
"Godfather: Part II, The (1974)",0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,4.5,0,5,0,0.0,4.0,0,4.5,0,5.0
"Godfather: Part III, The (1990)",0,0.0,0.0,0,0,0,0.0,0,3,0.0,...,0.0,0,2,0,0.0,0.0,0,4.5,0,0.0
The Godfather Trilogy: 1972-1990 (1992),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,4.5,0,0,0,0.0,0.0,0,0.0,0,0.0
Tokyo Godfathers (2003),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0


In [47]:
indices = list(df.index) 
indices.index('Godfather, The (1972)')

3499

In [48]:
similar_indices = (distance[3499].argsort()[1:6])
for i in similar_indices:
  print(indices[i])

Godfather: Part II, The (1974)
Goodfellas (1990)
One Flew Over the Cuckoo's Nest (1975)
Reservoir Dogs (1992)
Fargo (1996)
