# Similarity-Based Movie Recommendation System


## Part 1: Setting up the Data

This dataset is a series of reviews and ratings of movies from Grouplens.org. 
https://grouplens.org/datasets/movielens/

In [28]:
import gzip
from collections import defaultdict
import scipy
import scipy.optimize
import numpy as np
import random
import pandas as pd
import csv



This data is divided into two files. File 1 contain rating, movie_id and File 2 contain movie_id, title.
So concatinating both the file in one dataset.

In [25]:
path='/Users/dhruvil/Downloads/ml-latest-small/ml-latest-small/ratings.tsv'
path1='/Users/dhruvil/Downloads/ml-latest-small/ml-latest-small/movies.tsv'

f = open(path, 'rt', encoding="utf8")
f1 = open(path1, 'rt', encoding="utf8")



header = f.readline()
header = header.strip().split('\t')
header[0] = header[0][1:]

header1 = f1.readline()
header1 = header1.strip().split('\t')
header1[0] = header1[0][1:]
data1=[]
for line1 in f1:
    fields1 = line1.strip().split('\t')
    d1 = dict(zip(header1, fields1))
    data1.append(d1)

#print(data1)


dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    
    for x in data1:
    #    fields1 = line1.strip().split('\t')
    #    d1 = dict(zip(header1, fields1))
    #    print(d['movieId'],d1['movieId'])
        if d['movieId']==x['movieId']:
            d['timestamp'] = x['title']
        
        
    d['rating'] = float(d['rating'])
    dataset.append(d)


Let's look at what a typical entry will look like in this dataset.

In [23]:
dataset

[{'userId': '1',
  'movieId': '1',
  'rating': 4.0,
  'timestamp': 'Toy Story (1995)'},
 {'userId': '1',
  'movieId': '3',
  'rating': 4.0,
  'timestamp': 'Grumpier Old Men (1995)'},
 {'userId': '1', 'movieId': '6', 'rating': 4.0, 'timestamp': 'Heat (1995)'},
 {'userId': '1',
  'movieId': '47',
  'rating': 5.0,
  'timestamp': 'Seven (a.k.a. Se7en) (1995)'},
 {'userId': '1',
  'movieId': '50',
  'rating': 5.0,
  'timestamp': 'Usual Suspects, The (1995)'},
 {'userId': '1',
  'movieId': '70',
  'rating': 3.0,
  'timestamp': 'From Dusk Till Dawn (1996)'},
 {'userId': '1',
  'movieId': '101',
  'rating': 5.0,
  'timestamp': 'Bottle Rocket (1996)'},
 {'userId': '1',
  'movieId': '110',
  'rating': 4.0,
  'timestamp': 'Braveheart (1995)'},
 {'userId': '1',
  'movieId': '151',
  'rating': 5.0,
  'timestamp': 'Rob Roy (1995)'},
 {'userId': '1',
  'movieId': '157',
  'rating': 5.0,
  'timestamp': 'Canadian Bacon (1995)'},
 {'userId': '1',
  'movieId': '163',
  'rating': 5.0,
  'timestamp': 'Desp

## Part 2: Finding Similarities

Here we determmine ratings given by each user to multiple movies. So inorder to find the average rating a particular gives.

In [30]:
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)

itemNames = {}

for d in dataset:
    user,item = d['userId'], d['movieId']
    usersPerItem[item].add(user)
    itemsPerUser[user].add(item)
    itemNames[item] = d['timestamp']
print(itemsPerUser)

defaultdict(<class 'set'>, {'1': {'3062', '1032', '1226', '919', '1278', '780', '1927', '500', '1258', '1197', '1291', '3033', '673', '1060', '480', '1080', '1222', '1206', '2137', '3176', '940', '1270', '2268', '2947', '1009', '2078', '1445', '2161', '2096', '553', '2116', '216', '3', '1097', '2987', '1275', '3247', '1240', '954', '1732', '2761', '3448', '2944', '2143', '3740', '608', '3702', '2648', '2571', '2115', '163', '1954', '2872', '2048', '2528', '151', '2387', '2916', '3386', '3147', '3809', '943', '1408', '2949', '5060', '3273', '1214', '316', '2366', '457', '1282', '593', '223', '552', '661', '1256', '1090', '1348', '1804', '2141', '1219', '2716', '3744', '2580', '2093', '2018', '2139', '2616', '2700', '1580', '2502', '3489', '6', '1644', '1025', '3729', '2338', '2985', '3441', '596', '1625', '349', '2389', '2406', '804', '2450', '923', '333', '1967', '1049', '1620', '3793', '1213', '2094', '2492', '441', '2291', '1023', '1793', '543', '590', '2858', '3703', '2617', '70', '

### Functions to find Similarities

We need to set up our Jaccard function and a function to determine what is similar within the dataset. Instead of Jaccard Function you can also use cosine function.

In [31]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

In [32]:
def mostSimilar(iD, n):
    similarities = []
    id_list = []
    users = usersPerItem[iD]
    for i2 in usersPerItem:
        if i2 == iD: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
        
    similarities.sort(reverse=True)
    
    for i in similarities:
        id_list.append(i[1])
        
    print(id_list[:n])
    return similarities[:n]

### Getting a recommendation

In this section we will get prediction for any particular user based on his ratings.

In [33]:
dataset[2]

{'userId': '1', 'movieId': '6', 'rating': 4.0, 'timestamp': 'Heat (1995)'}

In [35]:
query = dataset[10]['movieId']
query

'163'

In [36]:
itemNames[query]

'Desperado (1995)'

In [37]:
mostSimilar(query, 10)

['353', '555', '6', '380', '173', '288', '293', '553', '552', '16']


[(0.3, '353'),
 (0.297029702970297, '555'),
 (0.2923076923076923, '6'),
 (0.26424870466321243, '380'),
 (0.2549019607843137, '173'),
 (0.25396825396825395, '288'),
 (0.25157232704402516, '293'),
 (0.24761904761904763, '553'),
 (0.24509803921568626, '552'),
 (0.24369747899159663, '16')]

In [38]:
### Top 10 recommended movies for the user....

[itemNames[x[1]] for x in mostSimilar(query, 10)]

['353', '555', '6', '380', '173', '288', '293', '553', '552', '16']


['Crow, The (1994)',
 'True Romance (1993)',
 'Heat (1995)',
 'True Lies (1994)',
 'Judge Dredd (1995)',
 'Natural Born Killers (1994)',
 'Léon: The Professional (a.k.a. The Professional) (Léon) (1994)',
 'Tombstone (1993)',
 'Three Musketeers, The (1993)',
 'Casino (1995)']

## Part 3: Collaborative-Filtering-Based Rating Estimation

We can also use the similarity-based recommender we developed above to make predictions about user's ratings.

Specifically, a user's rating for an item is assumed to be a weighted sum of their previous ratings, weighted by how similar the query item is to each of their previous purchases.

We start by building a few more utility data structures to keep track of all of the reviews by each user and for each item.

In [39]:
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)

c=0
for d in dataset:
    user,item = d['userId'], d['movieId']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)
    c=c+1

In [40]:
#Calculating the mean rating of the entire dataset

total_star_rating = 0
c1=0
for d in dataset:
    total_star_rating += d['rating']
    c1=c1+1
avg_star_rating = total_star_rating/c1
print(avg_star_rating)


3.501556983616962


Now that we have calculated the average rating of our dataset as a whole, we are going to implement a function which predicts Rating based on a user and an item.

In [41]:
def predictRating(user,item):
    ratings = []
    similarities = []
    for d in reviewsPerUser[user]:
        i2 = d['movieId']
        if i2 == item: continue
        ratings.append(d['rating'])
        similarities.append(Jaccard(usersPerItem[item],usersPerItem[i2]))
    if (sum(similarities) > 0):
        weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]
        return sum(weightedRatings) / sum(similarities)
    else:
        # User hasn't rated any similar items
        return avg_star_rating

In [42]:
dataset[10]

{'userId': '1',
 'movieId': '163',
 'rating': 5.0,
 'timestamp': 'Desperado (1995)'}

In [44]:
#Predicting rating for the user at index [10]

user,item = dataset[10]['userId'], dataset[10]['movieId']
predictRating(user, item)

4.357645946451583

In this case our user hasn't rated any similar items, so our function defaults to returning the dataset Mean Rating. Let's try another example with a user who has.

In [50]:
#Predicting rating for the user at index [12]
user,item = dataset[12]['userId'], dataset[12]['movieId']
predictRating(user, item)
#Answer should differ from the above

4.394928680387841

## Part 4: Evaluating Performance

Lets start by defining out typical MSE function.

In [51]:
def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

To evaluate the performance of our model, we will need two things:
1. A list of the average Rating (i.e. ratingMean)
2. A list of our predicted ratings (i.e. ratings defined by our predictRating function)

In [52]:
ratingMean = []
predictedRatinng = []
for d in range(10):
    ratingMean.append(avg_star_rating)
    
    user,item = dataset[d]['userId'], dataset[d]['movieId']
    predictedRatinng.append(predictRating(user, item))

Finally, we will compare our two lists above with the actual star ratings in our dataset.

In [53]:
labels = [d['rating'] for d in dataset]
print(MSE(ratingMean, labels), MSE(predictedRatinng, labels))

1.247199853687452 0.4462077996374793
