#  Ratings and Recommendations Using Jaccard Index

In [3]:
import pandas as pd
import re
ratings = pd.read_csv('u.data',sep='\t',names=['user_id', 'movie_id', 'rating'],usecols=range(3),encoding='latin1')
movies = pd.read_csv('u.item',sep='|',names=['movie_id', 'title', 'date'],usecols=range(3),encoding='latin1')
ratings = pd.merge(movies,ratings)
# Convert date to datetime
ratings['date']=pd.to_datetime(ratings['date'])

#Get rid of year as part of title - it's in the date.
def shorten_title(title):
    return re.sub(r' \([12][0-9][0-9][0-9]\)','',title)
ratings['title']=ratings['title'].apply(shorten_title)
ratings.head()

Unnamed: 0,movie_id,title,date,user_id,rating
0,1,Toy Story,1995-01-01,308,4
1,1,Toy Story,1995-01-01,287,5
2,1,Toy Story,1995-01-01,148,4
3,1,Toy Story,1995-01-01,280,4
4,1,Toy Story,1995-01-01,66,3


### user ratings for movies in the database

#### Part 1 

In [49]:
df = {'mean rating': ratings.groupby(['title'])['rating'].mean(), 'count': ratings.groupby(['title']).size()}
g = pd.DataFrame(data=df).reset_index()     #convert groupby object to dataFrame object 
perfect_rated_movies = g[g['mean rating']==5]    #keep the perfect rating movies with rating of 5 

In [50]:
perfect_rated_movies

Unnamed: 0,title,count,mean rating
30,Aiqing wansui,1,5.0
461,Entertaining Angels: The Dorothy Day Story,1,5.0
632,"Great Day in Harlem, A",1,5.0
943,Marlene Dietrich: Shadow and Light,1,5.0
1171,Prefontaine,3,5.0
1271,"Saint of Fort Washington, The",2,5.0
1275,Santa with Muscles,2,5.0
1355,Someone Else's America,1,5.0
1383,Star Kid,3,5.0
1467,They Made Me a Criminal,1,5.0


The two movies that have perfect ratings from all the users who saw them, and which also tie for the most users who rated them are Star Kid and Prefontaine. 

#### Part 2

In [48]:
d = {'mean rating': ratings.groupby(['title'])['rating'].mean(), 'count': ratings.groupby(['title']).size()}
g = pd.DataFrame(data=d).reset_index()
highest_rated_movies = g[g['count']>100]     #keep only movies that has greater than 100 reviews
highest_rated_movies

Unnamed: 0,title,count,mean rating
2,101 Dalmatians,109,2.908257
3,12 Angry Men,125,4.344000
7,2001: A Space Odyssey,259,3.969112
15,Absolute Power,127,3.370079
16,"Abyss, The",151,3.589404
17,Ace Ventura: Pet Detective,103,3.048544
24,"Adventures of Priscilla, Queen of the Desert, The",111,3.594595
27,"African Queen, The",152,4.184211
32,Air Force One,431,3.631090
36,Aladdin,219,3.812785


In [28]:
highest_rated_movies.sort_values(by=['mean rating'],ascending=False).head(5) #print the first 5 highest mean rating movies 

Unnamed: 0,title,count,mean rating
317,"Close Shave, A",112,4.491071
1278,Schindler's List,298,4.466443
1647,"Wrong Trousers, The",118,4.466102
272,Casablanca,243,4.45679
1313,"Shawshank Redemption, The",283,4.44523


T five most highly rated movies(by mean rating overall users who rated them)for movies that were rated by more that 100 people are A CLose Shave, Schindler's list, The Wrong Trousers, Casablanca, and The Shawshank Redemption. 

### Using the Jaccard Index Method

#### Part 1

In [57]:
from math import*
def jaccard_similarity(x,y):       #function that calculates the jaccard index for two moives 
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)
jaccard = []
def getkey(item):   #get moive names from a dictionary framwork 
    return item[1]
gy_title = ratings.groupby('title')    #groupby movie names in ratings 
def j_close(x,i):    #function that finds the movies with the highest jaccard index as the movie of your choice
    x = gy_title.get_group(x)['user_id']         #finding different movies by inputting their titles 
    for key in gy_title.groups.keys():   #iteration through titles 
        y = gy_title.get_group(key)['user_id']
        jacc=(key, jaccard_similarity(x,y))
        jaccard.append(jacc)
        sort = sorted(jaccard, key = getkey, reverse = True) #sort the movies with highest jaccard index to the moive of your choice
    print(sort[1:i+1])

In [59]:
j_close('Toy Story',5)

[('Star Wars', 0.5825688073394495), ('Independence Day (ID4)', 0.5537918871252204), ('Return of the Jedi', 0.5492730210016155), ('Rock, The', 0.5341959334565619), ('Fargo', 0.5118110236220472)]


The five closest movies by Jaccard index to Toy Story are ('Star Wars', 0.5825688073394495), ('Independence Day (ID4)', 0.5537918871252204), ('Return of the Jedi', 0.5492730210016155), ('Rock, The', 0.5341959334565619), ('Fargo', 0.5118110236220472). 

#### Part 2

In [62]:
from math import*
def jaccard_similarity(x,y):       #function that calculates the jaccard index for two moives 
    intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
    union_cardinality = len(set.union(*[set(x), set(y)]))
    return intersection_cardinality/float(union_cardinality)
jaccard = []
def getkey(item):
    return item[1]
gy_title = ratings.groupby('title')
def j_close(x,i):
    x = gy_title.get_group(x)['user_id']
    for key in gy_title.groups.keys():
        y = gy_title.get_group(key)['user_id']
        jacc=(key, jaccard_similarity(x,y))
        jaccard.append(jacc)
        sort = sorted(jaccard, key = getkey, reverse = True)
    print(sort[1:i+1])

In [63]:
j_close('Casablanca',5)

[('Citizen Kane', 0.4798657718120805), ('Wizard of Oz, The', 0.47289156626506024), ('2001: A Space Odyssey', 0.45507246376811594), ('Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 0.45182724252491696), ('Raiders of the Lost Ark', 0.4444444444444444)]


The five closest movies by Jaccard index to Casablanca are ('Citizen Kane', 0.4798657718120805), ('Wizard of Oz, The', 0.47289156626506024), ('2001: A Space Odyssey', 0.45507246376811594), ('Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb', 0.45182724252491696), ('Raiders of the Lost Ark', 0.4444444444444444). 

#### Some comments about Jaccard Index:

The Jaccard index seems like a good indicator for a reasonable similarity metric for recommender systems. However, it does not take into account of the quality of reviews by weighting them differently. We should think critically as we are looking at the recommandations made by the jaccard index. 