In [1]:
'''Import necessary library'''
import pickle
import pandas as pd
from math import sqrt

**User-User Collaborative filtering approach**

User–user CF is a straightforward algorithmic interpretation of the core premise of collaborative filtering: find other users whose past rating behavior is similar to that of the current user and use their ratings on other items to predict what the current user will like. Lets say for example, To predict Mary’s preference for an item she has not rated, user–user CF looks
for other users who have high agreement with Mary on the items they have both rated. These users’ ratings for the item in question are then weighted by their level of agreement with Mary’s ratings to predict Mary’s preference.

In [2]:
with open('dict_scores_for_25_critics.txt', 'rb') as handle:
    dataset = pickle.loads(handle.read())

**Similarity Function**

A critical design decision in implementing user–user CF is the choice of similarity function. We will try using two functions to estimate the similarity between two users or critics. One is the euclidean distance metric that calculates the similarity between two users and the other one turns out to be the more efficient pearson co-efficient. The problem with euclidean distance is that it measures the dissimilarity too, In our case, the people who like the same movies are less important than the ones that prefer different movies or that they have different tastes. To be more precise, instead of just relying on the most similar person, a prediction is normally based on the weighted average of the recommendations of several people. The weight given to a person’s ratings is determined by the correlation between that person and the person for whom to make a prediction.


In [3]:
def similarity_score(critic1,critic2):
    
    '''Returns ratio Euclidean distance score of critic1 and critic2''' 

    common = {} # To get both rated movies by critic1 and critic2

    for movie in dataset[critic1]:
        if movie in dataset[critic2]:
            common[movie] = 1

        # Conditions to check they both have an common rating movies
        if len(common) == 0:
            return 0

        # Finding Euclidean distance 
        ed = []

        for movie in dataset[critic1]:
            if movie in dataset[critic2]:
                ed.append(pow(dataset[critic1][movie] - dataset[critic2][movie],2))
        ed = sum(ed)

        return 1/(1+sqrt(ed))

**Pearson co-efficient**

This method computes the statistical correlation (Pearson’s r) between two user’s common ratings to determine their similarity. This measures how well two critics are linearly related in our case. A formal representation of the pearson-co-efficient is given below,
<img src="pcc.png"width=700px> 

The below table gives additional details about each and every variable used in the above formula,

<img src="pcc_2.png"width=700px> 

In [4]:
def pearson_correlation(critic1,critic2):

# Fetch the commonly rated items
    common = {}
    for item in dataset[critic1]:
        if item in dataset[critic2]:
            common[item] = 1

    total_ratings = len(common)
    
    if total_ratings == 0:
        return 0

    # Add all the ratings given by both the critics for all the movies that are in common between them.
    critic1_pref_sum = sum([dataset[critic1][item] for item in common])
    critic2_pref_sum = sum([dataset[critic2][item] for item in common])

    # Sum up the squares of ratings of each user.
    critic1_ratings_sum = sum([pow(dataset[critic1][item],2) for item in common])
    critic2_ratings_sum = sum([pow(dataset[critic2][item],2) for item in common])

    # Sum up the product value of both preferences for each item
    product_sum_of_both_users = sum([dataset[critic1][item] * dataset[critic2][item] for item in common])

    # Calculate the pearson score
    n1 = product_sum_of_both_users - (critic1_pref_sum*critic2_pref_sum/total_ratings)
    d1 = sqrt((critic1_ratings_sum - pow(critic1_pref_sum,2)/total_ratings) * 
                             (critic2_ratings_sum -pow(critic2_pref_sum,2)/total_ratings))
    if d1 == 0:
        return 0
    else:
        rating = n1/d1
        return rating

**Reference**

Implemented these functionalities by gaining knowledge from the book called Programming collective Intelligence written by Toby Segaran.

In [5]:
def most_similar_users(critic,n):
    '''returns the number_of_users (similar critics) for a given specific critic by estimating the 
    pearson correlation score between the given critic and all other critics in the dataset.'''
    
    scores = [(pearson_correlation(critic,other_critic),other_critic) for other_critic in dataset if  other_critic != critic ]
    scores.sort()
    scores.reverse()
    return scores[:n]

In [6]:
def user_reommendations(critic,n):

    '''Gets recommendations for a critic by using a weighted average of every other user's rankings'''
    totals = {}
    simSums = {}
    for other in dataset:
        if other != critic:
            sim = pearson_correlation(critic,other)
            if sim <=0: 
                continue
            for item in dataset[other]:
                # only score movies i haven't seen yet
                if item not in dataset[critic] or dataset[critic][item] == 0:
                # Similrity * score
                    totals.setdefault(item,0)
                    totals[item] = totals[item] + dataset[other][item]* sim
                    # sum of similarities
                    simSums.setdefault(item,0)
                    simSums[item]+= sim

    rankings = [(total/simSums[item],item) for item,total in totals.items()]
    rankings.sort()
    rankings.reverse()
    # returns the recommended items
    recommendataions_list = [recommend_item for score,recommend_item in rankings]
    return recommendataions_list[:n]

In [7]:
print dataset.keys()
print "Total number of critics in the dataset=%d" % len(dataset.keys())

['Lawrence Toppman', 'Michael Wilmington', 'David Edelstein', 'Ty Burr', 'Desson Thomson', 'Peter Travers', 'Roger Ebert', 'James Berardinelli', 'Stephen Holden', 'Kenneth Turan', 'David Sterritt', 'Wesley Morris', 'Jonathan Rosenbaum', 'Manohla Dargis', 'Lisa Schwarzbaum', 'Michael Phillips', 'Roger Moore', 'Dana Stevens', 'Steven Rea', 'Joe Morgenstern', "Michael O'Sullivan", 'Rene Rodriguez', 'Michael Sragow', 'Todd McCarthy', 'Peter Rainer']
Total number of critics in the dataset=25


In [8]:
'''Total ratings given by the first critic '''
print len(dataset.items()[0][1])

1429


In [9]:
'''Total number of movie ratings given by all critics'''
c=0
for i in dataset.keys():
    c=c+len(dataset[i])
print c

40950


In [10]:
'''Get top 20 recommended movies for critic Michael Phillips by estimating 
    the correlation score betweem him and all other critics'''

print user_reommendations('Michael Phillips',20)

['Who Framed Roger Rabbit', 'Tootsie', 'The Wizard of Oz (re-release)', 'Raging Bull', 'Patton', 'National Gallery', 'From Here to Eternity (re-release)', 'Frantic [re-release]', 'Beau Travail', 'Z Channel: A Magnificent Obsession', 'You, the Living', 'Where Are You Taking Me?', 'When Marnie Was There', 'What Richard Did', 'Werckmeister Harmonies', 'Waging a Living', 'Umberto D (re-release)', 'Two Women', 'Two Step', 'This Filthy World']


In [11]:
'''Print the most similar critics to critic Todd McCarthy'''
print most_similar_users('Todd McCarthy',3) 

[(0.6142933202957855, 'Ty Burr'), (0.607544288227396, 'Roger Moore'), (0.5851445957264291, 'Kenneth Turan')]


In [12]:
'''Calculate the length of ratings i.e, the total movies rated by each critic'''
r={}
for i in dataset.keys():
    r[i] = len(dataset[i])
print r

{'Lawrence Toppman': 1429, 'Michael Wilmington': 1075, 'Desson Thomson': 1604, 'Ty Burr': 1692, 'David Edelstein': 1706, 'Dana Stevens': 1069, 'Roger Ebert': 2738, 'James Berardinelli': 2266, 'Stephen Holden': 1945, 'Kenneth Turan': 1887, 'David Sterritt': 1859, 'Wesley Morris': 1377, 'Jonathan Rosenbaum': 1416, 'Manohla Dargis': 1612, 'Lisa Schwarzbaum': 1781, 'Michael Phillips': 1381, 'Roger Moore': 1203, 'Peter Travers': 2221, 'Steven Rea': 1602, 'Joe Morgenstern': 1841, "Michael O'Sullivan": 1146, 'Rene Rodriguez': 1621, 'Michael Sragow': 1031, 'Todd McCarthy': 1365, 'Peter Rainer': 2083}


In [13]:
'''Calculate average rating given by each critic'''

import scipy
avg={}
for i in dataset.keys():
    samp=[]
    for v in dataset[i]:
        samp.append(dataset[i][v])
        a=scipy.mean(samp)
        avg[i] = a
print avg

{'Lawrence Toppman': 64.809657102869139, 'Michael Wilmington': 72.836279069767443, 'Desson Thomson': 59.748129675810475, 'Ty Burr': 65.086879432624116, 'David Edelstein': 64.167643610785461, 'Dana Stevens': 61.973807296538823, 'Roger Ebert': 68.58436815193572, 'James Berardinelli': 64.287290379523384, 'Stephen Holden': 58.750642673521853, 'Kenneth Turan': 70.752517223105457, 'David Sterritt': 66.463152232382996, 'Wesley Morris': 57.913580246913583, 'Jonathan Rosenbaum': 59.738700564971751, 'Manohla Dargis': 60.899503722084368, 'Lisa Schwarzbaum': 69.112296462661419, 'Michael Phillips': 65.485155684286752, 'Roger Moore': 59.088113050706568, 'Peter Travers': 65.621791985592083, 'Steven Rea': 70.595505617977523, 'Joe Morgenstern': 59.782726778924498, "Michael O'Sullivan": 58.139616055846425, 'Rene Rodriguez': 63.198025909932142, 'Michael Sragow': 64.676042677012603, 'Todd McCarthy': 62.065934065934066, 'Peter Rainer': 66.151704272683631}
