** Beginner's Guide to Data Mining by Ron Zacharski** <br>
*Chapter 3 - Item Based Collaborative Filtering* 

In the previous chapter, we used collaborative filtering by recommending products rated by similar users. However, in this chapter, we will do something different, item based collaborative filtering. We determine the similarities between items. By using this metric and the user's ratings, we can make predictions for the ratings for products the user has not rated.

In [1]:
from math import sqrt
from operator import itemgetter

In [2]:
users = {"Angelica":{"Blues Traveler":3.5, "Broken Bells":2.0, "Norah Jones":4.5, "Phoenix":5.0, "Slightly Stoopid":1.5,
                    "The Strokes":2.5, "Vampire Weekend":2.0}, 
        "Bill": {"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, 
                 "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
        "Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
                  "Slightly Stoopid": 1.0},
        "Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, 
                "The Strokes": 4.0,"Vampire Weekend": 2.0},
        "Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0,"Vampire Weekend": 1.0},
        "Jordyn": {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5,
                   "The Strokes": 4.0, "Vampire Weekend": 4.0},
        "Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0,
                "The Strokes": 5.0},
        "Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5,
                     "The Strokes": 3.0}}

In [3]:
users2 = {"Amy": {"Taylor Swift": 4, "PSY": 3, "Whitney Houston": 4},
          "Ben": {"Taylor Swift": 5, "PSY": 2},
          "Clara": {"PSY": 3.5, "Whitney Houston": 4},
          "Daisy": {"Taylor Swift": 5, "Whitney Houston": 3}}

In [4]:
users3 = {"David": {"Imagine Dragons": 3, "Daft Punk": 5, "Lorde": 4, "Fall Out Boy": 1},
          "Matt":{"Imagine Dragons": 3, "Daft Punk": 4,"Lorde": 4, "Fall Out Boy": 1},
          "Ben":   {"Kacey Musgraves": 4, "Imagine Dragons": 3,"Lorde": 3, "Fall Out Boy": 1},
          "Chris": {"Kacey Musgraves": 4, "Imagine Dragons": 4,"Daft Punk": 4, "Lorde": 3, "Fall Out Boy": 1},
          "Tori":  {"Kacey Musgraves": 5, "Imagine Dragons": 4,"Daft Punk": 5, "Fall Out Boy": 3}}

We will make a function to calculate the adjusted cosine similarity between two bands. We find the similarity between two items which have both been rated by users. This is similar to the cosine similarity function we used in Chapter 2.<br>
The formula is:

$$s(i,j) =\frac{ \sum_{u \in U} (R_{u,i} - \bar{R_{u}})(R_{u,j} - \bar{R_{u}}) } { \sqrt {\sum_{u \in U} (R_{u,i} - \bar{R_{u}})^{2}} \sqrt {\sum_{u \in U} (R_{u,j} - \bar{R_{u}})^{2}} } $$

U is the set of users that have rated both items i and j

We subtract the average rating of each user from his/her rating to normalize them. This helps us avoid rating inflation and ensures all user ratings have the same objective scale.

In [5]:
def computesimilarity(band1, band2, users):
    ###
    averages = {}
    for user, band in users.items():
        averages[user] = sum(band.values())/(len(band.values()))
    numerator = 0
    dem1 = 0
    dem2 = 0
    for user, band in users.items():
        if (band1 in band) and (band2 in band):
            numerator += (band[band1] - averages[user])*(band[band2] - averages[user])
            dem1 += (band[band1] - averages[user])**2
            dem2 += (band[band2] - averages[user])**2            
    return numerator/(sqrt(dem1)*sqrt(dem2))

In [6]:
computesimilarity("The Strokes", "Deadmau5", users)

-0.5671074810201343

In [7]:
computesimilarity("Lorde", "Imagine Dragons", users3)

-0.2525265372291518

Weighted Slope One<BR>

This is an algorithm which we can use to make recommendations for users. It involves two steps. In the first step, we calculate the deviations between item pairs. In simple words, we try to find out, on average, the deviation in the rating of item 1 from item 2. We do this for all item pairs. In the second step, we use these deviation values to predict ratings for items not rated by a user.

Step 1: Computing Deviations

The formula for calculating deviations for item pairs is:

$$ dev(i,j) = \sum_{u \in S_{i,j}(X)} \frac {u_i - u_j} {card(S_{i,j}(X))} $$

where $ S_{i,j} $ is the set of users who rated both items i and j. card(S) is how many elements are in S and X, that is how many users rated both items i and j.

Now, let us make a function to calculate this value for all item pairs.

In [8]:
def deviations(data):
    dev = {}
    freq = {}
    ### Get ratings from the dataset
    for ratings in data.values():
        ### For every item1, calculate the deviation of all items(item2) from item1
        ### For every item1, calculate the frequency when item1 and item2 are rated by the same user
        for item1, rating1 in ratings.items():
            dev.setdefault(item1, {})
            freq.setdefault(item1, {})
            for item2, rating2 in ratings.items():
                if item1 != item2:      
                    dev[item1].setdefault(item2, 0)
                    freq[item1].setdefault(item2, 0.0)
                    dev[item1][item2] += rating1 - rating2
                    freq[item1][item2] += 1
    ### For every item1-item2 pair, divide the deviations by the frequencies to get the final deviation between the items
    for k1, v1 in dev.items():
        for k2 in v1:
            v1[k2] /= freq[k1][k2]
    ### Return two dictionaries. One contains the deviations of all items to item i. 
    ### The second contains the number of users who rated both items i and j.
    return dev, freq

In [9]:
deviations(users2)

({'PSY': {'Taylor Swift': -2.0, 'Whitney Houston': -0.75},
  'Taylor Swift': {'PSY': 2.0, 'Whitney Houston': 1.0},
  'Whitney Houston': {'PSY': 0.75, 'Taylor Swift': -1.0}},
 {'PSY': {'Taylor Swift': 2.0, 'Whitney Houston': 2.0},
  'Taylor Swift': {'PSY': 2.0, 'Whitney Houston': 2.0},
  'Whitney Houston': {'PSY': 2.0, 'Taylor Swift': 2.0}})

The output contains two dictionaries. The first one contains deviations. For example, on average, PSY is rated 2 points below Taylor Swift and 0.85 points below Whitney Houston. The second dictionary contains the number of users who rated both items. So, 2 users rated PSY and Taylor Swift. Similarly, two users rated PSY and Whitney Houston.

In [10]:
deviations(users3)

({'Daft Punk': {'Fall Out Boy': 3.0,
   'Imagine Dragons': 1.0,
   'Kacey Musgraves': 0.0,
   'Lorde': 0.6666666666666666},
  'Fall Out Boy': {'Daft Punk': -3.0,
   'Imagine Dragons': -2.0,
   'Kacey Musgraves': -2.6666666666666665,
   'Lorde': -2.5},
  'Imagine Dragons': {'Daft Punk': -1.0,
   'Fall Out Boy': 2.0,
   'Kacey Musgraves': -0.6666666666666666,
   'Lorde': -0.25},
  'Kacey Musgraves': {'Daft Punk': 0.0,
   'Fall Out Boy': 2.6666666666666665,
   'Imagine Dragons': 0.6666666666666666,
   'Lorde': 1.0},
  'Lorde': {'Daft Punk': -0.6666666666666666,
   'Fall Out Boy': 2.5,
   'Imagine Dragons': 0.25,
   'Kacey Musgraves': -1.0}},
 {'Daft Punk': {'Fall Out Boy': 4.0,
   'Imagine Dragons': 4.0,
   'Kacey Musgraves': 2.0,
   'Lorde': 3.0},
  'Fall Out Boy': {'Daft Punk': 4.0,
   'Imagine Dragons': 5.0,
   'Kacey Musgraves': 3.0,
   'Lorde': 4.0},
  'Imagine Dragons': {'Daft Punk': 4.0,
   'Fall Out Boy': 5.0,
   'Kacey Musgraves': 3.0,
   'Lorde': 4.0},
  'Kacey Musgraves': {'Daf

In [15]:
deviations(users)

({'Blues Traveler': {'Broken Bells': 1.2,
   'Deadmau5': 0.16666666666666666,
   'Norah Jones': 0.25,
   'Phoenix': -0.4166666666666667,
   'Slightly Stoopid': 0.75,
   'The Strokes': 0.0,
   'Vampire Weekend': 0.5},
  'Broken Bells': {'Blues Traveler': -1.2,
   'Deadmau5': 0.5,
   'Norah Jones': -1.2,
   'Phoenix': -1.3333333333333333,
   'Slightly Stoopid': -0.3333333333333333,
   'The Strokes': -0.6,
   'Vampire Weekend': 1.2},
  'Deadmau5': {'Blues Traveler': -0.16666666666666666,
   'Broken Bells': -0.5,
   'Norah Jones': -2.0,
   'Phoenix': -0.375,
   'Slightly Stoopid': 0.0,
   'The Strokes': -0.8333333333333334,
   'Vampire Weekend': 0.875},
  'Norah Jones': {'Blues Traveler': -0.25,
   'Broken Bells': 1.2,
   'Deadmau5': 2.0,
   'Phoenix': -0.7,
   'Slightly Stoopid': 1.4,
   'The Strokes': 0.6,
   'Vampire Weekend': 2.1666666666666665},
  'Phoenix': {'Blues Traveler': 0.4166666666666667,
   'Broken Bells': 1.3333333333333333,
   'Deadmau5': 0.375,
   'Norah Jones': 0.7,
   'S

Step 2: Predictions using Weighted Slope One

The formula for making predictions is:

$$ p^{wS1}(u)_j = \frac {\sum_{i \in S(u) - (j)} (dev_{j,i} + u_i)c_{j,i}} {\sum_{i \in S(u) - (j)} c_{j,i}} $$

$p^{wS1}(u)_j$ is the predicted rating of user u for item j. $c_{j,i}$ is the $card(S_{j,i}(X))$. $i \in S(u)-(j)$ means for every item that user u has rated except of item j.

In the numerator we add the deviation of item j with item i and the user rating for item i. We multiply this sum by the total number of users who have rated both item j and i.

Let's code the function to calculate this.

In [11]:
def slopeone(data, user):
    recommendations = {}
    frequency = deviations(data)[1]
    denominator = {}
    final = {}
    ### for every item and rating in the user's recommendations
    for item, ratings in user.items():
        ### for every item our user did not rate
        for DiffItem, DiffRatings in deviations(data)[0].items():
            
            if (DiffItem not in user) and (item in deviations(data)[0][DiffItem]):
                #freq = frequency[item2][item1]
                recommendations.setdefault(DiffItem, 0.0)
                denominator.setdefault(DiffItem, 0)
                recommendations[DiffItem] += (deviations(data)[0][DiffItem][item] + user[item])*frequency[item][DiffItem]
                denominator[DiffItem] += frequency[item][DiffItem]
                
    
    for k1 in recommendations:
        final[k1] = recommendations[k1]/denominator[k1]
    return sorted(final.items(), key=itemgetter(1), reverse=True)

In [12]:
users2["Ben"].items()

dict_items([('Taylor Swift', 5), ('PSY', 2)])

In [13]:
deviations(users2)[0].items()

dict_items([('Taylor Swift', {'PSY': 2.0, 'Whitney Houston': 1.0}), ('PSY', {'Taylor Swift': -2.0, 'Whitney Houston': -0.75}), ('Whitney Houston', {'Taylor Swift': -1.0, 'PSY': 0.75})])

In [14]:
slopeone(users2, users2["Ben"])

[('Whitney Houston', 3.375)]

In [16]:
slopeone(users, users["Veronica"])

[('Deadmau5', 2.8529411764705883),
 ('Broken Bells', 2.5555555555555554),
 ('Vampire Weekend', 2.3055555555555554)]

In [18]:
slopeone(users3, users3["David"])

[('Kacey Musgraves', 4.2)]