This notebook contains simple similarity-based recommender and item-based CF with Jaccard. This example code will help you to understand the data and help you to do hw4.ipynb
Run the below code and be familar with the code before doing hw4.ipynb

## Simple similarity-based recommender with Jaccard

The first recommender system we'll implement/run is a simple simaliry-based recommender that makes recommendations based on the Jaccard similarity between items.

We'll start with several standard imports:

In [None]:
import gzip
from collections import defaultdict
import scipy
import scipy.optimize
import numpy
import random

And load the data, and convert integer-valued fields as we go. Note that here we use a large "Musical Instruments" dataset for the sake of demonstrating a more scalable system. Download the date from here: https://web.cs.wpi.edu/~kmlee/cs547/amazon_reviews_us_Musical_Instruments_v1_00_small.tsv.gz



In [None]:
# From https://web.cs.wpi.edu/~kmlee/cs547/amazon_reviews_us_Musical_Instruments_v1_00_small.tsv.gz
path = "C://Users/humanist0810/Documents/hw4-rec/amazon_reviews_us_Musical_Instruments_v1_00_small.tsv.gz"

In [None]:
f = gzip.open(path, 'rt', encoding="utf8")

In [None]:
header = f.readline()
header = header.strip().split('\t')

In [None]:
print(header)

In [None]:
dataset = []

In [None]:
for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    d['star_rating'] = int(d['star_rating'])
    d['helpful_votes'] = int(d['helpful_votes'])
    d['total_votes'] = int(d['total_votes'])
    dataset.append(d)

In [None]:
len(dataset)

Let's examine one of the entries in this dataset:

In [None]:
dataset[0]

First we'll build a few useful data structures, in this case just to maintain a collection of the items reviewed by each user, and the collection of users who have reviewed each item.

In [None]:
usersPerItem = defaultdict(set)
itemsPerUser = defaultdict(set)

In [None]:
itemNames = {}

In [None]:
for d in dataset:
    user,item = d['customer_id'], d['product_id']
    usersPerItem[item].add(user)
    itemsPerUser[user].add(item)
    itemNames[item] = d['product_title']

This is a generic implementation of the Jaccard similarity between two sets, which we'll use to build our recommender:

In [None]:
def Jaccard(s1, s2):
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    return numer / denom

Our implementation of the recommender system just finds the most similar item (i2) compared to the query item (i), based on their Jaccard similarities (i.e., overlap between users who purchased both items). Overall, find the 5 most similar items.

In [None]:
def mostSimilar(i):
    similarities = []
    users = usersPerItem[i]
    print(users)
    for i2 in usersPerItem: # For all items
        if i == i2: continue # other than the query
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:5]

### Generating a recommendation
Let's select some example item from the dataset to use as a query to generate similar recommendations:

In [None]:
dataset[1]

In [None]:
query = dataset[1]['product_id']
print(query)

Next we'll examine the most similar items compared to this query:

In [None]:
mostSimilar(query)

In [None]:
itemNames[query]

In [None]:
[itemNames[x[1]] for x in mostSimilar(query)]

## Efficient similarity-based recommendation

In [None]:
def mostSimilarFast(i):
    similarities = []
    users = usersPerItem[i]
    candidateItems = set()
    for u in users:
        candidateItems = candidateItems.union(itemsPerUser[u])
    for i2 in candidateItems:
        if i2 == i: continue
        sim = Jaccard(users, usersPerItem[i2])
        similarities.append((sim,i2))
    similarities.sort(reverse=True)
    return similarities[:5]

In [None]:
mostSimilarFast(query)

## Rating estimation of Item-based Collaborative-filtering with Jaccard
We can also use the similarity-based recommender we developed above to make predictions about user's ratings. 

Specifically, a user's rating for an item is assumed to be a weighted sum of their previous ratings, weighted by how similar the query item is to each of their previous purchases.

We start by building a few more utility data structures to keep track of all of the reviews by each user and for each item.

In [None]:
reviewsPerUser = defaultdict(list)
reviewsPerItem = defaultdict(list)

In [None]:
for d in dataset:
    user,item = d['customer_id'], d['product_id']
    reviewsPerUser[user].append(d)
    reviewsPerItem[item].append(d)

Next we compute the rating mean. This will be used as a simple baseline, but will also be used as a "default" prediction in the event that the user has rated no previous items with a Jaccard similarity greater than zero (compared to the query).

In [None]:
ratingMean = sum([d['star_rating'] for d in dataset]) / len(dataset)

In [None]:
ratingMean

Our prediction function computes (a) a list of the user's previous ratings (ignoring the query item); and (b) a list of the similarities of these previous items, compared to the query. These weights are used to constructed a weighted average of the ratings from the first set.

In [None]:
def predictRating(user,item):
    ratings = []
    similarities = []
    for d in reviewsPerUser[user]:
        i2 = d['product_id']
        if i2 == item: continue
        ratings.append(d['star_rating'])
        similarities.append(Jaccard(usersPerItem[item],usersPerItem[i2]))
    if (sum(similarities) > 0):
        weightedRatings = [(x*y) for x,y in zip(ratings,similarities)]
        return sum(weightedRatings) / sum(similarities)
    else:
        # User hasn't rated any similar items
        return ratingMean

Let's try a simple example:

In [None]:
dataset[1]

In [None]:
u,i = dataset[1]['customer_id'], dataset[1]['product_id']

In [None]:
predictRating(u, i)

Again, we evaluate the performace of our model:

In [None]:
def MSE(predictions, labels):
    differences = [(x-y)**2 for x,y in zip(predictions,labels)]
    return sum(differences) / len(differences)

In [None]:
alwaysPredictMean = [ratingMean for d in dataset]

In [None]:
len(alwaysPredictMean)

In [None]:
cfPredictions = [predictRating(d['customer_id'], d['product_id']) for d in dataset]

In [None]:
labels = [d['star_rating'] for d in dataset]

In [None]:
MSE(alwaysPredictMean, labels)

In [None]:
MSE(cfPredictions, labels)

In this case, the accuracy of our rating prediction model was actually worse (in terms of the MSE) than just predicting the mean rating. However note again that this is just a heuristic, and could be modified to improve its predictions (e.g. by using a different similarity function other than the Jaccard similarity or increasing a size of the dataset).

