(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Recommender System for Amazon Electronics

In this assignment, we will be working with the [Amazon dataset](http://cs-people.bu.edu/kzhao/teaching/amazon_reviews_Electronics.tar.gz). You will build a recommender system to make predictions related to reviews of Electronics products on Amazon.

Your grades will be determined by your performance on the predictive tasks as well as a brief written report about the approaches you took.

This assignment should be completed **individually**.

## Files

**train.json** 1,000,000 reviews to be used for training. It is not necessary to use all reviews for training if doing so proves too computationally intensive. The fields in this file are:

* **reviewerID** The ID of the reviewer. This is a hashed user identifier from Amazon.

* **asin** The ID of the item. This is a hashed product identifier from Amazon.

* **overall** The rating of reviewer gave the item.

* **helpful** The helpfulness votes for the review. This has 2 subfields, 'nHelpful' and 'outOf'. The latter is the total number of votes this review received. The former is the number of those that considered the review to be helpful.

* **reviewText** The text of the review.

* **summary** The summary of the review.

* **unixReviewTime** The time of the review in seconds since 1970.

**meta.json** Contains metadata of the items:

* **asin** The ID of the item.

* **categories** The category labels of the item being reviewed.

* **price** The price of the item.

* **brand** The brand of the item.

**pairs_Rating.txt** The pairs (reviewerID and asin) on which you are to predict ratings.

**pairs_Purchase.txt** The pairs on which you are to predict whether a user purchased an item or not.

**pairs_Helpful.txt** The pairs on which you are to predict helpfulness votes. A third column in this file is the total number of votes from which you should predict how many were helpful.

**helpful.json** The review data associated with the helpfulness prediction test set. The 'nHelpful' field has been removed from this data since that is the value you need to predict above. This data will only be of use for the helpfulness prediction task.

**baseline.py** A simple baseline for each task.

## Tasks

**Rating prediction** Predict people's star ratings as accurately as possible for those (reviewerID, asin) pairs in 'pairs_Rating.txt'. Accuracy will be measured in terms of the [root mean-squared error (RMSE)](http://www.kaggle.com/wiki/RootMeanSquaredError).

**Purchase prediction** Predict given a (reviewerID, asin) pair from 'pairs_Purchase.txt' whether the user purchased the item (really, whether it was one of the items they reviewed). Accuracy will be measured in terms of the [categorization accuracy](http://www.kaggle.com/wiki/HammingLoss) (1 minus the Hamming loss).

**Helpfulness prediction** Predic whether a user's review of an item will be considered helpful. The file 'pairs_Helpful.txt' contains (reviewerID, asin) pairs with a third column containing the number of votes the user's review of the item received. You must predict how many of them were helpful. Accuracy will be measured in terms of the total [absolute error](http://www.kaggle.com/wiki/AbsoluteError), i.e. you are penalized one according to the difference |nHelpful - prediction|, where 'nHelpful' is the number of helpful votes the review actually received, and 'prediction' is your prediction of this quantity.

We set up competitions on Kaggle to keep track of your results compared to those of other members of the class. The leaderboard will show your results on half of the test data, but your ultimate score will depend on your predictions across the whole dataset.
* Kaggle competition: [rating prediction](https://inclass.kaggle.com/c/cs591-hw3-rating-prediction3) click here to [join](https://kaggle.com/join/datascience16rating)
* Kaggle competition: [purchase prediction](https://inclass.kaggle.com/c/cs591-hw3-purchase-prediction) click here to [join](https://kaggle.com/join/datascience16purchase)
* Kaggle competition: [helpfulness prediction](https://inclass.kaggle.com/c/cs591-hw3-helpful-prediction) click here to [join](https://kaggle.com/join/datascience16helpful)

## Grading and Evaluation

You will be graded on the following aspects.

* Your written report. This should describe the approaches you took to each of the 3 tasks. To obtain good performance, you should not need to invent new approaches (though you are more than welcome to) but rather you will be graded based on your decision to apply reasonable approaches to each of the given tasks. (**10pts** for each task)

* Your ability to obtain a solution which outperforms the baselines on the unseen portion of the test data. Obtaining full marks requires a solution which is substantially better (at least several percent) than baseline performance. (**10pts** for each task)

* Your ranking for each of the three tasks compared to other students in the class. (**5pts** for each task)

* Obtain a solution which outperforms the baselines on the seen portion of the test data (the leaderboard). 
(**5pts** for each task)

## Baselines

Simple baselines have been provided for each of the 3 tasks. These are included in 'baselines.py' among the files above. These 3 baselines operate as follows:

**Rating prediction** Returns the global average rating, or the user's average if you have seen them before in the training data.

**Purchase prediction** Finds the most popular products that account for 50% of purchases in the training data. Return '1' whenever such a product is seen at test time, '0' otherwise.

** Helpfulness prediction** Multiplies the number of votes by the global average helpfulness rate, or the user's rate if we saw this user in the training data.

Running 'baseline.py' produces 3 files containing predicted outputs. Your submission files should have the same format.

## Dataset Citation

**Image-based recommendations on styles and substitutes** J. McAuley, C. Targett, J. Shi, A. van den Hengel *SIGIR*, 2015

**Inferring networks of substitutable and complementary products** J. McAuley, R. Pandey, J. Leskovec *Knowledge Discovery and Data Mining*, 2015

-----------------

In [145]:
from collections import defaultdict
import pandas as pd
import numpy as np
def readJson(f):
    for l in open(f):
        yield eval(l)
        
allRatings = []
userRatings = defaultdict(list)
itemRatings = defaultdict(list)
for l in readJson('train.json'):
  user,item = l['reviewerID'],l['asin']
  allRatings.append(l['overall'])
  userRatings[user].append(l['overall'])
  itemRatings[item].append(l['overall'])

globalAverage = sum(allRatings) / len(allRatings)
'''
# Stocastic Gradient Desecnt
allRatings = []
userRating = {}
UserIDD = {}
ItemIDD = {}
userid = {}
itemid = {}
uc=0
ic=0

for line in readJson('train.json'):
    user,item,stars= line['reviewerID'],line['asin'],line['overall']
    train =[user,item,stars]
    allRatings.append(train)

review = pd.DataFrame(allRatings, columns = ['UserID','ItemID','Stars'])

# initially taking random values for the biases and P and Q
k=12
userbais = np.random.rand(len(review.UserID.unique()))
itembais = np.random.rand(len(review.ItemID.unique()))
P = np.random.rand(len(review.UserID.unique()),k)
Q = np.random.rand(len(review.ItemID.unique()),k)
Q = Q.T

Lamda = 1 

LRate = 0.05

for userID in review.UserID.unique():
    UserIDD[userID]=uc
    uc+=1
    
for itemID in review.ItemID.unique():
    ItemIDD[itemID]=ic
    ic+=1

review['UserID'] = review['UserID'].apply(lambda x:int(UserIDD[x]))
review['ItemID'] = review['ItemID'].apply(lambda x:int(ItemIDD[x]))

for k in range(6):
    Global_mu = globalAverage
    for k in range(len(review)):
        UserID = review['UserID'][k]
        ItemID = review['ItemID'][k]
        Error = review['Stars'][k]-(Global_mu+userbais[UserID]+itembais[ItemID]+np.dot(P[UserID,:] , Q[:,ItemID]))
        userbais[UserID] += (LRate*(Error-(Lamda*userbais[UserID])))
        itembais[ItemID] += (LRate*(Error-(Lamda*itembais[ItemID])))
        q = Q[:,ItemID]+(LRate*(Error*P[UserID,:]-Lamda*Q[:,ItemID]))
        p = P[UserID,:]+(LRate*(Error*Q[:,ItemID]-Lamda*P[UserID,:]))
        # to store the old value
        Q[:,ItemID]=q
        P[UserID,:]=p

        
predictions = open("predictions_Rating.txt", 'w')
userid = {}
itemid = {}
for l in open("pairs_Rating.txt"):
  if l.startswith("reviewerID"):
    #header
    predictions.write(l)
    continue
  u,i = l.strip().split('-')
  for k in range(len(review)):
        userid = review['UserID'][k]
        itemid = review['ItemID'][k]
        try:
            if itemid in review['ItemID'] and userid in review['UserID']:
                pred = Global_mu+userbais[UserIDD[userid]]+itembais[ItemIDD[itemid]]+np.dot(P[UserIDD[userid],:],Q[:,ItemIDD[itemid]])
                predictions.write(u + '-' + i + ',' + str(pred) + '\n')
            else:
                predictions.write(u + '-' + i + ',' + str(Global_mu) + '\n')
        except:
            continue

predictions.close()

'''


userAverage = {}
itemAverage = {}
for u in userRatings:
    userAverage[u] = sum(userRatings[u]) / len(userRatings[u])
for v in itemRatings:
    itemAverage[v] = sum(itemRatings[v]) / len(itemRatings[v])

predictions = open("predictions_Rating.txt", 'w')
for l in open("pairs_Rating.txt"):
  if l.startswith("reviewerID"):
    #header
    predictions.write(l)
    continue
  u,i = l.strip().split('-')
  if i in itemAverage and u in userAverage:
    avg = (2*itemAverage[i] + userAverage[u] + globalAverage)/4    
    predictions.write(u + '-' + i + ',' + str(avg) + '\n')
  elif i in itemAverage and u not in userAverage:
    avg = (itemAverage[i] + globalAverage)/2 
    predictions.write(u + '-' + i + ',' + str(avg) + '\n')
  elif u in userAverage and i not in itemAverage:
    avg = (userAverage[u] + globalAverage)/2
    predictions.write(u + '-' + i + ',' + str(avg) + '\n')
  else:
    predictions.write(u + '-' + i + ',' + str(globalAverage) + '\n')
predictions.close()


Task 1 was to predict user’s ratings for the file provided. Initially I tried to build a recommendation system using stochastic gradient descent and predicting from it. First I read all the data from the train.json file and created a data frame ‘review’ which contents all the userid, itemid and the ratings they gave. I took my k=12 and initially calculated random user bias and item bias both of which where the size of unique users and unique items, I took lambda to be 1 and the learning rate to be 0.05. I calculate the global average and put it in place of Mu. Finally I trained my system using 6 iterations and updated the error and the bias accordingly ever time using the equation taught. I also included the dot product. Then I opened the prediction file, read the pair from their and and tried to predict the rating from the Mu, user bias, item bias and the P and Q values as seen in the formula thought in the class. Although the program ran but it took hours to compute and the score on the kaggle completion for this solution was very bad, it hardly just beat the baseline. I have included the code for this above in the commented section.

Finally, not getting the desired output from the above method, I calculated the global average of all the ratings and gave the values somewhat like this: 
1)	if both the item and the user is present in our training dataset then I have taken the average of all of that particular user’s ratings that we have, the average of all of that particular item’s ratings and the global average. I have given a little more weightage to the item’s average as it makes more sense to get similar ratings for the same item.
2)	If only the item is present in our training dataset and the user is not present, then I’ve computed the average of the item’s average and the global average and written that.
3)	If only the user is present in our training dataset and the item is not present, then I’ve computed the average of the user’s ratings average and the global average and written that.
4)	If neither the user nor the item is present in our training dataset, then I’ve just taken the global average and written that as we don’t have any other info on it.
Surprisingly the method above gave me a very good score and thus I’ve submitted these scores as my final score on kaggle. 


In [144]:
from collections import defaultdict
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.cluster import KMeans
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def readJson(f):
  for l in open(f):
    yield eval(l)
'''

countVector = CountVectorizer(min_df=1)
list = []
for line in readJson('meta.json'):
    item,categories = line['asin'],line['categories']
    list.append([item,categories])
    
df = pd.DataFrame(list,columns=['item','category'])
itemsdf = pd.DataFrame(countVector.fit_transform(df.category).toarray(), columns=countVector.get_feature_names())

mis = np.zeros(30)
mis[0] = 0;
for key in range(1,30):
    kmeans = KMeans(init='k-means++', n_clusters=key, n_init=100)
    kmeans.fit_predict(itemsdf)
    mis[key] = kmeans.inertia_

plt.plot(range(1,len(mis)),mis[1:])
plt.xlabel('Number of clusters')
plt.ylabel('Error')

kmeans = KMeans(init='k-means++', n_clusters=20, n_init=100)
kmeans.fit_predict(itemsdf)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
mis = kmeans.inertia_

'''
itemCount = defaultdict(int)
totalPurchases = 0

for l in readJson('train.json'):
  user,item = l['reviewerID'],l['asin']
  itemCount[item] += 1
  totalPurchases += 1

mostPopular = [(itemCount[x], x) for x in itemCount]
mostPopular.sort(reverse=True)

return1 = set()
count = 0
for ic, i in mostPopular[:61110]:
  count += ic
  return1.add(i)
 

predictions = open("predictions_Purchase.txt", 'w')
for l in open("pairs_Purchase.txt"):
  if l.startswith("reviewerID"):
    #header
    predictions.write(l)
    continue
  u,i = l.strip().split('-')
  if i in return1:
    predictions.write(u + '-' + i + ",1\n")
  else:
    predictions.write(u + '-' + i + ",0\n")
predictions.close()


Task 2 was to predict weather user purchased the item or not. Initially I tried to solve this using clusters. I used k-means ++ and formed clusters of similar items. For doing so I created a list and read all the items and categories from the meta file, then I converted it to a data frame where I had all the items with all of these respective categories. Based on which I formed 20 clusters. My idea was that if a user bought an item from a cluster then his chances of buying another similar item which would belong to the same cluster was more likely. So for evry pair that I read from the precidtion file, I would check which cluster the item belonged to and if the user has bought any item from that particular cluster, if the user has bought an item before then I would give the value 1 else if not then 08. But this didn’t work well, the clusetering is taking hours and the kernel crashes for many functions that I try to perform. I have included the code for forming the clusters above.

Thus following which to improve on the baseline I checked that the top 61,110 items sold are the once that have the maximum frequencies of purchase. So, I created an list of the MostPopular items sold and sorted this list by descending order of the number of this item sold and picked the top 61,110 items sold out of the 170,000 items. Following which I gave 1 if the item belonged to this list else 0. This approach gave me a very good score on the kaggle competition and so I left it at that. 



In [143]:
def readJson(f):
  for l in open(f):
    yield eval(l)

allHelpful = []
userHelpful = defaultdict(list)
itemHelpful = defaultdict(list)

for l in readJson('train.json'):
  user,item = l['reviewerID'],l['asin']
  allHelpful.append(l['helpful'])
  userHelpful[user].append(l['helpful'])
  itemHelpful[item].append(l['helpful'])  

averageRate = sum([x['nHelpful'] for x in allHelpful]) * 1.0 / sum([x['outOf'] for x in allHelpful])
userRate = {}
itemRate = {}
for u in userHelpful:
  userRate[u] = sum([x['nHelpful'] for x in userHelpful[u]]) * 1.0 / sum([x['outOf'] for x in userHelpful[u]])
for i in itemHelpful:
  itemRate[i] = sum([x['nHelpful'] for x in itemHelpful[i]]) * 1.0 / sum([x['outOf'] for x in itemHelpful[i]])

values = userRate.values()
useravg = sum(values) / len(values)

values = itemRate.values()
itemavg = sum(values) / len(values)

predictions = open("predictions_Helpful.txt", 'w')
for l in open("pairs_Helpful.txt"):
  if l.startswith("reviewerID"):
    #header
    predictions.write(l)
    continue
  u,i,outOf = l.strip().split('-')
  outOf = int(outOf)
  if i in itemRate and u in userRate:
    avg = (2*userRate[u] + itemRate[i] + averageRate)/4
    predictions.write(u + '-' + i + '-' + str(outOf) + ',' + str(outOf*avg) + '\n')  
  elif u not in userRate and i in itemRate:
    avg = (itemRate[i] + averageRate)/2
    predictions.write(u + '-' + i + '-' + str(outOf) + ',' + str(outOf*avg) + '\n')
  elif u in userRate and i not in itemRate:
    avg = (userRate[u] + averageRate)/2
    predictions.write(u + '-' + i + '-' + str(outOf) + ',' + str(outOf*avg) + '\n')
  else:
    avg = (averageRate)
    predictions.write(u + '-' + i + '-' + str(outOf) + ',' + str(outOf*avg) + '\n')
predictions.close()

Task 3 was to predict how many out of the total votes found a particular review helpful. I initially started with the same approach as the first task. Just instead of teaching the review stars I was passing the value I got after dividing the number of helpful votes by the total number of votes. I tried to form a recommendation system using stochastic gradient decent and predicted from it using the formula. But as the approach took a very long time and the output wasn’t even very good in its score at the end solved this question using the similar average method I used for task 1. 
I calculated the global average of all the helpful votes divided by all the total number of votes for each review, also calculated average of all the user’s helpfulness and all the item’s helpfulness and gave the values somewhat like this: 
1)	if both the item and the user is present in our training dataset then I have taken the average of all of that particular user’s ratings helpfulness that we have, the average of all of that particular item’s review’s helpfulness and the global average of helpfulness. I have given a little more weightage to the item’s helpfulness as it makes more sense to get similar votes for the same item.
2)	If only the item is present in our training dataset and the user is not present, then I’ve computed the average of the item’s rating’s helpfulness average, the global average helpfulness and written that.
3)	If only the user is present in our training dataset and the item is not present, then I’ve computed the average of the user’s ratings helpfulness average and the global average helpfulness and written that.
4)	If neither the user nor the item is present in our training dataset, then I’ve just taken the global average helpfulness and written that as we don’t have any other info on it.
Finally, I multiplied the average scores by the number of total votes (Outof) that we have for each review as it showed up on the prediction text file.
Surprisingly the method above gave me a very good score and thus I’ve submitted these scores as my final score on kaggle.


In [148]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("custom.css", "r").read()
    return HTML(styles)
css_styling()