(In order to load the stylesheet of this notebook, execute the last code cell in this notebook)

# Recommender System for Amazon Electronics

In this assignment, we will be working with the [Amazon dataset](http://cs-people.bu.edu/kzhao/teaching/amazon_reviews_Electronics.tar.gz). You will build a recommender system to make predictions related to reviews of Electronics products on Amazon.

Your grades will be determined by your performance on the predictive tasks as well as a brief written report about the approaches you took.

This assignment should be completed **individually**.

## Files

**train.json** 1,000,000 reviews to be used for training. It is not necessary to use all reviews for training if doing so proves too computationally intensive. The fields in this file are:

* **reviewerID** The ID of the reviewer. This is a hashed user identifier from Amazon.

* **asin** The ID of the item. This is a hashed product identifier from Amazon.

* **overall** The rating of reviewer gave the item.

* **helpful** The helpfulness votes for the review. This has 2 subfields, 'nHelpful' and 'outOf'. The latter is the total number of votes this review received. The former is the number of those that considered the review to be helpful.

* **reviewText** The text of the review.

* **summary** The summary of the review.

* **unixReviewTime** The time of the review in seconds since 1970.

**meta.json** Contains metadata of the items:

* **asin** The ID of the item.

* **categories** The category labels of the item being reviewed.

* **price** The price of the item.

* **brand** The brand of the item.

**pairs_Rating.txt** The pairs (reviewerID and asin) on which you are to predict ratings.

**pairs_Purchase.txt** The pairs on which you are to predict whether a user purchased an item or not.

**pairs_Helpful.txt** The pairs on which you are to predict helpfulness votes. A third column in this file is the total number of votes from which you should predict how many were helpful.

**helpful.json** The review data associated with the helpfulness prediction test set. The 'nHelpful' field has been removed from this data since that is the value you need to predict above. This data will only be of use for the helpfulness prediction task.

**baseline.py** A simple baseline for each task.

## Tasks

**Rating prediction** Predict people's star ratings as accurately as possible for those (reviewerID, asin) pairs in 'pairs_Rating.txt'. Accuracy will be measured in terms of the [root mean-squared error (RMSE)](http://www.kaggle.com/wiki/RootMeanSquaredError).

**Purchase prediction** Predict given a (reviewerID, asin) pair from 'pairs_Purchase.txt' whether the user purchased the item (really, whether it was one of the items they reviewed). Accuracy will be measured in terms of the [categorization accuracy](http://www.kaggle.com/wiki/HammingLoss) (1 minus the Hamming loss).

**Helpfulness prediction** Predic whether a user's review of an item will be considered helpful. The file 'pairs_Helpful.txt' contains (reviewerID, asin) pairs with a third column containing the number of votes the user's review of the item received. You must predict how many of them were helpful. Accuracy will be measured in terms of the total [absolute error](http://www.kaggle.com/wiki/AbsoluteError), i.e. you are penalized one according to the difference |nHelpful - prediction|, where 'nHelpful' is the number of helpful votes the review actually received, and 'prediction' is your prediction of this quantity.

We set up competitions on Kaggle to keep track of your results compared to those of other members of the class. The leaderboard will show your results on half of the test data, but your ultimate score will depend on your predictions across the whole dataset.
* Kaggle competition: [rating prediction](https://inclass.kaggle.com/c/cs591-hw3-rating-prediction3) click here to [join](https://kaggle.com/join/datascience16rating)
* Kaggle competition: [purchase prediction](https://inclass.kaggle.com/c/cs591-hw3-purchase-prediction) click here to [join](https://kaggle.com/join/datascience16purchase)
* Kaggle competition: [helpfulness prediction](https://inclass.kaggle.com/c/cs591-hw3-helpful-prediction) click here to [join](https://kaggle.com/join/datascience16helpful)

## Grading and Evaluation

You will be graded on the following aspects.

* Your written report. This should describe the approaches you took to each of the 3 tasks. To obtain good performance, you should not need to invent new approaches (though you are more than welcome to) but rather you will be graded based on your decision to apply reasonable approaches to each of the given tasks. (**10pts** for each task)

* Your ability to obtain a solution which outperforms the baselines on the unseen portion of the test data. Obtaining full marks requires a solution which is substantially better (at least several percent) than baseline performance. (**10pts** for each task)

* Your ranking for each of the three tasks compared to other students in the class. (**5pts** for each task)

* Obtain a solution which outperforms the baselines on the seen portion of the test data (the leaderboard). 
(**5pts** for each task)

## Baselines

Simple baselines have been provided for each of the 3 tasks. These are included in 'baselines.py' among the files above. These 3 baselines operate as follows:

**Rating prediction** Returns the global average rating, or the user's average if you have seen them before in the training data.

**Purchase prediction** Finds the most popular products that account for 50% of purchases in the training data. Return '1' whenever such a product is seen at test time, '0' otherwise.

** Helpfulness prediction** Multiplies the number of votes by the global average helpfulness rate, or the user's rate if we saw this user in the training data.

Running 'baseline.py' produces 3 files containing predicted outputs. Your submission files should have the same format.

## Dataset Citation

**Image-based recommendations on styles and substitutes** J. McAuley, C. Targett, J. Shi, A. van den Hengel *SIGIR*, 2015

**Inferring networks of substitutable and complementary products** J. McAuley, R. Pandey, J. Leskovec *Knowledge Discovery and Data Mining*, 2015

# Pre-process the data

Basically I just trimmed some extraneous information from files and get rid of some signs like quotation marks, etc and make them to one to one or one to many tables so I can do some aggregation or merge on those tables.

In [13]:
import csv
import pandas as pd
import numpy as np
import json

STOPWORDS_FD = 'stopwords'
STOPWORDS = {}
from collections import defaultdict

# From Homework 2
def get_words(filename):
    words = {}
    for line in open(filename):
        word = line.rstrip()
        words[word] = word
    return words

def readJson(f):
    for l in open(f):
        yield eval(l)

In [42]:
arr = []
for l in readJson('train.json'):
    tmp = [l['reviewerID'],l['asin'],l['overall']]
    arr.append(tmp)

df = pd.DataFrame(arr)
df.to_csv('reviewerID-asin.csv')

In [3]:
df = pd.read_csv('reviewerID-asin.csv')
users = []
items = []
for index,item in df.iterrows():
    s = str(item[1])
    i = str(item[2])
    users.append(s)
    items.append(i)
    
users = list(set(users))
items = list(set(items))

df1 = pd.DataFrame(users)
df2 = pd.DataFrame(items)
df1.to_csv('train_users.csv')
df2.to_csv('train_items.csv')

In [68]:
train_users = pd.read_csv('train_users.csv')
train_items = pd.read_csv('train_items.csv')
overall_rating = pd.read_csv('reviewerID-asin.csv')

user_avg_rating = {}
train_items.columns = ['index','item']

overall_rating.columns = ['index','user_id', 'item', 'rating']

group_by_user_avg_rating = overall_rating.groupby(['user_id'])['rating'].mean()

test = pd.DataFrame(group_by_user_avg_rating)

item_total = 0
item_count = 0
user_total = 0
user_count = 0

user_avg_rating = {}# average rating of a user has given
for index, item in test.iterrows():
    user_count = user_count + 1
    _id = str(index)
    rating = item
    user_total = user_total + float(rating)
    user_avg_rating[_id] = float(rating)

group_by_item_avg_rating = overall_rating.groupby(['item'])['rating'].mean()
test2 = pd.DataFrame(group_by_item_avg_rating)
test2.to_csv('item_avg_rating.csv')

item_avg_rating = {}# average rating of each item
for index, item in test2.iterrows():
    item_count = item_count + 1
    item_id = str(index)[4:]
    rating = float(item)
    item_avg_rating[item_id] = rating
    item_total = item_total + rating
    
#all_item_avg_score = 3.789
all_user_avg_score = 3.78
    

In [71]:
test = []
for l in open("pairs_Rating.txt"):
    u,i = l.strip().split('-')
    test.append(i)

test_set = []
for index, item in test2.iterrows():
    item_id = str(index)
    test_set.append(item_id)

test_set = list(set(test_set))
test = list(set(test))

all_existing_items = pd.DataFrame(test_set)
all_existing_items.columns = ['item']

allItems = pd.DataFrame.from_csv("test.csv")
allItems.columns = ['item','types','price']
allItems['item'] = allItems['item'].astype(str)

result = all_existing_items.merge(allItems, left_on='item', right_on='item', how='inner')
result.to_csv('all_existing_items_information.csv')

In [9]:
item_info = pd.read_csv('all_existing_items_information.csv')
item_info.columns = ['index','item','types','price']
cate_price = item_info.groupby(['types'])['price'].mean()
cate_price.to_csv('cate_price.csv')

item_price = item_info.groupby(['item'])['price'].mean()
#item_price[item_price == 0.0] = 76.55
item_price.to_csv('item_price.csv')

item_info = pd.read_csv('all_existing_items_information.csv')
item_info.columns = ['index','item','types','price']
cate_price = pd.read_csv('cate_price.csv')
cate_price.columns = ['types', 'prices']

item_cate_price = item_info.merge(cate_price, left_on='types', right_on='types', how='inner')
item_cate_price.to_csv('item_cate_price.csv')

In [4]:
item_price_non_zeros = pd.read_csv('item_price.csv')
item_price_non_zeros.columns = ['item','avg_price']
df = item_price_non_zeros[(item_price_non_zeros != 0.0).all(1)]
df.to_csv('item_price_non_zeros.csv')
print df['avg_price'].mean()

76.5536098567


# 1. Rating Prediction

In [18]:
import csv
import pandas as pd
import json
import numpy as np
from sklearn import linear_model
from sklearn.linear_model import BayesianRidge, LinearRegression
from sklearn import svm
from sklearn import tree
from sklearn import preprocessing
from sklearn import ensemble
from sklearn import neighbors
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import SGDClassifier

reviewerID_asin = pd.read_csv('reviewerID-asin.csv')
reviewerID_asin.columns = ['index','user','item','rate']

item_avg_rating = pd.read_csv('item_avg_rating.csv')
item_avg_rating.columns = ['item','avg_rate']

item_price = pd.read_csv('item_price.csv')
item_price.columns = ['item','avg_price']

user_rate = pd.read_csv('group_by_user_avg_rating.csv')
user_rate.columns = ['user','give_rate']

item_cate_price = pd.read_csv('item_cate_price.csv')
#item_cate_price.columns = ['item','cat','cat_price']
item_cate_price.drop(['price', 'index'], axis=1, inplace=True)

item_cate_price.columns = ['index', 'item','cat','cat_price']

tmp1 = reviewerID_asin.merge(item_avg_rating, left_on='item', right_on='item', how='inner')
tmp2 = tmp1.merge(item_price, left_on='item', right_on='item', how='inner')
tmp3 = tmp2.merge(user_rate, left_on='user', right_on='user', how='inner')
tmp4 = tmp3.merge(item_cate_price, left_on='item', right_on='item', how='inner')
tmp4.drop(['index_y', 'index_x'], axis=1, inplace=True)

X = tmp4.copy()
X.drop(['user', 'item','cat','rate','cat_price'], axis=1, inplace=True) #'avg_rate',

y = tmp4.copy()
y.drop(['user', 'item','cat','avg_price','give_rate', 'cat_price', 'avg_rate'], axis=1, inplace=True)


train_X = X.as_matrix()
train_Y = y.as_matrix()

my_x = np.array(train_X)
my_y = np.array(train_Y)
# Train the model using the training sets
X_normalized = preprocessing.normalize(my_x, norm='l2')

clf = linear_model.SGDRegressor()
clf = clf.fit(X_normalized, my_y)

estimator1.fit(X_normalized, my_y)
est = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls').fit(X_normalized, my_y)
#estimator2.fit(X_train, y_train)
#avg_rate , avg_price, give_rate, cat_price

testing = []
for l in open("pairs_Rating.txt"):
    u,i = l.strip().split('-')
    tmp = [u,i]
    testing.append(tmp)
    
user_item = pd.DataFrame(testing)
user_item = user_item[1:]
user_item.columns = ['user', 'item']

tp1 = user_item.merge(item_avg_rating, left_on='item', right_on='item', how='left')
tp2 = tp1.merge(item_price, left_on='item', right_on='item', how='left')
tp3 = tp2.merge(user_rate, left_on='user', right_on='user', how='left')
tp4 = tp3.merge(item_cate_price, left_on='item', right_on='item', how='left')
tp4.drop(['user', 'item','cat','index','cat_price'], axis=1, inplace=True) #,'cat_price' ,'avg_rate'


tp4.fillna(3.78, inplace=True)
testing_x = tp4.as_matrix()
X_normalized_testing = preprocessing.normalize(testing_x, norm='l2')
#h = regr.predict(X_normalized_testing)
#h2 = clf.predict(X_normalized_testing)
h2 = est.predict(X_normalized_testing)
h3 = clf.predict(X_normalized_testing)
print len(h2)

predictions = open("predictions_Rating.txt", 'w')
count = 0
for l in open("pairs_Rating.txt"):
    if l.startswith("reviewerID"):
        #header
        predictions.write(l)
        continue
    u,i = l.strip().split('-')
    predictions.write(u + '-' + i + ',' + str(round(h2[count],3)) + '\n')
    count = count + 1
predictions.close()

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


100000


In [62]:
count = 0
for l in open("pairs_Purchase.txt"):
    count = count + 1
print count

100001


# 2.Purchase prediction



In [1]:
import csv
import pandas as pd
import json
import numpy as np
reviewerID_asin = pd.read_csv('reviewerID-asin.csv')
reviewerID_asin.columns = ['index','user', 'item', 'score']
purchase_times_user = reviewerID_asin.groupby(['user'])['user'].count()
purchase_times_user.to_csv('purchase_times_user.csv')

purchase_times_item = reviewerID_asin.groupby(['item'])['item'].count()
purchase_times_item.to_csv('purchase_times_item.csv')


In [5]:
import csv
import pandas as pd
import json
import numpy as np
from sklearn import linear_model
from sklearn import svm

reviewerID_asin = pd.read_csv('reviewerID-asin.csv')
reviewerID_asin.columns = ['index','user', 'item', 'score']

item_avg_rating = pd.read_csv('item_avg_rating.csv')
item_avg_rating.columns = ['item','avg_rate']

item_price = pd.read_csv('item_price.csv')
item_price.columns = ['item','avg_price']

user_rate = pd.read_csv('group_by_user_avg_rating.csv')
user_rate.columns = ['user','give_rate']

purchase_times_item = pd.read_csv('purchase_times_item.csv')
purchase_times_item.columns = ['item', 'item_times']

purchase_times_user = pd.read_csv('purchase_times_user.csv')
purchase_times_user.columns = ['user', 'user_times']

tmp1 = reviewerID_asin.merge(item_avg_rating, left_on='item', right_on='item', how='left')
tmp2 = tmp1.merge(item_price, left_on='item', right_on='item', how='left')
tmp3 = tmp2.merge(user_rate, left_on='user', right_on='user', how='left')
tmp4 = tmp3.merge(purchase_times_user, left_on='user', right_on='user', how='left')
tmp5 = tmp4.merge(purchase_times_item, left_on='item', right_on='item', how='left')
user_basic_info = tmp5.copy()

a = user_basic_info.groupby(['user'])['avg_rate'].mean()
a.to_csv('user-item_avg_rate.txt')

b = user_basic_info.groupby(['user'])['avg_price'].mean()
b.to_csv('user-item_avg_price.txt')

c = user_basic_info.groupby(['user'])['item_times'].mean()
c.to_csv('user-item_times.txt')

d = user_basic_info.groupby(['item'])['avg_rate'].mean()
d.to_csv('item_avg_rate.txt')

e = user_basic_info.groupby(['item'])['item_times'].mean()
e.to_csv('item_avg_times.txt')

In [6]:
item_avg_rate = pd.read_csv('item_avg_rate.txt')
item_avg_times = pd.read_csv('item_avg_times.txt')
item_avg_rate.columns = ['item', 'rate']
item_avg_times.columns = ['item', 'times']
print item_avg_rate['rate'].mean()
print item_avg_times['times'].mean()

t1 = item_avg_rate.merge(item_avg_times, left_on='item', right_on='item', how='left')

user_item_avg_rate = pd.read_csv('user-item_avg_rate.txt')
#user_item_avg_price = pd.read_csv('user-item_avg_price.txt')
user_item_times = pd.read_csv('user-item_times.txt')
user_item_avg_rate.columns = ['user', 'item_avg_rate']
#user_item_avg_price.columns = ['user', 'item_avg_price']
user_item_times.columns = ['user', 'item_times']
print user_item_avg_rate['item_avg_rate'].mean()
print user_item_times['item_times'].mean()

user_preference = {}
item_dict = {}
tmp1 = user_item_avg_rate.merge(user_item_times, left_on='user', right_on='user', how='left')
#tmp2 = tmp1.merge(user_item_times, left_on='user', right_on='user', how='left')

for index, item in t1.iterrows():
    item_dict[str(item[0])] = [round(float(item[1]),3), float(item[2])]

for index, item in tmp1.iterrows():
    user_preference[str(item[0])] = [round(float(item[1]),3), float(item[2])]


3.78936422692
5.84164407889
3.81509771505
74.2812708206


In [47]:
from sklearn import preprocessing
from scipy import spatial
ct = 0
predictions = open("predictions_Purchase.txt", 'w')
for l in open("pairs_Purchase.txt"):
    if l.startswith("reviewerID"):
        predictions.write(l)
        continue
    u,i = l.strip().split('-')
    if u not in user_preference:
        current_user_preference = [3.815, 74.281]
    else:
        current_user_preference = user_preference[u]
    if i not in item_dict:
        current_item_info = [1,1]
    else:
        current_item_info = item_dict[i]
    #print current_user_preference
    #current_user_preference.fillna(74.281, inplace=True)
    #print current_item_info
    #current_user_preference[1] * 0.30
    if current_item_info[0] >= current_user_preference[0] * 0.8 and current_item_info[1] >= current_user_preference[1] * 0.05:
        predictions.write(u + '-' + i + ",1\n")
    else:
        predictions.write(u + '-' + i + ",0\n")
    ct = ct + 1
predictions.close()
print ct

current_item_info1.0
current_user_preference99.0
current_item_info6.0
current_user_preference233.0
current_item_info1
current_user_preference74.281
current_item_info1
current_user_preference3.0
current_item_info1
current_user_preference74.281
current_item_info60.0
current_user_preference74.281
current_item_info60.0
current_user_preference33.5
current_item_info60.0
current_user_preference74.281
current_item_info1
current_user_preference22.5
current_item_info1
current_user_preference10.0
current_item_info48.0
current_user_preference19.0
current_item_info8.0
current_user_preference106.0
current_item_info48.0
current_user_preference498.0
current_item_info48.0
current_user_preference1.0
current_item_info48.0
current_user_preference12.0
current_item_info1
current_user_preference27.0
current_item_info14.0
current_user_preference2.0
current_item_info1
current_user_preference11.0
current_item_info14.0
current_user_preference3.0
current_item_info1
current_user_preference11.0
current_item_info3.0

In [31]:
from scipy import spatial

dataSetI = [3.667, 585.0]
dataSetII = [1.0, 10.0]
v1 = preprocessing.normalize(dataSetI, norm='l2')
v2 = preprocessing.normalize(dataSetII, norm='l2')
result = 1 - spatial.distance.cosine(v1, v2)
print result

0.995641356495




-----------------

In [None]:
# Code for setting the style of the notebook
from IPython.core.display import HTML
def css_styling():
    styles = open("../theme/custom.css", "r").read()
    return HTML(styles)
css_styling()