# CSE 158 Assignment 2
This is our shared jupyter notebook

Data
- Amazon product review dataset (specifically Movies and TV) : userID, productID, rating
- Amazon product review dataset (specifically Movies and TV) : userID, productID, rating, review text, time of purchase, and others.

Predictive task:
- **Purchase prediction** "Predict whether a user will the user purchase the product or not"
(purchase(userID, product) = true or false)

##### Notes:
* Note 1: Since the reviews are only for the movies that have been purchased, the dataset is un-balanced. Balanced dataset would have reviews for the un-purchased movies in addition to the reviews for the purchased movies. Therefore, an alternative predictive task may be to predict (1) whether or not a user likes/dislikes a movie or (2) what rating a user would give a movie.

*  Note 2: Try Alternative model other than logistic regression: (1) TF-IDF prediction (2) SVM (3) k-means neighbor. Present the results using histograms.

* Note 3: Data visualization.

* Note 4: Report various error metrics : (1) MSE, (2) Accuracy, (3) BER

* Note 5: The dataset in use uses reviews that have at least 5 for each product and for each user. Therefore, we may have higher accuracy.

# Libraries Used
Here's a list of libraries we use in our project. 

In [0]:
import gzip
from collections import defaultdict
import math
import scipy.optimize
import numpy as np 
import pandas as pd 
import string
import random
import string
from sklearn import linear_model
from sklearn import svm

In [0]:
movie_rating = []
with open("./Movies_and_TV.csv") as f:
  for l in f:
    l = l.rstrip().split(',')
    movie_rating.append(l)

movie_rating[:10]

FileNotFoundError: ignored

In [0]:
total = len(movie_rating)

X_train = movie_rating[:round(total*0.6)]
X_validate = movie_rating[round(total*0.6):round(total*0.8)]
X_test = movie_rating[round(total*0.8):]


In [0]:
X_train = X_train[0:50000]
X_validate = X_validate[:20000]

In [0]:
productreview_count = dict()
for review in movie_rating:
  productID = review[0]
  if productID in productreview_count.keys():
    count = productreview_count[productID]
    productreview_count[productID] = count + 1
  else:
    productreview_count[productID] = 1

In [0]:
mostFrequent = [(productreview_count[x], x) for x in productreview_count.keys()]
mostFrequent.sort()
mostFrequent.reverse()

mostFrequent1000 = []
for i in mostFrequent[:1000]:
    mostFrequent1000.append(i[1])

# Helper Functions
Here's our collection of helper functions that we use in our project. We have the following functions implemented below:
- ***Jaccard(s1, s2)*** - computes the Jaccard similarity between two sets 
- ***printBERandAccuracy(predictions, y)*** - prints the Balanced Rate Error and the Accuracy of the model.


In [0]:
def Jaccard(s1, s2):
    ''' Computes the Jaccard similarity between two sets
    '''
    numer = len(s1.intersection(s2))
    denom = len(s1.union(s2))
    if denom > 0:
        return numer/denom
    return 0
    
def printBERandAccuracy(predictions, y):
    ''' Calculates and Prints BER and Accuracy 
        for given Predictions and y. 
        Used as helper function for all homework questions. 
    '''
    # Calculating accuracy and balanced error rate 
    index = 0
    correctCount = 0
    falsePositives = 0
    falseNegatives = 0
    labeledPositives = 0
    labeledNegatives = 0
    
    for p in predictions:
        #checking correctCount
        if p==y[index]:
            correctCount = correctCount + 1
        
        #checking false positives 
        if p and not y[index]:
            falsePositives = falsePositives+1
        
        #checking false negatves 
        if not p and y[index]:
            falseNegatives = falseNegatives + 1
        
        #labeled positive
        if y[index]:
            labeledPositives = labeledPositives + 1
        
        #labeled positive
        if not y[index]:
            labeledNegatives = labeledNegatives + 1
    
        index = index + 1
    
    #calculate accuracy 
    accuracy = correctCount/len(predictions)

    #calculate balanced error rate 
    fpr = falsePositives/labeledNegatives
    fnr = falseNegatives/labeledPositives
    ber = 0.5*(fpr+fnr)

    print("Balanced Error Rate: "+ str(ber))
    print("Accuracy: "+ str(accuracy))
    
    return ber

# Useful Information

- userSet
- productSet 
- purchasedSet
- notPurchasedSet

In [0]:
userSet = set()
productSet = set()
purchasedSet = set()

notPurchasedSet_train = set()
notPurchasedSet_validate = set()
notPurchasedSet_test = set()

for p, u, r in movie_rating:
  productSet.add(p)
  userSet.add(u)
  purchasedSet.add((u, p))

listUserSet = list(userSet)
listProductSet = list(productSet)

# for train
y_train = [] 
for i in range(len(X_train)):
  y_train.append(1)

# for validation
y_validate = [] 
for i in range(len(X_validate)):
  y_validate.append(1)

# for test 
y_test = []
for i in range(len(X_test)):
  y_test.append(1)

#train
for p, u, r in X_train:
  product = random.choice(listProductSet)
  if (u,product) in purchasedSet or (u, product) in notPurchasedSet_train: continue
  notPurchasedSet_train.add((u, product))
  y_train.append(0)

#validate
for p, u, r in X_validate:
  product = random.choice(listProductSet)
  if (u,product) in purchasedSet or (u, product) in notPurchasedSet_validate: continue
  notPurchasedSet_validate.add((u, product))
  y_validate.append(0)
  
#train 
for p, u, r in X_test:
  product = random.choice(listProductSet)
  if (u,product) in purchasedSet or (u, product) in notPurchasedSet_test: continue
  notPurchasedSet_test.add((u, product))
  y_test.append(0)



In [0]:
len(y_train)

In [0]:
len(X_train) + len(notPurchasedSet_train)

In [0]:
#basic stats 

print(len(movie_rating))
print(len(productSet))
print(len(userSet))

NameError: ignored

# Model - Logistic Regression (Training)



In [0]:
featureMatrix_train = []

#train processable set 
purchasedTrain = []
for d in X_train:
  p = d[0]
  u = d[1]
  purchasedTrain.append((u, p))


# Building featureMatrix for TRAINING  
ratingsPerUser = defaultdict(list)
ratingsPerItem = defaultdict(list)
for p,u,r in X_train:
    ratingsPerUser[u].append((p,r))
    ratingsPerItem[p].append((u,r))

for (label, sample) in [(1, purchasedTrain), (0, list(notPurchasedSet_train))]:
  for (u, p) in sample:
    
    #user Jaccard 
    maxSimUser = 0
    similar_users = set(ratingsPerItem[p])
    for p2, _ in ratingsPerUser[u]:
      sim = Jaccard(similar_users, set(ratingsPerItem[p2]))
      if sim > maxSimUser:
        maxSimUser = sim

    #product Jaccard 
    maxSimProduct = 0
    productsUserHasPurchased = set(ratingsPerUser[u])
    for u2, _ in ratingsPerItem[p]:
      sim = Jaccard(productsUserHasPurchased, set(ratingsPerUser[u2]))
      if sim > maxSimProduct:
        maxSimProduct = sim 

    feature1 = 1
    feature2 = maxSimUser
    feature3 = maxSimProduct
    feature4 = len(ratingsPerUser[u])
    feature5 = len(ratingsPerItem[p])

    featureVector = []
    featureVector.append(feature1)
    featureVector.append(feature2)
    featureVector.append(feature3)
    featureVector.append(feature4)
    featureVector.append(feature5)

    featureMatrix_train.append(featureVector)


featureMatrix_train = np.array(featureMatrix_train)

model = linear_model.LogisticRegression()
model.fit(featureMatrix_train, y_train)






LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
len(y_validate)

39998

In [0]:
#FEATURE MATRIX FOR VALIDATE
featureMatrix_validate = []

#validation processable set 
purchasedValid = []
for d in X_validate:
  p = d[0]
  u = d[1]
  purchasedValid.append((u, p))

# Building featureMatrix for TRAINING  
ratingsPerUser = defaultdict(list)
ratingsPerItem = defaultdict(list)
for p,u,r in X_validate:
    ratingsPerUser[u].append((p,r))
    ratingsPerItem[p].append((u,r))

for (label, sample) in [(1, purchasedValid), (0, list(notPurchasedSet_validate))]:
  for (u, p) in sample:
    
    #user Jaccard 
    maxSimUser = 0
    similar_users = set(ratingsPerItem[p])
    for p2, _ in ratingsPerUser[u]:
      sim = Jaccard(similar_users, set(ratingsPerItem[p2]))
      if sim > maxSimUser:
        maxSimUser = sim

    #product Jaccard 
    maxSimProduct = 0
    productsUserHasPurchased = set(ratingsPerUser[u])
    for u2, _ in ratingsPerItem[p]:
      sim = Jaccard(productsUserHasPurchased, set(ratingsPerUser[u2]))
      if sim > maxSimProduct:
        maxSimProduct = sim 

    feature1 = 1
    feature2 = maxSimUser
    feature3 = maxSimProduct
    feature4 = len(ratingsPerUser[u])
    feature5 = len(ratingsPerItem[p])

    featureVector = []
    featureVector.append(feature1)
    featureVector.append(feature2)
    featureVector.append(feature3)
    featureVector.append(feature4)
    featureVector.append(feature5)

    featureMatrix_validate.append(featureVector)

featureMatrix_validate = np.array(featureMatrix_validate)

predictions_validate = model.predict(featureMatrix_validate)

In [0]:
printBERandAccuracy(predictions_validate, y_validate)

Balanced Error Rate: 0.014400025002500249
Accuracy: 0.9855992799639982


0.014400025002500249