# Content-based recommendation

# Exercise 1
Based on the TF-IDF vectors obtained in the Exercise 2 from Session 4, represent each user in the same vector space. Amongst other feasible solutions, you can represent a user (user profile) by computing the weighted mean of the items vectors. Compute the cosine similarity for user 'A39WWMBA0299ZF' and all products in the training set not rated by the user. What are the top-5 recommended items for user 'A39WWMBA0299ZF'? Print out the top-5 items and their similarity score.  

In [8]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

[nltk_data] Downloading package punkt to /Users/lwk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/lwk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
# load data
import sys
sys.path.append('../')
import pickle
import pandas as pd

# Load TRAIN and TEST sets 
test_data = pickle.load( open( "test.pkl", "rb" ) )
train_data = pickle.load( open( "training.pkl", "rb" ) )

# Load the METADATA (ITEMS)
df = pd.read_json('meta_All_Beauty.json', lines=True)

# Discard duplicates
df = df.sort_values(by=['asin'])
clean_dataset_item = df.drop_duplicates(subset=['asin'], keep = 'last').reset_index(drop=True)

# Discard items that weren't rated by our subset of users
item_in_subset = list(test_data.loc[:,'asin'])+list(train_data.loc[:,'asin'])
# print(list(item_in_subset))
clean_dataset_item = clean_dataset_item[clean_dataset_item['asin'].isin(item_in_subset)]



In [27]:
porter_stemmer = PorterStemmer()
len_words = 0
len_filter_words = 0
title_list = []
temp_list = []
for title in clean_dataset_item['title']:
    # print(title)
    word_list = [word for word in word_tokenize(title)]
    temp_list += word_list
    # temp_list.append(word_list)
len_words = len(temp_list)
# temp_list_ = []
# print(temp_list.count('3.5'))
for title in clean_dataset_item['title']:
    filter_list = [porter_stemmer.stem(word) for word in word_tokenize(title) if word not in stopwords.words("english")]
    len_filter_words += len(filter_list)
    # temp_list_ += filter_list
    title_list.append(TreebankWordDetokenizer().detokenize(filter_list))
    # title_list.append(" ".join(filter))
print(len_words)
print(len_filter_words)

1039
1002


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vectorizer = TfidfVectorizer()

# <YOUR CODE HERE>
X = tfidf_vectorizer.fit_transform(title_list)
# print(tfidf_vectorizer.get_feature_names_out())
print(X.shape)

(84, 449)


In [29]:
import gzip
import os
import json
import pandas as pd
import numpy as np
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('All_Beauty_5.json.gz')

df = df.sort_values(by=['reviewerID', 'asin', 'unixReviewTime'])
cleaned_dataset = df.dropna(subset=['overall']).drop_duplicates(subset=['reviewerID', 'asin'], keep = 'last').reset_index(drop=True)
# print(len(cleaned_dataset))
# cleaned_dataset.head()
cleaned_dataset = cleaned_dataset.sort_values(by=['reviewerID', 'unixReviewTime']).reset_index(drop=True)
# extracting the latest (in time) positively rated item (rating  ≥4 ) by each user. 
test_data_pre = cleaned_dataset[cleaned_dataset.overall >= 4.0].drop_duplicates(subset=['reviewerID'], keep='last')
# generate training data
training_data = cleaned_dataset.drop(test_data_pre.index)

# Remove users that do not appear in the training set.
user_in_training = test_data_pre['reviewerID'].isin(training_data['reviewerID'])
test_data = test_data_pre[user_in_training]

In [51]:
x = lambda bools: True if bools == False else False
item_in = clean_dataset_item['asin'][clean_dataset_item['asin'].isin(training_data[training_data['reviewerID']=='A39WWMBA0299ZF']['asin'])]
item_not_in = clean_dataset_item['asin'][[x(bools) for bools in clean_dataset_item['asin'].isin(training_data[training_data['reviewerID']=='A39WWMBA0299ZF']['asin'])]]

In [68]:

def pick_item_from_tfidf(tfidf_vectorizer:TfidfVectorizer, item_list:list, data_set:list):
    vector_np = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names_out()).to_numpy()
    vector_list = []
    for i, item in enumerate(data_set):
        if item in item_list:
            vector_list.append(vector_np[i,:])
    return vector_list

In [69]:
vector_in = pick_item_from_tfidf(tfidf_vectorizer, list(item_in), list(clean_dataset_item['asin']))
vector_not_in = pick_item_from_tfidf(tfidf_vectorizer, list(item_not_in), list(clean_dataset_item['asin']))

In [81]:
mean_vector = [np.mean(vector_in, axis=0)]
cos_sim = cosine_similarity(vector_not_in,mean_vector)
for i in np.argsort([-float(i) for i in cos_sim],)[:5]:
    print(list(item_not_in)[i]) 
print(cos_sim[np.argsort([-float(i) for i in cos_sim])[:5]])


B019FWRG3C
B00W259T7G
B00006L9LC
B002GP80EU
B019809F9Y
[[0.35285711]
 [0.16561569]
 [0.14454085]
 [0.12955529]
 [0.08856782]]


# Exercise 2



Compute the systems’ hit rate based on the top-5, top-10 and top-20 recommendations, averaged over the total number of users. Remember that, as we are evaluating the system, you should compute the hit rate over the test set. How well/bad does this Content-based approach perform compared to the Collaborative Filtering?

# Exercise 3

Repeat Exercise 1 and 2, this time representing the products and users in a word2vec vector space. You may use the gensim library and download the 300-dimension embeddings from Google. Source: https://radimrehurek.com/gensim/models/word2vec.html#pretrained-models

In [None]:
import gensim.downloader
word2vec_vectors = gensim.downloader.load('word2vec-google-news-300')

