# 4. ContentBasedFiltering - Review

In [34]:
import timeit
import pandas as pd
import numpy as np
import re
from ast import literal_eval
from sklearn.metrics.pairwise import linear_kernel

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics, svm


## 4.1 Feature Engineering

In [2]:
#import dataset
df_re = pd.read_csv('data/df_reT.csv')

In [3]:
df_re.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1863906 entries, 0 to 1863905
Data columns (total 5 columns):
asin          1863906 non-null object
title         1863906 non-null object
reviewerID    1863906 non-null object
overall       1863906 non-null float64
review        1863906 non-null object
dtypes: float64(1), object(4)
memory usage: 71.1+ MB


In [4]:
df_re.head()

Unnamed: 0,asin,title,reviewerID,overall,review
0,B000N6DDJQ,The Scarlet Letter A Romance,A1D2C0WDCSHUWZ,5.0,"When people think of a ""scarlet letter,"" we im..."
1,B000N6DDJQ,The Scarlet Letter A Romance,A2M3NCTFUGI4SR,5.0,Hawthorne's best work is central to understand...
2,B000N6DDJQ,The Scarlet Letter A Romance,A1OXI0N58TMYY9,5.0,This is a review for not the novel itself but ...
3,B0006DG9OM,Tess of the D'Urbervilles: A pure woman (Harpe...,AZ05JR3XQN9IP,5.0,A question that often appears in agony aunt co...
4,B0006DG9OM,Tess of the D'Urbervilles: A pure woman (Harpe...,A1R77GO77FBLMJ,5.0,Truly an excellently written book. From the ve...


### 4.1.1 Text Preprocessing

Since we will be using the tf-idf vectorizer to build a term frequency matrix for the review data, it is important to alter the pure review text into a format that the vectorizer can use. <br>
These include removing any special characters, keeping all texts in lower-case.<br> Also tokenizing the documents in to lists of words and removing stopwords that will only decrease accuracy.

In this review text, we will not consider bigram or higher form of words.

In [97]:
'''
Takes in text

Outputs text with all lower case and special characters removed
'''
def cleanText(t):
    
    #change to lower case
    t = str(t).lower()
    #remove special characters
    t = re.sub("(\\W)+"," ",t)
    
    return t


#build stopwords
eng_stopwords = set(stopwords.words('english') + ['book','books','author','authors'])


'''
Takes in text

Outputs tokenized text with stopwords removed
'''
def removeStopwords(t,stopwords = eng_stopwords):
    
    #tokenize : only single words, no n-grams
    t = word_tokenize(t,language='english')
    
    filtered = []
    
    for word in t:
        if word not in stopwords:
            filtered.append(word)
    
    
    return filtered
    

**Column 'title'**

Both title and asin are unique identifiers for books in our dataset. Therefore, they represent redundant data.  <br> Since asin is easier to maintain and smaller in size, we will strip out the unique asin-title to a seperate dataframe and remove the title column from review dataset.

In [6]:
#clean up asin, title, reviewerID
df_re['asin'] = df_re['asin'].apply(lambda x: x.lower())
df_re['title'] = df_re['title'].apply(lambda x: cleanText(x))
df_re['reviewerID'] = df_re['reviewerID'].apply(lambda x: x.lower())

In [7]:
#isolate unique asin-title
asin_title = df_re[['asin','title']].drop_duplicates(keep='first')



In [8]:
#drop redundant title 
df_re.drop(columns='title',inplace=True)

**Column 'review'**

This is the field to be used for review content-based filtering, therefore will be transformed to td-idf vectorizer.

We will need to create a set of stopwords that we can use to filter out some word that may not be useful during modeling. <br> Aside from the regular english stopwords, we will also add words that are specific to book genre as it will cause unnecessary importance.  <br> Some examples are book, books, author, authors. 

In [None]:
#clean up review
df_re['review'] = df_re['review'].apply(lambda x: cleanText(x))

#build stopwords
eng_stopwords = set(stopwords.words('english') + ['book','books','author','authors'])


#tokenize and remove stopwords
df_re['review'] = df_re['review'].apply(lambda x: removeStopwords(x, eng_stopwords))

#lemmatize each work token to root word given we know the pos_tag of the word
df_re['review_lemmatized'] = df_re['review'].apply(lambda x: lemmatize_words(x))

In [9]:
#clean up review
df_re['review'] = df_re['review'].apply(lambda x: cleanText(x))

In [10]:
#build stopwords
eng_stopwords = set(stopwords.words('english') + ['book','books','author','authors'])

In [11]:
#tokenize and remove stopwords
df_re['review'] = df_re['review'].apply(lambda x: removeStopwords(x, eng_stopwords))

In [7]:
#we will lemmatize the tokens to their root word

'''
Takes in word

Outputs the word's POS_tag: v, a, n, r
'''
def get_wordnet_pos(word):

    #get first letter of word tag
    tag = nltk.pos_tag([word])[0][1][0].upper()
    
    #create simple dict to associate tag to wordnet attributes   
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV,
                "C": wordnet.NOUN,
                "S": wordnet.ADJ_SAT}

    #we put a conditional here to account for cases where tag is not within expected range
    if tag_dict.get(tag):
        return tag_dict.get(tag)
    else:
        return ''

#initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_words(words):
    
    #lemmatize the list of words provided
    return [lemmatizer.lemmatize(w, get_wordnet_pos(w)) if get_wordnet_pos(w) != '' else lemmatizer.lemmatize(w) for w in words]
    
#does nothing
def dummy(doc):
    return doc

In [114]:
#lemmatize each work token to root word given we know the pos_tag of the word
df_re['review_lemmatized'] = df_re['review'].apply(lambda x: lemmatize_words(x))

Lemmatization process took about 12 hours to complete. <br> Aside from stemming, lemmatization was the best choice to move forward in dealing with variation of same root words. <br> Stemming showed the results in much too simplified fashion so decision was made to go with lemmatization although resource-heavy.


### 4.1.2 CountVectorizer Analysis

In [116]:
#saved for checkpoint
df_re.to_csv('data/reT_tokenized.csv',index=False)

In [2]:
#when reading in our word list again, pandas parses it as string instead of list of words.
#using literal_eval to fix this
df_re = pd.read_csv('data/reT_tokenized.csv',converters={'review_lemmatized': literal_eval})

In [12]:
#isolate documents into corpora
corpora = df_re['review_lemmatized']

In [14]:
#instantiate CountVectorizer with default settings: no lower casing, preprocessing or tokenizing.
vectorizer = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy)

#tokenize and build vocab fitting the corpora then tranform to make count vector
count_vector = vectorizer.fit_transform(corpora)


After running the vectorizer, we can see that we get back an expected number of sparse matrix rows as this is the number of documents in our corpora.  <br> We can also see that we have a total of 199638 words.

In [68]:
print('number of vocab: ', len(vectorizer.get_feature_names()))

number of vocab:  199638


In [17]:
full_vocab = vectorizer.get_feature_names()

Since these vocabularies are sorted in alphabetical oder, we will slice to view sections of vocabs to see if they are within expected range.

-  tokens that start with _
-  tokens with numeric variables
-  tokens that start with numbers then string
-  tokens that includes noise

As seen below, until upto 6100th vocab, we are getting incorrect word tokens that starts with _ that have not been properly fixed for punctuation. <br> As well as words that start with numbers that are incorrect.

In [65]:
print(full_vocab[6069:6077])
print(full_vocab[:10])
print(full_vocab[40:50])
print(full_vocab[:-7:-1])

['_yanked_', '_year', '_yes_', '_yet', '_yet_', '_you', '_you_', '_your']
['0', '00', '000', '0000', '00000', '0000000000', '00000oh', '00000th', '00001', '0001']
['00a', '00am', '00audio', '00beware', '00first', '00pm', '00s', '00story', '00taken', '01']
['zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzugh', 'zzzzzzzzzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzzzzzzorro', 'zzzzzzzzzzzzzzzzzzzzzzz']


Another interesting noise found in token is like below. <br> The term 'Klausner' is most likely a part of name and the tokenizer did not separate these most likely due to incorrect formatting.

In [67]:
print(full_vocab[100050:100060])

['klausnerexceptional', 'klausnerexciting', 'klausnerexhilarating', 'klausnerfabulous', 'klausnerfanastic', 'klausnerfans', 'klausnerfantastic', 'klausnerfascinating', 'klausnerfine', 'klausnerfor']


#### Vectorizer Parameter Optimization

In [25]:
#instantiate CountVectorizer with default settings: no lower casing, preprocessing or tokenizing.
vectorizer_2 = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy, min_df=5)

#tokenize and build vocab fitting the corpora then tranform to make count vector
count_vector_2 = vectorizer_2.fit_transform(corpora)


In [70]:
print('number of vocab: ', len(vectorizer_2.get_feature_names()))

number of vocab:  115297


We increased the min_df to 5 to reduce any words that are too unique. <br> But we are modifying this parameter in the expense of the possibility of losing some of the unique keywords for different books.  <br> Since each of our books in the review data contains at least 50 reviews, having a min_df = 5 should not hurt this too much.

Same to be done for max_df paramter.  Unlike min_df, we will use max_df to remove words that are repeated too much.  <br> We will say that if a word appears in more than half of the documents in corpus, remove it.

In [71]:
#instantiate CountVectorizer with default settings: no lower casing, preprocessing or tokenizing.
vectorizer_3 = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy, min_df=5, max_df=0.5)

#tokenize and build vocab fitting the corpora then tranform to make count vector
count_vector_3 = vectorizer_3.fit_transform(corpora)


In [34]:
print('number of vocab: ', len(vectorizer_3.get_feature_names()))

number of vocab:  115296


After setting max_df to 0.5, there are only one less word in our vocabulary.  <br>This means that from words that passed the min_df=5 limit, only one word appeared in over half of the documents.

In [193]:
#instantiate CountVectorizer with default settings: no lower casing, preprocessing or tokenizing.
vectorizer_4 = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy, min_df=5, max_df=0.5, max_features=50000)

#tokenize and build vocab fitting the corpora then tranform to make count vector
count_vector_4 = vectorizer_4.fit_transform(corpora)


In [194]:
print('number of vocab: ', len(vectorizer_4.get_feature_names()))

number of vocab:  50000


### 4.1.3 Text Processing

Let's take a look into these noisy text. <br> 

**Alphanumeric Tokens**

Starting with alphanumeric tokens, these were thought to be incorrect spacing issue but closely examining the data reveals review-spamming issues.

In [86]:
df_re.loc[df_re.review.str.contains('afficianado')].head()

Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
60465,b000133q20,agg9c66toljzb,5.0,"['fourth', 'outing', 'wonderful', 'precious', ...","[fourth, out, wonderful, precious, ramotswe, 1..."
141320,b0007c963i,aeqfyoi6yj83z,5.0,"['blood', 'meridian', 'traces', 'life', 'namel...","[blood, meridian, trace, life, nameless, kid, ..."
197120,1417616903,a1tjpmb7n776ws,5.0,"['write', 'review', 'three', 'times', 'first',...","[write, review, three, time, first, draft, mak..."
214937,b000n757qc,a15q7abiu9o9yz,5.0,"['years', 'ago', 'maybe', 'innocent', 'time', ...","[year, ago, maybe, innocent, time, minimum, le..."
256408,b000gli9hy,a1twtulvd6f22o,5.0,"['difficult', 'review', 'much', 'lehane', 'lat...","[difficult, review, much, lehane, late, quot, ..."


Above is what we should expect to see for most vocabularies. Diverse reviews from different reviewers.<br> 
Below examples are what is seen for most of the alphanumeric tokens.

In [88]:
x = df_re.loc[df_re.review.str.contains('2012extraordinary')]
print('number of identical reviews: ', x.shape[0])
x.head()

number of identical reviews:  15


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
163089,b0007hut02,a5qynf99oo4zw,5.0,"['extraordinary', 'talent', 'unique', 'powerfu...","[extraordinary, talent, unique, powerful, crea..."
384515,b000mom96m,a5qynf99oo4zw,5.0,"['extraordinary', 'talent', 'unique', 'powerfu...","[extraordinary, talent, unique, powerful, crea..."
425719,b000mnees4,a5qynf99oo4zw,5.0,"['extraordinary', 'talent', 'unique', 'powerfu...","[extraordinary, talent, unique, powerful, crea..."
564627,0786108789,a5qynf99oo4zw,5.0,"['extraordinary', 'talent', 'unique', 'powerfu...","[extraordinary, talent, unique, powerful, crea..."
632028,b000fom3my,a5qynf99oo4zw,5.0,"['extraordinary', 'talent', 'unique', 'powerfu...","[extraordinary, talent, unique, powerful, crea..."


In [89]:
x = df_re.loc[df_re.review.str.contains('451through')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  16


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
140,b000tz19tc,a3fzmqw5eb9s08,4.5,"['metaphors', 'complication', 'similies', 'fah...","[metaphor, complication, similies, fahrenheit,..."
322,b000gl8umi,a3fzmqw5eb9s08,4.5,"['metaphors', 'complications', 'similies', 'fa...","[metaphor, complication, similies, fahrenheit,..."
1764,b000k0g43c,a3fzmqw5eb9s08,4.5,"['metaphors', 'complications', 'similies', 'fa...","[metaphor, complication, similies, fahrenheit,..."
4138,0003300277,a3fzmqw5eb9s08,4.5,"['metaphors', 'complications', 'similies', 'fa...","[metaphor, complication, similies, fahrenheit,..."
8225,b000ovsqcy,a3fzmqw5eb9s08,4.5,"['metaphors', 'complications', 'similies', 'fa...","[metaphor, complication, similies, fahrenheit,..."


In [90]:
x = df_re.loc[df_re.review.str.contains('540somepages')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  55


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
74438,0694520187,a37r9my02w1ekz,1.0,"['man', 'lots', 'reviews', 'pretentious', 'tal...","[man, lot, review, pretentious, talk, sort, hi..."
175542,b000kpf11s,a37r9my02w1ekz,1.0,"['man', 'lots', 'reviews', 'pretentious', 'tal...","[man, lot, review, pretentious, talk, sort, hi..."
185845,b0006al5rg,a37r9my02w1ekz,1.0,"['man', 'lots', 'reviews', 'pretentious', 'tal...","[man, lot, review, pretentious, talk, sort, hi..."
186436,b000j6dlbu,a37r9my02w1ekz,1.0,"['man', 'lots', 'reviews', 'pretentious', 'tal...","[man, lot, review, pretentious, talk, sort, hi..."
192208,b0006awvpg,a37r9my02w1ekz,1.0,"['man', 'lots', 'reviews', 'pretentious', 'tal...","[man, lot, review, pretentious, talk, sort, hi..."


In [91]:
x = df_re.loc[df_re.review.str.contains('0844606863isbn')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  57


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
28443,b000n6ddjq,atyss9yrsc6gx,3.0,"['critique', 'scarlet', 'letter', 'scarlet', '...","[critique, scarlet, letter, scarlet, letter, n..."
167872,1590862899,atyss9yrsc6gx,3.0,"['critique', 'scarlet', 'letter', 'scarlet', '...","[critique, scarlet, letter, scarlet, letter, n..."
189294,b000mzcjum,atyss9yrsc6gx,3.0,"['critique', 'scarlet', 'letter', 'scarlet', '...","[critique, scarlet, letter, scarlet, letter, n..."
223424,076240552x,atyss9yrsc6gx,3.0,"['critique', 'scarlet', 'letter', 'scarlet', '...","[critique, scarlet, letter, scarlet, letter, n..."
237184,1561035025,atyss9yrsc6gx,3.0,"['critique', 'scarlet', 'letter', 'scarlet', '...","[critique, scarlet, letter, scarlet, letter, n..."


Most alphanumeric tokens that were repeated enough for it to be picked up by CountVectorizer were spam reviews.<br> Although there may be actual alphanumeric tokens that are valid, none were found during the analysis. <br>Therefore, we will remove all alphanumeric tokens.

**Tokens contaning '_'**

Similar to above, most of the tokens containing _ character looks to be spam reviews.

In [98]:
df_re.loc[df_re.review.str.contains('_all')].head()

Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
5333,b000ovug5y,a1o7omw58qwg5k,5.0,"['terminal', 'man', 'get', 'cha', 'harry', 'be...","[terminal, man, get, cha, harry, benson, one, ..."
9403,0786119551,ak81wlvd5kgux,4.0,"['b', 'eware', 'judging', 'standpoint', 'perso...","[b, eware, judging, standpoint, personal, life..."
14429,b000npgd58,a13ofob1394g31,5.0,"['good', 'news', 'elvis', 'still', 'alive', 'f...","[good, news, elvis, still, alive, flip, burger..."
14693,b000pwncgm,a13ofob1394g31,5.0,"['good', 'news', 'elvis', 'still', 'alive', 'f...","[good, news, elvis, still, alive, flip, burger..."
14993,b000jesvhg,a1o7omw58qwg5k,5.0,"['would', 'like', 'respond', 'friendly', 'way'...","[would, like, respond, friendly, way, barbara,..."


In [99]:
df_re.loc[df_re.review.str.contains('_paris')].head()

Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
132142,b000my6vhk,a1aaybkptl97je,2.0,"['simply', 'put', 'sun', 'also', 'rises', 'sto...","[simply, put, sun, also, rise, story, handful,..."
181165,b000pc53zk,a1aaybkptl97je,2.0,"['simply', 'put', 'sun', 'also', 'rises', 'sto...","[simply, put, sun, also, rise, story, handful,..."
214477,b0006dxw74,a1aaybkptl97je,2.0,"['simply', 'put', 'sun', 'also', 'rises', 'sto...","[simply, put, sun, also, rise, story, handful,..."
350898,b00086u8gw,a1aaybkptl97je,2.0,"['simply', 'put', 'sun', 'also', 'rises', 'sto...","[simply, put, sun, also, rise, story, handful,..."
468557,b0007hdo9k,a1aaybkptl97je,2.0,"['simply', 'put', 'sun', 'also', 'rises', 'sto...","[simply, put, sun, also, rise, story, handful,..."


In [100]:
df_re.loc[df_re.review.str.contains('_good')].head()

Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
1520,b000mooajg,ak81wlvd5kgux,2.0,"['atlas', 'shrugged', 'undoubtedly', 'interest...","[atlas, shrug, undoubtedly, interest, thought,..."
4152,b000mwc3fq,ak81wlvd5kgux,2.0,"['quot', 'positive', 'uphold', 'defend', 'atta...","[quot, positive, uphold, defend, attack, oppos..."
4416,b000pgi7qi,ak81wlvd5kgux,2.0,"['quot', 'need', 'destroy', 'objectivism', 're...","[quot, need, destroy, objectivism, reasonable,..."
5031,b000pwmt1g,ak81wlvd5kgux,2.0,"['quot', 'need', 'destroy', 'objectivism', 're...","[quot, need, destroy, objectivism, reasonable,..."
7118,b0000cki8j,ak81wlvd5kgux,2.0,"['quot', 'need', 'destroy', 'objectivism', 're...","[quot, need, destroy, objectivism, reasonable,..."


In [101]:
df_re.loc[df_re.review.str.contains('_weakness_')].head()

Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
92517,b000nowyr0,a2b0xo8btprx7r,5.0,"['love', 'loss', 'sacrifice', 'gain', 'tender'...","[love, loss, sacrifice, gain, tender, night, f..."
301771,b000kaavli,a2b0xo8btprx7r,5.0,"['love', 'loss', 'sacrifice', 'gain', 'tender'...","[love, loss, sacrifice, gain, tender, night, f..."
324066,b0006amffm,a2b0xo8btprx7r,5.0,"['love', 'loss', 'sacrifice', 'gain', 'tender'...","[love, loss, sacrifice, gain, tender, night, f..."
368250,0808514601,a2b0xo8btprx7r,5.0,"['love', 'loss', 'sacrifice', 'gain', 'tender'...","[love, loss, sacrifice, gain, tender, night, f..."
395465,b000pen42c,a2b0xo8btprx7r,5.0,"['love', 'loss', 'sacrifice', 'gain', 'tender'...","[love, loss, sacrifice, gain, tender, night, f..."


**Numeric Tokens**

For numeric tokens, it looks like there is a mixture of good reviews and spams. <br> Numbers under 100 show a lot of diversity whereas most numbers higher than that are spams.

In [102]:
x = df_re.loc[df_re.review.str.contains('0')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  341909


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
0,b000n6ddjq,a1d2c0wdcshuwz,5.0,"['people', 'think', 'scarlet', 'letter', 'imme...","[people, think, scarlet, letter, immediate, th..."
2,b000n6ddjq,a1oxi0n58tmyy9,5.0,"['review', 'novel', 'various', 'franklin', 'li...","[review, novel, various, franklin, library, ed..."
3,b0006dg9om,az05jr3xqn9ip,5.0,"['question', 'often', 'appears', 'agony', 'aun...","[question, often, appear, agony, aunt, column,..."
6,b0000ckd7e,a21vr7m8o55ef6,5.0,"['following', 'success', 'road', 'kerouac', 'p...","[follow, success, road, kerouac, publisher, in..."
7,b0000ckd7e,a10b4uol0ib274,5.0,"['kerouac', 'may', 'best', 'known', 'road', 'f...","[kerouac, may, best, know, road, far, favorite..."


In [103]:
x = df_re.loc[df_re.review.str.contains('000')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  26295


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
12,b000pbzh5m,a2itemel0dtx8a,4.5,"['foundation', 'consists', 'five', 'stories', ...","[foundation, consists, five, story, separate, ..."
22,b000pbzh5m,a2b2g75aqom0pi,5.0,"['second', 'foundation', 'novel', 'read', 'who...","[second, foundation, novel, read, whole, serie..."
25,b000pbzh5m,ac3c00yw5k55b,5.0,"['second', 'original', 'trilogy', 'fourth', 'o...","[second, original, trilogy, fourth, overall, f..."
29,b000pbzh5m,awlfvct9128jv,3.5,"['first', 'three', 'novels', 'original', 'foun...","[first, three, novel, original, foundation, tr..."
33,b000pbzh5m,ag7r3mmf8qldt,4.0,"['isaac', 'asimov', 'one', 'popular', 'science...","[isaac, asimov, one, popular, science, fiction..."


In [104]:
x = df_re.loc[df_re.review.str.contains('0140278079')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  11


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
211678,b000h01yyy,a3dwxvgoe2xziq,4.0,"['fascinating', 'written', 'lot', 'reviews', '...","[fascinate, write, lot, review, pagels, reader..."
334264,b000fi4ryw,a3dwxvgoe2xziq,4.0,"['fascinating', 'written', 'lot', 'reviews', '...","[fascinate, write, lot, review, pagels, reader..."
618871,b00023o1l4,a3dwxvgoe2xziq,4.0,"['course', 'prequel', 'da', 'vinci', 'code', '...","[course, prequel, da, vinci, code, two, togeth..."
662689,b000pggccy,a3dwxvgoe2xziq,4.0,"['fascinating', 'written', 'lot', 'reviews', '...","[fascinate, write, lot, review, pagels, reader..."
693508,b0000d1bxo,a3dwxvgoe2xziq,4.0,"['course', 'prequel', 'da', 'vinci', 'code', '...","[course, prequel, da, vinci, code, two, togeth..."


In [105]:
x = df_re.loc[df_re.review.str.contains('1660')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  145


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
4345,b00007k45c,a1cqjzsi3lk1vs,4.5,"['enjoyable', 'eloquent', 'thoroughly', 'enjoy...","[enjoyable, eloquent, thoroughly, enjoyable, r..."
12048,b0006iu7c2,a1cqjzsi3lk1vs,4.5,"['enjoyable', 'eloquent', 'thoroughly', 'enjoy...","[enjoyable, eloquent, thoroughly, enjoyable, r..."
19441,0006513204,a1cqjzsi3lk1vs,4.5,"['enjoyable', 'eloquent', 'thoroughly', 'enjoy...","[enjoyable, eloquent, thoroughly, enjoyable, r..."
54373,1565115430,a319kyeiaz3son,5.0,"['vivid', 'simple', 'poetic', 'girl', 'hyacint...","[vivid, simple, poetic, girl, hyacinth, blue, ..."
63968,1417627530,a39abkrs1mkftw,5.0,"['picked', 'year', 'wonders', 'geraldine', 'br...","[picked, year, wonder, geraldine, brook, two, ..."


In [108]:
x = df_re.loc[df_re.review.str.contains('83')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  5948


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
10,b0000ckd7e,auey946m1l939,4.0,"['commemorative', 'edition', 'add', 'much', 'n...","[commemorative, edition, add, much, new, pengu..."
186,0140860096,a2pns90bkgnm16,4.5,"['popular', 'high', 'school', 'college', 'requ...","[popular, high, school, college, require, read..."
398,0192839020,a1lmbm1n4exs5w,4.5,"['title', 'also', 'happens', 'plot', 'outline'...","[title, also, happens, plot, outline, element,..."
724,b000ks34x2,a1lmbm1n4exs5w,4.5,"['title', 'also', 'happens', 'plot', 'outline'...","[title, also, happens, plot, outline, element,..."
834,b0007gqnj4,a1l43kwwr05pcs,4.0,"['second', 'published', 'work', 'writes', 'kne...","[second, publish, work, writes, knew, best, ea..."


In [109]:
x = df_re.loc[df_re.review.str.contains('93')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  28820


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
2,b000n6ddjq,a1oxi0n58tmyy9,5.0,"['review', 'novel', 'various', 'franklin', 'li...","[review, novel, various, franklin, library, ed..."
33,b000pbzh5m,ag7r3mmf8qldt,4.0,"['isaac', 'asimov', 'one', 'popular', 'science...","[isaac, asimov, one, popular, science, fiction..."
81,b000pbzh6q,ag7r3mmf8qldt,4.0,"['second', 'original', 'three', 'foundation', ...","[second, original, three, foundation, series, ..."
135,b000tz19tc,a2njo6ye954dbh,5.0,"['teaching', 'quot', 'fahrenheit', '451', 'quo...","[teach, quot, fahrenheit, 451, quot, example, ..."
154,b000phiim0,a2l7n2u5z316ze,5.0,"['one', 'greatest', 'pieces', 'work', 'heritag...","[one, great, piece, work, heritage, press, arg..."


In [110]:
x = df_re.loc[df_re.review.str.contains('535')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  70


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
1066,b000n5hfqy,a2dktzmmg3jhn4,4.0,"['rating', 'mein', 'kampf', '4', 'stars', 'due...","[rating, mein, kampf, 4, star, due, translatio..."
4059,b000raiqs6,a2dktzmmg3jhn4,4.0,"['rating', 'mein', 'kampf', '4', 'stars', 'due...","[rating, mein, kampf, 4, star, due, translatio..."
4855,b000phn85c,a2dktzmmg3jhn4,4.0,"['rating', 'mein', 'kampf', '4', 'stars', 'due...","[rating, mein, kampf, 4, star, due, translatio..."
7032,b000mkid0w,a2dktzmmg3jhn4,4.0,"['rating', 'mein', 'kampf', '4', 'stars', 'due...","[rating, mein, kampf, 4, star, due, translatio..."
8622,b000p0ms7i,a2dktzmmg3jhn4,4.0,"['rating', 'mein', 'kampf', '4', 'stars', 'due...","[rating, mein, kampf, 4, star, due, translatio..."


In [111]:
x = df_re.loc[df_re.review.str.contains('3724')]
print('number of reviews: ', x.shape[0])
x.head()

number of reviews:  55


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
74196,0694520187,a2fthcgh06o4y5,5.0,"['unless', 'naval', 'historian', 'melville', '...","[unless, naval, historian, melville, scholar, ..."
175708,b000kpf11s,a2fthcgh06o4y5,5.0,"['unless', 'naval', 'historian', 'melville', '...","[unless, naval, historian, melville, scholar, ..."
186007,b0006al5rg,a2fthcgh06o4y5,5.0,"['unless', 'naval', 'historian', 'melville', '...","[unless, naval, historian, melville, scholar, ..."
186330,b000j6dlbu,a2fthcgh06o4y5,5.0,"['unless', 'naval', 'historian', 'melville', '...","[unless, naval, historian, melville, scholar, ..."
192370,b0006awvpg,a2fthcgh06o4y5,5.0,"['unless', 'naval', 'historian', 'melville', '...","[unless, naval, historian, melville, scholar, ..."


**Noisy Text**

Some of the other noisy texts found are repeating alphabets such as 'aaaaa' or 'zzzzz'. <br>While characters like a, e, z may repeat itself but no more than once.  <br>These entries will be removed as well.

In [6]:
def hasNumbers(inputString):
    #check if containing number
    return bool(re.search(r'\d', inputString))

#including noisy texts found in next iteration of stopwords
detailed_stopwords = [ 'aa', 'aaa', 'aaaa', 'aaaaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa', 'aaaaaaaaaaa', 
                      'aaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaahhhhhhhhhhhhhhh', 'aaaaaahhh', 'aaaah', 'aaah', 
                      'aaarrrggghhh', 'aaarrrrgghhh', 'aabout', 'aaccidental', 'aachen', 'aacute', 'aad', 
                      'aaf', 'aah', 'aahh', 'aaw','zz', 'zzboring', 'zzz', 'zzzzs', 'zzzzz', 'zzzzzz', 
                      'zzzzzzz', 'zzzzzzzz', 'zzzzzzzzben', 'zzzzzzzzz', 'zzzzzzzzzz', 'zzzzzzzzzzz', 
                      'zzzzzzzzzzzz', 'zzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzz', 
                      'zzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzzz', 
                      'zzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzzzzzzzzzzzz']

'''
Takes in list of words

Outputs processed list of words
'''
def processText(t):
    
    out = []
    
    #loop through list of words
    for w in t:
        
        #only keep words without any numbers or _ or not part of noisy text and having between 3~20 characters in length
        if (not hasNumbers(w)) and ('_' not in w) and (w not in detailed_stopwords) and (len(w) > 3 and len(w) < 20):
            out.append(w)
        
    return out


        
        

def findSpam(srs):
    
    spam_ind = []
    
    #iterate through all documents
    for index, row in srs.items():
        
        #iterate through all words
        for w in row:

            # if word contains number higher than 2000 (calendar year) or _ or part of detailed_stopwords or more than 20 char in length
            if (hasNumbers(w) and (int(re.sub(r'\D','',w))>2000)) or ('_' in w) or (w in detailed_stopwords) or (len(w) > 20):
                
                #multiple append to the list is find at the moment, we can use the unique values 
                spam_ind.append(index)
    
    
    return spam_ind
            
    

We've found a number of spam considered reviews.  <br> Reviews corresponding to these indices will be removed from both the original dataset df_re as well as the tf-idf vectors.

In [None]:
#find spam index
spam = findSpam(df_re.review_lemmatized)

In [66]:
#make it into index list
spam_ind = list(set(spam))

In [70]:
#remove spam
df_re = df_re.loc[~df_re.index.isin(spam_ind)]

#reset index
df_re.reset_index(inplace=True)

In [None]:
df_re.drop(columns='index',inplace=True)

In [73]:
df_re.head()

Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
0,b000n6ddjq,a1d2c0wdcshuwz,5.0,"['people', 'think', 'scarlet', 'letter', 'imme...","[people, think, scarlet, letter, immediate, th..."
1,b000n6ddjq,a2m3nctfugi4sr,5.0,"['hawthorne', 'best', 'work', 'central', 'unde...","[hawthorne, best, work, central, understand, t..."
2,b0006dg9om,az05jr3xqn9ip,5.0,"['question', 'often', 'appears', 'agony', 'aun...","[question, often, appear, agony, aunt, column,..."
3,b0006dg9om,a1r77go77fblmj,5.0,"['truly', 'excellently', 'written', 'beginning...","[truly, excellently, write, begin, te, come, a..."
4,b0000ckd7e,a10b4uol0ib274,5.0,"['kerouac', 'may', 'best', 'known', 'road', 'f...","[kerouac, may, best, know, road, far, favorite..."


In [74]:
df_re.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1806240 entries, 0 to 1806239
Data columns (total 5 columns):
asin                 object
reviewerID           object
overall              float64
review               object
review_lemmatized    object
dtypes: float64(1), object(4)
memory usage: 68.9+ MB


In [77]:
#process text again
df_re['review_lemmatized'] = df_re['review_lemmatized'].apply(lambda x: processText(x))

In [78]:
df_re.to_csv('data/spam_processed.csv',index=False)

In [2]:
#when reading in our word list again, pandas parses it as string instead of list of words.
#using literal_eval to fix this
df_re = pd.read_csv('data/spam_processed.csv',converters={'review_lemmatized': literal_eval})

### 4.1.4 Tf-Idf Transformer

In [4]:
#isolate review_lemmatized
lemmatized_text = df_re.review_lemmatized

In [8]:
#instantiate CountVectorizer with default settings: no lower casing, preprocessing or tokenizing.
vectorizer_5 = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy, min_df=5, max_df=0.5, max_features=10000)

#tokenize and build vocab fitting the corpora then tranform to make count vector
count_vector_updated = vectorizer_5.fit_transform(lemmatized_text)

In [9]:
len(vectorizer_5.get_feature_names())

10000

In [10]:
new_vocab=vectorizer_5.get_feature_names()

In [13]:
print(new_vocab)



New vocab list seem to be in reasonable range.

In [11]:
#instantiate tfidf transformer
tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True)

In [12]:
tfidf_vector = tfidf_transformer.fit_transform(count_vector_updated)

In [13]:
tfidf_vector.shape

(1806240, 10000)

**Data Sampling**

Our datset is still way too big to process with current resources. <br> we will use pandas sample to randomly sample out data.

In [25]:
#sample 500000 samples from our dataset
sampled_data = df_re.sample(n=10000, random_state=0)

In [26]:
sampled_review = sampled_data.review_lemmatized
sampled_output = sampled_data.asin

In [27]:
count_vector_sampled = vectorizer_5.transform(sampled_review)

In [28]:
tfidf_vector_sampled = tfidf_transformer.transform(count_vector_sampled)

In [29]:
x_train, x_test, y_train, y_test = train_test_split(tfidf_vector_sampled, sampled_output, test_size=0.3, random_state=0)

### 4.1.5 Review Dataset Revisited

In attempt to create a classification model for the above tfidf vector revealed an obstacle.<br> Sheer number of reviews AKA documents are creating way too much burden for the available resources and is running to no end.

Although it would be best if we can make use of all the data available, but model is not great because it has more data than other models. <br> We will bring back our older version of the full review dataset which has been cleaned but not yet trimmed down. <br> The only filtering done for this dataset is the minimum review count for 50 that was originally done.

From this dataset, to reduce the number of documents, we will first go through an updated, more rigorous data trimming and tigher restriction on the number of reviews.  <br> The reason for this decision is because we do not want to reduce our selection of books too much.  <br>We will sacrifice a little bit of validity of the reviews to save the selection of books.

In [60]:
#import full dataset with 6 million reviews
df_main = pd.read_csv('data/df_re_dupCleaned.csv')

**SPAM Removal**

We noticed there was an abundance of spam reviews (single reviewer, many books, same reviews).  <br> In prior duplicate checking, it was done for the whole dataframe which means the duplicate was only checked for the whole rows. <br> In the case of SPAM reviews, they are not row-duplicates but mostly reviewer-review duplicates. 

However, we are conducting a more rigorous data trimming so any duplicated reviews will be removed keeping none of the duplicates. <br> Here, we are not keeping any of the duplicates as we are not assuming any two different books reviews or reviews from two different users are the same, and that spam reviews do not contribute to the actual good reviews.

In [44]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6006462 entries, 0 to 6006461
Data columns (total 5 columns):
asin          object
title         object
reviewerID    object
overall       float64
review        object
dtypes: float64(1), object(4)
memory usage: 229.1+ MB


In [45]:
#drop duplicates based on column review
df_main.drop_duplicates(subset='review', keep=False, inplace=True)

In [46]:
df_main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1281215 entries, 0 to 6006290
Data columns (total 5 columns):
asin          1281215 non-null object
title         1281162 non-null object
reviewerID    1281215 non-null object
overall       1281215 non-null float64
review        1281215 non-null object
dtypes: float64(1), object(4)
memory usage: 58.6+ MB


We see a big drop in the total number of reviews, 6 million to 1.3 million.  This goes to show that how our review dataset was teeming with duplicated reviews.

**Lower Bound : 30 reviews**

In [47]:
main_asin = df_main.groupby('asin').asin.count()

print('number of different books: ', len(main_asin))

number of different books:  16393


There are currently 16393 books after duplicate drops.  However, current dataset contains reviews for book that after cleaning, contains only few reviews. <br> Since for this review-based content based filtering recommender system, it is important to have enough review data for recommender to actually recommend. 

<br>We will drop books with less than 30 reviews as opposed to 50 before. 

In [50]:
print('number of books with less than 30 reviews: ',len(main_asin[main_asin<30]))

number of books with less than 30 reviews:  3902


In [53]:
#filter 
df_main = df_main.loc[~df_main.asin.isin(main_asin[main_asin<30].index)]

**Upper Bound : 50 reviews**

Next, we will set an upper limit to the reviews to 50.  <br> This means any books in our process review dataframe will only contain 30 to 50 reviews per books. <br> This is a decision made to reduce the number of data but not lose the number of books.  <br> Doing this will also introduce a bit of fairness to the lesser reviewed books and this is okay. <br>We are trying to build a recommender system based on user reviews not based on how many reviews the book has gotten.

However, setting an upper limit to the reviews is far difficult than removing books with reviews less than 30 because we need to make a decision of which reviews will be picked. <br> To make sure the selection is not biased, we will re-use the dataframe sample method to sample out 50 reviews from any books that have more than 50 reviews.

In [5]:
'''
Takes in dataframe

Outputs sampled dataframe
'''

def takeSample(df):
    
    #return sampled dataframe
    return df.sample(n=100, random_state=0)
    

    
def removeExcess(df):
    
    #get asin list 
    asin_list = df.groupby('asin').asin.count()
    
    #get list of books with too many reviews
    excess_reviewed_books = asin_list[asin_list>50].index
    
    #save the non-excessive reviews to output dataframe
    df_result = df.loc[~df.asin.isin(excess_reviewed_books)]
    
    
    #loop through all excessive review books
    for book in excess_reviewed_books :
        
        #isolate the book at scope
        df_slice = df.loc[df.asin==book]
    
        #sample 50 reviews from our dataframe
        df_slice = df_slice.sample(n=50, random_state=0)
        
        #merge back to the primary dataframe
        df_result = pd.concat([df_result,df_slice])
        
    
    return df_result
    
    

In [96]:
start = timeit.default_timer()

df_re = removeExcess(df_main)

stop = timeit.default_timer()

print('Time: ', stop - start)  

9432
Time:  3420.539488839


In [97]:
df_re.to_csv('data/re_upperbound.csv',index=False)

**Text Preprocessing**

In [120]:
#df_re = pd.read_csv('data/re_upperbound.csv')

In [110]:
df_re.reset_index(inplace=True)

In [121]:
#clean up asin, title, reviewerID
df_re['asin'] = df_re['asin'].apply(lambda x: x.lower())
df_re['title'] = df_re['title'].apply(lambda x: cleanText(x))
df_re['reviewerID'] = df_re['reviewerID'].apply(lambda x: x.lower())


In [122]:
#add title to review text
df_re['review'] = df_re['title'] + ' ' + df_re['review']

In [124]:
#isolate unique asin-title
asin_title = df_re[['asin','title']].drop_duplicates(keep='first')

#drop redundant title 
df_re.drop(columns='title',inplace=True)

In [132]:
#clean up review
df_re['review'] = df_re['review'].apply(lambda x: cleanText(x))

#build stopwords
eng_stopwords = set(stopwords.words('english') + ['book','books','author','authors'])


#tokenize and remove stopwords
df_re['review'] = df_re['review'].apply(lambda x: removeStopwords(x, eng_stopwords))

#lemmatize each work token to root word given we know the pos_tag of the word
df_re['review_lemmatized'] = df_re['review'].apply(lambda x: lemmatize_words(x))

In [133]:
#find spam index
spam = findSpam(df_re.review_lemmatized)

#make it into index list
spam_ind = list(set(spam))

#remove spam
df_re = df_re.loc[~df_re.index.isin(spam_ind)]

#reset index
df_re.reset_index(inplace=True)

df_re.drop(columns='index',inplace=True)

df_re.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,asin,reviewerID,overall,review,review_lemmatized
0,b0000630mu,a332u346e9t5pu,4.5,"[html, complete, reference, great, think, good...","[html, complete, reference, great, think, good..."
1,0700608761,a3sofls15m0iq2,5.0,"[stopped, stalingrad, luftwaffe, hitler, defea...","[stop, stalingrad, luftwaffe, hitler, defeat, ..."
2,0743471474,a3fy7ov19hp0q7,2.0,"[shadow, lion, tried, couple, times, read, tim...","[shadow, lion, try, couple, time, read, time, ..."
3,0743471474,a2n8h6udrjk531,5.0,"[shadow, lion, brilliant, mercedes, descriptio...","[shadow, lion, brilliant, mercedes, descriptio..."
4,0060953616,aqanmvbtro0k2,1.5,"[climbing, high, woman, account, surviving, ev...","[climb, high, woman, account, survive, everest..."


In [134]:
#process text again
df_re['review_lemmatized'] = df_re['review_lemmatized'].apply(lambda x: processText(x))

df_re.to_csv('data/upperbound_lemmatized.csv',index=False)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


**Consine Similarity**

Now that we have the dataset ready, let's try building the review-based recommender system to see how effective it is.

In [164]:
#import titles df so we can match the output asin to title
titles = pd.read_csv('data/titles.csv')

In [2]:
#when reading in our word list again, pandas parses it as string instead of list of words.
#using literal_eval to fix this
df_re = pd.read_csv('data/upperbound_lemmatized.csv',converters={'review_lemmatized': literal_eval})

In [3]:
#isolate the text for easier processing
lemmatized_text = df_re.review_lemmatized


In [9]:
#instantiate CountVectorizer with default settings: no lower casing, preprocessing or tokenizing.
vectorizer = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy, min_df=5, max_df=0.5, max_features=50000)

#tokenize and build vocab fitting the corpora then tranform to make count vector
count_vector = vectorizer.fit_transform(lemmatized_text)

In [10]:
count_vector.shape

(578449, 50000)

In [11]:
#instantiate tfidf transformer
transformer = TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_vector = transformer.fit_transform(count_vector)

We have a fairly tuned tfidf vector, we will use this to recommend similar products.

In [156]:
'''
Takes in book title to search, tfidf of corpus and number of recommendations

Outputs recommendations 
'''

def recommend(book, titles, df, cv, tt, tf, num_rec=5):
    
    book_text = cleanText(book)
    book_text = removeStopwords(book_text)
    book_text = lemmatize_words(book_text)
    
    book_vector = cv.transform([book_text])
    book_tfidf = tt.transform(book_vector)
    
    cosine_matrix = linear_kernel(book_tfidf,tf).flatten()
    related_docs_indices = cosine_matrix.argsort()[:-num_rec:-1]

    
    print('recommending for ',book,'\n')
    
    print('recommendations: ')
    
    for i in related_docs_indices:
        
        print(titles.loc[titles.asin == df_re.asin.iloc[i]].title)
        

In [157]:
recommend('HTML Complete Reference', titles, df_re, vectorizer, transformer, tfidf_vector, 10)

recommending for  HTML Complete Reference 

recommendations: 
0    HTML: The Complete Reference
Name: title, dtype: object
0    HTML: The Complete Reference
Name: title, dtype: object
0    HTML: The Complete Reference
Name: title, dtype: object
4833    HTML Goodies
Name: title, dtype: object
0    HTML: The Complete Reference
Name: title, dtype: object
0    HTML: The Complete Reference
Name: title, dtype: object
0    HTML: The Complete Reference
Name: title, dtype: object
0    HTML: The Complete Reference
Name: title, dtype: object
25377    HTML 4 for Dummies, Fourth Edition
Name: title, dtype: object


In [158]:
recommend('Lord of the Rings: Two Towers', titles, df_re, vectorizer, transformer, tfidf_vector, 10)

recommending for  Lord of the Rings: Two Towers 

recommendations: 
22646    The Dark Tide: Book One of the Iron Tower Trilogy
Name: title, dtype: object
35736    The Lord of the Rings Weapons and Warfare
Name: title, dtype: object
23926    The Iron Tower Omnibus (Mithgar)
Name: title, dtype: object
35790    Dark Tower Boxed Set: v. 1-1v
Name: title, dtype: object
23926    The Iron Tower Omnibus (Mithgar)
Name: title, dtype: object
23926    The Iron Tower Omnibus (Mithgar)
Name: title, dtype: object
13907    The Dark Tower, Books 1-3: The Gunslinger, The...
Name: title, dtype: object
31607    The Fifth Ring
Name: title, dtype: object
23926    The Iron Tower Omnibus (Mithgar)
Name: title, dtype: object


In [166]:
recommend('statistics application', titles, df_re, vectorizer, transformer, tfidf_vector, 10)

recommending for  statistics application 

recommendations: 
12273    The Lady Tasting Tea: How Statistics Revolutio...
Name: title, dtype: object
35574    Statistics For Dummies (For Dummies (Math & Sc...
Name: title, dtype: object
33939    Statistics for the Utterly Confused (Schaum's ...
Name: title, dtype: object
35574    Statistics For Dummies (For Dummies (Math & Sc...
Name: title, dtype: object
35574    Statistics For Dummies (For Dummies (Math & Sc...
Name: title, dtype: object
12273    The Lady Tasting Tea: How Statistics Revolutio...
Name: title, dtype: object
33939    Statistics for the Utterly Confused (Schaum's ...
Name: title, dtype: object
35574    Statistics For Dummies (For Dummies (Math & Sc...
Name: title, dtype: object
17863    History: Fiction or Science?
Name: title, dtype: object


Our review-based recommender is providing the same recommendation multiple times.  This is due to multiple reviews being available for each books.  We could further expand our recommender to skip the asin if they were selected to be recommendation in the future. However, we are seeing the limitation of review content based recommender system as the keywords are determined by the user reviews.  Although our current recommender system is doing a good job in recommending similar (mostly the same for now) books, there were many manual data trimming involved to keep the corpora manageable.  This means that the corpora itself was built by the decision of the developer to keep certain reviews or not introducing bias.  This may not work best in production situation.

Content based filtering is usually applied for situations where we analyze text for titles, descriptions or even transcripts, but not usually a someone's opinion as there are not many variation to describe someone's emotion towards a product as opposed to the product's direct descriptions.

As part of building a recommender system for this project, we will take two steps back at this time to create a basic content based recommender system with 1) title of the books,  2) descriptions and genre of the book.

We will do this in a separate notebook.