# 5. Product Dataset Content-based Filtering

In continuing the efforts to build a content based filtering recommender system, we have changed the working dataset to product dataset rather than review dataset. <br> In this section, we will build simple recommender system in two ways below:

- content-based filtering using title as tokens
- content-based filtering using product descriptions

In [121]:
import pandas as pd

import re
from ast import literal_eval

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics, svm
from sklearn.metrics.pairwise import linear_kernel

## 5.1 Data Review 

In [33]:
df_pr = pd.read_csv('data/pr_books.csv')

In [34]:
df_pr.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930600 entries, 0 to 2930599
Data columns (total 9 columns):
description    2382109 non-null object
title          2929812 non-null object
also_buy       1343756 non-null object
brand          2828674 non-null object
also_view      1204260 non-null object
main_cat       2930552 non-null object
asin           2930600 non-null object
category       2544322 non-null object
details        394979 non-null object
dtypes: object(9)
memory usage: 201.2+ MB


## 5.2 Data Preprocessing

As before, data trimming needs to be done.  But different from before, each of our entries will not count too much toward a larger corpus since we are not analyzing a written text but keywords descriptions.  

For this same reason, we will remove any entries null features.  <br> We are also taking also_view and also_buy into consideration as they may be used for collaborative filtering after this.

In [35]:
#initial blanket duplicate drop to make sure there aren't any row duplicates
df_pr.drop_duplicates(keep='first',inplace=True)

In [36]:
df_pr.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2930600 entries, 0 to 2930599
Data columns (total 9 columns):
description    2382109 non-null object
title          2929812 non-null object
also_buy       1343756 non-null object
brand          2828674 non-null object
also_view      1204260 non-null object
main_cat       2930552 non-null object
asin           2930600 non-null object
category       2544322 non-null object
details        394979 non-null object
dtypes: object(9)
memory usage: 223.6+ MB


In [37]:
df_pr = df_pr.dropna()

In [38]:
df_pr.info(verbose=True,null_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16263 entries, 2534541 to 2930147
Data columns (total 9 columns):
description    16263 non-null object
title          16263 non-null object
also_buy       16263 non-null object
brand          16263 non-null object
also_view      16263 non-null object
main_cat       16263 non-null object
asin           16263 non-null object
category       16263 non-null object
details        16263 non-null object
dtypes: object(9)
memory usage: 1.2+ MB


In [39]:
#drop column that is not needed
df_pr.drop(columns='details',inplace=True)

**Title Preprocessing**

In [40]:
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1,1), min_df = 0, max_df = 0.1, stop_words = 'english')
tfidf_matrix = tf.fit_transform(df_pr['title'])

In [41]:
tfidf_matrix.shape

(16263, 25624)

View the vobalary list directly to view any bad data.

In [42]:
print(tf.get_feature_names())



In [67]:
'''
Takes in text

Outputs text with all lower case and special characters removed
'''
def cleanText(t):
    
    #change to lower case
    t = str(t).lower()
    #remove special characters
    t = re.sub("(\\W)+"," ",t)
    
    return t


def hasNumbers(inputString):
    #check if containing number
    return bool(re.search(r'\d', inputString))

'''
Takes in text

Outputs tokenized text with stopwords removed
'''
def processText(t,stopwords=eng_stopwords):
    
    #tokenize : only single words, no n-grams
    t = word_tokenize(t,language='english')
    
    filtered = []
    
    for word in t:
        if (word not in stopwords) and (not hasNumbers(word)) and ('_' not in word) :
            filtered.append(word)
    
    return filtered

#does nothing
def dummy(doc):
    return doc

In [44]:
df_pr['title_cleansed'] = df_pr['title'].apply(lambda x: cleanText(x))

In [45]:
eng_stopwords = set(stopwords.words('english'))
df_pr['title_cleansed'] = df_pr['title_cleansed'].apply(lambda x: processText(x,eng_stopwords))

In [46]:
df_pr.reset_index(inplace=True)

In [47]:
df_pr.drop(columns='index',inplace=True)

## 5.2 Title-based Recommender 

Now that title has been cleaned, we will put it through the vectorizers algorithms. <br> Since building a cosine similarities for all vectors is too resource consuming, we will take the input and turn it to the tfidf vectorizer to create the cosine similarity matrix on the fly and recommend the similar titles.

In [48]:
cv = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy, min_df = 0, max_df = 0.1)
count= cv.fit_transform(df_pr.title_cleansed)

In [49]:
#instantiate tfidf transformer
tt = TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf = tt.fit_transform(count)

In [50]:
tfidf.shape

(16263, 23076)

In [53]:
#using the first book on the list as test
cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
related_docs_indices = cosine_similarities.argsort()[:-5:-1]

related_docs_indices

array([1.        , 0.        , 0.06581903, ..., 0.        , 0.21330557,
       0.2355951 ])

In [63]:
print('input title: ',df_pr.title[0])
print('recommendations: ')
print(df_pr.title[related_docs_indices])

input title:  Practical Chinese Reader: Elementary Course, Book 1
recommendations: 
0     Practical Chinese Reader: Elementary Course, B...
61                         New Practical Chinese Reader
63    New Practical Chinese Reader Textbook-6 (Chine...
60    New Practical Chinese Reader Textbook 5 (v. 5)...
Name: title, dtype: object


Our base test has worked! 

In [254]:
'''
Takes in book title to search, tfidf of corpus and number of recommendations

Outputs recommendations 
'''

def recommend(book,df, cv, tt, tf, num_rec=5):
    
    book_text = cleanText(book)
    book_text = processText(book_text)
    
    book_vector = cv.transform([book_text])
    book_tfidf = tt.transform(book_vector)
    
    cosine_matrix = linear_kernel(book_tfidf,tf).flatten()
    related_docs_indices = cosine_matrix.argsort()[:-num_rec:-1]
    
    print('recommending for ',book,'\n')
    
    print('recommendations: ')
    
    for i in related_docs_indices:
        
        print(df[['title','brand']].iloc[i])

In [255]:
recommend('Lord of the Rings', df_pr, cv, tt, tfidf ,num_rec=10)

recommending for  Lord of the Rings 

recommendations: 
title       The Lord of the Rings (4 Volumes)
brand    Visit Amazon's J. R. R. Tolkien Page
Name: 11908, dtype: object
title    Lord Of The Rings - One Volume Edition
brand      Visit Amazon's J. R. R. Tolkien Page
Name: 12091, dtype: object
title    The Lord Of The Rings and the Hobbit 4 Books C...
brand                  Visit Amazon's J. R. R. Tolken Page
Name: 11299, dtype: object
title    Fellowship of the Ring (Lord of the Rings Part 1)
brand                 Visit Amazon's J. R. R. Tolkien Page
Name: 12734, dtype: object
title    Hobbit and Lord of the Rings Trilogy - Boxed S...
brand                 Visit Amazon's J. R. R. Tolkien Page
Name: 11806, dtype: object
title    El Senor De Los Anillos / the Lord of the Ring...
brand                 Visit Amazon's J. R. R. Tolkien Page
Name: 5057, dtype: object
title    The Fellowship of the Ring (Book #1 The Lord o...
brand                             J.R.R. (Author); Tolkien
Name:

Title-based recommender system is working as expected!

## 5.3 Description-based Recommender

In [115]:
df_pr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16263 entries, 0 to 16262
Data columns (total 9 columns):
description       16263 non-null object
title             16263 non-null object
also_buy          16263 non-null object
brand             16263 non-null object
also_view         16263 non-null object
main_cat          16263 non-null object
asin              16263 non-null object
category          16263 non-null object
title_cleansed    16263 non-null object
dtypes: object(9)
memory usage: 1.1+ MB


Looks like the description data is a string in list that is presented as string. <br> We will need to process these texts first. 

Description based recommender is an enhanced version of the title based recommender system with various other descriptive features available for each books.  <br> This will include description, category, main_cat, brand and title.

In [170]:
#we will lemmatize the tokens to their root word

'''
Takes in word

Outputs the word's POS_tag: v, a, n, r
'''
def get_wordnet_pos(word):

    #get first letter of word tag
    tag = nltk.pos_tag([word])[0][1][0].upper()
    
    #create simple dict to associate tag to wordnet attributes   
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV,
                "C": wordnet.NOUN,
                "S": wordnet.ADJ_SAT}

    #we put a conditional here to account for cases where tag is not within expected range
    if tag_dict.get(tag):
        return tag_dict.get(tag)
    else:
        return ''

#initialize lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_words(words):
    
    #lemmatize the list of words provided
    return [lemmatizer.lemmatize(w, get_wordnet_pos(w)) if get_wordnet_pos(w) != '' else lemmatizer.lemmatize(w) for w in words]
    

In [None]:
#apply cleanText to description and category first to take care of list concatenation issue
df_pr['big_des'] = df_pr.title + " " + df_pr.brand + " " + df_pr.main_cat + " " + df_pr.category.apply(lambda x: cleanText(x)) + " " + df_pr.description.apply(lambda x: cleanText(x))

#additional cleansing and tokenizing with lemmatizer
df_pr.big_des = df_pr.big_des.apply(lambda x: cleanText(x))
df_pr.big_des = df_pr.big_des.apply(lambda x: processText(x))
#lemmatize each work token to root word given we know the pos_tag of the word
df_pr.big_des = df_pr.big_des.apply(lambda x: lemmatize_words(x))


In [None]:
#instantiate count vectorizer
cv1 = CountVectorizer(lowercase=False,preprocessor=dummy, tokenizer=dummy, min_df = 0, max_df = 0.1)
count1= cv1.fit_transform(df_pr.big_des) 

#instantiate tfidf transformer
tt1 = TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf1 = tt1.fit_transform(count1)

In [256]:
recommend('Lord of the Rings',df_pr, cv1, tt1, tfidf1 , num_rec=10)

recommending for  Lord of the Rings 

recommendations: 
title    Good Years, From 1900 to the First Worlad War
brand                  Visit Amazon's Walter Lord Page
Name: 9594, dtype: object
title    Diana Gabaldon Lord John Series Complete Set [...
brand                   Visit Amazon's Diana Gabaldon Page
Name: 15659, dtype: object
title    El Senor De Los Anillos / the Lord of the Ring...
brand                 Visit Amazon's J. R. R. Tolkien Page
Name: 5057, dtype: object
title    LORD OF MISRULE[Lord of Misrule] BY Gordon, Ja...
brand                     Visit Amazon's Jaimy Gordon Page
Name: 13412, dtype: object
title    Tales From the Flat Earth: The Lords of Darkne...
brand                                           Tanith Lee
Name: 8785, dtype: object
title    Lich Lords (Role Aids/Advanced Dungeons and Dr...
brand                                         Lynn Sellers
Name: 15968, dtype: object
title    Lord of the Flies
brand      William Golding
Name: 15151, dtype: object
titl

Including product descriptions and other data has definitely expanded the range of recommendation. <br> This is a much better version of the recommender as this can also recommend books that are in the same category or same author or even with similar descritions.

In [257]:
df_pr.to_csv('data/pr_lemmatized.csv',index=False)