# Content Based Recommender

> Author: [Yalim Demirkesen](github.com/demirkeseny) 

> Date: March 2019

As you might recall from the previous notebook, I used a module called *Goodreads* that enabled me to get detailed information about each book. In my case most useful one is the book description. Almost for each book there are couple paragraphs of explanations. 

Just to give an example, for *Harry Potter and the Philosopher's Stone* there is the below description:

> *Harry Potter's life is miserable. His parents are dead and he's stuck with his heartless relatives, who force him to live in a tiny closet under the stairs. But his fortune changes when he receives a letter that tells him the truth about himself: he's a wizard. A mysterious visitor rescues him from his relatives and takes him to his new home, Hogwarts School of Witchcraft and Wizardry.  After a lifetime of bottling up his magical powers, Harry finally feels like a normal kid. But even within the Wizarding community, he is special. He is the boy who lived: the only person to have ever survived a killing curse inflicted by the evil Lord Voldemort, who launched a brutal takeover of the Wizarding world, only to vanish after failing to kill Harry.  Though Harry's first year at Hogwarts is the best of his life, not everything is perfect. There is a dangerous secret object hidden within the castle walls, and Harry believes it's his responsibility to prevent it from falling into evil hands. But doing so will bring him into contact with forces more terrifying than he ever could have imagined.  Full of sympathetic characters, wildly imaginative situations, and countless exciting details, the first installment in the series assembles an unforgettable magical world and sets the stage for many high-stakes adventures to come.*

This detailed explanation is essential to create a NLP model that could analyze the frequency of the words and then assign the similarity among them. So based on the word counts I was able to suggest books.

In [1]:
# Necessary libraries:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.corpus import wordnet
from ast import literal_eval
import scipy.stats as stats

In [2]:
# Downloading the extended book dataset:
books = pd.read_csv('./data/books_extended.csv', encoding='utf-8-sig')

In [3]:
# Taking only the useful columns:
books = books[['book_id','authors','original_title','title_x','language_code', 
                'description']]

In order to run NLP model, I need to take only the english explanations. Since I also have different langugages, I cannot build a recommendation since they require a different model.

In [4]:
books.language_code.value_counts()

eng      6341
en-US    2070
en-GB     257
ara        64
en-CA      58
fre        25
ind        21
spa        20
ger        13
jpn         7
per         7
por         6
pol         6
en          4
nor         3
dan         3
ita         2
fil         2
swe         1
vie         1
rus         1
nl          1
tur         1
rum         1
mul         1
Name: language_code, dtype: int64

In [5]:
# Listing all the words used to describe English:
eng_book = ['eng', 'en-US', 'en-GB', 'en-CA', 'en']

In [6]:
# Filtering the english books
books = books[books['language_code'].isin(eng_book)]

In [7]:
books.head()

Unnamed: 0,book_id,authors,original_title,title_x,language_code,description
0,2767052,Suzanne Collins,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,"Could you survive on your own, in the wild, wi..."
1,3,"J.K. Rowling, Mary GrandPré",Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,Harry Potter's life is miserable. His parents ...
2,41865,Stephenie Meyer,Twilight,"Twilight (Twilight, #1)",en-US,<b>About three things I was absolutely positiv...
3,2657,Harper Lee,To Kill a Mockingbird,To Kill a Mockingbird,eng,The unforgettable novel of a childhood in a sl...
4,4671,F. Scott Fitzgerald,The Great Gatsby,The Great Gatsby,eng,Alternate Cover Edition ISBN: 0743273567 (ISBN...


In [8]:
# Checking the number of rows of the remaining books:
books.shape

(8730, 6)

Below there is the description of one book. As we can realize there characters that are not related to the explanation but rather the symbols for the format of the text. We need to delete them since they will be recognized no different than other strings to our model.

In [9]:
print(books.description.tolist()[1:2][0])

Harry Potter's life is miserable. His parents are dead and he's stuck with his heartless relatives, who force him to live in a tiny closet under the stairs. But his fortune changes when he receives a letter that tells him the truth about himself: he's a wizard. A mysterious visitor rescues him from his relatives and takes him to his new home, Hogwarts School of Witchcraft and Wizardry.<br /><br />After a lifetime of bottling up his magical powers, Harry finally feels like a normal kid. But even within the Wizarding community, he is special. He is the boy who lived: the only person to have ever survived a killing curse inflicted by the evil Lord Voldemort, who launched a brutal takeover of the Wizarding world, only to vanish after failing to kill Harry.<br /><br />Though Harry's first year at Hogwarts is the best of his life, not everything is perfect. There is a dangerous secret object hidden within the castle walls, and Harry believes it's his responsibility to prevent it from falling

In [11]:
books.description = books.description.str.replace('<br />', ' ')
books.description = books.description.str.replace('<b>', ' ')
books.description = books.description.str.replace('</b>', ' ')
books.description = books.description.str.replace('<i>', ' ')
books.description = books.description.str.replace('</i>', ' ')
books.description = books.description.str.replace('<p>', ' ')
books.description = books.description.str.replace('</p>', ' ')
books.description = books.description.str.replace('<blockquote>', ' ')

After the cleaning process, we will delete all the empty values for the description since we cannot use them if they don't exist.

In [12]:
books.dropna(subset=['description'], inplace = True)

In [13]:
# Checking the missing values in all of the columns:
books.isnull().sum()

book_id             0
authors             0
original_title    489
title_x             0
language_code       0
description         0
dtype: int64

It is okay that there are empty cells in the `original_title` since it shows that the `title_x` is has the original title.

In [14]:
# Resetting the indexes after eliminating the missing data
books.reset_index(drop=True, inplace=True)

To check multiple scenarios, I will run the countvectorizer in bins of 1, 2, 3 and 4 words.

In [15]:
# creating a tfidf on the descriptions
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 4),min_df=0, stop_words='english')

In [16]:
tfidf_matrix = tf.fit_transform(books['description'])

In [17]:
tfidf_matrix.shape

(8671, 1832045)

In [18]:
# since recommender systems rely on the cosine similarity, I benefited from that one
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [19]:
book_desc= books['description']
indices = pd.Series(books.index, index = book_desc)

In [20]:
# created a function to create a recommendation
def recommender(book_title, topN):
    
    for title in books.loc[books['title_x'].str.contains(book_title), 'title_x']:
        print(title)
        rowN = books[books['title_x'] == title].index.values.astype(int)[0]
        overview = books.iloc[rowN]['description']
        scores = sorted(list(enumerate(cosine_sim[indices[overview]])), key=lambda x:x[1], reverse=True)[1:(topN+1)]
        book_no = [cell[0]for cell in scores]
        topN = books.iloc[book_no][['authors','title_x','description']] 
        topN.reset_index(drop=True, inplace=True)
        return topN

Below you can find the top 10 recommended books for `Treasure Island`:

In [21]:
treasure9 = recommender('Treasure', 10)
treasure9.columns = ['AUTHOR','BOOK TITLE', 'DESCRIPTION']
treasure9

Treasure Island


Unnamed: 0,AUTHOR,BOOK TITLE,DESCRIPTION
0,Enid Blyton,"Five on a Treasure Island (Famous Five, #1)","The very first Famous Five adventure, featurin..."
1,Robert Louis Stevenson,The Black Arrow,From the beloved author of Treasure Island ...
2,"Alexandre Dumas, Robin Buss",The Count of Monte Cristo,"In 1815 Edmond Dantès, a young and successful ..."
3,Susan Cooper,The Dark Is Rising Sequence (The Dark Is Risi...,"""When the Dark comes rising, six shall turn..."
4,Arthur Conan Doyle,The Complete Sherlock Holmes,A study in scarlet -- The sign of four -- Adve...
5,Megan Whalen Turner,"The Thief (The Queen's Thief, #1)","The king's scholar, the magus, believes he kno..."
6,Anne Rivers Siddons,Low Country,Caroline Venable has everything her Southern h...
7,Nora Roberts,The Reef,A marine archaeologist and a salvager join fo...
8,Clive Cussler,"Inca Gold (Dirk Pitt, #12)","Nearly five centuries ago, a fleet of boats la..."
9,"Dave Barry, Ridley Pearson, Greg Call",Peter and the Starcatchers (Peter and the Star...,Orphan Peter and his mates are dispatched to a...
