# Book Recommendation with Retrieval Augmented Generation -- Vector Store

In the ever-evolving landscape of book discovery, traditional recommendation systems often fall short. Large language models (LLMs) offer a promising new approach. By leveraging their ability to process vast amounts of text data, LLMs can delve into the intricacies of different genres, writing styles, and reader preferences. This newfound depth holds the potential to revolutionize book recommendations, leading readers not just to familiar tropes, but to truly personalized literary journeys.

One of the exciting advancements in LLM-powered book recommendation systems is the integration of Retrieval-Augmented Generation (RAG). RAG functions as a sophisticated information retrieval tool for the LLM. By efficiently searching vast datasets of book information, RAG identifies titles with similar content and stylistic elements. This retrieved data empowers the LLM to move beyond simple similarity-based recommendations. RAG allows the LLM to grasp the underlying themes and narrative approaches that resonated with the user, enabling it to generate highly personalized suggestions that cater to the user's specific literary preferences.

In this post, we are going to demonstrate how to build a simple vector store and retrieve the documents that are sementically relevant. 

## Data Set

We will use a dataset from Kaggle

source : https://www.kaggle.com/datasets/ishikajohari/best-books-10k-multi-genre-data/

## Setup 

Vector Store: FAISS

Embedding model:bert-base-uncased from Hugging face - [here](https://huggingface.co/google-bert/bert-base-uncased)

In [2]:
from dotenv import dotenv_values
import os
import requests
import pandas as pd

from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI,HuggingFaceHub
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma, FAISS
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.prompts import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
import os

# Load the API key, the key can be obtained from the huggingface website
config = dotenv_values(".env")  
HUGGINGFACEHUB_API_TOKEN = config['HUGGINGFACEHUB_API_TOKEN']
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN


In [3]:
df = pd.read_csv('../data/goodreads_data.csv').drop(['Unnamed: 0'],axis=1)
df = df.assign( genre_len = lambda x:len(x['Genres']))
df = df[df['genre_len']>0]
df = df[['Book','Genres','Description']]
df.to_csv('book_genre.csv',index=False)
df.head()

Unnamed: 0,Book,Genres,Description
0,To Kill a Mockingbird,"['Classics', 'Fiction', 'Historical Fiction', ...",The unforgettable novel of a childhood in a sl...
1,Harry Potter and the Philosopher’s Stone (Harr...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',...",Harry Potter thinks he is an ordinary boy - un...
2,Pride and Prejudice,"['Classics', 'Fiction', 'Romance', 'Historical...","Since its immediate success in 1813, Pride and..."
3,The Diary of a Young Girl,"['Classics', 'Nonfiction', 'History', 'Biograp...",Discovered in the attic in which she spent the...
4,Animal Farm,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",Librarian's note: There is an Alternate Cover ...


In [4]:
df.shape

(10000, 3)

In [5]:
#data loader
loader = CSVLoader(file_path="book_genre.csv",encoding='utf-8')
data = loader.load()

#data transformers
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [6]:
len(texts)

10013

In [9]:
texts[1]

Document(page_content="Book: Harry Potter and the Philosopher’s Stone (Harry Potter, #1)\nGenres: ['Fantasy', 'Fiction', 'Young Adult', 'Magic', 'Childrens', 'Middle Grade', 'Classics']\nDescription: Harry Potter thinks he is an ordinary boy - until he is rescued by an owl, taken to Hogwarts School of Witchcraft and Wizardry, learns to play Quidditch and does battle in a deadly duel. The Reason ... HARRY POTTER IS A WIZARD!", metadata={'source': 'book_genre.csv', 'row': 1})

In [10]:
embeddings = HuggingFaceEmbeddings(model_name='bert-base-uncased')

  from .autonotebook import tqdm as notebook_tqdm
No sentence-transformers model found with name bert-base-uncased. Creating a new one with MEAN pooling.


The embedding process might take a while. 

In this example, we will only sample the top 1000 documents.

In [11]:
%%time
docsearch = FAISS.from_documents(texts[0:1000], embeddings)
retriever=docsearch.as_retriever()
# docsearch.save_local("faiss_store1")

CPU times: total: 58min 47s
Wall time: 6min 3s


With the document retriever object, we are able to performe the semantic search with it

In [12]:
%%time
ans = docsearch.similarity_search("I want some fantasy book", n=5)
for detail in ans:
    print(detail.page_content.split('\n')[0])
    print(detail.page_content.split('\n')[1])
    print('\n')

Book: Eleanor & Park
Genres: ['Young Adult', 'Romance', 'Contemporary', 'Fiction', 'Realistic Fiction', 'Audiobook', 'Teen']


Book: Are You There God? It's Me, Margaret
Genres: ['Young Adult', 'Fiction', 'Childrens', 'Classics', 'Middle Grade', 'Coming Of Age', 'Realistic Fiction']


Book: The Paper Bag Princess
Genres: ['Picture Books', 'Childrens', 'Fantasy', 'Fiction', 'Dragons', 'Classics', 'Fairy Tales']


Book: The Velveteen Rabbit
Genres: ['Classics', 'Childrens', 'Fiction', 'Picture Books', 'Fantasy', 'Animals', 'Young Adult']


CPU times: total: 578 ms
Wall time: 69 ms


## Data Preprocessing

In the next session, we are going to perform some Natural Language Processing (NLP) data preprocessing to the description part, and we can compare the result.

In [13]:
import pandas as pd
import numpy as np
import nltk
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
import re
from bs4 import BeautifulSoup
import csv

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
def remove_url(review):
    try:
        res = re.sub(r'http\S+', '', review)
        res = BeautifulSoup(res,"html.parser").get_text()
    except:
        return review
    return res

def remove_non_alphabetic(review):
    try:
        return re.sub("[^a-zA-Z\s]+", "", review)
    except:
        return review

def remove_extra_spaces(review):
    try:
        return ' '.join(review.split())
    except:
        return review

def contractionfunction(s):
    try:
        s = re.sub(r"won\'t", "will not", s)
        s = re.sub(r"can\'t", "can not", s)
        s = re.sub(r"n\'t", " not", s)
        s = re.sub(r"\'re", " are", s)
        s = re.sub(r"\'s", " is", s)
        s = re.sub(r"\'d", " would", s)
        s = re.sub(r"\'ll", " will", s)
        s = re.sub(r"\'t", " not", s)
        s = re.sub(r"\'ve", " have", s)
        s = re.sub(r"\'m", " am", s)
        return s
    except:
        return s

def lemmatize_text(text, tokenizer, lemmatizer):
    try:
        return " ".join([lemmatizer.lemmatize(w) for w in tokenizer.tokenize(text)])
    except:
        return text


In [15]:
df['Preprocessed_description'] = df['Description']

In [19]:
# data cleaning
# convert to lower case
df['Preprocessed_description'] = df['Preprocessed_description'].str.lower()

# remove html and urls
df['Preprocessed_description'] = df['Preprocessed_description'].apply(lambda x: remove_url(x))

# remove non-alphabetical characters
df['Preprocessed_description'] = df['Preprocessed_description'].apply(lambda x: remove_non_alphabetic(x))

# remove extra spaces
df['Preprocessed_description'] = df['Preprocessed_description'].apply(lambda x: remove_extra_spaces(x))

# perform contractions
df['Preprocessed_description'] = df['Preprocessed_description'].apply(lambda x: contractionfunction(x))


# print the average length of review before and after the data cleaning
print(f"Average length of review before and after data cleaning")
print(f'{df.Description.str.len().mean()}, {df.Preprocessed_description.str.len().mean()}')

# preprocessing
# remove stop words
pattern = r'\b(?:{})\b'.format('|'.join(stopwords.words('english')))
df['Preprocessed_description'] = df['Preprocessed_description'].str.replace(pattern, "").str.replace(r'\s+', ' ')

# perform lemmatization
tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()

df['lemmatized_description'] = df['Preprocessed_description'].apply(lambda x: lemmatize_text(x, tokenizer, lemmatizer))


print(f"Average length of review before and after preprocessing")
print(f'{df.Preprocessed_description.str.len().mean()}, {df.lemmatized_description.str.len().mean()}')

Average length of review before and after data cleaning
956.1368537740602, 915.5401592260405
Average length of review before and after preprocessing
915.5401592260405, 900.0581477375794


In [21]:
df.head()

Unnamed: 0,Book,Genres,Description,Preprocessed_description,lemmatized_description
0,To Kill a Mockingbird,"['Classics', 'Fiction', 'Historical Fiction', ...",The unforgettable novel of a childhood in a sl...,the unforgettable novel of a childhood in a sl...,the unforgettable novel of a childhood in a sl...
1,Harry Potter and the Philosopher’s Stone (Harr...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',...",Harry Potter thinks he is an ordinary boy - un...,harry potter thinks he is an ordinary boy unti...,harry potter think he is an ordinary boy until...
2,Pride and Prejudice,"['Classics', 'Fiction', 'Romance', 'Historical...","Since its immediate success in 1813, Pride and...",since its immediate success in pride and preju...,since it immediate success in pride and prejud...
3,The Diary of a Young Girl,"['Classics', 'Nonfiction', 'History', 'Biograp...",Discovered in the attic in which she spent the...,discovered in the attic in which she spent the...,discovered in the attic in which she spent the...
4,Animal Farm,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",Librarian's note: There is an Alternate Cover ...,librarians note there is an alternate cover ed...,librarian note there is an alternate cover edi...


In [41]:
df_all = df_all[['Book','Genres','lemmatized_review']]
df.to_csv('book_genre_lemmatized_review.csv',index=False)
df.head()

Unnamed: 0,Book,Genres,Description,lemmatized_review
0,To Kill a Mockingbird,"['Classics', 'Fiction', 'Historical Fiction', ...",the unforgettable novel of a childhood in a sl...,the unforgettable novel of a childhood in a sl...
1,Harry Potter and the Philosopher’s Stone (Harr...,"['Fantasy', 'Fiction', 'Young Adult', 'Magic',...",harry potter thinks he is an ordinary boy unti...,harry potter think he is an ordinary boy until...
2,Pride and Prejudice,"['Classics', 'Fiction', 'Romance', 'Historical...",since its immediate success in pride and preju...,since it immediate success in pride and prejud...
3,The Diary of a Young Girl,"['Classics', 'Nonfiction', 'History', 'Biograp...",discovered in the attic in which she spent the...,discovered in the attic in which she spent the...
4,Animal Farm,"['Classics', 'Fiction', 'Dystopia', 'Fantasy',...",librarians note there is an alternate cover ed...,librarian note there is an alternate cover edi...


In [42]:
#data loader
loader = CSVLoader('book_genre_lemmatized_review.csv',encoding='utf-8')
data = loader.load()

#data transformers
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
texts = text_splitter.split_documents(data)

In [48]:
texts[0]

Document(page_content="Book: To Kill a Mockingbird\nGenres: ['Classics', 'Fiction', 'Historical Fiction', 'School', 'Literature', 'Young Adult', 'Historical']\nDescription: the unforgettable novel of a childhood in a sleepy southern town and the crisis of conscience that rocked it to kill a mockingbird became both an instant bestseller and a critical success when it was first published in it went on to win the pulitzer prize in and was later made into an academy awardwinning film also a classiccompassionate dramatic and deeply moving to kill a mockingbird takes readers to the roots of human behavior to innocence and experience kindness and cruelty love and hatred humor and pathos now with over million copies in print and translated into forty languages this regional story by a young alabama woman claims universal appeal harper lee always considered her book to be a simple love story today it is regarded as a masterpiece of american literature\nlemmatized_review: the unforgettable novel

In [43]:
%%time
#Fill Vector DB
docsearch2 = FAISS.from_documents(texts[0:200], embeddings)
# docsearch = Chroma.from_documents(texts, embeddings)
retriever2 = docsearch2.as_retriever()

CPU times: total: 14min 6s
Wall time: 1min 26s


In [44]:
%%time
ans = docsearch2.similarity_search("I want something other than harry potter", n=5)
print(ans[0].page_content)

Book: Harry Potter and the Goblet of Fire (Harry Potter, #4)
Genres: ['Fantasy', 'Young Adult', 'Fiction', 'Magic', 'Childrens', 'Middle Grade', 'Audiobook']
Description: it is the summer holidays and soon harry potter will be starting his fourth year at hogwarts school of witchcraft and wizardry harry is counting the days there are new spells to be learnt more quidditch to be played and hogwarts castle to continue exploring but harry needs to be careful there are unexpected dangers lurking
lemmatized_review: it is the summer holiday and soon harry potter will be starting his fourth year at hogwarts school of witchcraft and wizardry harry is counting the day there are new spell to be learnt more quidditch to be played and hogwarts castle to continue exploring but harry need to be careful there are unexpected danger lurking
CPU times: total: 438 ms
Wall time: 46 ms


In [52]:
for detail in ans:
    print(detail.page_content.split('\n')[0])

Book: Harry Potter and the Goblet of Fire (Harry Potter, #4)
Book: Harry Potter and the Philosopher’s Stone (Harry Potter, #1)
Book: Harry Potter and the Prisoner of Azkaban (Harry Potter, #3)
Book: Winnie-the-Pooh (Winnie-the-Pooh #1)


In [61]:
def compare_docsearch(q):
    ans = docsearch.similarity_search(q, n=5)
    ans2 = docsearch2.similarity_search(q, n=5)
    
    print(f"original docsearch: ")
    for detail in ans:
        print(detail.page_content.split('\n')[0])
        print(detail.page_content.split('\n')[1])
        print('\n')

    print('----------------------')
    print(f"preprocessed docsearch: ")
    for detail in ans2:
        print(detail.page_content.split('\n')[0])
        print(detail.page_content.split('\n')[1])
        print('\n')


In [64]:
compare_docsearch('magic fantasy kid ')

original docsearch: 
Book: The Velveteen Rabbit
Genres: ['Classics', 'Childrens', 'Fiction', 'Picture Books', 'Fantasy', 'Animals', 'Young Adult']


Book: Charlie and the Chocolate Factory (Charlie Bucket, #1)
Genres: ['Childrens', 'Fiction', 'Fantasy', 'Classics', 'Young Adult', 'Middle Grade', 'Humor']


Book: The Complete Grimm's Fairy Tales
Genres: ['Classics', 'Fantasy', 'Fiction', 'Fairy Tales', 'Short Stories', 'Childrens', 'Literature']


Book: The Cat in the Hat (The Cat in the Hat, #1)
Genres: ['Childrens', 'Picture Books', 'Classics', 'Fiction', 'Poetry', 'Fantasy', 'Humor']


----------------------
preprocessed docsearch: 
Book: Charlie and the Chocolate Factory (Charlie Bucket, #1)
Genres: ['Childrens', 'Fiction', 'Fantasy', 'Classics', 'Young Adult', 'Middle Grade', 'Humor']


Book: Where the Wild Things Are
Genres: ['Childrens', 'Picture Books', 'Fiction', 'Classics', 'Fantasy', 'Adventure', 'Young Adult']


Book: The Velveteen Rabbit
Genres: ['Classics', 'Childrens', 'F