**Building Function for Vector Search Recommendation**
---

**Load libraries and dependancies**

In [6]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv
from langchain.chains import LLMChain
from langchain_community.vectorstores import Chroma
from transformers import pipeline
import gradio as gr
from transformers import AutoModel, AutoTokenizer
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

**Load Clean CSV**

In [25]:
books = pd.read_csv("cleaned_books.csv", on_bad_lines="skip")

In [27]:
books.sample(3)

Unnamed: 0,isbn13,title,authors,categories,description,published_year,average_rating,num_pages,ratings_count
2003,9780375701962,The Moviegoer,Walker Percy,Fiction,Kate's desperate struggle to maintain her sani...,1998,3.69,242.0,20796.0
2808,9780446699075,Robert Ludlum's (TM) The Arctic Event,Robert Ludlum;James H. Cobb,Fiction,"On a remote island in the Canadian Arctic, res...",2007,4.01,390.0,4299.0
6092,9781594731402,How to be a Perfect Stranger,Stuart M. Matlins;Arthur J. Magida,Reference,A guide to the rituals and celebrations of the...,2006,3.96,403.0,114.0


**Filtering out book description with less than 25 words. Descriptions with less than 25 words fail to convey enough information to be paired well with prompts**

In [29]:
books['description_count'] = books['description'].str.split().apply(len)
books = books[books['description_count'] >= 25 ]

**Create a column that joins a unique identifier (isbn13) and description. This is important has it allows us to complete the vector search and track it back using the uid.**

In [31]:
books.loc[:, "code_description"] = books[["isbn13", "description"]].astype(str).agg(" ".join, axis=1)

**To do a vector search we need to take the code_description column from our books table and save it into a text file called code_description.txt**

In [34]:
books["code_description"].to_csv(
    "code_description.txt",
    index=False,
    header=False
)

**We load the text file code_description.txt into memory as raw text.**

**Then, we split that text into smaller chunks, each up to 500 characters long, using newlines (\n) to know where to break.**

**This makes it easier for us  to process or analyze the text piece by piece instead of all at once.**

In [36]:
loader = TextLoader("code_description.txt", encoding="utf-8")

raw_documents = loader.load()

text_splitter = CharacterTextSplitter(
    chunk_size=500,             # pick a positive integer >0
    chunk_overlap=0,            # fixed spelling
    separator="\n"              # fixed spelling and string
)

documents = text_splitter.split_documents(raw_documents)


Created a chunk of size 1170, which is longer than the specified 500
Created a chunk of size 1216, which is longer than the specified 500
Created a chunk of size 962, which is longer than the specified 500
Created a chunk of size 845, which is longer than the specified 500
Created a chunk of size 881, which is longer than the specified 500
Created a chunk of size 1191, which is longer than the specified 500
Created a chunk of size 515, which is longer than the specified 500
Created a chunk of size 754, which is longer than the specified 500
Created a chunk of size 730, which is longer than the specified 500
Created a chunk of size 723, which is longer than the specified 500
Created a chunk of size 1267, which is longer than the specified 500
Created a chunk of size 683, which is longer than the specified 500
Created a chunk of size 555, which is longer than the specified 500
Created a chunk of size 523, which is longer than the specified 500
Created a chunk of size 789, which is longer

**INSERT APIKEY HERE**

In [28]:
import configy
from openai import OpenAI

client = OpenAI(api_key=configy.OPENAI_AI_KEY)

**We take all the text chunks (documents) and store them in a Chroma database so we can search them later using embeddings (numerical representations of meaning).**

**We create those embeddings with OpenAIEmbeddings(), which turns each chunk into a vector that captures its content.**

In [40]:
db_books = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings())

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


**We define a function called retrieve_recommendations that takes a search query (like “books about jungles”) and returns the top 5 matching books from our books table.**

**Here’s what happens step by step:**

    1) We search our Chroma database for text chunks similar to the query (similarity_search).

    2) For each matching chunk, we extract the ISBN number (assuming it’s the first word in the text).

    3) We turn those ISBNs into a list.

    4) We filter the books table to keep only rows whose ISBN is in that list.

    6) We return the first top_k rows as recommendations.

**This way, when we call the function with a search string, it gives us the most relevant books from our dataset.**

In [44]:
def retrieve_recommendations(
    query: str,
    top_k: int = 5
) -> pd.DataFrame:
    recs = db_books.similarity_search(query, k=5)

    books_list = []

    for rec in recs:
        isbn_str = rec.page_content.split()[0].strip("'").strip('"')
        books_list.append(int(isbn_str))

    return books[books["isbn13"].isin(books_list)].head(top_k)

**Call the function**

In [46]:
retrieve_recommendations("A book about superheroes")

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


Unnamed: 0,isbn13,title,authors,categories,description,published_year,average_rating,num_pages,ratings_count,description_count,code_description
1863,9780345484109,Armageddon's Children,Terry Brooks,Fiction,In a futuristic world in which civilization is...,2007,4.1,404.0,608.0,46,9780345484109 In a futuristic world in which c...
4404,9780756605926,The DC Comics Encyclopedia,Phil Jimenez,Antiques & Collectibles,A richly illustrated reference provides a defi...,2004,4.29,352.0,123.0,41,9780756605926 A richly illustrated reference p...
5505,9781401209414,Challengers of the Unknown,Howard Chaykin;Michelle Madsen,Comics & Graphic Novels,Five extraordinary people with seemingly nothi...,2006,2.86,144.0,43.0,81,9781401209414 Five extraordinary people with s...
5514,9781401212636,Superman,Kurt Busiek;Fabian Nicieza;Pete Woods;Len Wein...,Juvenile Fiction,A story featuring the Man of Steel being pursu...,2007,3.18,144.0,136.0,35,9781401212636 A story featuring the Man of Ste...
5751,9781563893698,JLA,Grant Morrison,Juvenile Fiction,When the Justice League of America sets up hea...,1997,3.97,93.0,3601.0,28,9781563893698 When the Justice League of Ameri...


In [48]:
books.to_csv("cleaned_books.csv", index=False, encoding="utf-8")