**Building Function for Vector Search Recommendation**
---

**Load libraries and dependancies**

In [7]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from dotenv import load_dotenv
from langchain.chains import LLMChain
from langchain_community.vectorstores import Chroma
from transformers import pipeline
import gradio as gr
from transformers import AutoModel, AutoTokenizer
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# Load your OpenAI API key securely from configy.py
import configy
from openai import OpenAI

In [8]:
client = OpenAI(api_key=configy.OPENAI_AI_KEY)

**Load Clean CSV**

In [10]:
books = pd.read_csv("cleaned_books.csv", on_bad_lines="skip")

In [11]:
books.sample(3)

Unnamed: 0,isbn13,title,authors,categories,description,published_year,average_rating,num_pages,ratings_count
6120,9781596871328,Roger Zelazny's To Rule in Amber,John Gregory Betancourt,Fiction,As Oberon takes up the reins of leadership to ...,2005,3.75,310.0,730.0
4350,9780751531244,Spencerville,Nelson DeMille,Abused wives,After twenty-five years of working in the shad...,2001,3.66,564.0,49.0
4850,9780807612743,A Persian requiem,Sīmīn Dānishvar,Fiction,"Chronicles the life of Zari, the wife of a feu...",1992,3.89,279.0,48.0


**Filtering out book description with less than 25 words. Descriptions with less than 25 words fail to convey enough information to be paired well with prompts**

In [13]:
books['description_count'] = books['description'].str.split().apply(len)
books = books[books['description_count'] >= 25]

**Create a column that joins a unique identifier (isbn13) and description. This is important has it allows us to complete the vector search and track it back using the uid.**

In [15]:
books.loc[:, "code_description"] = books[["isbn13", "description"]].astype(str).agg(" ".join, axis=1)

**To do a vector search we need to take the code_description column from our books table and save it into a text file called code_description.txt**

In [17]:
books["code_description"].to_csv(
    "code_description.txt",
    index=False,
    header=False
)

**We load the text file code_description.txt into memory as raw text.**

**Then, we split that text into smaller chunks, each up to 500 characters long, using newlines (\n) to know where to break.**

**This makes it easier for us  to process or analyze the text piece by piece instead of all at once.**

In [19]:
loader = TextLoader("code_description.txt", encoding="utf-8")
raw_documents = loader.load()

text_splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0,
    separator="\n"
)

documents = text_splitter.split_documents(raw_documents)

Created a chunk of size 1170, which is longer than the specified 500
Created a chunk of size 1216, which is longer than the specified 500
Created a chunk of size 962, which is longer than the specified 500
Created a chunk of size 845, which is longer than the specified 500
Created a chunk of size 881, which is longer than the specified 500
Created a chunk of size 1191, which is longer than the specified 500
Created a chunk of size 515, which is longer than the specified 500
Created a chunk of size 754, which is longer than the specified 500
Created a chunk of size 730, which is longer than the specified 500
Created a chunk of size 723, which is longer than the specified 500
Created a chunk of size 1267, which is longer than the specified 500
Created a chunk of size 683, which is longer than the specified 500
Created a chunk of size 555, which is longer than the specified 500
Created a chunk of size 523, which is longer than the specified 500
Created a chunk of size 789, which is longer

**We take all the text chunks (documents) and store them in a Chroma database so we can search them later using embeddings (numerical representations of meaning).**

**We create those embeddings with OpenAIEmbeddings(), which turns each chunk into a vector that captures its content.**

In [21]:
db_books = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(api_key=configy.OPENAI_AI_KEY)
)

Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


**We define a function called retrieve_recommendations that takes a search query (like “books about jungles”) and returns the top 5 matching books from our books table.**

**Here’s what happens step by step:**

    1) We search our Chroma database for text chunks similar to the query (similarity_search).

    2) For each matching chunk, we extract the ISBN number (assuming it’s the first word in the text).

    3) We turn those ISBNs into a list.

    4) We filter the books table to keep only rows whose ISBN is in that list.

    6) We return the first top_k rows as recommendations.

**This way, when we call the function with a search string, it gives us the most relevant books from our dataset.**

In [23]:
def retrieve_recommendations(
    query: str,
    top_k: int = 5
) -> pd.DataFrame:
    recs = db_books.similarity_search(query, k=top_k)

    books_list = []
    for rec in recs:
        isbn_str = rec.page_content.split()[0].strip("'").strip('"')
        books_list.append(int(isbn_str))

    return books[books["isbn13"].isin(books_list)].head(top_k)


**Call the function**

In [25]:
results = retrieve_recommendations("A book about superheroes")
print(results)

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given


             isbn13                       title  \
1863  9780345484109       Armageddon's Children   
4404  9780756605926  The DC Comics Encyclopedia   
5505  9781401209414  Challengers of the Unknown   
5514  9781401212636                    Superman   
5751  9781563893698                         JLA   

                                                authors  \
1863                                       Terry Brooks   
4404                                       Phil Jimenez   
5505                     Howard Chaykin;Michelle Madsen   
5514  Kurt Busiek;Fabian Nicieza;Pete Woods;Len Wein...   
5751                                     Grant Morrison   

                   categories  \
1863                  Fiction   
4404  Antiques & Collectibles   
5505  Comics & Graphic Novels   
5514         Juvenile Fiction   
5751         Juvenile Fiction   

                                            description  published_year  \
1863  In a futuristic world in which civilization is...         

In [26]:
books.to_csv("cleaned_books.csv", index=False, encoding="utf-8")