# About 

This notebook is a usecase about similarity search. 

Specifically, I build a vector database for searching books with similar "Subjects".
- The vector database I used is `Chroma`
- The file `processed_pg_catalog.csv` was generated from `00-book-database.ipynb`

# Settings

### Packages

In [1]:
import os
import pandas as pd
from tqdm import tqdm


# llm model
from langchain.llms import Ollama

# vector database
import chromadb
from chromadb.utils import embedding_functions

# embeddings
from langchain.embeddings import HuggingFaceEmbeddings

### Variables

In [2]:
#----------------#
# variables that require changes
#----------------#
llm_model_id ="llama2"

# collection name
collection_name ='catalog'

#  embeddings
embeddings_model_id ="all-MiniLM-L6-v2"

### Directories

In [3]:
main_Dir = "../"

#----------------#
# data dir
#----------------#
data_Dir = os.path.join(main_Dir,"data")
raw_data_Dir = os.path.join(data_Dir,"raw")
processed_data_Dir = os.path.join(data_Dir,"processed")

subjects_chroma_Dir = os.path.join(data_Dir, "subjects_chroma")

embedding_Dir = os.path.join(data_Dir,f"{embeddings_model_id}")

#----------------#
# make dirs
#----------------#
for f in [data_Dir, raw_data_Dir, processed_data_Dir, subjects_chroma_Dir, embedding_Dir]:
    os.makedirs(f, exist_ok=True)

# Build a Vector Database

### Read data

In [4]:
filename="processed_pg_catalog.csv"
filepath= os.path.join(processed_data_Dir, filename)
df = pd.read_csv(filepath,low_memory=False)
df.head(2)

Unnamed: 0,ID,Book,Authors,Subjects,Bookshelves
0,1,The Declaration of Independence of the United ...,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",Politics
1,1,The Declaration of Independence of the United ...,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",American Revolutionary War


In [5]:
# group Bookshelves so that for each Book, there is only one row.
df_group = df.groupby(['ID', 'Book', 'Authors', 'Subjects'])['Bookshelves'].sum().reset_index()

# change column type as string
df_group['ID'] = df_group['ID'].astype(str)

# have a look 
df_group.head(2)

Unnamed: 0,ID,Book,Authors,Subjects,Bookshelves
0,1,The Declaration of Independence of the United ...,"Jefferson, Thomas, 1743-1826","United States -- History -- Revolution, 1775-1...",PoliticsAmerican Revolutionary WarUnited State...
1,2,The United States Bill of Rights\r\nThe Ten Or...,United States,Civil rights -- United States -- Sources; Unit...,PoliticsAmerican Revolutionary WarUnited State...


### Define an embedding function 

In [6]:
# create an embedding function 
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=embeddings_model_id
)

### Define a chroma db client 

In [7]:
# create a ChromaDB client
client = chromadb.PersistentClient(path=subjects_chroma_Dir)

# delete a collection
#client = client.delete_collection(collection_name)

# if you need reset client
#client.reset()

### Instantiate a ChromaDB collection
A collection is the object that stores your embedded documents along with any associated metadata. 

In [8]:
# create collection
collection = client.create_collection(
    name=collection_name,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"}
)

### Add documents and metadata to collection 

In [10]:
# documents: a list of "Subjects" 
documents = df_group["Subjects"].to_list()

# ids: a list of unique ids (make sure, string type)
ids = df_group["ID"].astype(str).to_list()

# meta data: a list of dictionary containing other columns
dict_meta = df_group[["Book","Authors","Bookshelves"]].to_dict(orient='records')


In [11]:
# add embeddings to collection
collection.add(
    documents=documents, # Subjects
    ids=ids,
    metadatas=dict_meta,
)

In [12]:
# count the embeddings added to the collection
collection.count()

11314

# Query the Vector Database


### For example, find a book realted to "Alaska"

In [13]:
# the Subject of books
selection_book ="Alaska"

# Form the question
selection_question= f'Find me books related to {selection_book}'
print(selection_question)

Find me books related to Alaska


In [14]:
# query
query_results = collection.query(
    query_texts=[selection_question],
    include=["documents", "distances", "metadatas"],
    n_results=3)

In [15]:
# extract ids
query_results_ids = query_results["ids"][0]
print(query_results_ids)

['5233', '24392', '6017']


In [16]:
# filter the previous dataframe (df_group) by `query_results_ids`
query_results_df =  df_group[df_group['ID'].isin(query_results_ids)]
query_results_df

Unnamed: 0,ID,Book,Authors,Subjects,Bookshelves
1447,5233,The Iron Trail,"Beach, Rex, 1877-1949",Western stories; Alaska -- Fiction,Movie Books
1746,6017,The Silver Horde,"Beach, Rex, 1877-1949",Alaska -- Fiction,"Movie BooksBestsellers, American, 1895-1923"
7408,24392,Cat and Mouse,"Williams, Ralph, 1914-1959; Van Dongen, H. R.,...",Science fiction; Alaska -- Fiction; Hunting st...,Science Fiction
