# Project Overview

## Objective
This project demonstrates the implementation of a vector database system using ChromaDB to perform semantic similarity search on the OpenBookQA dataset. The goal is to showcase practical applications of embedding-based retrieval systems for question-answering tasks.

## Key Features
- Vector database creation and management using ChromaDB
- Semantic embedding and similarity search with cosine distance
- Integration with HuggingFace datasets (OpenBookQA)
- Performance evaluation of retrieval accuracy

## Technologies Used
- **ChromaDB**: Vector database for storing and querying embeddings
- **HuggingFace Datasets**: For loading the OpenBookQA dataset
- **Python**: Primary programming language
- **Sentence Transformers**: For generating text embeddings (all-MiniLM-L6-v2 model)

## Dataset
The OpenBookQA dataset from AllenAI contains science questions paired with supporting common knowledge facts. This project uses the fact1 attribute as the knowledge base for semantic search.

In [1]:
# install
%pip install chromadb numpy datasets

Collecting chromadb
  Downloading chromadb-1.3.5-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.3.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.38.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

## **Loading a Dataset**
In this project, I use the [**OpenBookQA**](https://arxiv.org/abs/1809.02789) dataset, available from [HuggingFace](https://huggingface.co/datasets/allenai/openbookqa).

Dataset description, from HuggingFace:

> OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of question-answering dataset modeled after open book exams for assessing human understanding of a subject.

In [2]:
# Get the OpenBookQA dataset from HuggingFace
from datasets import load_dataset

dataset = load_dataset("allenai/openbookqa", "additional", split="train")

# Filter the dataset to only include questions with a fact1 (supporting common knowledge)
dataset = dataset.filter(lambda x: x["fact1"] != "")

print("Number of questions with fact1: ", len(dataset))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

additional/train-00000-of-00001.parquet:   0%|          | 0.00/635k [00:00<?, ?B/s]

additional/validation-00000-of-00001.par(…):   0%|          | 0.00/75.9k [00:00<?, ?B/s]

additional/test-00000-of-00001.parquet:   0%|          | 0.00/72.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/4957 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/4957 [00:00<?, ? examples/s]

Number of questions with fact1:  4957


## **Creating a Collection**

Create a collection using the rows (i.e., documents) in the OpenBookQA dataset. For efficiency, only the first 1,000 rows are stored in the collection. The string value of the `fact1` attribute is regarded as a document. Also, the **cosine similarity** is used for the distance function on the embedding space.

In [3]:
# Import Chroma and instantiate a client. The default Chroma client is ephemeral, meaning it will not save to disk.
import chromadb

client = chromadb.Client()

In [4]:
# Create a new Chroma collection to store the supporting common knowledge (fact1).
collection = client.create_collection(
    name="openbookqa_fact1",
    metadata={"hnsw:space": "cosine"} # use the cosine similarity function
)

In [5]:
# Embed and store the first 1000 "fact1" attributes in the training set. It will take about two minutes.
collection.add(
    ids=dataset["id"][:1000],
    documents=dataset["fact1"][:1000],
    metadatas=[{"type": "knowledge"} for _ in range(0, 1000)
    ],
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 61.2MiB/s]


## **Querying the Collection**


## Semantic Similarity Search Experiments

### Example 1: Most Similar Document Retrieval

This experiment demonstrates querying the vector database to find the most semantically similar document. Using the `question_stem` attribute from the first query (i.e., `dataset["question_stem"][0]`), we retrieve the document ID whose `fact1` (common knowledge) has the highest cosine similarity to the query.

In [6]:
# 1. Define the query
query_index = 0
query_text = dataset["question_stem"][query_index]

# 2. Query the collection
results = collection.query(
    query_texts=[query_text],
    n_results=10
)

ids = results['ids'][0]
distances = results['distances'][0]

# 3. Create the 'combined' list
combined = list(zip(ids, distances))

# 4. Sort by distance, then alphabetically by ID  for consistent ranking
combined.sort(key=lambda x: (x[1], x[0]))

# 5. Get the answer
q1_ans = combined[0][0]
print("Q1 Answer:", q1_ans)

Q1 Answer: 7-822


### Example 2: Similarity Distance Calculation

For the query from Example 1, we calculate the cosine distance between the query and the retrieved document to quantify the semantic similarity. The distance is rounded to four decimal places for precision.

In [19]:
# Access the distance from the first item in the sorted list (index 1 of the tuple)
# combined[0] looks like: ('7-822', 0.12345...)
q2_distance = combined[0][1]

# Round off to four decimal places
e2_ans = round(q2_distance, 4)

print(e2_ans)


0.2935


### Example 3: Top-K Retrieval (3rd Most Similar)

Extending the previous experiment, we now retrieve the third most similar document to demonstrate top-k retrieval capabilities. This shows how ChromaDB efficiently ranks multiple results by semantic similarity.

In [20]:
# Q3: Third most similar ID (Index 2)
e3_ans = combined[2][0]
print("", q3_ans)

 9-69


### Example 4: Distance for 3rd Ranked Result

We calculate the cosine distance for the third most similar document from Example 3, providing insight into how similarity scores change across the ranking.

In [21]:
# Q4: Distance of the third most similar
e4_ans = round(combined[2][1], 4)
print(e4_ans)

0.3069


### Examples 5-8: Alternative Query Testing

We repeat the previous four experiments (Examples 1-4) using a different query (`dataset["question_stem"][3]`) to validate the consistency and robustness of the retrieval system across different inputs.

In [10]:
# 1. Define the query (4th query is index 3)
query_index_4 = 3
query_text_4 = dataset["question_stem"][query_index_4]

# 2. Query the collection
results_4 = collection.query(
    query_texts=[query_text_4],
    n_results=10
)

ids_4 = results_4['ids'][0]
distances_4 = results_4['distances'][0]
combined_4 = list(zip(ids_4, distances_4))

# 3. Apply Tie-breaking sort
combined_4.sort(key=lambda x: (x[1], x[0]))

# Q5 Answer: Most similar ID
print(combined_4[0][0])

9-869


In [11]:
# Q6 Answer: Distance of most similar
print(round(combined_4[0][1], 4))


0.4712


In [12]:
# Q7 Answer: Third most similar ID
print(combined_4[2][0])

1284


In [13]:
# Q8 Answer: Distance of third most similar
print(round(combined_4[2][1], 4))

0.4958


## Performance Evaluation

### Retrieval Accuracy Assessment

To evaluate the system's performance, we conduct batch retrieval on 100 queries from the dataset (`dataset["question_stem"][:100]`). For each query, we retrieve the most similar document and check if the retrieved document's ID matches the query's source document ID. This metric measures how well the system performs self-retrieval, where semantically related facts from the same document should rank highest.

In [14]:
# Batch query for the first 100 questions
num_queries = 100
queries = dataset["question_stem"][:num_queries]
target_ids = dataset["id"][:num_queries]

# Query for the top result
results_batch = collection.query(
    query_texts=queries,
    n_results=5 # Fetch 5 # Retrieve top 5 results for comprehensive similarity ranking
)

correct_count = 0
correct_distances = []

for i in range(num_queries):
    # Get results for specific query i
    current_ids = results_batch['ids'][i]
    current_dists = results_batch['distances'][i]

    # Sort locally to ensure tie-# Sort results by distance and then by ID for consistent ranking
    current_combined = list(zip(current_ids, current_dists))
    current_combined.sort(key=lambda x: (x[1], x[0]))

    # Get top 1 result
    top_id = current_combined[0][0]
    top_dist = current_combined[0][1]

    # Check if the retrieved ID matches the # Verify if the retrieved document ID matches the target ID
    if top_id == target_ids[i]:
        correct_count += 1
        correct_distances.append(top_dist)

print(correct_count)


34


### Error Rate Analysis

Complementing the accuracy assessment, we also count the number of incorrect retrieval results. The combination of correct and incorrect counts provides a complete picture of the system's retrieval performance.

In [15]:
### Error Rate Analysis

print(num_queries - correct_count)

66


### Minimum Distance Analysis

Among the correctly retrieved results, we identify the minimum cosine distance to understand the best-case performance. This metric indicates how closely the most similar documents align semantically.

In [16]:
if correct_distances:
    print(round(min(correct_distances), 4))
else:
    print("No correct results found")


0.0155


### Maximum Distance Analysis

Conversely, we identify the maximum cosine distance among correctly retrieved results. This metric reveals the worst-case performance for correct retrievals and helps understand the range of similarity scores.

In [17]:
if correct_distances:
    print(round(max(correct_distances), 4))
else:
    print("No correct results found")

0.5728
