<a href="https://colab.research.google.com/github/anudee796/chroma/blob/main/chromadb/experimental/density_relevance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Density based retrieval relevance

An important aspect of using embeddings-based retreival systems like Chroma is knowing whether there are relevant results to a given query in the existing dataset. As application developers, we would like to know when the system doesn't have enough information to complete a given query or task - we want to know what we don't know.

This is particularly important in the case of retrieval-augmented generation, since it's [often been observed](https://arxiv.org/abs/2302.00093) that supplying irrelevant context serves to confuse the generative model, leading to the degredation of application performance in ways that are difficult to detect.

Unlike a relational database which will not return results if none match the query, a vector search based retrieval system will return the $k$ nearest neighbors to any given query, whether they are relevant or not.

One possible approach one might take is to tune a distance threshold, and reject any results which fall further away from the query. This might be suitable for certain kind of fixed datasets, but in practice such thresholds tend to be very brittle, and often serve to exclude many relevant results while not always excluding irrelevant ones. Additionally, the threshold will need to be continously adapted as the data changes. Additionally, such distance thresholds are not comparable across embedding models for a given dataset, nor across datasets for a given embedding model.

We would prefer to find a data driven approach which can:
- produce a uniform and comparable measure of relevance for any dataset
- automatically adapt as the underlying data changes
- is relatively inexpensive to compute

This notebook demonstrates one possible such approach, which relies on the distribution of distances (pseudo 'density') between points in a given dataset. For a given result, we use compute the percentile the result's distance to the query falls into with respect to the overall distribution of distances in the dataset. This approach produces a uniform measure of relevance for any dataset, and is relatively cheap to compute, and can be computed online as data mutates.

This approach is still very preliminary, and we welcome contributions and alternative approaches - some ideas are listed at the end of this notebook.

## Preliminaries

In [5]:
from langchain_community.document_loaders import PyPDFLoader
from google.colab import files

# Upload the PDF file
uploaded = files.upload()

# Get the filename
filename = list(uploaded.keys())[0]

# Load the PDF using PyPDFLoader
loader = PyPDFLoader(filename)
docs = loader.load()

Saving MANUAL_CX49574_125080 Rev 120214.pdf to MANUAL_CX49574_125080 Rev 120214.pdf


In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from google.colab import files

# Upload the PDF file
uploaded = files.upload()

# Get the filename
filename = list(uploaded.keys())[0]

# Load the PDF using PyPDFLoader
loader = PyPDFLoader(filename)
docs = loader.load()

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# Split the documents into chunks
documents = text_splitter.split_documents(docs)

Saving MANUAL_CX49574_125080 Rev 120214.pdf to MANUAL_CX49574_125080 Rev 120214.pdf


In [11]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [1]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Instruct-HF")
model = AutoModelForCausalLM.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Instruct-HF")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/55.2k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/930 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/59.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/30 [00:00<?, ?it/s]

model-00001-of-00030.safetensors:   0%|          | 0.00/4.58G [00:00<?, ?B/s]

model-00002-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00003-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00004-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00005-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00006-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00007-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00008-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00009-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00010-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00011-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00012-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00013-of-00030.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00014-of-00030.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00015-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

model-00016-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]



model-00017-of-00030.safetensors:   0%|          | 0.00/4.66G [00:00<?, ?B/s]

ChunkedEncodingError: ('Connection broken: IncompleteRead(4463688015 bytes read, 200479393 more expected)', IncompleteRead(4463688015 bytes read, 200479393 more expected))

In [2]:
# Clone the GitHub repository
!git clone https://github.com/anudee796/chroma.git

# Change directory to the cloned repository
%cd chromadb

# Install dependencies
!pip install -r requirements.txt

# Run ChromaDB
!python run_chromadb.py

fatal: could not create work tree dir 'chroma': No space left on device


In [None]:
!pip install langchain==0.0.240 langchain-community # Install necessary packages
!pip install chromadb
!pip install langchain_ollama # Make sure langchain_ollama is installed

# Install Ollama if not already installed
!pip install ollama

# Function to pull the model
def pull_model(model_name):
  try:
    # Run the shell command to pull the model
    result = subprocess.run(["ollama", "pull", model_name], check=True, capture_output=True, text=True, encoding='utf-8')
    print(f"Model '{model_name}' pulled successfully.")
    print(result.stdout)
  except subprocess.CalledProcessError as e:
    print(f"Failed to pull model '{model_name}'.")
    print(e.stderr)

# Pull the "nvidia/Llama-3.1-Nemotro" model
pull_model("nvidia/Llama-3.1-Nemotro")

import os
import threading
import time

from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_ollama import ChatOllama
from langchain.chains import LLMChain

# Function to start the Ollama server in a separate thread
def start_ollama_server():
  os.system("ollama serve --port 11434")

# Create and start the thread
ollama_thread = threading.Thread(target=start_ollama_server)
ollama_thread.start()

# Wait for the Ollama server to start
time.sleep(10)

# Initialize the local LLM from Ollama
local_model = "nvidia/Llama-3.1-Nemotro"
llm = ChatOllama(model=local_model, base_url="http://localhost:11434")

# Example usage:
template = "Tell me a joke about {topic}."
prompt_template = ChatPromptTemplate.from_template(template)
chain = LLMChain(llm=llm, prompt=prompt_template, output_parser=StrOutputParser())

# Get a joke about programming
response = chain.invoke({"topic": "programming"})
print(response)

### Dataset

As a demonstration we use the [SciQ dataset](https://arxiv.org/abs/1707.06209), available from [HuggingFace](https://huggingface.co/datasets/sciq).

Dataset description, from HuggingFace:

> The SciQ dataset contains 13,679 crowdsourced science exam questions about Physics, Chemistry and Biology, among others. The questions are in multiple-choice format with 4 answer options each. For the majority of the questions, an additional paragraph with supporting evidence for the correct answer is provided.