# LanceDB

> **Important**: It's important to note that we'll be using the local version of LanceDB in this course. This code can also work with the hosted version of LanceDB as long as you change the connection string to point to your LanceDB instance. 

LanceDB is a vector database that makes it easy to build and evaluate RAG applications. In this notebook, we'll explore how to use LanceDB's key features to store, search, and retrieve data effectively.

## Why this Matters
When building RAG systems, choosing the right vector database is crucial. While many options exist, LanceDB stands out by providing:

1. Simple schema definition with Pydantic models
2. Automatic embedding generation and management
3. A unified API for different search types (vector, full-text, hybrid)

## What you'll Learn

Through hands-on examples, you'll discover how to:

1. Set up LanceDB tables with proper schemas
2. Perform different types of searches (full-text, vector, hybrid)
3. Enhance results with reranking

By the end of this notebook, you'll understand how to leverage LanceDB's capabilities to build robust retrieval systems.

## Setting Up LanceDB

We can create our LanceDB instance using the `lancedb` library and the `connect` function.

In [7]:
import lancedb

db = lancedb.connect("./lancedb")
db

LanceDBConnection(uri='/Users/ivanleo/Documents/coding/ttt/systematically-improving-rag/cohort_2/week0/lancedb')

This should in turn create a `lancedb` directory in your current working directory. We can validate that this is the case by running the following command.

In [8]:
import os

os.path.exists("./lancedb")

True

Now let's create our first table. We'll do so by defining a Pydantic Schema and then using the `Table` class to create our table. We'll also use the OpenAI Embeddings API to create embeddings for these individual documents that we ingest. 

To read more about the different embedding models that are available for use, you can check out their documentation [here](https://lancedb.github.io/lancedb/embeddings/available_embedding_models/text_embedding_functions/). 

First let's define our Table schema.

In [5]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry


func = get_registry().get("openai").create(name="text-embedding-3-small")


# Define a Schema
class Words(LanceModel):
    # This is the source field that will be used as input to the OpenAI Embedding API
    text: str = func.SourceField()

    # This is the vector field that will store the output of the OpenAI Embedding API
    vector: Vector(func.ndims()) = func.VectorField()

Now let's create our table with this schema. By using Pydantic, LanceDB will create the necessary fields for us and we can use the `add` method to ingest our data.

In [9]:
table = db.create_table("words", schema=Words, mode="overwrite")

# Ingest our data
table.add([{"text": "hello world"}, {"text": "goodbye world"}])

[2025-02-11T06:26:03Z WARN  lance::dataset::write::insert] No existing dataset at /Users/ivanleo/Documents/coding/ttt/systematically-improving-rag/cohort_2/week0/lancedb/words.lance, it will be created


We can verify that our data was ingested correctly and that the embeddings were created by converting our table to a pandas dataframe and printing the results.

In [11]:
table.to_pandas()

Unnamed: 0,text,vector
0,hello world,"[-0.006763331, -0.03919632, 0.034175806, 0.028..."
1,goodbye world,"[0.025792664, -0.0054613473, 0.011670824, 0.01..."


And with that we've created our first table in LanceDB! Now we'll walk through how to do full text search with our table. This is a simple method which provides a strong baseline for our retrieval system.

## Full Text Search

> **Important** : Before running this code, make sure you've installed `tantivy==0.20.1` in your local kernel. This is important because LanceDB uses Tantivy under the hood to perform FTS.

By default, LanceDB uses vector search to perform any query, This means that when we query our table, it'll use the vector embeddings to find the most similar documents to our query.

In order to use Full Text Search instead, we need to explicitly set the query type to `fts` in order for it to work. 


In [12]:
table.search("hello", query_type="fts").to_list()

RuntimeError: lance error: Invalid user input: Cannot perform full text search unless an INVERTED index has been created on at least one column, /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-0.22.0/src/dataset/scanner.rs:1515:17

But when we run the code above, we get the error that an inverted index is not found. This is a valid error because in order for us to perform full text search, we need to create an index which maps keywords to the documents that contain them.

### What is an Inverted Index?

An inverted index is a data structure that allows us to quickly look up which documents contain a given keyword. It's a key component of full text search systems and is used to speed up the search process.

it maps words or subwords to the documents that contain them. We need to generate this ahead of time so that we can perform full text search efficiently. 

Let's see a simplified example.

In [25]:
# We have an initial document
documents = {
    1: "The quick brown fox",
    2: "The lazy brown dog",
    3: "The fox jumps over dog",
}

We might then do some pre-processing here to break down the contents of each documents into subwords or words. 

In [26]:
inverted_index = {
    "the": {1, 2, 3},
    "quick": {1},
    "brown": {1, 2},
    "fox": {1, 3},
    "lazy": {2},
    "dog": {2, 3},
    "jumps": {3},
    "over": {3},
}

This means that when users make a query like `the dog`, we can quickly look up the documents that contain these words and return the results. 

We use a simplified implementation here we check for each word in the query and then return the documents that contain any of the words. A document that has more matches has a higher score and will be ranked higher in the returned results.

In [31]:
def search(query, inverted_index):
    # Convert query to lowercase and split into words
    query_words = query.lower().split()

    # Count matches for each document
    doc_scores = {}
    for word in query_words:
        if word in inverted_index:
            for doc_id in inverted_index[word]:
                doc_scores[doc_id] = doc_scores.get(doc_id, 0) + 1

    # Sort documents by score in descending order
    sorted_results = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)

    # Return list of (doc_id, score) tuples
    return [
        {"doc_id": doc_id, "score": score / len(query_words)}
        for doc_id, score in sorted_results
    ]


query = "the quick"
results = search(query, inverted_index)
print(f"Search results: {results}")

LanceDB automates this process for us and makes it incredibly easy to perform full text search with the `create_fts_index` method. A few things are different here, they use a scoring mechanism called `BM25` and more pre-processing is done for both your queries and documents themselves.

In [33]:
table.create_fts_index("text", replace=True)

We can now see that our full text search works nicely out of the box now. Additionally with Tantivy, we can use boolean queries to combine multiple words as seen below where we combine `hello` and `goodbye` in our query.

In [39]:
for item in table.search("hello", query_type="fts").to_list():
    print(item["text"])

In [38]:
for item in table.search("hello OR goodbye", query_type="fts").to_list():
    print(item["text"])

## Understanding Different Retrieval Methods

Let's quickly review the different retrieval methods that we've seen so far.

1. **Vector Search** : Vector Search converts text into number sequences (vectors) that capture meaning. When you search, it finds documents whose vectors are closest to your query vector. This works well for finding semantically similar content, even if the exact words don't match. For example, "I'm delighted" and "I'm really happy" would be considered similar.
2. **Full text Search** : Full Text Search directly matches the words in your query to words in documents. It uses techniques like BM25 scoring, which ranks documents based on how often your search terms appear and how unique those terms are across all documents. This is great for finding exact matches or specific keywords.
3. **Hybrid Search** : Hybrid Search combines both approaches to get the best of both worlds. It can find documents that either contain your exact keywords or express similar meanings. We make a query using both search results and then combine the final set of retrieved results in a given way to return a new set of results.

Let's see how we can perform hybrid search in LanceDB. 

In [None]:
for item in table.search("I'm really excited!", query_type="hybrid").to_list():
    print(item["text"])

We can see that the results are semantically similar to our query. Hello World is much closer to say "I'm really excited!" than Goodbye World. 

### Using Re-Rankers

While basic search methods work well for simple queries, more complex questions often benefit from re-ranking. 

A re-ranker takes an initial set of search results and applies a more sophisticated model to analyze how well each result actually answers your query. 

For example, in our capital punishment example below, hybrid search returns documents that contain the relevant keywords. But the re-ranker can better understand we're asking about location rather than general information, helping it prioritize the Washington D.C. result. This additional analysis takes more computation time but often produces significantly better results for complex queries.

In [48]:
# First let's create some more complex documents
documents = [
    "Carson City is the capital city of the American state of Nevada.",
    "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean. Its capital is Saipan.",
    "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. ",
    "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.",
]
documents = [{"text": doc} for doc in documents]

# Create a new table for our complex documents
complex_table = db.create_table("complex_docs", data=documents, schema=Words)

In [49]:
# Create an index for full text search
complex_table.create_fts_index(["text"])

In [58]:
# Let's try a search query and see the results before reranking
query = "where did the capital of the United States decide to allow capital punishment?"

results = complex_table.search(query, query_type="hybrid").limit(4).to_list()
for idx, item in enumerate(results):
    print(f"\n{idx + 1}. {item['text']}")

This is a good start but we can see that the results are not exactly what we want. We're asking for a location where the capital punishment decision was made instead of when capital punishment was allowed.

In this case we can solve this by using a `ReRanker`. Re-rankers are more expensive than just doing simple vector search but they often allow us to improve the results of our retrieval system by a large margin.

We can use the `rerank` method in LanceDB to re-rank our results. We'll use the `CohereReranker` to do so. We'll also use the `rerank_top_k` parameter to limit the number of results that we return.


In [None]:
from lancedb.rerankers import CohereReranker

# Let's try a search query and see the results before reranking
query = "where did the capital of the United States decide to allow capital punishment?"

# Then Define a Cohere Reranker
reranker = CohereReranker(model_name="rerank-english-v3.0")

results = (
    complex_table.search(query, query_type="hybrid").rerank(reranker).limit(4).to_list()
)
for idx, item in enumerate(results):
    print(f"\n{idx + 1}. {item['text']}")

## Conclusion

In this notebook, we've explored the core features that make LanceDB well-suited for RAG applications. We learned how to:

1. Define clean schemas using Pydantic models
2. Index and search data using different methods
3. Leverage the Re-Ranking API to improve the results of our retrieval system

LanceDB's combination of type safety through Pydantic, automated embedding handling, and unified search API provides a strong foundation for RAG development. These capabilities will be especially valuable in future weeks as we evaluate and improve different aspects of our retrieval system. Understanding these LanceDB basics ensures you can follow along with the rest of the course and make the most out of it.
