# Workshop Setup Instructions

To set up your environment for this workshop, follow these steps:

```
uv venv --python 3.9 
source .venv/bin/activate
uv pip install -r requirements.txt
```

Use this kernel when running these notebooks

## Database Setup

Now to setup our vector database we will use LanceDB, this is a vector database that is very easy to setup and use.

It is open source and can be run locally, or in the cloud.

In [9]:
import lancedb

db = lancedb.connect("./lancedb")

This should in turn create a `lancedb` directory in your current working directory. We can validate that this is the case by running the following command.

In [10]:
import os

os.path.exists("./lancedb")

True

 ## Creating a Vector Table in LanceDB

### Next, we'll create our first vector table by:
 
 1. Defining a Pydantic schema to structure our data
 2. Using LanceDB's `Table` class to create and manage the table
 3. Integrating with embedding models to vectorize our documents

 LanceDB supports multiple embedding providers including OpenAI, Cohere, and HuggingFace. For this workshop, we'll use OpenAI's embedding models which offer an excellent balance of quality and performance.

 You can explore all available embedding models in the [LanceDB documentation](https://lancedb.github.io/lancedb/embeddings/available_embedding_models/text_embedding_functions/).

 Let's start by defining our table schema:

In [11]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry


func = get_registry().get("openai").create(name="text-embedding-3-small")

# Define a Schema
class Words(LanceModel):
    # This is the source field that will be used as input to the OpenAI Embedding API
    text: str = func.SourceField()

    # This is the vector field that will store the output of the OpenAI Embedding API
    vector: Vector(func.ndims()) = func.VectorField()

Now let's create our table with this schema. By using Pydantic, LanceDB will create the necessary fields for us and we can use the `add` method to ingest our data.


In [12]:
table = db.create_table("words", schema=Words, mode="overwrite")

# Ingest our data
table.add(
    [
        {"text": "hello world"},
        {"text": "goodbye world"}
    ]
)

[90m[[0m2025-03-14T17:49:18Z [33mWARN [0m lance::dataset::write::insert[90m][0m No existing dataset at /Users/darrenhinde/Documents/GitHub/synthetic-generation/Workshop/lancedb/words.lance, it will be created


In [14]:
table.to_pandas()

Unnamed: 0,text,vector
0,hello world,"[-0.006701672, -0.03921971, 0.034169834, 0.028..."
1,goodbye world,"[0.02580526, -0.0054752463, 0.011662952, 0.012..."


### Full text search

Now that we have our table created, we can use the `search` method to search for documents in our table.

table.search("hello world")



In [15]:
table.search("hello",query_type="fts").to_list()

RuntimeError: lance error: Invalid user input: Cannot perform full text search unless an INVERTED index has been created on at least one column, /Users/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/lance-0.24.1/src/dataset/scanner.rs:1555:17

#### When we run the code above, we get an error saying "inverted index not found". This happens because full text search requires a special index that connects words to their documents.
 
### What is an Inverted Index?

An inverted index is simply a lookup table that shows which documents contain specific words. Think of it like the index at the back of a book that tells you which pages contain certain topics.

It connects words to the documents where they appear. We must create this index beforehand to make text searches fast and efficient.

Let's look at a simple example to understand this better.

In [16]:
# We have an initial document
documents = {
    1: "The quick brown fox",
    2: "The lazy brown dog",
    3: "The fox jumps over dog"
}

In [17]:
inverted_index = {
    'the': {1, 2, 3},
    'quick': {1},
    'brown': {1, 2}, 
    'fox': {1, 3},
    'lazy': {2},
    'dog': {2, 3},
    'jumps': {3},
    'over': {3}
}

This means that when users make a query like `the dog`, we can quickly look up the documents that contain these words and return the results. 

We use a simplified implementation here where we check for each word in the query and then return the documents that contain any of the words. A document that has more matches has a higher score and will be ranked higher in the returned results.

In [18]:
def search(query, inverted_index):
    # Convert query to lowercase and split into words
    query_words = query.lower().split()
    
    # Count matches for each document
    doc_scores = {}
    for word in query_words:
        if word in inverted_index:
            for doc_id in inverted_index[word]:
                doc_scores[doc_id] = doc_scores.get(doc_id, 0) + 1
    
    # Sort documents by score in descending order
    sorted_results = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Return list of (doc_id, score) tuples
    return [
        {
            "doc_id": doc_id,
            "score": score/ len(query_words)
        }
        for doc_id, score in sorted_results
    ]

query = "the quick"
results = search(query, inverted_index)
print(f"Search results: {results}")


Search results: [{'doc_id': 1, 'score': 1.0}, {'doc_id': 2, 'score': 0.5}, {'doc_id': 3, 'score': 0.5}]


B25 search

In [20]:
table.create_fts_index("text", replace=True)

In [21]:
for item in table.search("hello",query_type="fts").to_list():
    print(item["text"])

hello world


In [22]:
for item in table.search("hello OR goodbye",query_type="fts").to_list():
    print(item["text"])

hello world
goodbye world


## Understanding Different Retrieval Methods

Let's explore the key retrieval methods we've covered and how they complement each other:

1. **Vector Search**: Transforms text into numerical vectors that capture semantic meaning. This approach identifies documents with vectors most similar to your query vector, making it excellent for finding conceptually related content regardless of exact wording. For instance, "I'm delighted" and "I'm really happy" would be recognized as semantically equivalent.

2. **Full Text Search**: Performs direct word matching between your query and document content. It employs sophisticated algorithms like BM25 scoring to rank results based on term frequency and uniqueness across the corpus. This method excels at precise keyword matching and finding specific information.

3. **Hybrid Search**: Leverages the strengths of both vector and full-text approaches. By combining semantic understanding with exact keyword matching, hybrid search delivers more comprehensive results. The system queries both methods independently and then intelligently merges the results to provide the most relevant documents.

Now, let's implement hybrid search in LanceDB to demonstrate its effectiveness.

## Logging Setup

We will use the `arize-phoenix` library to log our data and metrics. This is a powerful library that allows us to log data and metrics to a central location, and then view them in a dashboard.


You will need to go to https://app.phoenix.arize.com/ and create an account.

Once you have an account you will need to create an API key.

You can then use this API key to log data and metrics to your account.





In [1]:
# Let's set up Phoenix for logging and tracing
# We'll use the API key from our .env file to authenticate with Phoenix

import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Check if Phoenix environment variables are set
phoenix_endpoint = os.environ.get("PHOENIX_COLLECTOR_ENDPOINT")
phoenix_headers = os.environ.get("PHOENIX_CLIENT_HEADERS")

if phoenix_endpoint and phoenix_headers:
    print(f"Phoenix environment variables are configured:")
    print(f"Endpoint: {phoenix_endpoint}")
    print(f"Headers configured: {'api_key' in phoenix_headers}")
else:
    print("Phoenix environment variables not found. Please check your .env file.")
    # You can set them manually if needed
    # os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=YOUR_API_KEY"
    # os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"


Phoenix environment variables are configured:
Endpoint: https://app.phoenix.arize.com
Headers configured: True


In [3]:
from phoenix.otel import register

# configure the Phoenix tracer
tracer_provider = register(
  project_name="my-llm-app", # Default is 'default'
  auto_instrument=True # See 'Trace all calls made to a library' below
)
tracer = tracer_provider.get_tracer(__name__)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: my-llm-app
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.





In [4]:
@tracer.chain
def my_func(input: str) -> str:
    # This is a simple function that will be traced by Phoenix
    processed_result = f"Processed: {input}"
    return processed_result

# Let's test our tracing with a few examples
print("Testing Phoenix tracing...")
test_inputs = ["Hello, world!", "Testing tracing", "Phoenix is awesome"]
for test_input in test_inputs:
    result = my_func(test_input)
    print(f"Input: '{test_input}' → Output: '{result}'")

print("\nCheck your Phoenix dashboard to see the traces!")

Testing Phoenix tracing...
Input: 'Hello, world!' → Output: 'Processed: Hello, world!'
Input: 'Testing tracing' → Output: 'Processed: Testing tracing'
Input: 'Phoenix is awesome' → Output: 'Processed: Phoenix is awesome'

Check your Phoenix dashboard to see the traces!


# Done !