# Retrieval Hands On Lab

## Objectives
By the end of this lab, participants will:

1. Understand how to parse PDFs inside Snowflake
2. Understand how to create vector representations of text data and load it into Snowflake tables
3. Perform similarity search against embeddings in Snowflake
4. Use Snowflake Cortex Search for retrieval and understand the benefits compared to simple similarity search
5. Understand how to evaluate retrieval performance

# Part 1: Setup
In this section, we will:

1. Create some snowflake objects to store our data in
2. Upload a PDF of Cincinnati Parks' 3 year development plan into a stage
3. Parse the PDF into usable text and load the results into a Snowflake table

## Inspect the PDF

Download the Cincinnati Parks 3 year plan by clicking on the data directory, then clicking the '...' next to the PDF and choosing 'Download'.

In [None]:
USE RETRIEVAL_LAB.PUBLIC;

-- Create a stage to store our PDF
CREATE OR REPLACE STAGE docs ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE') DIRECTORY = ( ENABLE = true );

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root

session = get_active_session()
pdf_path = "./data/cincinnati-parks-3-year-plan.pdf"

session.file.put(
    pdf_path,
    "@docs",
    auto_compress=False,
    overwrite=True
)

# Force a refresh on the stage so the doc is accessible
session.sql("ALTER STAGE docs REFRESH").collect()

In [None]:
-- Verify PDF was properly uploaded
LIST @docs;

In [None]:
-- This table will store the text from the parsed PDF
CREATE OR REPLACE TABLE PARSED_PDFS ( 
    RELATIVE_PATH VARCHAR,
    SIZE NUMBER(38,0),
    FILE_URL VARCHAR,
    PARSED_DATA VARCHAR);

In [None]:
-- We use Snowflake Cortex's PARSE_DOCUMENT function to extract the text from the pdf and save it to a column
INSERT INTO PARSED_PDFS (relative_path, size, file_url, parsed_data)
SELECT 
        relative_path,
        size,
        file_url,
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT('@docs', relative_path, { 'mode': 'OCR' }):content AS parsed_data
    FROM directory(@docs);

In [None]:
-- Verify the data was successfully parsed
select * from PARSED_PDFS;

## Part 2 - Generate Embeddings

In this section, we will:

1. Explore various strategies for chunking the text data
2. Generate embeddings for our text chunks
3. Load the results into a Snowflake table using the `VECTOR` datatype

### Chunking Strategies

In this section, we'll explore various chunking strategies. The right strategy will ultimately depend on the data and use case at hand. The three strategies explored in this section are:

1. **Snowflake's Recursive Text Splitter** - this method uses a set of separators to iteratively split the text. If the result of the split for the primary separator doesn't result in the desired chunk size, another pass will be taken through the text with the next separator. Snowflake's default separator list is `[”\n\n”, “\n”, “ “, “”]`, meaning it will first use a paragraph break, then a line break, then a space, and then an empty string. This can be altered as needed.

2. **Semantic Chunking** - this strategy attempts to generate chunks by accounting for the semantic meaning of text within the document. The goal is to create chunks comprised of sentences that cover the same theme or topic. This will result in more varied chunk sizes as opposed to always creating a chunk size of 500 tokens, as an example.

3. **Paragraph Chunking** - this is a simpler strategy that can be used on documents that are well structured into paragraphs. In our case, this strategy makes a lot of sense as it will group each park plan into its own chunk.

In [None]:
chunking_sql = """
    SELECT f.value::string AS chunk
    FROM PARSED_PDFS,
    LATERAL FLATTEN(
      INPUT => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
      PARSED_DATA, -- column to split
      'none', -- format ('none' or 'markdown')
      1000, -- chunk size
      100 -- overlap
      )
    ) f
"""

recursive_chunks = [row["CHUNK"] for row in session.sql(chunking_sql).collect()]

In [None]:
import re
from typing import List
from langchain.embeddings.base import Embeddings
from langchain_experimental.text_splitter import SemanticChunker
from snowflake.cortex import embed_text_768


# Creating a custom Langchain Embeddings class for Snowflake to use with the SemanticChunker
class SnowflakeCortexEmbeddings(Embeddings):
    def __init__(self):
        self.model = 'e5-base-v2'

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        results = []
        for text in texts:
            embedding = embed_text_768(self.model, text, session)
            results.append(embedding)
        return results

    def embed_query(self, text: str) -> List[float]:
        return self.embed_documents([text])[0]


snowflake_embeddings = SnowflakeCortexEmbeddings()
chunker = SemanticChunker(embeddings=snowflake_embeddings)

parsed_data_df = session.table('parsed_pdfs')
parsed_text = parsed_data_df.collect()[0]

# Removing line breaks to pass in a continuous string to the chunker
cleaned_text = re.sub(r'\n(?=\S)', ' ', parsed_text["PARSED_DATA"])

chunks = chunker.create_documents([cleaned_text])
semantic_chunks = [doc.page_content for doc in chunks]
            

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root

def is_title(line):
    stripped = line.strip()
    return (
        bool(stripped) and 
        stripped == stripped.upper() and 
        any(c.isalpha() for c in stripped)
    )
# Create chunks for each park and its project plan
def chunk_by_project(parsed_text):
    lines = parsed_text["PARSED_DATA"].splitlines()
    chunks = []
    current_title = None
    current_desc_lines = []
    i = 0
    while i < len(lines):
        line = lines[i].strip()
        if is_title(line):
            # Check if the next line is also a title (part of same heading)
            title_lines = [line]
            while i + 1 < len(lines) and is_title(lines[i + 1].strip()):
                i += 1
                title_lines.append(lines[i].strip())
            # If we already have a title and description, save that chunk
            if current_title:
                chunk = f"{current_title}\n{' '.join(current_desc_lines).strip()}"
                chunks.append({
                    "relative_path": parsed_text["RELATIVE_PATH"],
                    "size": parsed_text["SIZE"],
                    "file_url": parsed_text["FILE_URL"],
                    "chunk": chunk
                })
            # Start a new chunk
            current_title = ' '.join(title_lines)
            current_desc_lines = []
        else:
            current_desc_lines.append(line)
        i += 1

    # Add the last chunk
    if current_title and current_desc_lines:
        chunk = f"{current_title}\n{' '.join(current_desc_lines).strip()}"
        chunks.append({
            "relative_path": parsed_text["RELATIVE_PATH"],
            "size": parsed_text["SIZE"],
            "file_url": parsed_text["FILE_URL"],
            "chunk": chunk
        })

    return chunks

paragraph_chunks = chunk_by_project(parsed_text)

In [None]:
import pandas as pd


def count_tokens(text: str) -> int:
    # This runs a SQL query to tokenize the text and count tokens
    result = session.sql(f"""
        SELECT SNOWFLAKE.CORTEX.COUNT_TOKENS('e5-base-v2', $${text}$$) AS token_count
    """).collect()
    return result[0]["TOKEN_COUNT"]

def get_summary_statistics(chunks, label):
    token_counts = [count_tokens(chunk if isinstance(chunk, str) else chunk["chunk"]) for chunk in chunks]
    return {
        "strategy": label,
        "num_chunks": len(chunks),
        "min_tokens": min(token_counts),
        "max_tokens": max(token_counts),
        "avg_tokens": sum(token_counts) / len(token_counts) if chunks else 0,
    }

all_summaries = [
    get_summary_statistics(recursive_chunks, "recursive"),
    get_summary_statistics(semantic_chunks, "semantic"),
    get_summary_statistics(paragraph_chunks, "paragraph")
]

summary_df = pd.DataFrame(all_summaries)
summary_df

In [None]:
# View paragraph chunks
for idx, chunk in enumerate(paragraph_chunks):
    if idx < 10:
        print(f'chunk {idx}:', chunk['chunk'])

In [None]:
from snowflake.cortex import embed_text_768

# Create embeddings for each chunk
model = 'e5-base-v2'
for chunk in paragraph_chunks:
    chunk['embedding'] = embed_text_768(model, chunk['chunk'], session)
    

In [None]:
from snowflake.snowpark.types import VectorType, DoubleType

# Save off chunks and embeddings into new table
df = session.create_dataframe(paragraph_chunks)
df = df.with_column('embedding', df.col('embedding').cast(VectorType(float, 768)))
df.write.save_as_table("DOCS_CHUNKS_TABLE", mode='overwrite')

In [None]:
-- Validate table
select chunk, embedding 
from DOCS_CHUNKS_TABLE
limit 10;

## Part 3: Test out different search methods

In this section, we will:

1. Perform a standard cosine similarity search
2. Create a Cortex Search Service
3. Perform a search against the Cortex Search Service

### Standard Similarity Search
Snowflake offers a built in function to perform semantic similarity search. You provide the column of type `VECTOR` that you will be searching and the embedding of your search query. It will return the similarity score (between -1 and 1) for each embedding in your vector column.

One thing to note is that when inserting data into a column of type `VECTOR`, Snowflake does not create an Approximate Nearest Neighbor (ANN) index like you would get with many Vector databases. This will result in slower retrieval speed.

In [None]:
-- Ault Park Trail
SELECT VECTOR_COSINE_SIMILARITY(
            docs_chunks_table.embedding,
            SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', 'When will the Ault Park trail plan complete?')
       ) as similarity,
       chunk
FROM docs_chunks_table
ORDER BY similarity desc
LIMIT 10
;

### Snowflake Cortex Search Service

The Cortex Search Service provides a simple way to perform search against your data. It handles embedding data, loading it into a table, and provides a low-latency interface for searching. The search method is a hybrid of keyword and vector search and also has a built-in semantic reranker to provide the most relevant chunks of data.

All of this together provides a simple mechanism for high quality, performant search inside Snowflake. Given the speed increase when compared to standard cosine similarity search, Snowflake is likely creating an ANN index for Cortex Search, although it is not explicitly called out in the docs.

In [None]:
CREATE OR REPLACE CORTEX SEARCH SERVICE parks_search_service
  ON CHUNK
  WAREHOUSE = compute_wh
  TARGET_LAG = '1 day'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-m-v1.5'
  AS (
    SELECT
        CHUNK,
    FROM docs_chunks_table
);

In [None]:
# Quick test of the search service
import json
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root

root = Root(session)
parks_search_service = (root
  .databases["RETRIEVAL_LAB"]
  .schemas["PUBLIC"]
  .cortex_search_services["parks_search_service"]
)

resp = parks_search_service.search(
  query="When will the Ault Park trail plan complete?",
  columns=["chunk"],
  limit=3
)

results = json.loads(resp.to_json())['results']

for idx, chunk in enumerate(results):
    print(f'Result: {idx+1}')
    print(chunk['chunk'])

## Part 4: Evals

In this section, we'll:

1. Build out a sample of questions and ground truth results
2. Create a framework for running our samples through both standard search and Cortex search
3. Perform the evaluation and display the results

Typically, a framework such as [ragas](https://docs.ragas.io/en/stable/) would be used to perform the evaluation, but in order to keep the lab setup minimal, we are going to do the eval without the help of an external library.

## Eval Metrics
Our first eval set is going to be searching for a single specific chunk. Below is a set of search queries with the chunk we would expect to get back. Given this focused search effort, we'll use the following metrics for evaluation:

- Hit@1 - determines if our search brought back the expected result as the first item retrieved
- Hit@3 - measures if our expected chunk was present in the top 3 results
- Mean Reciprocal Rank (MRR) - this metric is focused on how quickly the retrieval can find the first relevant result. Ex: if our expected chunk is at rank 1, then the reciprocal rank is 1. If our expected chunk is third, then the reciprocal rank is 1/3.
- Miss rate - the rate at which the retrieval did not return our expected chunk at all.
- P50/90/99 latency - this measures the speed of the retrieval results. P50 = indicates that 50% of the requests were returned in this time or less.

In [None]:
exact_eval_set = [
    {
        "query": "When will the Ault Park trail plan complete?",
        "expected_chunk": "AULT PARK VALLEY TRAIL One of the busiest trails in Cincinnati Parks is experiencing serious erosion issues along the creek also housing important sewer infrastructure. This project shores up the trail to keep hikers safe in advance of a larger MSD sewer project in the coming years to protect the trail in the long term. The project is underway and will close out in early 2025."
    },
    {
        "query": "When was the Sawyer Point Park playground burnt down?",
        "expected_chunk": "SAWYER POINT PLAYGROUND AND PARK PLANNING Work is underway to restore the Sawyer Point Park playground, which was suddenly destroyed by a massive fire in November 2024. This project represents a chance to create an amazing new, uniquely Cincinnati, amenity for the next generation of park users of a wide range of ages and abilities to enjoy. The new playground will be built in the vicinity of the former playground though not in the same location. The goal is to engage with the community to develop something truly fantastic in this iconic regional park serving as a distinctive source of lasting pride for our city. The project also creates an opportunity to comprehensively review the layout of the park to guide a longer-term plan for improvements in the coming years."
    },
    {
        "query": "How many miles is the CROWN network?",
        "expected_chunk": "BRAMBLE PARK TRAIL This project is a partnership with the community and will utilize a State of Ohio Department of Natural Resources grant. This will be the first segment of the Little Duck Creek Trail and run 0.35 miles in length through Bramble Park in Madisonville. The trail will connect to the Murray Trail, part of the CROWN network connecting more than 104 miles of trails in Cincinnati. Planning will take place during 2025."
    },
    {
        "query": "When did Smale park open?",
        "expected_chunk": "SMALE CONCRETE & GRANITE UPGRADES This award-winning, heavily used signature Cincinnati Park opened in 2012. Sections of concrete and specialized granite need repair in order to maintain this regional asset."
    },
    {
        "query": "Which park was added to the National Register of Historic Places?",
        "expected_chunk": "GIBSON HOUSE ROOF & FAÇADE This architecturally important structure was built in the middle 19th century and added to the National Register of Historic Places in 1976. Critical repairs are needed to preserve this treasure, which is now used for offices and a rental venue. Construction is planned to begin early 2026."
    },
    {
        "query": "Who is the Park Board partnering with for the Smale River's Edge project?",
        "expected_chunk": "SMALE RIVER’S EDGE The U.S. Army Corps of Engineers and the Cincinnati Park Board are partnering on a study to improve and revitalize the Cincinnati Ohio River’s edge along the western edge of Smale Riverfront Park. The overall vision is to make the Cincinnati Riverfront a welcoming, safe, sustainable park, serving as a gateway to connect people to their heritage, community, and the natural environment for generations to come. The project will provide opportunities for ecosystem restoration and recreation, while protecting Cincinnati’s Riverfront from erosion. Initial design selection of this multi-million project will be complete in mid-2025 with construction planned to start in 2027."
    },
    {
        "query": "How long will the California Woods Hydrological plan take?",
        "expected_chunk": "CALIFORNIA WOODS HYDROLOGICAL PLAN DESIGN This amazing preserve is experiencing significant erosion issues that threaten long-term public access to the park. A specialized firm has been selected to develop a plan for sustainable interventions to the stream flow to mitigate the on-going erosion issues in the most environmentally sustainable manner. Investigation and design work begins in the second quarter of 2025 and will take about a year to finalize."
    },
    {
        "query": "Which park plan is partering with the Cincinnati Off Road Alliance?",
        "expected_chunk": "MT. AIRY BIKE SKILLS COURSE This partnership with the Cincinnati Parks Foundation and the Cincinnati Off Road Alliance (CORA) will nearly double the existing mileage of mountain biking trails within Mt. Airy Forest. It will be the first beginner natural surface trail experience within the city. With input from the community, the project has been funded and a contractor selected. The project is anticipated to be complete in early 2026."
    },
    {
        "query": "How many acres is the Cincinnati Park system comprised of?",
        "expected_chunk": "CINCINNAT PARKS PARK IMPROVEMENT PROJECTS 3-YEAR PLAN Cincinnati Parks' 5,000 acres consist of 8 regional parks, 70 neighborhood parks, 34 preserves and natural areas, 5 parkways, 65 miles of hiking trails, 80,000 street trees on 1,000 miles of City streets, 6 nature centers, 18 scenic overlooks, 52 playgrounds, 500 landscaped gardens, and over 100 picnic areas. With all of this to care for, there are constant needs of all shapes and sizes. Whether it be a bad sidewalk, an aging playground, a leaking roof, or a park that could use a complete facelift, there’s plenty to do to keep our parks looking great and best serving our residents and users. This is why the Board of Park Commissioners approved a work plan, generated by Parks staff, outlining projects underway and planned over the next 3 years. This plan represents a roadmap of what Cincinnati Parks will be prioritizing in the coming years and creates transparency into improvement projects. This was developed after a careful evaluation based on a number of factors including safety, equity, efficiencies, long-term maintenance, available funding, and more. This plan represents current priorities, capacity, and needs, and is a living document that will be updated as circumstances evolve and schedules are adjusted."
    },
    {
        "query": "Which communities does the Burnet Woods dog park serve?",
        "expected_chunk": "BURNET WOODS DOG PARK This new community dog park will serve Clifton, Corryville, CUF, and the surrounding areas, further contributing to the attractiveness and quality of life. The project represents a partnership with a number of community supporters, partners, and donors, including the Cincinnati Parks Foundation and Clifton Pop-up-Pup-Party (PUPP). The new amenity is expected to be under construction in May 2025 and take about 3 months to complete."
    }
    
]

In [None]:
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col
import json
import re

model = 'snowflake-arctic-embed-m-v1.5'

def escape_sql_string(s):
    return s.replace("'", "''")

def vector_search(query, k=3):
    safe_query = escape_sql_string(query)
    return session.sql(f"""
        SELECT chunk, VECTOR_COSINE_SIMILARITY(
            docs_chunks_table.embedding,
            SNOWFLAKE.CORTEX.EMBED_TEXT_768('{model}', '{safe_query}')
        ) AS similarity
        FROM docs_chunks_table
        ORDER BY similarity DESC
        LIMIT {k}
    """).collect()

def cortex_search(query, limit=3):
    parks_search_service = (root.databases["RETRIEVAL_LAB"]
                                   .schemas["PUBLIC"]
                                   .cortex_search_services["parks_search_service"])
    resp = parks_search_service.search(
        query=query,
        columns=["chunk"],
        limit=limit
    )
    results = json.loads(resp.to_json())['results']
    return results


def normalize(text):
    # Lowercase, remove extra whitespace, and normalize newlines
    return re.sub(r'\s+', ' ', text.strip().lower())


In [None]:
import time

def run_exact_eval():
    results = []

    for item in exact_eval_set:
        query = item['query']
        expected = item['expected_chunk']

        # Vector search with timing
        start_time = time.time()
        vector_results = vector_search(query, k=3)
        vector_time = time.time() - start_time
        vector_chunks = [r['CHUNK'] for r in vector_results]
        vector_match_rank = next(
            (i + 1 for i, chunk in enumerate(vector_chunks)
             if normalize(chunk) == normalize(expected)),
            None
        )

        # Cortex search with timing
        start_time = time.time()
        cortex_results = cortex_search(query, limit=3)
        cortex_time = time.time() - start_time
        cortex_chunks = [r['chunk'] for r in cortex_results]
        cortex_match_rank = next(
            (i + 1 for i, chunk in enumerate(cortex_chunks)
             if normalize(chunk) == normalize(expected)),
            None
        )

        results.append({
            "query": query,
            "expected_chunk": expected,
            "vector_hit_rank": vector_match_rank or "Miss",
            "vector_chunks": vector_chunks,
            "vector_time": vector_time,
            "cortex_hit_rank": cortex_match_rank or "Miss",
            "cortex_chunks": cortex_chunks,
            "cortex_time": cortex_time
        })

    return results


import numpy as np

def compute_metrics(results, method):
    total = len(results)
    hit_at_1 = 0
    hit_at_3 = 0
    reciprocal_ranks = []
    ranks = []
    misses = 0
    times = []

    for r in results:
        hit_rank = r[f"{method}_hit_rank"]
        response_time = r[f"{method}_time"]
        times.append(response_time)

        if isinstance(hit_rank, int):
            if hit_rank == 1:
                hit_at_1 += 1
            if hit_rank <= 3:
                hit_at_3 += 1
            reciprocal_ranks.append(1.0 / hit_rank)
            ranks.append(hit_rank)
        else:
            misses += 1
            reciprocal_ranks.append(0.0)

    accuracy = hit_at_1 / total
    hit3 = hit_at_3 / total
    mrr = sum(reciprocal_ranks) / total
    mean_rank = sum(ranks) / len(ranks) if ranks else None
    miss_rate = misses / total
    avg_time = np.mean(times)

    p50 = np.percentile(times, 50)
    p90 = np.percentile(times, 90)
    p99 = np.percentile(times, 99)

    return {
        "Accuracy (Hit@1)": round(accuracy, 3),
        "Hit@3": round(hit3, 3),
        "MRR": round(mrr, 3),
        "Miss Rate": round(miss_rate, 3),
        "Avg Retrieval Time (ms)": round(avg_time * 1000, 3),
        "P50 Latency (ms)": round(p50 * 1000, 3),
        "P90 Latency (ms)": round(p90 * 1000, 3),
        "P99 Latency (ms)": round(p99 * 1000, 3)
    }


In [None]:
results = run_exact_eval()

vector_metrics = compute_metrics(results, method="vector")
cortex_metrics = compute_metrics(results, method="cortex")

df = pd.DataFrame({
    "Metric": list(vector_metrics.keys()),
    "Vector Search": list(vector_metrics.values()),
    "Cortex Search": list(cortex_metrics.values())
})

df

### Another round of evals
The first set of eval questions was looking for exact matches. While that is one type of question we can expect, how about more open ended questions that will rely on multiple chunks? The following sample queries were chosen to minimize exact keyword matching and to evaluate how many relevant chunks were returned in the top 3 results.

For these questions, we will evaluate:
- Precision@k - the ratio of relevant chunks retrieved to the total number of chunks retrieved
- Recall@k - the ratio of relevant chunks retrieved to the total number of relevant chunks


Determining the value of `k` is an important consideration when evaluating retrieval performance. In our question set below, the first question has 7 relevant chunks, while the other two have only 3 each.

If we use `k=3`, then the maximum possible `recall@3` for the first question is limited to 3/7, even if the system retrieves the best results. On the other hand, if we use `k=7`, then the maximum possible `precision@7` for the second and third questions is only 3/7, since there are only 3 relevant chunks to be found.

To balance the tradeoff between precision and recall across questions of varying difficulty, we’ll evaluate performance at multiple values of `k: [3, 5, 7]`.


In [None]:
open_eval_set = [
    {
        "query": "Which parks will have new slides within the next year?",
        "expected_chunks": [
            "AULT PARK PLAYGROUND Parks is working with the Ault Park Advisory Council (APAC), who are fundraising to supplement State of Ohio grant funding on this new playground. With the help of the community, the project has been designed. It is anticipated work will begin in the fall of 2025 with completion before the end of the year.",
            "SAWYER POINT PLAYGROUND AND PARK PLANNING Work is underway to restore the Sawyer Point Park playground, which was suddenly destroyed by a massive fire in November 2024. This project represents a chance to create an amazing new, uniquely Cincinnati, amenity for the next generation of park users of a wide range of ages and abilities to enjoy. The new playground will be built in the vicinity of the former playground though not in the same location. The goal is to engage with the community to develop something truly fantastic in this iconic regional park serving as a distinctive source of lasting pride for our city. The project also creates an opportunity to comprehensively review the layout of the park to guide a longer-term plan for improvements in the coming years.",
            "GLENWAY PARK RENOVATIONS & PLAYGROUND Parks is working with the Cincinnati Parks Foundation and the surrounding community to renovate Glenway Park in East Price Hill. Fundraising and conceptual planning are in development and include a new playground, new lighting, regrading, and walking path improvements to increase accessibility and visibility. Construction is slated to begin mid-2026.",
            "MCEVOY PARK IMPROVEMENTS This park renovation project looks to implement traffic calming measures, a new playground, pavilion renovations and some trail improvements. Planning will kick off late 2025 and is anticipated to last a year.",
            "MT. AIRY MCFARLAN PLAYGROUND Parks is working on developing a new signature adventure playground in Mt. Airy Forest off of McFarlan. The playground project would correspond with retirement of the Area 3 playground near the disc golf course that is at the end of its lifecycle. Planning for this project is planned to kick off in 2026.",       
            "HOFFNER PARK PLAYGROUND This playground is due for replacement. Work will begin in mid-2026."
            "MLK PARK RENOVATION Parks will work with the community to design a renovated park best serving the surrounding neighborhoods. Planning with the community for this project is expected to be completed in 2027."
        ]
    },
        {
        "query": "Which bicycling paths have updates planned?",
        "expected_chunks": [
            "TED BERRY INTERNATIONAL FRIENDSHIP PARK RESTORATION This park has been in a state of disrepair due to an emergency fix to a major water main running through the park. This project returns the original state of the park and is a joint venture between the Cincinnati Park Board and Greater Cincinnati Water Works. Key project elements include restoring bike paths, walkways, seating walls, pavers, new trees, sod, shrubs, electrical, lighting, irrigation, and more. This magnificent park is slated to be fully restored in May 2025.",
            "MT. AIRY BIKE SKILLS COURSE This partnership with the Cincinnati Parks Foundation and the Cincinnati Off Road Alliance (CORA) will nearly double the existing mileage of mountain biking trails within Mt. Airy Forest. It will be the first beginner natural surface trail experience within the city. With input from the community, the project has been funded and a contractor selected. The project is anticipated to be complete in early 2026.",
            "OHIO RIVER WAY BIKE TRAIL EXTENSION THROUGH SAWYER POINT & YEATMAN’S COVE This partnership with Great Parks, the City of Cincinnati, and Metro will design and construct a 4.75-mile shared use trail. It will go from Sawyer Point on the Cincinnati riverfront east to the Lunken Trail and the 78-mile Little Miami Scenic Trail. Planning is underway in 2025."
        ]
        
    },
    {
        "query": "What updates are planned for Smale Park?",
        "expected_chunks": [
            "SMALE LOT 23 ENHANCEMENTS This project is a funding partnership with the Cincinnati Parks Foundation and has been in the works for several years. It includes adding large garden planters, granite promenade seat walls, a sandstone and granite donor wall and cladding walls. Construction will begin in late 2025 with completion in 2026.",
            "SMALE RIVER’S EDGE The U.S. Army Corps of Engineers and the Cincinnati Park Board are partnering on a study to improve and revitalize the Cincinnati Ohio River’s edge along the western edge of Smale Riverfront Park. The overall vision is to make the Cincinnati Riverfront a welcoming, safe, sustainable park, serving as a gateway to connect people to their heritage, community, and the natural environment for generations to come. The project will provide opportunities for ecosystem restoration and recreation, while protecting Cincinnati’s Riverfront from erosion. Initial design selection of this multi-million project will be complete in mid-2025 with construction planned to start in 2027.",
            "SMALE CONCRETE & GRANITE UPGRADES This award-winning, heavily used signature Cincinnati Park opened in 2012. Sections of concrete and specialized granite need repair in order to maintain this regional asset.",
        ]
    }

]



In [None]:
def precision_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    hits = sum(normalize(r) in [normalize(chunk) for chunk in relevant] for r in retrieved_k)
    return hits / k

def recall_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    hits = sum(normalize(r) in [normalize(chunk) for chunk in relevant] for r in retrieved_k)
    return hits / len(relevant) if relevant else 0


def run_open_eval(k):
    results = []

    for item in open_eval_set:
        query = item['query']
        relevant_chunks = item['expected_chunks']

        # Vector search
        start_time = time.time()
        vector_results = vector_search(query, k=k)
        vector_time = time.time() - start_time
        vector_chunks = [r['CHUNK'] for r in vector_results]

        # Cortex search
        start_time = time.time()
        cortex_results = cortex_search(query, limit=k)
        cortex_time = time.time() - start_time
        cortex_chunks = [r['chunk'] for r in cortex_results]

        results.append({
            "query": query,
            "relevant_chunks": relevant_chunks,
            "vector_chunks": vector_chunks,
            "vector_time": vector_time,
            "cortex_chunks": cortex_chunks,
            "cortex_time": cortex_time
        })

    return results


In [None]:
def compute_open_metrics(results, method, k=3):
    precisions = []
    recalls = []
    times = []

    for r in results:
        relevant = r["relevant_chunks"]
        retrieved = r[f"{method}_chunks"]
        response_time = r[f"{method}_time"]
        times.append(response_time)

        p = precision_at_k(retrieved, relevant, k)
        rcl = recall_at_k(retrieved, relevant, k)

        precisions.append(p)
        recalls.append(rcl)

    avg_p = np.mean(precisions)
    avg_r = np.mean(recalls)
    avg_time = np.mean(times)

    p50 = np.percentile(times, 50)
    p90 = np.percentile(times, 90)
    p99 = np.percentile(times, 99)

    return {
        f"Avg Precision@{k}": round(avg_p, 3),
        f"Avg Recall@{k}": round(avg_r, 3),
        "Avg Retrieval Time (ms)": round(avg_time * 1000, 3),
        "P50 Latency (ms)": round(p50 * 1000, 3),
        "P90 Latency (ms)": round(p90 * 1000, 3),
        "P99 Latency (ms)": round(p99 * 1000, 3)
    }


In [None]:
import pandas as pd

ks = [3, 5, 7]
methods = ["vector", "cortex"]

rows = []

for k in ks:
    open_results_k = run_open_eval(k=k)  # Run retrieval at this k
    for method in methods:
        metrics = compute_open_metrics(open_results_k, method=method, k=k)
        for metric_name, value in metrics.items():
            rows.append({
                "k": k,
                "Method": method.capitalize(),
                "Metric": metric_name,
                "Value": value
            })

# Create the DataFrame
df_open = pd.DataFrame(rows)

# Optional: Pivot for better readability
df_pivot = df_open.pivot(index=["Metric", "k"], columns="Method", values="Value").reset_index()

df_pivot
