# Retrieval Hands On Lab

## Objectives
By the end of this lab, participants will:

1. Understand how to parse PDFs inside Snowflake
2. Understand how to create vector representations of text data and load it into Snowflake tables
3. Perform similarity search against embeddings in Snowflake
4. Use Snowflake Cortex Search for retrieval and understand the benefits compared to simple similarity search

# Part 1: Setup
In this section, we will:

1. Create some snowflake objects to store our data in
2. Upload a PDF of Cincinnati Parks' 3 year development plan into a stage
3. Parse the PDF into usable text and load the results into a Snowflake table

In [None]:
CREATE OR REPLACE DATABASE RETRIEVAL_LAB;
CREATE OR REPLACE SCHEMA DATA;
USE RETRIEVAL_LAB.DATA;
CREATE OR REPLACE STAGE docs ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE') DIRECTORY = ( ENABLE = true );

Here is where we will upload the PDF. To do so, navigate to our newly created stage by:

1. Click on the Database icon on the left nav bar
2. Go to your 'RETRIEVAL_LAB' database
3. Expand the 'DATA' schema
4. Click on 'Stages'
5. Click on the 'DOCS' stage

Once you're at the stage, we will download and upload the pdf.

TODO: Hoping we can just store the pdf in one spot

In [None]:
-- Open the URL generated below
SELECT GET_PRESIGNED_URL('@docs', 'cincinnati-parks-3-year-plan.pdf', 3600);

In [None]:
-- This table will store the text from the parsed PDF
CREATE OR REPLACE TABLE PARSED_PDFS ( 
    RELATIVE_PATH VARCHAR,
    SIZE NUMBER(38,0),
    FILE_URL VARCHAR,
    PARSED_DATA VARCHAR);

In [None]:
-- We use Snowflake Cortex's PARSE_DOCUMENT function to extract the text from the pdf and save it to a column
INSERT INTO PARSED_PDFS (relative_path, size, file_url, parsed_data)
SELECT 
        relative_path,
        size,
        file_url,
    SNOWFLAKE.CORTEX.PARSE_DOCUMENT('@docs', relative_path, { 'mode': 'OCR' }):content AS parsed_data
    FROM directory(@docs);

In [None]:
-- Verify the data was successfully parsed
select * from PARSED_PDFS;

## Part 2 - Generate Embeddings

In this section, we will:

1. Explore various strategies for chunking the text data
2. Generate embeddings for our text chunks
3. Load the results into a Snowflake table using the `VECTOR` datatype

### Chunking Strategies

In this section, we'll explore various chunking strategies. The right strategy will ultimately depend on the data and use case at hand. In our example, the PDF is cleanly delineated into paragraphs, so a simple regex based chunker is ideal.

1. Snowflake Recursive Text Splitter
2. Semantic Chunking
3. Simple Chunking

In [None]:
SELECT
  f.value::string AS chunk
FROM
  PARSED_PDFS,
  LATERAL FLATTEN(
    INPUT => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
      PARSED_DATA,
      'none',
      1000,
      100
    )
  ) f;

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root

session = get_active_session()

parsed_data_df = session.table('parsed_pdfs')
parsed_text = parsed_data_df.collect()[0]


In [None]:
import nltk
nltk.download('punkt')  # Run once
from nltk.tokenize import sent_tokenize

# Extract text from your Snowflake row
text = parsed_text['PARSED_DATA']

# Split into sentences
sentences = sent_tokenize(text)

# Optional: Group sentences into chunks of ~500 characters
chunks = []
chunk = ""
for sentence in sentences:
    if len(chunk) + len(sentence) < 500:
        chunk += " " + sentence
    else:
        chunks.append(chunk.strip())
        chunk = sentence
if chunk:
    chunks.append(chunk.strip())

# Print the chunks
for i, c in enumerate(chunks):
    print(f"Chunk {i+1}:\n{c}\n")

In [None]:
import re

def chunk_by_project(parsed_text):
    # Use regex to find titles and their following paragraphs
    # A project title is in all caps and followed by a paragraph (could be multiline")
    pattern = r'([A-Z0-9 ,&\-()]+)\n(.*?)(?=(?:\n[A-Z0-9 ,&\-()]+\n)|\Z)'  # \Z means end of string
    matches = re.findall(pattern, parsed_text['PARSED_DATA'], re.DOTALL)

    chunk_records = []
    for title, description in matches:
        clean_title = title.strip()
        clean_description = description.strip().replace('\n', ' ')
        text_chunk = f"{clean_title}\n{clean_description}"
        chunk_records.append({
            "relative_path": parsed_text["RELATIVE_PATH"],
            "size": parsed_text["SIZE"],
            "file_url": parsed_text["FILE_URL"],
            "chunk": text_chunk
        })
    return chunk_records
    

chunks = chunk_by_project(parsed_text)


In [None]:
for chunk in chunks:
    print(chunk)

In [None]:
from snowflake.cortex import embed_text_768

model = 'e5-base-v2'
for chunk in chunks:
    chunk['embedding'] = embed_text_768(model, chunk['chunk'], session)
    

In [None]:
from snowflake.snowpark.types import VectorType, DoubleType
df = session.create_dataframe(chunks)
df = df.with_column('embedding', df.col('embedding').cast(VectorType(float, 768)))
df.write.save_as_table("DOCS_CHUNKS_TABLE")

In [None]:
select * from DOCS_CHUNKS_TABLE;

In [None]:
select * from docs_chunks_table where contains(chunk, 'SAWYER POINT PLAYGROUND');

In [None]:
SELECT VECTOR_COSINE_SIMILARITY(
            docs_chunks_table.embedding,
            SNOWFLAKE.CORTEX.EMBED_TEXT_768('e5-base-v2', 'When will the Ault Park trail plan complete?')
       ) as similarity,
       chunk
FROM docs_chunks_table
ORDER BY similarity desc
LIMIT 3
;

## Create Cortex Search Service for advanced hybrid search

In [None]:
select * from docs_chunks_table

In [None]:
CREATE OR REPLACE CORTEX SEARCH SERVICE parks_search_service
  ON CHUNK
  WAREHOUSE = compute_wh
  TARGET_LAG = '1 day'
  EMBEDDING_MODEL = 'snowflake-arctic-embed-m-v1.5'
  AS (
    SELECT
        CHUNK,
        
    FROM docs_chunks_table
);

In [None]:
from snowflake.snowpark.context import get_active_session
from snowflake.core import Root

session = get_active_session()

root = Root(session)
transcript_search_service = (root
  .databases["RETRIEVAL_LAB"]
  .schemas["DATA"]
  .cortex_search_services["parks_search_service"]
)

resp = transcript_search_service.search(
  query="When will the Ault Park trail plan complete?",
  columns=["chunk"],
  limit=3
)
print(resp.to_json())


In [None]:
SELECT
  *
FROM
  TABLE (
    CORTEX_SEARCH_DATA_SCAN (
      SERVICE_NAME => 'parks_search_service'
    )
  );

In [None]:
select * from docs_chunks_table where contains(chunk, 'AULT PARK VALLEY TRAIL');