## Download data and install libraries

In [1]:
import io
import zipfile
import requests
import frontmatter
 


In [None]:
# we download the repository as a zip file. GitHub provides a convenient URL format for this:
url = 'https://codeload.github.com/DataTalksClub/faq/zip/refs/heads/main'
resp = requests.get(url)
# NB!! This code downloads the file to memory, not to disk.

In [None]:
repository_data = []

# Create a ZipFile object from the downloaded content
zf = zipfile.ZipFile(io.BytesIO(resp.content))

for file_info in zf.infolist():
    filename = file_info.filename.lower()

    # Only process markdown files
    if not filename.endswith('.md'):
        continue

    # Read and parse each file
    with zf.open(file_info) as f_in:
        content = f_in.read()
        post = frontmatter.loads(content)
        data = post.to_dict()
        data['filename'] = filename
        repository_data.append(data)

zf.close()


In [None]:
print(repository_data[1])

In [2]:
# Complete Implementation
# Let's now put everything together into a reusable function:

import io
import zipfile
import requests
import frontmatter

def read_repo_data(repo_owner, repo_name):
    """
    Download and parse all markdown files from a GitHub repository.
    
    Args:
        repo_owner: GitHub username or organization
        repo_name: Repository name
    
    Returns:
        List of dictionaries containing file content and metadata
    """
    prefix = 'https://codeload.github.com' 
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)
    
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    repository_data = []
    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    
    for file_info in zf.infolist():
        filename = file_info.filename
        filename_lower = filename.lower()

        if not (filename_lower.endswith('.md') 
            or filename_lower.endswith('.mdx')):
            continue
    
        try:
            with zf.open(file_info) as f_in:
                content = f_in.read().decode('utf-8', errors='ignore')
                post = frontmatter.loads(content)
                data = post.to_dict()
                data['filename'] = filename
                repository_data.append(data)
        except Exception as e:
            print(f"Error processing {filename}: {e}")
            continue
    
    zf.close()
    return repository_data


In [3]:
# We can now use this function for different repositories:
    
dtc_faq = read_repo_data('DataTalksClub', 'faq')
evidently_docs = read_repo_data('evidentlyai', 'docs')

print(f"FAQ documents: {len(dtc_faq)}")
print(f"Evidently documents: {len(evidently_docs)}")


FAQ documents: 1219
Evidently documents: 95


In [None]:
dtc_faq[2]

In [None]:
len(evidently_docs[45]['content']) # 21712 
# his is too long - we need to apply chunking say for 2k symbols here...

## Observations during data processing

Data Processing Considerations

For FAQ, the data is ready to use. These are small records that we can index (put into a search engine) as is.
For Evidently docs, the documents are very large. We need extra processing called "chunking" - breaking large documents into smaller, manageable pieces. This is important because:

Search relevance: Smaller chunks are more specific and relevant to user queries

Performance: AI models work better with shorter text segments

Memory limits: Large documents might exceed token limits of language models

We will cover chunking techniques below

TODO

-- Create a new uv project in the project directory

-- Select a GitHub repo with documentation (preferably with .md files) - I can look for network security repos for example 

-- Download the data from there using the techniques we've learned


In [4]:
# This is how the document above at index 45 looks like:

# {'title': 'LLM regression testing',
#  'description': 'How to run regression testing for LLM outputs.',
#  'content': 'In this tutorial, you will learn...'
# }

# The content field is 21,712 characters long. The simplest thing we can do is cut it into pieces of equal length. 
# For example, for size of 2000 characters, we will have:

# Chunk 1: 0..2000
# Chunk 2: 2000..4000
# Chunk 3: 4000..6000

# And so on.

# However, this approach has disadvantages:

# Context loss: Important information might be split in the middle
# Incomplete sentences: Chunks might end mid-sentence
# Missing connections: Related information might end up in different chunks

# That's why, in practice, we usually make sure there's overlap between chunks. For size 2000 and overlap 1000, we will have:

# Chunk 1: 0..2000
# Chunk 2: 1000..3000
# Chunk 3: 2000..4000
# ...

# This is better for AI because:

# Continuity: Important information isn't lost at chunk boundaries
# Context preservation: Related sentences stay together in at least one chunk
# Better search: Queries can match information even if it spans chunk boundaries

# This approach is known as the "sliding window" method. This is how we implement it in Python:

def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

evidently_overlapping_chunks_45 = sliding_window(evidently_docs[45]['content'], 2000, 1000)
evidently_overlapping_chunks_45[:2] # please note chunks are indeed OVERLAPPING below

[{'start': 0,
  'chunk': "In this tutorial, you will learn how to perform regression testing for LLM outputs.\n\nYou can compare new and old responses after changing a prompt, model, or anything else in your system. By re-running the same inputs with new parameters, you can spot any significant changes. This helps you push updates with confidence or identify issues to fix.\n\n<Info>\n  **This example uses Evidently Cloud.** You'll run evals in Python and upload them. You can also skip the upload and view Reports locally. For self-hosted, replace `CloudWorkspace` with `Workspace`.\n</Info>\n\n# Tutorial scope\n\nHere's what we'll do:\n\n* **Create a toy dataset**. Build a small Q&A dataset with answers and reference responses.\n\n* **Get new answers**. Imitate generating new answers to the same question.\n\n* **Create and run a Report with Tests**. Compare the answers using LLM-as-a-judge to evaluate length, correctness and style consistency.\n\n* **Build a monitoring Dashboard**. Get p

In [None]:
# Let's process all the documents in Evidently text dump:

evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

In [None]:
# Note that we use copy() and pop() operations:

# doc.copy() creates a shallow copy of the document dictionary
# doc_copy.pop('content') removes the 'content' key and returns its value
# This way we preserve the original dictionary keys that we can use later in the chunks.

# This way, we obtain 575 chunks from 95 documents.

# We can play with the parameters by including more or less content. 2000 characters is usually good enough for RAG applications.
len(evidently_chunks)

In [None]:
#evidently_chunks[:4]

# There are some alternative approaches:

# Token-based chunking: You first tokenize the content (turn it into a sequence of words) and then do a sliding window over tokens
# Advantages: More precise control over LLM input size
# Disadvantages: Doesn't work well for documents with code
# Paragraph splitting: Split by paragraphs
# Section splitting: Split by sections
# AI-powered splitting: Let AI split the text intelligently

# We won't cover token-based chunking here, as we're working with documents that contain code. But it's easy to implement - ask ChatGPT for help if you need it for text-only content.

# We will implement the others.


## Chunks continued - Splitting by Paragraphs and Sections

In [None]:
# Splitting by paragraphs is relatively easy:

import re
text = evidently_docs[45]['content']
paragraphs = re.split(r"\n\s*\n", text.strip())

# We use \n\s*\n regex pattern for splitting:

# \n matches a newline
# \s* matches zero or more whitespace characters
# \n matches another newline
# So \n\s*\n matches two newlines with optional whitespace between them

# This works well for literature, but it doesn't work well for documents. Most paragraphs in technical documentation are very short.
paragraphs[:2]

In [None]:
# You can combine sliding window and paragraph splitting for more intelligent processing. We won't do it here, but it's a good exercise to try.

# Let's now look at section splitting. Here, we take advantage of the documents' structure. Markdown documents have this structure:

# # Heading 1
# ## Heading 2  
# ### Heading 3

# What we can do is split by headers.

# For that we will use regex too:

import re

def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    # Split and keep the headers
    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text
        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()

        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

# Note: This code may not work perfectly if we want to split by level 1 headings and have Python code with # comments. 
# But in general, this is not a big problem for documentation.

# If we want to split by second-level headers, that's what we do:

# sections = split_markdown_by_level(text, level=2)

In [None]:
# Now we iterate over all the docs to create the final result:

evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

# Like previously, copy() creates a copy of the document metadata. pop('content') removes and returns the content. 
# This way, each section gets the same metadata (title, description) as the original document.

# This was more intelligent processing, but we can go even further and use LLMs for that.
len(evidently_chunks)

In [None]:
evidently_chunks[:2]

## Intelligent Chunking with LLM

In [None]:
# In some cases, we want to be more intelligent with chunking. Instead of doing simple splits, we delegate this work to AI.

# This makes sense when:

# Complex structure: Documents have complex, non-standard structure
# Semantic coherence: You want chunks that are semantically meaningful
# Custom logic: You need domain-specific splitting rules
# Quality over cost: You prioritize quality over processing cost

# This costs money. In most cases, we don't need intelligent chunking.

# Simple approaches are sufficient. Use intelligent chunking only when

# You already evaluated simpler methods and you can confirm that they produce poor results
# You have complex, unstructured documents
# Quality is more important than cost
# You have the budget for LLM processing

# Note: You can use any alternative LLM provider. One option is Groq, which is free with rate limits. You can replace the OpenAI library with the Groq library and it should work.

# To continue, you need to get the API key from https://platform.openai.com/api-keys (assuming you have an account).

Let's stop Jupyter and create an environment variable with your key:

'''
export OPENAI_API_KEY='your-api-key'
'''

Install the OpenAI SDK:

'''
uv add openai
'''

Then run jupyter notebook:

'''
uv run jupyter notebook
'''

It's cumbersome to set environment variables every time. I recommend using direnv, which works for Linux, Mac and Windows.

Note: if you use direnv, don't forget to add .envrc to .gitignore.

Warning: Never commit your API keys to git! Others can use your API key and you'll pay for it.

In [None]:
# we have to enter OpenAI key here and then we can run all cells freely...
import os
from getpass import getpass
from openai import OpenAI

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
# Now we're ready to use OpenAI:

from openai import OpenAI

openai_client = OpenAI()


def llm(prompt, model='gpt-4o-mini'):
    messages = [
        {"role": "user", "content": prompt}
    ]

    response = openai_client.responses.create(
        model='gpt-4o-mini',
        input=messages
    )

    return response.output_text

# This code invokes an LLM (gpt-4o-mini) with the provided prompt and returns the results. 
# We will explain in more detail what this code does in the next lessons.

In [None]:
# Let's create a prompt:

prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()

# The prompt asks the LLM to:

# Split the document logically (not just by length)
# Make sections self-contained
# Use a specific output format that's easy to parse


In [None]:
# Let's create a function for intelligent chunking:

def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

In [None]:
# Now we apply this to every document:

from tqdm.auto import tqdm

evidently_chunks = []

for doc in tqdm(evidently_docs):
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)

# tqdm is a library that shows progress bars. It helps you track progress when processing a large number of documents.

# Note: This process requires time and incurs costs. As mentioned before, use this only when really necessary. 
# For most applications, you don't need intelligent chunking.
# this particular chunking took 30 mins and cost about US 5 cents or so...

In [None]:
# evidently_chunks[:2]

In [None]:
# How to Choose a Chunking Approach
# You may wonder - which chunking should I use? The answer: start with the simplest one and gradually increase complexity. 
# Start with simple chunking with overlaps. We will later talk about evaluations. 
# You can use evaluations to make informed decisions about chunking strategies.

# Our data is ready. Now we can index it – insert it into a search engine and make it available for our (future) agent to use.

# TODO:
# For the project you selected, apply chunking
# Experiment with simple chunking, paragraph chunking + sliding window, and section chunking
# Which approach makes sense for your application? 
# Manually inspect the results and analyze what works best


## Text search

In [5]:
# The simplest type of search is a text search. Suppose we build a Q&A system for courses (using the FAQ dataset). 
# We want to find the answer to this question:
# "What should be in a test dataset for AI evaluation?"
# Text search works by finding all documents that contain at least one word from the query. 
# The more words from the query that appear in a document, the more relevant that document is.
# This is how modern search systems like Apache Solr or Elasticsearch work. They use indexes to efficiently search through millions of documents without having to scan each one individually.
# In this lesson, we'll start with a simple in-memory text search. The engine we will use is called minsearch.
# Note: This search engine was implemented as part of a workshop I held some time ago. You can find details here if you want to know how it works.

# Let's install it:
# uv add minsearch

# We will use it for chunked Evidently docs.
# To remind you, this is how we prepared the docs:
evidently_docs = read_repo_data('evidentlyai', 'docs')

evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

In [6]:
evidently_chunks[:2]

[{'start': 0,
  'chunk': '<Note>\n  If you\'re not looking to build API reference documentation, you can delete\n  this section by removing the api-reference folder.\n</Note>\n\n## Welcome\n\nThere are two ways to build API documentation: [OpenAPI](https://mintlify.com/docs/api-playground/openapi/setup) and [MDX components](https://mintlify.com/docs/api-playground/mdx/configuration). For the starter kit, we are using the following OpenAPI specification.\n\n<Card\n  title="Plant Store Endpoints"\n  icon="leaf"\n  href="https://github.com/mintlify/starter/blob/main/api-reference/openapi.json"\n>\n  View the OpenAPI specification file\n</Card>\n\n## Authentication\n\nAll API endpoints are authenticated using Bearer tokens and picked up from the specification file.\n\n```json\n"security": [\n  {\n    "bearerAuth": []\n  }\n]\n```',
  'title': 'Introduction',
  'description': 'Example section for showcasing API endpoints',
  'filename': 'docs-main/api-reference/introduction.mdx'},
 {'start'

In [7]:
# Let's now index this data with minsearch:
from minsearch import Index

index = Index(
    text_fields=["chunk", "title", "description", "filename"],
    keyword_fields=[]
)

index.fit(evidently_chunks)
# <minsearch.minsearch.Index at 0x237b44aeba0>

<minsearch.minsearch.Index at 0x237b44aeba0>

In [8]:
# Here we create an index that will search through four text fields: chunk content, title, description, and filename. The keyword_fields parameter is for exact matches (we don't need it for now).
# We can now use it for search:
query = 'What should be in a test dataset for AI evaluation?'
results = index.search(query)
results

# This is text search, also known as "lexical search". We look for exact matches between our query and the documents.

[{'start': 0,
  'chunk': 'Retrieval-Augmented Generation (RAG) systems rely on retrieving answers from a knowledge base before generating responses. To evaluate them effectively, you need a test dataset that reflects what the system *should* know.\n\nInstead of manually creating test cases, you can generate them directly from your knowledge source, ensuring accurate and relevant ground truth data.\n\n## Create a RAG test dataset\n\nYou can generate ground truth RAG dataset from your data source.\n\n### 1. Create a Project\n\nIn the Evidently UI, start a new Project or open an existing one.\n\n* Navigate to “Datasets” in the left menu.\n* Click “Generate” and select the “RAG” option.\n\n![](/images/synthetic/synthetic_data_select_method.png)\n\n### 2. Upload your knowledge base\n\nSelect a file containing the information your AI system retrieves from. Supported formats: Markdown (.md), CSV, TXT, PDFs. Choose how many inputs to generate.\n\n![](/images/synthetic/synthetic_data_inputs_exa

## Vector search

In [None]:
# Text search has limitations. Consider these two queries:
# "I just discovered the program, can I still enroll?"
# "I just found out about the course, can I still join?"
# These ask the same question but share no common words (among important ones). Text search would fail to find relevant matches.
# This is where embeddings help. Embeddings are numerical representations of text that capture semantic meaning. 
# Words and phrases with similar meanings have similar embeddings, even if they use different words.
# Vector search uses these embeddings to identify semantically similar documents, rather than just exact word matches.
# For vector search, we need to turn our documents into vectors (embeddings).
# We will use the sentence-transformers library for this purpose.

# Install it:
# uv add sentence-transformers

