First of all, let's talk about the RAG (Retrieval-augmented generation) workflow.

Maybe you see some videos or blogs explaining it, balabala. **In my words, RAG is a way that reducing the length of your input to LLM, it likes a updated search engines(if you want)**.

For example, you ask the LLM Model `What is FPGA?`. 

If the developer train model by something about FPGA or MCU, maybe it can answer something. But if not, the model will say something randomly. So today LLM, like ChatGPT, use internet or self-dataset to search what they don't "remember".

Here also is problem, they don't know which is right, or they misunderstand by the volumious data. For a precise and reliable answer, RAG is all your need. 

You just need to prepare the data you think reliable, RAG will ask LLM with related-data (not all your data). And LLM also just answer your question by your data. 

So to develop a RAG, we need to do :
1. Prepare your data,
2. Embedding your data,
3. Finding related data,
4. Ask LLM with related data.

I will step-by-step develop a RAG system, and you will know the meanings and details of the above four steps.

We will use [Ollama](https://ollama.com/), [GLM-OCR](https://ollama.com/library/glm-ocr), [embeddinggemma](https://ollama.com/library/embeddinggemma) and [GPT-OSS 20B](https://ollama.com/library/gpt-oss:20b). 
>If your device can't use gpt-oss 20b, just change it to which LLM you can run. I recommand [Gemma3 4B](https://ollama.com/library/gemma3:4b), but it maybe not work well in multi-language data.

## Prepare data
I prepare some PDFs in `./pdfs`. They are my blogs and notes in Chinese and English. And they are about photography, mathematic, programming and Ubuntu. You can run the below cell to see:

In [None]:
!ls ./pdfs

/home/zhonguncle


## Embedding data
What is embedding?
In short, embedding is convert any data to a tensor (or vector). 

Why to do this? 
Now we can use mathematical methods to it, like calculating cosine distance of 2 tensor can find the similar data of specfied data. It is the core of Machine Learning: convert something to what we can do some computing on it. 

But the convert is not randomly, we need to make similar data closer in space. 
>If you want to know more about it, please watch [Stanford CS224N: NLP with Deep Learning | Spring 2024 | Lecture 1 - Intro and Word Vectors](https://www.youtube.com/watch?v=DzpHeXVSC5I). It show how words to a vector. Yes, you can convert image, even video to a vector. But this is not the content today.

You can train a embedding model from zero, but the size of data and gpu is huge, it is not for individual. So we use the [embeddinggemma](https://ollama.com/library/embeddinggemma) to do it.

This step can be devided to 5 samll steps:
1. Get text from PDF file (you can update it to many format after learning, like EPUB);
2. Split content to chunk,
3. Embedding chunk,
4. Save it (Optional),
5. Use it.


### Get text from PDF file
You can extract sentences from PDF file directly, but in actual, many PDF files can't be extracted sentences directly, such as pure images. 

In [2]:
import os
import re
import time
import gc
import base64
import ollama
from io import BytesIO
from pdf2image import convert_from_path

# ====================== OCR Model Configuration ======================
OCR_MODEL = "glm-ocr"
OCR_DPI = 100
OCR_PROMPT = "Text recognition. Accurately extract all text, LaTeX formulas, and symbols from the image. Output only the original content."
OLLAMA_HOST = "http://localhost:11434"


# ====================== Image to Base64 ======================
def image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="PNG", quality=80, optimize=True)
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

# ====================== OCR single page ======================
def ocr_single_page(page_img, prompt):
    img_base64 = image_to_base64(page_img)
    response = ollama.chat(
        model=OCR_MODEL,
        messages=[{'role': 'user', 'content': prompt, 'images': [img_base64]}],
    )
    return response["message"]["content"].strip()

In [3]:
# Step 1: Define the path of the PDF to process
pdf_path = "./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf"

# Start of core logic (all code in the original function is expanded flat)
full_text = ""  # Initialize the full text variable
try:
    # Get PDF file name and TXT save name
    pdf_file_name = os.path.basename(pdf_path)
    pdf_name = os.path.splitext(pdf_file_name)[0]
    txt_file_name = pdf_name + ".txt"
    
    # Print start prompt
    print(f"\nüöÄ GLM-OCR processing: {pdf_file_name}")
    
    # Convert PDF to images
    pages = convert_from_path(
        pdf_path, dpi=OCR_DPI, thread_count=1, use_pdftocairo=True, grayscale=True
    )
    total_pages = len(pages)

    # Iterate through each page to process OCR
    for idx, page_img in enumerate(pages):
        # Print progress (every 5 pages / last page)
        if (idx+1)%5 == 0 or (idx+1) == len(pages):
            print(f"   Recognizing: Page {idx+1}/{total_pages}")
        
        page_text = ""
        # Retry mechanism (max 2 attempts)
        for retry in range(2):
            try:
                # Call OCR and write to TXT
                with open(txt_file_name, "a", encoding="utf-8") as f:
                    page_text = ocr_single_page(page_img, OCR_PROMPT)
                    f.write(page_text)
                break  # Exit retry if successful
            except TimeoutError:
                print(f"   ‚ö†Ô∏è Page {idx+1} OCR timed out, retrying {retry+1}...")
                time.sleep(2)
            except Exception as e:
                print(f"   ‚ö†Ô∏è Page {idx+1} OCR failed: {e}")
                time.sleep(2)
        
        # Fill placeholder text when OCR fails
        if not page_text:
            page_text = f"[Recognition failed for Page {idx+1}]"
        
        # Concatenate text
        full_text += page_text + "\n\n"
        

    # Clean up redundant whitespace characters
    full_text = re.sub(r'\s+', ' ', full_text).strip()
    print(f"‚úÖ OCR completed! Text length: {len(full_text)} characters")

# Global exception handling
except Exception as e:
    print(f"‚ùå OCR failed: {str(e)}")
    full_text = ""


üöÄ GLM-OCR processing: Guide of developing Tang Nano FPGA on Mac.pdf
‚ùå OCR failed: Unable to get page count.
I/O Error: Couldn't open file './pdfs/Guide of developing Tang Nano FPGA on Mac.pdf': No such file or directory.



### Get the category and tags
After OCR, we can use the first 1000 words to get the category and tags of file. These will help better to find the related data.

Sending first 1000 words to LLM to get the category and tags:

In [4]:
# Specify the LLM model name for generating category lists (custom model name for ollama)
LLM_MODEL = "gpt-oss:20b"

# Construct prompt to extract categories from PDF text:
# - Require output to be ONLY a dictionary with "categories" key (list value)
# - No extra text/formatting outside the dictionary
# - Use first 1000 chars of OCR-extracted full text as input content
prompt = f"""
Based on the following content, output only Numpy Dict with no other content:
{{"categories":["..."]}}

Content:
{full_text[:1000]}
"""

# Call ollama chat API to generate the category dictionary using the specified LLM model
result = ollama.chat(model=LLM_MODEL, messages=[{"role":"user","content":prompt}])

Now we get the categories via Numpy Dict format string:

In [5]:
import ast

cat_tags = ast.literal_eval(result["message"]["content"])
print(cat_tags)

{'categories': []}


Then we create a categories_list:
>Why we not directly ask LLM generate like this. Because it is not stable and not easy to process.

In [6]:
categories_list = {}
for category in cat_tags["categories"]:
    if category not in categories_list:
        categories_list[category] = []
    categories_list[category].append(pdf_path)
    
categories_list

{}

### Spliting
Before embedding, we need sperate content to chunks.
> Chunk means it maybe large than one sentence.

Why split the content to chunk?

Back to the question `What is Log curve?`. When you ask this question, RAG will calculate the similarity between question `What is Log curve?` and some content (we will use the category and tags to help). If not spliting, we will ask LLM with whole related PDF files, not just related text. It may be over the limit of context of LLM and embeddinggemma, so we need to split the text to chunk and try our best to give more content.
> The limit of content of embeddinggemma is 2K, but we can't ask it with 2,000 words. Because the limit contain the answer and we may pass some chunks (not just 1 chunk). So give LLM space to answer.

In below function, we will split text about 500 words size by `["\n\n", "\n", "„ÄÇ", "ÔºÅ", "Ôºü", "Ôºõ", "Ôºå", "„ÄÅ", ".", "!", "?", ";", ","]`. In case split text in middle of sentence or word.

In [7]:
def split_text_chunks(text, chunk_size=500, chunk_overlap=50):
    """
    Split long text into semantically coherent chunks with natural language boundary handling
    (avoids mid-sentence splits by prioritizing common separators)
    
    Args:
        text (str): Input text to split
        chunk_size (int): Max character length per chunk (default: 500)
        chunk_overlap (int): Overlapping chars between chunks (default: 50)
    
    Returns:
        list[str]: Non-empty, stripped text chunks
    """
    chunks = []
    start = 0
    text_length = len(text)
    # Priority separators (semantic order: paragraph > sentence > punctuation)
    separators = ["\n\n", "\n", "„ÄÇ", "ÔºÅ", "Ôºü", "Ôºõ", "Ôºå", "„ÄÅ", ".", "!", "?", ";", ","]

    while start < text_length:
        end = start + chunk_size
        # Add remaining text as last chunk if end exceeds text length
        if end >= text_length:
            chunks.append(text[start:].strip())
            break
        
        temp_chunk = text[start:end]
        split_pos = -1
        # Find last valid separator (after overlap threshold)
        for sep in separators:
            pos = temp_chunk.rfind(sep)
            if pos != -1 and pos > (chunk_size - chunk_overlap):
                split_pos = pos
                break
        
        # Split at natural separator if found
        if split_pos != -1:
            chunk_end = start + split_pos + 1
            chunks.append(text[start:chunk_end].strip())
            start = chunk_end - chunk_overlap
        # Fallback: split at chunk size with overlap
        else:
            chunks.append(temp_chunk.strip())
            start = end - chunk_overlap

    # Filter out empty chunks
    return [c for c in chunks if c]

What is `chunk_overlap=50` meaning? 

`chunk_overlap` refers to the overlap length of text chunks, and its core purpose is to avoid semantic breaks caused by the abrupt truncation of text by preserving a section of overlapping content between adjacent text chunks, thus ensuring the coherence of context.

For example, after spliting, the length and content of first 2 chunks like below. You can see the ending of the first chunk `suitable for learning and putting into production,` is same as the beginning of the second chunk. 

In [8]:
chunks = split_text_chunks(full_text)
print(f'First chunk length: {len(chunks[0])}')
print(f'First chunk content:\n{chunks[0]}\n')

print(f'Second chunk length: {len(chunks[1])}')
print(f'Second chunk content:\n{chunks[1]}')

IndexError: list index out of range

### Embedding
Now we can embedding the content. New version Ollama has function to do it, we just need to set the model name, like below:

In [None]:
EMBEDDING_MODEL = "embeddinggemma"

embedding_result = ollama.embed(model=EMBEDDING_MODEL, input=full_text[0])

After embedding, checking the vector. The default length of generated vector is 768:

In [None]:
print(f'Size of embedded vector: {len(embedding_result["embeddings"][0])}')
print(embedding_result["embeddings"][0][:100])  # Just show first 100 elements

Size of embedded vector: 768
[-0.1795711, -0.014000532, 0.014086971, 0.020606868, 0.068530254, 0.042396046, -0.020073084, 0.021680696, 0.025249122, -0.039366826, 0.010949659, -0.05269665, 0.030286899, -0.01697931, 0.10812019, 0.014858677, 0.011771816, -0.026990928, -0.056859653, 0.009036296, 0.036543313, -0.0494702, 2.5608497e-06, 0.0112010455, 0.0136307785, 0.03635816, 0.02216107, -0.011967776, 0.020245204, -0.031006882, 0.029326187, -0.007984784, 0.027734365, -0.038165923, 0.0009162924, 0.051496238, 0.013571829, -0.06736813, 0.058040578, -0.011162806, -0.03662894, 0.060834225, -0.017807618, 0.010662863, -0.010802845, -0.024050152, -0.018274004, -0.056624085, -0.0406374, 0.02730967, 0.020485438, 0.03649262, -0.062459107, -0.0069091204, -0.01730355, -0.016423075, -0.05392794, -0.02272239, -0.014979191, 0.03398299, -0.03778724, -0.0021451856, 0.018650385, 0.011395624, 0.046265505, -0.02664927, -0.005651879, 0.02514516, 0.016128816, 0.28290936, -0.015962197, -0.040483765, -0.02541215, -0

Convert all chunks to vector parallelly:

In [None]:
# Import concurrent execution modules for parallel embedding generation
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import numpy as np

# Configuration for embedding generation
EMBEDDING_MODEL = "embeddinggemma"  # Ollama embedding model name
MAX_WORKERS = 16  # Max parallel workers (adjust based on CPU/GPU resources)

def embed_batch(chunk, model_name):
    """
    Generate embedding for a single text chunk using Ollama
    Args:
        chunk (str): Text chunk to embed
        model_name (str): Name of Ollama embedding model
    Returns:
        np.array: Embedding vector (empty array if failed)
    """
    try:
        # Call Ollama API to get embedding for the chunk
        res = ollama.embed(model=model_name, input=chunk)
        # Convert embedding to float32 numpy array for efficiency
        return np.array(res["embeddings"], dtype=np.float32)
    except:
        # Return empty array if embedding generation fails
        return np.array([])

def parallel_get_embeddings(chunks, model_name):
    """
    Generate embeddings for multiple text chunks in parallel
    Args:
        chunks (list[str]): List of text chunks to embed
        model_name (str): Name of Ollama embedding model
    Returns:
        np.array: 2D array of embeddings (empty array if all failed)
    """
    all_emb = []
    # Use process pool for parallel embedding (faster for CPU/GPU-bound tasks)
    with ProcessPoolExecutor(max_workers=MAX_WORKERS) as executor:
        # Submit embedding tasks for all chunks
        futs = [executor.submit(embed_batch, b, model_name) for b in chunks]
        # Collect valid embedding results
        for f in futs:
            be = f.result()
            if len(be) > 0:
                all_emb.append(be[0])
    
    # Stack valid embeddings into 2D array and print shape
    if all_emb:
        final = np.vstack(all_emb)
        print(f"‚úÖ Finished Embedding, Dims: {final.shape}")
        return final
    # Return empty array if no valid embeddings
    return np.array([])

# Generate embeddings for all text chunks in parallel
embeddings = parallel_get_embeddings(chunks, EMBEDDING_MODEL)
# Print embedding array to verify output
print(embeddings)

‚úÖ Finished Embedding, Dims: (20, 768)
[[ 0.00150642 -0.0393247  -0.03107799 ... -0.0295878  -0.00446127
  -0.02752576]
 [-0.06393804 -0.01268001 -0.02310357 ... -0.01874642  0.04281209
   0.03216418]
 [-0.0704116  -0.04516455  0.02668154 ...  0.00263846  0.03863657
  -0.00353938]
 ...
 [-0.04021888 -0.03375006  0.05549837 ... -0.0137254  -0.06254524
   0.01480906]
 [-0.08117593 -0.04744734  0.01369969 ... -0.03297434  0.01833711
  -0.03613842]
 [-0.09259374  0.04772294 -0.02131703 ...  0.03189305  0.02579138
   0.01827864]]


I recommend you to add file path together. It will helps you to use it.

In [None]:
full_embeddings = {}
full_embeddings[pdf_path]=embeddings
print(full_embeddings[pdf_path])

[[ 0.00150642 -0.0393247  -0.03107799 ... -0.0295878  -0.00446127
  -0.02752576]
 [-0.06393804 -0.01268001 -0.02310357 ... -0.01874642  0.04281209
   0.03216418]
 [-0.0704116  -0.04516455  0.02668154 ...  0.00263846  0.03863657
  -0.00353938]
 ...
 [-0.04021888 -0.03375006  0.05549837 ... -0.0137254  -0.06254524
   0.01480906]
 [-0.08117593 -0.04744734  0.01369969 ... -0.03297434  0.01833711
  -0.03613842]
 [-0.09259374  0.04772294 -0.02131703 ...  0.03189305  0.02579138
   0.01827864]]


### Save
Maybe you have question: Why return np.array?

Because we will use `.npy` file store these vectors and reload in future. 
> You also can use `.npz` to reduce file size, or use JSON, Parquet, FAISS/Chroma, etc. Suit yourself.
> 
> But `.npy` has one downsize: when you add new vectors to `.npy` file, it needs to load old vectors, concatenate together, and save. It can't add new vectors directly.

We need to store many data: embedded vectors, categories and tags, respective paths.
> If you are in development, I recommend you save chunks. Spliting spends much time.

In [None]:
print(full_embeddings)

{'./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf': array([[-0.09427399, -0.09588927,  0.06010244, ..., -0.00119149,
         0.03862596,  0.01399344],
       [-0.08799237, -0.01759241,  0.07082588, ..., -0.03582003,
         0.02450423, -0.0444714 ],
       [ 0.00150642, -0.0393247 , -0.03107799, ..., -0.0295878 ,
        -0.00446127, -0.02752576],
       ...,
       [-0.04021888, -0.03375006,  0.05549837, ..., -0.0137254 ,
        -0.06254524,  0.01480906],
       [-0.08117593, -0.04744734,  0.01369969, ..., -0.03297434,
         0.01833711, -0.03613842],
       [-0.09259374,  0.04772294, -0.02131703, ...,  0.03189305,
         0.02579138,  0.01827864]], shape=(22, 768), dtype=float32)}


In [None]:
# we will save data in `save` directory
SAVE_DIR = "save"
EMBEDDING_SAVE_PATH = os.path.join(SAVE_DIR, "embeddings.npy")

# save embedded vectors
np.save(EMBEDDING_SAVE_PATH, full_embeddings)

We also need to save categories:

In [None]:
# we will save data in `save` directory
SAVE_DIR = "save"
CATEGORIES_LIST_SAVE_PATH = os.path.join(SAVE_DIR, "categories_list.npy")

np.save(CATEGORIES_LIST_SAVE_PATH, np.array(categories_list, dtype=object))

### Load
Now we try to load `whole_info.npy` and `cat_tags.npy` file:
> load `cat_tags.npy` is for find the closest keyword.

In [None]:
SAVE_DIR = "save"
CATEGORIES_SAVE_PATH = os.path.join(SAVE_DIR, "categories_list.npy")
EMBEDDING_SAVE_PATH = os.path.join(SAVE_DIR, "embeddings.npy")

categories_list_loaded = np.load(CATEGORIES_SAVE_PATH, allow_pickle=True).item()
embeddings_loaded = np.load(EMBEDDING_SAVE_PATH, allow_pickle=True).item() # .item() convert array to dict

In [None]:
print(f'{cat_tags_loaded}\n')
print(embeddings_loaded)

{'FPGA': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Tang Nano': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Mac': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'GOWIN IDE': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Text Editor': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Clang/LLVM': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Hardware Development': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Embedded Systems': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Development Workflow': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf'], 'Open Source': ['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf']}

{'./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf': array([[-0.09427399, -0.09588927,  0.06010244, ..., -0.00119149,
         0.03862596,  0.01399344],
       [-0.08799237, -0.01759241,  0.07082588, ..., -0.03582003,
         0.02450423, -0.0444714 ],
       [

## Finding related data
Ok, we arrive at the core of RAG: Finding related data.

This step can be devided to 5 samll steps:
1. Try to find categories, tags (This step can narrow down the search scope), 
2. Embed your question,
3. Find the related chunks by similarity.

Maybe you noticed, it works like a search engine. So you can modify it to a local search engine. Without exact match, you can find where it appear and in which file. 

### Try to find categories, tags
Back to the question `What is FPGA?`, finding the closest categories. 

#### First Method
It counts the number of full occurrences of each preset keyword in the lowercase text chunk as the matching score for that keyword, and returns the keyword with the highest score as the classification result (or "General Document" if no keywords match).

In [None]:
def classify_text_chunk(chunk):    
    """
    Classify text chunk by keyword count (case-insensitive)
    Args:
        chunk (str): Text chunk to classify
    Returns:
        str: Top matching category or "General Document" (no matches)
    """
    # Normalize chunk to lowercase for consistent matching
    chunk_lower = chunk.lower()
    # Initialize score tracker for all predefined categories
    category_scores = {cat: 0 for cat in categories_list_loaded}

    # Count category keyword occurrences in chunk
    for category in categories_list_loaded:
        category_scores[category] = chunk_lower.count(category.lower())
    
    # Get highest score and return corresponding category (or default)
    max_score = max(category_scores.values())
    return max(category_scores, key=category_scores.get) if max_score > 0 else "General Document"
    
def classify_chunks(text):
    """
    Batch classify list of text chunks
    Args:
        text (list[str]): List of text chunks
    Returns:
        list[str]: Categories for each chunk
    """
    # Apply single-chunk classification to all chunks
    return [classify_text_chunk(c) for c in text]

Try it:

In [None]:
question="What is FPGA?"
keyword = classify_text_chunk(question)
print(keyword)

FPGA


This algorithm is very simple: we just get 1 keyword. It is for showing. In actual, I recommend you to get some keywords, it will works better.

Now we can get this keyword embeddings. 

In [None]:
cat_tags_loaded[keyword]

['./pdfs/Guide of developing Tang Nano FPGA on Mac.pdf']

#### Second Method
Or you can generate embedding vectors to calculate similarity.

To calculate similarity, we need to calculate cosine similarity of two tensors or vectors. Cosine similarity means the angle between two vectors by cosine. 

$$
\text{cosine similarity} := \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \cdot \sqrt{\sum_{i=1}^{n} B_i^2}}
$$

Why we just calculate the distance of two vectors? Because the range of cosine similarity is 0 to 1, satisfies the axioms of probability.

In [None]:
def cosine_similarity(vec1, vec2):
    """
    Calculate cosine similarity between two vectors (normalized dot product)
    Args:
        vec1 (np.array): First embedding vector
        vec2 (np.array): Second embedding vector
    Returns:
        float: Cosine similarity score (0.0 if either vector is zero)
    """
    # Compute L2 norm of each vector (magnitude)
    n1 = np.linalg.norm(vec1)
    n2 = np.linalg.norm(vec2)
    
    # Return 0 if either vector has zero norm (avoid division by zero)
    if n1 == 0 or n2 == 0:
        return 0.0
    
    # Calculate normalized dot product (cosine similarity)
    return np.dot(vec1/n1, vec2/n2)

# Initialize variables to track top matching category
max_similarity = -1.0  # Track highest similarity score
best_category = ""     # Track category with highest similarity
    
# Generate embedding for user's question
question_embedding = ollama.embed(model=EMBEDDING_MODEL, input=question)["embeddings"][0]

# Iterate through all categories to find most similar one
for category in categories_list_loaded:
    # Generate embedding for current category and calculate similarity to question
    similarity = cosine_similarity(question_embedding, ollama.embed(model=EMBEDDING_MODEL, input=category)["embeddings"][0])
    
    # Print category (left-aligned) and similarity (12 decimal places)
    print(f"{category:<{20}} {similarity:.{12}f}")
    
    # Update top category if current similarity is higher
    if similarity > max_similarity:
        max_similarity = similarity
        best_category = category

# Print the final best-matching category
print(f'\nBest Category: {best_category}')

FPGA                 0.722461246440
Tang Nano            0.416567403635
Mac                  0.486751758428
GOWIN IDE            0.425685658280
Text Editor          0.422178165151
Clang/LLVM           0.492924708220
Hardware Development 0.623216937434
Embedded Systems     0.624374652789
Development Workflow 0.494293051459
Open Source          0.465430764758

Best Category: FPGA


### Embed question
This step is easy, we meet it many times before:

In [None]:
question_embedding = ollama.embed(model=EMBEDDING_MODEL, input=question)["embeddings"][0]
print(question_embedding[:100])

[-0.15408693, -0.026008384, -0.020546477, 0.0059912973, 0.019676102, -0.011539707, -0.0008004644, 0.012970376, 0.030865138, 0.048791505, -0.01950369, -0.044546984, 0.022569103, -0.0025020633, 0.055862352, 0.019775027, 0.07262218, -0.013940342, -0.032102257, -0.034715135, 0.0053905854, -0.013554472, -0.012396905, 0.053742614, 0.0040722643, 0.0032516464, 0.0068892143, -0.06406595, -0.029328734, 0.009482054, 0.0413703, 0.0016903707, 0.044082124, -0.010227873, -0.0008136854, 0.040183045, -0.011929971, -0.047931533, 0.010604784, -0.004251502, -0.009085997, 0.08101945, -0.0062629674, -0.0031834245, -0.0057065976, -0.06732319, -0.044221506, -0.06720558, 0.009620878, -0.012304886, -0.0037627423, -0.015941259, -0.013720847, 0.034674466, -0.0066491696, 0.023288716, -0.053317003, 0.0061896513, 0.02162611, 0.05568433, -0.046298698, -0.0016431058, 0.010757171, -0.027773952, 0.024997184, 0.04797287, -0.02842763, 0.005134202, -0.0047964654, 0.18524532, 0.013712985, -0.01291761, -0.060923137, -0.03661

### Find the related chunks by similarity
Ok, finally, we arrive at the core of RAG. I have shown all knowlegde needed before, so write a loop to find closest chunks.

In [None]:
results = []
for file in embeddings_loaded:
    index = 0
    for idx, embedding in enumerate(embeddings_loaded[file]):
        similarity = cosine_similarity(question_embedding, embedding)
        results.append((idx, similarity, chunks[idx]))
        print(similarity)

0.2854506568758853
0.4253609639088747
0.20651224128702755
0.3126381882554131
0.3249318263069938
0.38123328018590286
0.34579035236468514
0.17157891123196545
0.363234404828212
0.41283551366441573
0.23944850963462253
0.38228374480729826
0.42458171746203693
0.44611595463853976
0.27074020687240574
0.33667093007448123
0.22012099728769222
0.3128852069444872
0.4832363620397254
0.41481237150705463
0.2927572640138574
0.26775386458528444


In [None]:
# Sort the results
results.sort(key=lambda x: x[1], reverse=True)
# Show top 5 chunks idx and similarity
results[:5]

[(18,
  np.float64(0.4832363620397254),
  '| Sipeed Tang Nano: | GW1N-LV1QN48C6/I5 | tangnano | | Sipeed Tang Nano 1K | GW1NZ-LV1QN48C6/I5 | tangnano1k | | Sipeed Tang Nano 4K | GW1NSR-LV4CQN48PC7/I6 | tangnano4k | | Sipeed Tang Nano 9K | GW1NR-LV9QN88PC6/I5 | tangnano9k | | Seeed RUNBER | GW1N-UV4LQ144C6/I5 | runber | | @Disasm honeycomb | GW1NS-UX2CQN48C5/I4 | honeycomb | Next, if you have a Tang nano 9K like me, then use the following command (the device cannot write the long one in table): ```bash $ gowin_pack -d GW1N-9C -o top.'),
 (13,
  np.float64(0.44611595463853976),
  's @(posedge clk) begin clockCounter <= clockCounter + 1; if (clockCounter == WAIT_TIME) begin clockCounter <= 0; cur_state <= cur_state << 1; if (cur_state == 6\'b000000) begin ``` end end assign led[5:0] = cur_state[5:0]; endmodule // tangnano9k.cst IO_LOC "clk" 52; IO_PORT "clk" PULL_MODE=UP; IO_LOC "led[0]" 10; IO_LOC "led[1]" 11; IO_LOC "led[2]" 13; IO_LOC "led[3]" 14; IO_LOC "led[4]" 15; IO_LOC "led[5]" 16

## Ask LLM

We arrive at the end of the RAG. Just ask LLM with above selected chunks:

In [None]:
def generate_answer(query, relevant_chunks):
    if not relevant_chunks:
        return "No relevant content found"
    # concentrate Top 5 chunks together 
    ctx = "\n\n".join([f"{item[1][2]}" for item in enumerate(relevant_chunks)])
    prompt = f"""Please answer truthfully based on the context, prioritize extracting specific numbers and details, say 'I don't know' if you don't know, and do not fabricate content.
Context:
{ctx}
Question: {query}"""
    try:
        resp = ollama.chat(model=LLM_MODEL, messages=[{"role":"user","content":prompt}])
        return resp["message"]["content"]
    except:
        return "Failed to generate answer"

Let's try the question:

In [None]:
question="What is FPGA?"
relevant_chunks = results[:5]
generate_answer(question, relevant_chunks)

'**FPGA** stands for **Field‚ÄëProgrammable Gate Array**.  \nIt is a type of integrated circuit that contains an array of configurable logic blocks (CLBs), input/output (I/O) cells, programmable routing resources, and sometimes embedded memory or DSP blocks. Unlike fixed‚Äëlogic ASICs, an FPGA can be programmed after manufacture by loading a configuration file (often called a *bitstream*) that tells the device how to interconnect its internal blocks to implement a desired digital circuit.\n\nKey points from the context:\n\n| Device name | Part number | Board name | FPGA family |\n|-------------|-------------|------------|-------------|\n| Sipeed Tang Nano 9K | **GW1NR‚ÄëLV9QN88PC6/I5** | tangnano9k | Gowin GW1N‚Äë9C family |\n| Sipeed Tang Nano 4K | GW1NSR‚ÄëLV4CQN48PC7/I6 | tangnano4k | Gowin GW1N‚Äë4C family |\n| Sipeed Tang Nano 1K | GW1NZ‚ÄëLV1QN48C6/I5 | tangnano1k | Gowin GW1N‚Äë1C family |\n| Sipeed Tang Nano | GW1N‚ÄëLV1QN48C6/I5 | tangnano | Gowin GW1N‚Äë1C family |\n| Seeed R

Ask a question which isn't about this PDF:

In [None]:
question="What is Log curve?"
relevant_chunks = results[:5]
generate_answer(question, relevant_chunks)

"I don't know."

The LLM will not answer with lies.

Next, we will make it auto loads all PDF files and processes.