# Building a RAG System Locally with Ollama, LlamaIndex, and Chroma DB

## Exercise 0 - Install Workshop Dependencies

Before starting the workshop, ensure all necessary dependencies are installed in your Python environment. Use the following steps to set up your environment.

### Step 1: Create a Virtual Environment

Create and activate a virtual environment to isolate the workshop dependencies. For this workshop, we use **Python 3.11**. Choose between **venv** or **conda** (using Mamba for efficiency).

##### Using `venv`

On Linux/Mac:
  ```bash
  python3.11 -m venv local-rag
  source local-rag/bin/activate
  ```
On Windows:
  ```bash
  python3.11 -m venv local-rag
  local-rag\Scripts\activate
  ```

##### Using `conda`

   ```bash
   conda create -n local-rag python=3.11
   conda activate local-rag
   ```

### Step 2: Install Required Packages

Install all the required dependencies:

```bash
pip install -r requirements.txt
```

### Step 3: Verify Installation

Check that the key packages are installed correctly by importing them in Python:

In [None]:
import chromadb
import llama_index
import ollama

print("Dependencies installed successfully!")

## Exercise 1 - Setting up Ollama

### Install Ollama

First, download and install Ollama from the official website: [https://ollama.com/download/](https://ollama.com/download/).

### Pull Required Models

Open a terminal and run the following commands to download the necessary models:

1. Pull the `llama3` model:
   ```bash
   ollama pull llama3
   ```

2. Pull the Nomic embedding model if required:
   ```bash
   ollama pull nomic
   ```

### Run the Model

Once the models are installed, you can run the `llama3` model and test it by writing some prompts. Use the following command:

```bash
ollama run llama3
```

Type a prompt and observe the output to ensure everything is working correctly.

### Interact with Ollama in Python



In [None]:
import ollama

response = ollama.generate(model="llama3", prompt="What is EPFL?", stream=True)

for r in response:
    print(r["response"], end="")

## Exercise 2 - Getting Started with LlamaIndex and ChromaDB

**LlamaIndex** ([official site](https://llamaindex.ai)) is a framework for connecting LLMs with data sources, enabling efficient retrieval and interaction with structured or unstructured data.

**Chroma** ([official site](https://www.trychroma.com)) is a vector database designed for managing embeddings and serving as a retrieval layer for LLM applications.

In this exercise, we’ll explore how to set up and use LlamaIndex to index and retrieve data in a **Chroma** database.

### Step 0: Let's download a PDF

You can start by adding documents to the `./docs` folder. If you don't know what to use, we suggest downloading the PDF at the following link:

https://observationofalostsoul.wordpress.com/wp-content/uploads/2011/05/the-gospel-of-the-flying-spaghetti-monster.pdf

### Step 1: Set Up Chroma as the Storage Backend

Initialize the Chroma database and configure it for use with LlamaIndex. Here, we create an **Ephemeral Client** and collection, which stores data temporarily in memory without persisting it. This is ideal for testing and experimentation.

In [None]:
import chromadb

chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.get_or_create_collection("mydocs")

You can also create a **Persistent Client** that will preserve your database across sessions with:

```python
client = chromadb.PersistentClient(path="/path/to/save/to")
```

### Step 2: Set Up LlamaIndex connectors

Configure LlamaIndex to connect with Chroma as the vector store and set up a storage context. A **storage context** is an abstraction that manages how data is stored and retrieved, enabling seamless integration with different storage backends like Chroma.

In [None]:
from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

### Step 3: Load and explore documents

We can use LlamaIndex's `SimpleDirectoryReader` to **ingest documents from a directory**. This utility reads files from a specified directory and prepares them for indexing by splitting the content into manageable chunks.

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./docs", recursive=True).load_data()

Let's explore the content of the documents further with a dataframe.

In [None]:
from typing import List

import pandas as pd
from llama_index.core.schema import TextNode


def data_to_df(nodes: List[TextNode]):
    """Convert a list of TextNode objects to a pandas DataFrame."""
    return pd.DataFrame([node.dict() for node in nodes])

In [None]:
document_df = data_to_df(documents)

document_df.head()

We observe several attributes, including `metadata`, `text`, `text_template`, and others. Let's focus on these three key categories:

- **`metadata`**: This attribute contains additional information about the document, such as its source, creation date, or tags that can be used for filtering or retrieval purposes.
- **`text`**: The main content of the document, representing the raw textual data that will be indexed and queried.
- **`text_template`**: A structured format or schema for the document's text, often used to define how the content should be presented or processed during queries. 

These attributes play distinct roles in organizing and interacting with your data. Feel free to explore the different attributes at this point.

### Step 4: Index and the documents

To ingest documents into an index, we will need an embedder model to convert the document content into vector representations. These embeddings enable efficient similarity searches and retrievals.

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")

In LlamaIndex, we can create an index using the `VectorStoreIndex` class, which enables efficient storage and retrieval of document embeddings and integrates with various storage backends and embedding models. We use here the chroma collection we previously defined.

In [None]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=embed_model,
    show_progress=True,
)

### Step 5: Query the Index for Retrieval

Once the documents are indexed, we can perform retrieval on them. This allows us to ask questions or search for relevant content based on the embeddings stored in the index.

In [None]:
retriever = index.as_retriever(
    similarity_top_k=3,
)

nodes_with_score = retriever.retrieve("What is the Flying Spaghetti Monster?")
nodes = [n.node for n in nodes_with_score]
data_to_df(nodes)

Congrats! You've retrieved your first data!

## Exercise 3 - Your First RAG!

For a Retrieval-Augmented Generation (RAG) system, you need a Large Language Model (LLM) to generate answers to your queries by combining retrieved knowledge with the model's reasoning capabilities. At this point, Ollama comes to help as the LLM powering your RAG system. We set it up for LlamaIndex.

In [None]:
from llama_index.llms.ollama import Ollama

llm = Ollama(model="llama3", request_timeout=120.0)

Everything is ready for querying your data. You can define a query engine and start asking it questions. Congrats, You have a working RAG!

In [None]:
query_engine = index.as_query_engine(
    llm=llm,
    similarity_top_k=3,
    streaming=True,
)

response = query_engine.query("What is the Flying Spaghetti Monster?")

In [None]:
response.print_response_stream()

## Going further...

### Prompt template

LlamaIndex offers an easy way to improve the generated answer by prompting the LLM with a custom template, in which the relevant context will be fed.

In [None]:
from llama_index.core import PromptTemplate

template = (
    "As a devoted Pastafarian scholar touched by His Noodly Appendage, you shall defend our pasta-based teachings.\n\n"
    "Sacred commandments:\n"
    "- Keep answers CONCISE (2-3 paragraphs max)\n"
    "- Cite the Sacred Texts below\n"
    "- Use pasta metaphors liberally\n"
    "- Defend with noodly passion\n\n"
    "Sacred texts:\n"
    "-----------------------------------------\n"
    "{context_str}\n"
    "-----------------------------------------\n\n"
    "Question from seeker:\n"
    "{query_str}\n\n"
    "Answer with fervor, citing texts. Be passionate but brief. R'amen."
)
qa_template = PromptTemplate(template)


query_engine = index.as_query_engine(
    llm=llm,
    similartiy_top_k=3,
    streaming=True,
    text_qa_template=qa_template,
)

response = query_engine.query("What is the Flying Spaghetti Monster?")

In [None]:
response.print_response_stream()

### Better embeddings

Under the hood, a basic retriever is used. Let's look at how data is saved.

In [None]:
nodes_with_score = response.source_nodes
nodes = [n.node for n in nodes_with_score]
data_to_df(nodes)

In [None]:
from llama_index.core.schema import MetadataMode

node = nodes[0]

In [None]:
print(node.text)

But what do the models see exactly? Let's have a look.

In [None]:
print(
    "The Embedding model sees this: \n",
    node.get_content(metadata_mode=MetadataMode.EMBED),
)

In [None]:
print(
    "The LLM sees this: \n",
    node.get_content(metadata_mode=MetadataMode.LLM),
)

We might want to change the embeddings. For example, we can split the sentences in smaller blocks.

In [None]:
# Reset the index data
index.vector_store.clear()

In [None]:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter, SentenceWindowNodeParser

sentence_splitter = SentenceSplitter(chunk_size=200)

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    embed_model=embed_model,
    show_progress=True,
    transformations=[sentence_splitter],
)

### There are many more ways to improve the RAG system, explore them on the official LlamaIndex page!