This notebook provides a basic understanding how to leverage llama index framework for building RAG pipeline on top of local hosted LLMs using Ollama

In [1]:
# Install the following to begin with
# !pip install llama-index
# !pip install chromadb
# !pip install llama-index-embeddings-ollama
# !pip install llama-index-vector-stores-chroma

In [2]:
# Imports
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb

In this notebook, we'll be using the model weights for both the base LLM and the embedding generation model from Ollama

In [3]:
def configure_ollama_environment():
    """Configure Ollama models for both LLM and embeddings"""
    Settings.llm = Ollama(
        model="llama3.2:1b",
        base_url="http://localhost:11434",
        temperature=0.3,
        request_timeout=600.0
    )
    
    Settings.embed_model = OllamaEmbedding(
        model_name="snowflake-arctic-embed2:latest",
        base_url="http://localhost:11434",
        ollama_additional_kwargs={"mirostat": 0}
    )

Initializing the document processing pipeline below

In [4]:
def create_ollama_ingestion_pipeline(data_dir: str = "data"):
    """End-to-end document processing with Ollama embeddings"""
    configure_ollama_environment()
    
    # Load and chunk documents
    documents = SimpleDirectoryReader(
        input_dir=data_dir,
        required_exts=[".pdf", ".txt"],
        recursive=True
    ).load_data()

    # Initialize ChromaDB vector store
    chroma_client = chromadb.PersistentClient(path="./chroma_db")
    vector_store = ChromaVectorStore(
        chroma_collection=chroma_client.get_or_create_collection("ollama_rag")
    )
    
    # Create vector index
    return VectorStoreIndex.from_documents(
        documents=documents,
        storage_context=StorageContext.from_defaults(vector_store=vector_store),
        show_progress=True
    )

Created a basic query engine for searching through the vector index

In [5]:
def create_ollama_query_engine(index, similarity_top_k: int = 5):
    return index.as_query_engine(
        similarity_top_k=similarity_top_k,
        vector_store_query_mode="hybrid",
        alpha=0.5,
        response_mode="compact"
    )

Validating the pipeline with a sample question

In [6]:
def test_ollama_rag_system():
    index = create_ollama_ingestion_pipeline()
    query_engine = create_ollama_query_engine(index)
    query = "What is the main topic in this paper?"
    response = query_engine.query(query)
    print(f"Query: {query}\nResponse: {response}")

In [7]:
test_ollama_rag_system()

Parsing nodes:   0%|          | 0/11 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/27 [00:00<?, ?it/s]

Query: What is the main topic in this paper?
Response: The main topic of this paper is the development, deployment, and potential applications of Large Language Models (LLMs) in healthcare settings.
