<a href="https://colab.research.google.com/github/dansarmiento/analytics_portfolio/blob/main/LLM_RAG_Building_in_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The purpose of Local LLMs, Retrieval-Augmented Generation

---

### Benefits of a Local LLM

Running a Large Language Model on your own machine (or a managed environment like Colab) instead of relying on a third-party API service offers several key advantages:

* **Data Privacy & Security**: Your data, prompts, and documents are never sent to an external server. This is critical for handling sensitive, confidential, or proprietary information.
* **Cost-Effectiveness**: While there's an initial setup, running a local model avoids per-request API fees, which can become expensive with heavy usage. Open-source models are free to use.
* **Offline Capability**: Once the models are downloaded, the entire system can run without an internet connection, ensuring continuous availability.
* **Customization & Control**: You have full control over the model, its configuration, and the entire application stack, allowing for deeper integration and fine-tuning.

---

### The Power of Retrieval-Augmented Generation (RAG)

RAG is a powerful technique that enhances the capabilities of LLMs by connecting them to external knowledge bases.

* **Grounding in Facts**: RAG grounds the LLM in specific, verifiable information (from your PDFs, in this case). This drastically reduces the model's tendency to "hallucinate" or invent incorrect information.
* **Using Up-to-Date & Custom Data**: It allows the LLM to answer questions about information it was never trained on, such as recent documents or private, domain-specific knowledge.
* **Transparency & Trust**: Because the model's answers are based on retrieved context, you can often trace back the source of the information, increasing trust in the output.
* **Efficiency**: It's more efficient than retraining an entire LLM on new data. You simply update the knowledge base (the vector database) with new documents.

---

### What the Colab Notebook Accomplishes

The "PDF Question-Answering" notebook provides a complete, end-to-end implementation of a local RAG system. Here’s a summary of its key achievements:

1.  **Environment Setup**: It installs all necessary libraries and sets up the Ollama service to run powerful open-source LLMs (`llama3.2` and `nomic-embed-text`) directly within the Colab environment.
2.  **Dynamic Data Ingestion**: It automatically processes any PDF files you upload to the `/content/` folder, making it a flexible tool for your own documents.
3.  **Knowledge Base Creation**: It reads the PDFs, splits them into manageable chunks, and creates a searchable vector database using `ChromaDB`. This database acts as the LLM's "short-term memory."
4.  **Intelligent RAG Chain Construction**: It builds a sophisticated question-answering pipeline using LangChain. This chain intelligently retrieves the most relevant text from your documents to answer a question and ensures the LLM uses *only* that context for its response.
5.  **Interactive Q&A**: It provides a simple interface to ask questions about your documents and receive fact-based, context-aware answers generated by the local LLM.

## 1. Install Dependencies

 This cell installs all the necessary Python libraries for our RAG pipeline.
 We're installing:
 - ollama: To run and interact with the local LLM.
 - langchain & associated libraries: The core framework for building our RAG chain.
   - langchain-community: For community-contributed components like document loaders and vector stores.
   - langchain-ollama: For specific integrations with the Ollama service.
 - unstructured & its dependencies: A powerful library for parsing various file
   types, including PDFs. It requires several other libraries for full
   functionality.
 - chromadb: The vector store we will use to save and retrieve document embeddings.


In [1]:
!pip install ollama langchain langchain-community langchain-ollama "unstructured[all-docs]" chromadb --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m368.6/981.5 kB[0m [31m13.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━[0m [32m645.1/981.5 kB[0m [31m9.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [9

## 2. Setup and Run Ollama
 This cell downloads, installs, and starts the Ollama service in the background
 of our Colab environment. It then pulls the models we'll need:
 - llama3.2: The primary language model for understanding and generation.
 - nomic-embed-text: A specialized model for creating numerical representations
   (embeddings) of our text, which is crucial for similarity search.

In [2]:
import os
import asyncio

# Start Ollama as a background process
!curl -fsSL https://ollama.com/install.sh | sh
os.environ['OLLAMA_HOST'] = '0.0.0.0'
!nohup ollama serve > ollama.log 2>&1 &

# Wait a bit for the server to start
await asyncio.sleep(5)

# Pull the required models
!ollama pull llama3.2
!ollama pull nomic-embed-text

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[?25l[A[1G[?25h[?2026l[?2026h[

## 3. Prepare Your Documents

**IMPORTANT** Please upload your PDF files to the Colab environment.

1. Click the "Files" icon in the left-hand sidebar.
2. Drag and drop your PDF files from your computer into the `/content/`
   directory that appears in the sidebar.

This code will automatically find and process any file ending with '.pdf'
in the `/content/` directory.

In [3]:
print("Please upload your PDF files to the /content/ folder.")

Please upload your PDF files to the /content/ folder.


## 4. Build the RAG Pipeline

This is the core of our application. We'll define all the components and
functions needed to ingest the PDFs and build the question-answering chain.


In [4]:
import os
import logging
from langchain_community.document_loaders import UnstructuredPDFLoader, DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_ollama import ChatOllama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

# Configure logging to see the progress
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Configuration Constants
CONTENT_DIR = "/content/"
PERSIST_DIRECTORY = "./chroma_db"
MODEL_NAME = "llama3.2"
EMBEDDING_MODEL = "nomic-embed-text"



### Function load_documents
Loads all PDF documents from a specified directory.

How it works:
- We use DirectoryLoader to look for all files ending with '.pdf'.
- For each PDF, it uses UnstructuredPDFLoader to extract the text content.
- This is robust and can handle complex PDFs with tables and images.

In [5]:
def load_documents(directory_path):

    logging.info(f"Loading PDF documents from: {directory_path}")
    loader = DirectoryLoader(directory_path, glob="**/*.pdf", loader_cls=UnstructuredPDFLoader, show_progress=True)
    documents = loader.load()
    if not documents:
        logging.warning("No PDF documents were found. Please upload files to the /content/ directory.")
        return None
    logging.info(f"Successfully loaded {len(documents)} document(s).")
    return documents

### Function split_documents
Splits the loaded documents into smaller, more manageable chunks.

Why we do this:
- LLMs have a limited context window (the amount of text they can see at once).
- Splitting documents into smaller chunks ensures that we can pass relevant,
  focused information to the model without exceeding its limit.
- It also improves the accuracy of the vector search.

In [6]:
def split_documents(documents):

    logging.info("Splitting documents into smaller chunks...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    chunks = text_splitter.split_documents(documents)
    logging.info(f"Created {len(chunks)} text chunks.")
    return chunks

### Function create_vector_db
Creates a Chroma vector database from the document chunks.

How it works:
1.  It initializes OllamaEmbeddings, which uses the 'nomic-embed-text' model
    to convert each text chunk into a numerical vector (an embedding).
2.  These vectors capture the semantic meaning of the text.
3.  The Chroma vector store indexes these vectors, allowing for efficient
    "similarity searches" to find chunks that are most relevant to a user's question.

In [7]:
def create_vector_db(chunks):

    if not chunks:
        logging.error("No chunks to process. Cannot create vector database.")
        return None
    logging.info("Creating vector database...")
    embedding_function = OllamaEmbeddings(model=EMBEDDING_MODEL)
    vector_db = Chroma.from_documents(
        documents=chunks,
        embedding=embedding_function,
        persist_directory=PERSIST_DIRECTORY
    )
    vector_db.persist()
    logging.info("Vector database created and persisted successfully.")
    return vector_db

### Fuction create_rag_chain
Builds and returns the complete RAG (Retrieval-Augmented Generation) chain

In [8]:
def create_rag_chain(vector_db):

    logging.info("Creating the RAG chain...")

    # Initialize the LLM
    llm = ChatOllama(model=MODEL_NAME)

    # 1. Create the Retriever
    # The retriever's job is to fetch relevant documents from the vector store.
    # We use a MultiQueryRetriever which rephrases the user's question from
    # multiple perspectives to improve the search results.
    retriever_from_llm = MultiQueryRetriever.from_llm(
        retriever=vector_db.as_retriever(), llm=llm
    )

    # 2. Define the Prompt Template
    # This template structures the final prompt sent to the LLM. It instructs
    # the model to answer the question based *only* on the context provided
    # by the retriever, which prevents it from hallucinating or using outside knowledge.
    template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""
    prompt = ChatPromptTemplate.from_template(template)

    # 3. Assemble the Chain
    # This is where we define the flow of data using LangChain Expression Language (LCEL).
    # - The first step takes the user's question and passes it to the retriever to get context.
    # - The context and the original question are then fed into the prompt template.
    # - The formatted prompt is sent to the LLM.
    # - The LLM's response is cleaned up by the StrOutputParser.
    chain = (
        {"context": retriever_from_llm, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    logging.info("RAG chain created successfully.")
    return chain

### Main execution block

In [9]:
def setup_pipeline():
    """Runs the full data ingestion and pipeline setup process."""
    docs = load_documents(CONTENT_DIR)
    if not docs:
        return None

    chunks = split_documents(docs)
    if not chunks:
        return None

    db = create_vector_db(chunks)
    if not db:
        return None

    rag_chain = create_rag_chain(db)
    return rag_chain

# Let's build the pipeline!
rag_chain = setup_pipeline()

100%|██████████| 1/1 [01:31<00:00, 91.42s/it]
  vector_db.persist()


## 5. Ask a Question

Now you can ask questions about the content of your PDFs.

Simply type your question into the question variable in the form below
and run the cell. The RAG chain will find the most relevant information
in your documents and generate a concise answer.

In [10]:
if rag_chain:
  question = "tell me about how to find medication orders" # @param {type:"string"}
  print(f"-> Your Question: {question}\n")

  print("-> Assistant's Answer:")
  # The invoke method runs the entire chain and returns the final answer.
  response = rag_chain.invoke(question)
  print(response)
else:
  print("The RAG pipeline could not be set up. Please check the logs above, ensure you have uploaded PDFs, and try running the cells again.")



-> Your Question: tell me about how to find medication orders

-> Assistant's Answer:
According to the provided document, here is what you need to know about finding medication orders:

1. Medication orders are related to the MedicationOrderFact data model.
2. Inpatient-mode medication orders are documented on the Medication Administration Record (MAR).
3. Note that inpatient-mode medication orders are not limited to inpatient/hospital settings, but also include home care visits and outpatient oncology clinics.
4. The MAR related group stores concepts such as who performed the administration, patient ID, and encounter information.
5. You can find more general reporting data models for medication orders, including IP Pharmacy Medication Orders based on MedicationOrderFact.

To find medication orders in Epic, you would typically use the following fields:

* MedicationOrderFact (data model)
* ORDER_MED_ID (field) to retrieve a specific medication order
* ORD records and their correspondin