## **Libraries**

In [1]:
# ✅ LangChain: Framework to build modular, chainable LLM pipelines (RAG, QA, chatbots, etc.)
!pip install langchain -q

# ✅ langchain-community: Extra tools, connectors, and model wrappers not in core LangChain
!pip install langchain-community -q

# ✅ pypdf: To read and extract text from PDF documents
!pip install pypdf -q

# ✅ docarray: A lightweight vector storage library for storing and retrieving document embeddings
!pip install docarray -q

# ✅ sentence-transformers: Provides pretrained models like MiniLM to convert sentences into dense vectors (embeddings)
!pip install sentence-transformers -q

# ✅ huggingface_hub: Interface to download models/files from Hugging Face model hub (e.g., GGUF models)
!pip install huggingface_hub -q

# ✅ llama-cpp-python: Python bindings to run local GGUF-format LLaMA models using C++ backend (llama.cpp)
!pip install llama-cpp-python -q

# ✅ apt-get update: Updates package index
# ✅ git: Required to clone the llama.cpp GitHub repository
# ✅ cmake + build-essential: Required to build the C++ code for llama.cpp
!apt-get update && apt-get install -y git cmake build-essential -q


0% [Working]            Hit:1 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.83)] [                                                                               Hit:2 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
                                                                               Hit:3 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
0% [Waiting for headers] [Connected to cloud.r-project.org (108.157.173.89)] [W                                                                               Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
0% [Waiting for headers] [Waiting for headers] [Waiting for headers] [Connected                                                                               Get:5 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
0% [Waiting for headers] [Waiting for headers] [

## **Clone and compile LLaMa.cpp**

In [2]:
# 📥 Clone the llama.cpp repository from GitHub (required for running local LLaMA models efficiently)
!git clone https://github.com/ggerganov/llama.cpp

# 📁 Move into the llama.cpp directory
%cd llama.cpp

# 🛠️ Compile llama.cpp using all available processor cores (parallel build)
!make -j$(nproc)

# 🔙 Move back to the root project directory
%cd ..

fatal: destination path 'llama.cpp' already exists and is not an empty directory.
/content/llama.cpp
Makefile:2: *** The Makefile build is deprecated. Use the CMake build instead. For more details, see https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md.  Stop.
/content


In [3]:
# ✅ For downloading the LLaMA GGUF model from Hugging Face
from huggingface_hub import hf_hub_download

# 🧠 You are downloading a quantized 3B parameter instruction-tuned model (Q4_K_M)
model_path = hf_hub_download(
    repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
    filename="Llama-3.2-3B-Instruct-Q4_K_M.gguf"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## **Uploading Pdf**

In [4]:
# ✅ For uploading files (PDF) from your local system in Google Colab
from google.colab import files

# 🧾 User uploads the file → returned as a dict {filename: file_data}
uploaded = files.upload()

# 🛣️ Get the uploaded PDF file path
pdf_path = list(uploaded.keys())[0]

Saving blood test.pdf to blood test.pdf


## **Initializing LLaMA model**

In [11]:
# ✅ For checking if GPU is available (used in LLaMA config)
import torch

# 🧠 Import LlamaCpp from langchain_community (to use local LLaMA model)
# ✅ LangChain interface to LLaMA.cpp (loads and queries GGUF models locally)
from langchain_community.llms import LlamaCpp

# 🧠 Initialize the LLaMA model with context size, GPU settings, temperature, etc.
llm = LlamaCpp(
    model_path=model_path,
    n_ctx=5000,  # Max context window size (prompt + response)
    n_gpu_layers=33 if torch.cuda.is_available() else 0,  # Use GPU if available
    temperature=0.7,  # 0 = deterministic; 0.7 = more creative/random
    verbose=False  # Less logging noise
)

llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_ctx_per_seq (5000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility


## **PDF to Doc and pages extraction**

In [6]:
# ✅ LangChain PDF loader: Splits and converts PDF pages into documents
from langchain_community.document_loaders import PyPDFLoader

# ✅ LangChain vector storage: Stores document embeddings in memory (no DB required)
from langchain_community.vectorstores import DocArrayInMemorySearch

# ✅ Sentence embedding wrapper using Hugging Face (MiniLM, BERT, etc.)
from langchain_community.embeddings import HuggingFaceEmbeddings

# ✅ Used to define the prompt format used by the LLM
from langchain_core.prompts import PromptTemplate

# ✅ Pass-through node for dynamic chain input (like a question)
from langchain_core.runnables import RunnablePassthrough

# ✅ Parses the LLM output and returns as a plain string
from langchain_core.output_parsers import StrOutputParser

# 📄 Load your PDF and split it page-wise (each page becomes a document)
loader = PyPDFLoader(pdf_path)
pages = loader.load_and_split()

## **Embeddings, VectorStore, and Retriever**

In [7]:
# 🔤 Convert each sentence/page into embeddings using MiniLM (fast + accurate)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 🗃️ Store vectorized documents in memory (no external DB needed)
vectorstore = DocArrayInMemorySearch.from_documents(pages, embedding=embeddings)

# 🔎 Convert the vector store into a retriever
retriever = vectorstore.as_retriever()

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


## 📋 **Define the prompt template for LLM**

In [8]:
# This tells the LLM: "Use this context to answer this question"
template = """Use the following context to answer the question:
{context}

Question : {question}

Answer: """

# Create a LangChain PromptTemplate object
prompt = PromptTemplate.from_template(template)

## 🔗 **Build a RAG chain**
Steps:
1. Use retriever to get documents using question
2. Format the prompt
3. Run through the LLaMA model
4. Parse the output as clean text

In [12]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()  # Clean output (strip, format)
)

## 💬 Try a simple test question (control prompt)

In [13]:
question = "how to make pizza"

# 🚀 Run the RAG pipeline
response = chain.invoke(question) # Runs the full Retrieval + Generation pipeline

# 📢 Print the answer
print(f"Answer: {response}")

Answer: 1. Make the dough: In a large mixing bowl, combine 2 cups of warm water, 2 teaspoons of active dry yeast, and 3 tablespoons of olive oil. Mix until the yeast is dissolved, then add 3 cups of all-purpose flour and continue to mix until the dough becomes smooth and elastic. Knead the dough for 5-10 minutes, until it becomes smooth and elastic.

2. Prepare the sauce: In a blender or food processor, combine 1 cup of crushed tomatoes, 1 tablespoon of olive oil, 2 cloves of minced garlic, and 1 teaspoon of dried oregano. Blend until smooth.

3. Add cheese to the pizza: Grate 8 ounces of mozzarella cheese and sprinkle it evenly over the sauce.

4. Add toppings to your pizza: Choose your favorite toppings, such as pepperoni, sausage, mushrooms, onions, bell peppers, olives, etc. Sprinkle these toppings evenly over the cheese.

5. Bake the pizza in the oven: Preheat your oven to 425 degrees Fahrenheit (220 degrees Celsius). Place the pizza on a baking sheet or pizza stone and bake for 1

In [14]:
# ⏱️ Performance Benchmark: Ask a deeper question & measure response time
import time

st = time.time()

# 🧠 Real semantic question based on PDF content
question = "What is the main topic of the PDF? Can you tell me the top 2 important points from the document?"

# 🚀 Invoke the same RAG chain
response = chain.invoke(question)

# 🧾 Output the answer
print(f"Answer: {response}")

ed = time.time()
print(f"Inference took: {(ed - st):.3f} seconds")

Answer:  The main topic of the PDF is a "Complete Blood Count (CBC)" report.  Based on this, I can tell you that the top 2 important points from the document are:

1.  **Test Results**:  The blood test results show the following values:
    *   Hemoglobin: 12.00 - 15.00 g/dL
    *   Packed Cell Volume (PCV): 36.00 - 46.00%
2.  **Treatment Goals**:  According to the Lipid Association of India 2020 guidelines, the treatment goals for patients with high cholesterol levels are:
    *   LDL Cholesterol: <100 mg/dL
    *   Non-HDL Cholesterol: <130 mg/dL
Inference took: 846.309 seconds


## **Adding GuardRails**

* **Guardrails** add safety, limit abuse, and increase trustworthiness.
* The **diagram** gives a high-level overview of the complete data flow in your app.


In [15]:
def apply_guardrails(response: str, keywords=None, max_length=1000):
    """
    Guardrails for basic safety and quality control.
    - Check if response is empty
    - Check for restricted keywords (toxic/offensive)
    - Check for hallucination patterns
    """
    # Empty or null response
    if not response.strip():
        return "⚠️ The model did not generate a response. Please try a different question."

    # Keyword filtering (basic safety)
    if keywords:
        for word in keywords:
            if word.lower() in response.lower():
                return f"⚠️ Unsafe content detected: '{word}'. Response blocked."

    # Response too long (length control)
    if len(response) > max_length:
        return response[:max_length] + "\n\n⚠️ Truncated due to length."

    # If passed all guardrails
    return response

In [16]:
# Define your unsafe or banned words (basic example)
banned_keywords = ["kill", "hate", "bomb", "suicide", "terrorist"]

# Run RAG pipeline
question = "What is the main topic of the PDF? Can you tell me the top 2 important points?"
raw_response = chain.invoke(question)

# Apply guardrails before showing the response
safe_response = apply_guardrails(raw_response, keywords=banned_keywords)

print(f"Answer: {safe_response}")

Answer:  The main topic of the PDF is "Blood Test Report".

The top 2 important points are:

1. **Cholesterol Levels**: The report provides the results for Cholesterol, Total, which falls within the normal range (115.00 mg/dL).

2. **Electrolyte Balance**: The report also provides the results for Electrolytes: Phosphorus, Sodium, Potassium, and Chloride, all of which fall within the normal ranges (2.40 - 5.10 mg/dL for Phosphorus, 136.00 - 145.00 mEq/L for Sodium, 3.50 - 5.10 mEq/L for Potassium, and 98.00 - 107.00 mEq/L for Chloride).


### 📊 **Diagram: LangChain RAG with LLaMA + PDF + Guardrails**

```text
                ┌──────────────────────┐
                │  📁  Uploaded PDF     │
                └────────┬─────────────┘
                         ↓
              ┌───────────────────────────┐
              │ Load & Split Pages (Loader)│
              └────────┬──────────────────┘
                       ↓
         ┌───────────────────────────────┐
         │ Convert to Embeddings (MiniLM)│
         └────────┬──────────────────────┘
                  ↓
       ┌──────────────────────────────────┐
       │ In-Memory Vector Store (DocArray)│
       └────────┬─────────────────────────┘
                ↓
 ┌────────────────────────────────────────────┐
 │     QUESTION from USER                     │
 └────────────────────────────────────────────┘
                ↓
     ┌──────────────────────────────┐
     │  Retrieve Relevant Chunks    │  ← (from vector store)
     └────────┬─────────────────────┘
              ↓
     ┌──────────────────────────────┐
     │  Format Prompt (LangChain)   │
     └────────┬─────────────────────┘
              ↓
     ┌──────────────────────────────┐
     │   Local LLaMA Model (llama.cpp) │
     └────────┬─────────────────────┘
              ↓
     ┌──────────────────────────────┐
     │  Extract LLM Response (Parser)│
     └────────┬─────────────────────┘
              ↓
     ┌──────────────────────────────┐
     │ 🛡️ Guardrails (Validation)     │
     └────────┬─────────────────────┘
              ↓
     ┌──────────────────────────────┐
     │  ✅ Safe Answer to User       │
     └──────────────────────────────┘
```
