# **LLM + RAG + Agents with Semantic Kernel Memory**
This step-by-step exercise will guide you through **Large Language Models (LLMs)**, **Retrieval-Augmented Generation (RAG)**, and **Agents**, using a **local LLama2 model** and **Semantic Kernel (SK) Memory** in Python.

In [9]:
%pip install llama-cpp-python huggingface_hub >nul 2>&1
%pip install semantic-kernel >nul 2>&1
%pip install ollama >nul 2>&1


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


UsageError: Line magic function `%` not found.


## **Exercise 1: Loading LLama2 and Running a Basic Prompt**
### **What You Will Learn**
- How to download and load LLama2 locally
- How to send a basic query and receive a response

In [32]:
import os
import sys
from contextlib import redirect_stdout, redirect_stderr
from llama_cpp import Llama
from huggingface_hub import hf_hub_download

# Ensure all logging is disabled
os.environ["LLAMA_CPP_LOG_LEVEL"] = "ERROR"

MODEL_NAME = "TheBloke/Llama-2-7B-Chat-GGUF"
MODEL_FILE = "llama-2-7b-chat.Q4_K_M.gguf"

# Suppress downloading logs
with redirect_stdout(open(os.devnull, "w")), redirect_stderr(open(os.devnull, "w")):
    model_path = hf_hub_download(repo_id=MODEL_NAME, filename=MODEL_FILE)

# Suppress model loading logs
#with SuppressOutput():
llm = Llama(model_path=model_path, n_ctx=4096, verbose=False)

# Run the query and print only the needed output
print("\nRunning LLM query...\n")

#with SuppressOutput():
response = llm("Explain Large Language Models (LLMs) in simple terms. Use 100 Characters.", max_tokens=128)

# Ensure full output is printed without truncation
full_output = response['choices'][0]['text']

print("\n### Model Output ###\n")
print(full_output)


Note: you may need to restart the kernel to use updated packages.

Running LLM query...


### Model Output ###


Large Language Models (LLMs) are AI systems that generate human-like text by learning from vast amounts of data. They can be trained to perform specific tasks, like writing articles or chatting, and can improve over time with more data and training.


**Note:** If you are running this in a closed environment without internet access, ask your instructor for the model file path.

Remove the **, verbose=False** to see the additional logs.

## **Exercise 2: Implementing Manual RAG Retrieval**
### **What You Will Learn**
- How to use FAISS for document retrieval
- How to create and store vector embeddings
- How to retrieve the best-matching fact

In [33]:
%pip install torchvision torchaudio >nul 2>&1
%pip install faiss-cpu numpy sentence-transformers >nul 2>&1

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

# Load embedding model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# Sample knowledge base with unique facts
docs = [
    "RAG improves LLM responses by retrieving relevant external documents.",
    "Semantic Kernel is an AI orchestration framework.",
    "LLama2 is a local, open-source language model.",
    "The Moon has a diameter of 3,474 km.",
    "The heaviest recorded blue whale weighed approximately 190,000 kg.",
    "The Eiffel Tower can be 15 cm taller in the summer due to thermal expansion."
]

# Generate embeddings
doc_embeddings = np.array([embedder.encode(doc) for doc in docs])

# Create FAISS index
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(doc_embeddings)

# Save FAISS index
faiss.write_index(index, "vector_store.index")

# Test retrieval
query = "What is the weight of the heaviest blue whale?"
query_embedding = embedder.encode(query).reshape(1, -1)
D, I = index.search(query_embedding, k=1)
retrieved_doc = docs[I[0][0]]

print(f'🔍 Best match: {retrieved_doc}')


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
🔍 Best match: The heaviest recorded blue whale weighed approximately 190,000 kg.


## **Exercise 3: Enhancing LLM Responses with Retrieved Data**
### **What You Will Learn**
- How to retrieve relevant knowledge before generating a response
- How to use the retrieved data as context in a prompt

In [34]:
# Query for a specific fact
query = "How much does the heaviest blue whale weigh?"
query_embedding = embedder.encode(query).reshape(1, -1)

# Load FAISS index
index = faiss.read_index("vector_store.index")

# Perform search
D, I = index.search(query_embedding, k=1)
retrieved_doc = docs[I[0][0]]

# Display retrieved knowledge
print(f"\n🔍 Best match: {retrieved_doc}\n")

# Use retrieved document in LLM prompt
context = retrieved_doc
prompt = f"""[INST] <<SYS>>
You are an AI assistant that enhances responses using retrieved knowledge.
Below is relevant information from a knowledge base:
{context}
<</SYS>>
Answer the question: {query}[/INST]"""

# Generate response
response = llm(prompt, max_tokens=160)
    
print("\n### Model Output ###\n")
print(response['choices'][0]['text'])



🔍 Best match: The heaviest recorded blue whale weighed approximately 190,000 kg.


### Model Output ###

  Based on the information retrieved from the knowledge base, the heaviest blue whale weighs approximately 190,000 kg.


## **Exercise 4: Using Semantic Kernel Memory for Retrieval**
### **What You Will Learn**
- How to store and retrieve information using Semantic Kernel Memory
- How to replace manual FAISS retrieval with SK Memory
- How to integrate memory-based retrieval into LLM prompts
- How to abstract the model for both the chat completion and the memory (RAG) with Semantic Kernel

In [35]:
import asyncio
import torch
from transformers import AutoTokenizer, AutoModel
from semantic_kernel import Kernel
from semantic_kernel.memory import VolatileMemoryStore
from semantic_kernel.memory.semantic_text_memory import SemanticTextMemory

class LocalLlamaTextCompletion:
    def __init__(self, llama_instance, service_id: str):
        self.llama = llama_instance
        self.service_id = service_id

    async def complete(self, prompt: str, max_tokens: int = 200) -> str:
        response = await asyncio.to_thread(self.llama, prompt, max_tokens=max_tokens)
        return response['choices'][0]['text']


# Initialize our local completion service.
local_llama_completion = LocalLlamaTextCompletion(llm, "local_llama")


# ---------------------------
# Set up the local embedding service for the memory store.
# ---------------------------
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(embed_model_name)
embed_model = AutoModel.from_pretrained(embed_model_name)

def generate_embedding(text: str):
    # Tokenize the input text.
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        model_output = embed_model(**inputs)
    # Mean pooling to get a fixed-size embedding vector.
    embeddings = model_output.last_hidden_state.mean(dim=1)
    return embeddings[0].cpu().numpy()

class LocalEmbeddingService:
    async def generate_embeddings(self, texts, **kwargs):
        embeddings = []
        for t in texts:
            emb = await asyncio.to_thread(generate_embedding, t)
            embeddings.append(emb)
        return embeddings

local_embedding_service = LocalEmbeddingService()

# ---------------------------
# Initialize Semantic Kernel memory (SK Memory) with a volatile store.
# ---------------------------
memory_store = VolatileMemoryStore()
sk_memory = SemanticTextMemory(memory_store, local_embedding_service)

# ---------------------------
# Set up the Semantic Kernel and register the local completion service.
# ---------------------------
kernel = Kernel()
# Register the local LLaMA completion service under a service id.
# When creating a semantic function, we will reference it by "local_llama".
kernel.add_service(local_llama_completion)

# ---------------------------
# Populate your knowledge base.
# ---------------------------
knowledge_base = {
    "fact_rag": "RAG improves LLM responses by retrieving relevant external documents.",
    "fact_sk": "Semantic Kernel is an AI orchestration framework.",
    "fact_llama2": "LLama2 is a local, open-source language model.",
    "fact_moon": "The Moon has a diameter of 3,474 km.",
    "fact_bluewhale": "The heaviest recorded blue whale weighed approximately 190,000 kg.",
    "fact_eiffel": "The Eiffel Tower can be 15 cm taller in the summer due to thermal expansion."
}

async def save_knowledge():
    for fact_id, fact_text in knowledge_base.items():
        await sk_memory.save_information(collection="knowledge_base", text=fact_text, id=fact_id)

# Save the facts to memory.
await save_knowledge()

# Define your query.
query = "How much does the heaviest blue whale weigh?"

# Perform a semantic search using the local embedding service.
results = await sk_memory.search(collection="knowledge_base", query=query, limit=1, min_relevance_score=0.0)
if not results:
    print("No results found.")
    exit()

retrieved_info = results[0]
print(f"\n🧠 Retrieved from SK Memory: {retrieved_info.text}\n")

# Define the prompt template. Notice that we include placeholders for the retrieved info and the query.
prompt_template = (
    "[INST] <<SYS>>\n"
    "You are an AI assistant using retrieved knowledge.\n"
    "Relevant Info: {retrieved_info}\n"
    "<</SYS>>\n"
    "Answer the question: {query} [/INST]"
)
formatted_prompt = prompt_template.format(
    retrieved_info=retrieved_info.text,
    query=query
)
result = await kernel.services["local_llama"].complete(formatted_prompt, max_tokens=200)
print("\n### Model Output ###\n")
print(result)






🧠 Retrieved from SK Memory: The heaviest recorded blue whale weighed approximately 190,000 kg.


### Model Output ###

  Ah, an excellent question! According to my retrieved knowledge, the heaviest blue whale ever recorded weighed approximately 190,000 kilograms (or 190 tons). That's a whopping amount of weight! Can I help you with anything else?


## **Exercise 5: Using Semantic Kernel Functions to provide information**
Since Semantic Kernel can call a function using instruct models, we are going to use a local OLlama server

### **What You Will Learn**
- Running and playing with a local Ollama server
- Loading a model and chat with it
- Write a code that:
    - Add memory access using Semantic Kernel plugin function
    - Add a function that allow the Kernel to calculate


### Using Ollama local server

```
docker run -d --name ollama -p 11434:11434 ollama/ollama:latest
```
#### Start the server

```
docker exec -it ollama ollama serve
```

#### Chat with a model

```
docker exec -it ollama ollama run llama3.2

```


In [None]:
import asyncio
import torch
from transformers import AutoTokenizer, AutoModel
from semantic_kernel import Kernel
from semantic_kernel.memory import VolatileMemoryStore
from semantic_kernel.memory.semantic_text_memory import SemanticTextMemory
from semantic_kernel.core_plugins.math_plugin import MathPlugin
from semantic_kernel.core_plugins.time_plugin import TimePlugin
from semantic_kernel.connectors.ai.ollama import OllamaChatCompletion
from semantic_kernel.connectors.ai.ollama.ollama_prompt_execution_settings import OllamaChatPromptExecutionSettings
from semantic_kernel.functions.kernel_function_decorator import kernel_function
from semantic_kernel.connectors.ai.function_choice_behavior import FunctionChoiceBehavior
from semantic_kernel.contents import ChatHistory
from semantic_kernel.functions import KernelArguments

# Initialize the kernel
kernel = Kernel()

# Initialize memory store and semantic text memory
memory_store = VolatileMemoryStore()

# Set up the local embedding service for the memory store
embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(embed_model_name)
embed_model = AutoModel.from_pretrained(embed_model_name)

def generate_embedding(text: str):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        model_output = embed_model(**inputs)
    # Mean pooling to get a fixed-size embedding vector
    embeddings = model_output.last_hidden_state.mean(dim=1)
    return embeddings[0].cpu().numpy()

class LocalEmbeddingService:
    async def generate_embeddings(self, texts, **kwargs):
        embeddings = []
        for t in texts:
            emb = await asyncio.to_thread(generate_embedding, t)
            embeddings.append(emb)
        return embeddings

local_embedding_service = LocalEmbeddingService()
sk_memory = SemanticTextMemory(memory_store, local_embedding_service)

#
# Add plugins to the kernel
kernel.add_plugin(MathPlugin(), plugin_name="math")
kernel.add_plugin(TimePlugin(), plugin_name="time")

# Knowledge base with facts
# --- SK Memory Setup ---
memory_store = VolatileMemoryStore()
sk_memory = SemanticTextMemory(memory_store, local_embedding_service)

# --- Populate Knowledge Base ---
knowledge_base = {
    "fact_rag": "RAG improves LLM responses by retrieving relevant external documents.",
    "fact_sk": "Semantic Kernel is an AI orchestration framework.",
    "fact_llama2": "LLama2 is a local, open-source language model.",
    "fact_moon": "The Moon has a diameter of 3,474 km.",
    "fact_bluewhale": "The heaviest recorded blue whale weighed approximately 190,000 kg.",
    "fact_eiffel": "The Eiffel Tower can be 15 cm taller in the summer due to thermal expansion."
}

async def save_knowledge():
    for fact_id, fact_text in knowledge_base.items():
        await sk_memory.save_information(collection="knowledge_base", text=fact_text, id=fact_id)

# Save the facts to memory
await save_knowledge()

class MemoryPlugin:
    def __init__(self, memory):
        self.memory = memory

    @kernel_function(
        name="retrieve_fact",
        description="Retrieve a fact from memory based on the user's query using RAG. "
                    "The fact returned will contain a weight in kilograms embedded in text."
    )
    async def retrieve_fact(self, query: str) -> str:
        """Retrieve a fact from SK Memory using RAG."""
        results = await self.memory.search(collection="knowledge_base", query=query, limit=1, min_relevance_score=0.0)
        return results[0].text if results else "No relevant fact found."

kernel.add_plugin(MemoryPlugin(sk_memory), "memory_plugin")

# Configure the Ollama chat completion service
model_name = "llama3.2"  # Ensure this model is pulled and available
ollama_endpoint = "http://localhost:11434"
chat_completion_service = OllamaChatCompletion(ai_model_id=model_name, host=ollama_endpoint)

# Create request settings for Ollama
request_settings = OllamaChatPromptExecutionSettings()
request_settings.function_choice_behavior = FunctionChoiceBehavior.Auto(filters={"excluded_plugins": ["ChatBot"]})

# Register the Ollama service with the kernel
kernel.add_service(chat_completion_service)

# User query
user_input = "How much does the heaviest blue whale weigh in pounds?"

# Initialize chat history
history = ChatHistory()
history.add_user_message(user_input)

# Update arguments with user input and chat history
arguments = KernelArguments(settings=request_settings)
arguments["user_input"] = user_input
arguments["chat_history"] = history

chat_function = kernel.add_function(
    prompt="{{$chat_history}}{{$user_input}}",
    plugin_name="ChatBot",
    function_name="Chat")
    
# Invoke the chat function
result = await kernel.invoke(chat_function, arguments=arguments)

# Process the result
if result:
    response = result.value[0]
    print(f"Chatbot:> {response}")


Chatbot:> To convert this to pounds, we can multiply by 2.20462 (since there are 2.20462 pounds in a kilogram):

189,000 kg * 2.20462 lbs/kg ≈ 416,000 pounds.

So, the heaviest recorded blue whale weighed approximately 416,000 pounds.


## **Final Exercise**

**Exercise 6: Building a Disk Traversal and Summarization Chatbot**

**Objective:**
Develop a Python application that traverses a specified directory, identifies text-based documents, generates concise summaries for each, stores these summaries in memory, and enables interactive querying of this information through a chatbot interface.

**Key Learning Outcomes:**
- Implementing directory traversal to locate text documents.
- Applying text summarization techniques to distill essential information.
- Storing and managing summaries using Semantic Kernel Memory.
- Creating an interactive chatbot capable of responding to user queries based on the stored summaries.

**Instructions:**

1. **Directory Traversal:**
   - Utilize Python's `os` or `pathlib` modules to recursively traverse a user-specified directory.
   - Identify and collect paths of text-based documents (e.g., `.txt`, `.md`, `.docx`, `.pdf`).
   - For simplicity target `.txt`, `.md`
   - You can use Python libraries to target `.docx`, `.pdf`

2. **Storing Summaries in Semantic Kernel Memory:**
   - Integrate Semantic Kernel (SK) Memory to store the generated summaries.

3. **Developing the Chatbot Interface:**
   - Create an interactive chatbot that can handle user queries related to the summarized documents.
   - Upon receiving a query, the chatbot should:
     - Retrieve relevant summaries from SK Memory.
     - Provide coherent and contextually relevant responses based on the stored information.

This exercise aims to consolidate your understanding of integrating file system operations, text processing, memory management, and interactive user interfaces within a cohesive application. 

## **Setting up offline environment**

To set up your environment for running the provided Jupyter Notebook in a disconnected setting, follow these steps:

1. **Download and Prepare Dependencies**: Use the following script to download all necessary models and Docker images. This script should be executed in an environment with internet access.

   ```bash
   #!/bin/bash

   # Create a directory to store all resources
   mkdir -p llm_resources
   cd llm_resources

   # Download the LLaMA 2 model
   git lfs install
   git clone https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
   mv Llama-2-7B-Chat-GGUF models

   # Pull the Ollama Docker image
   docker pull ollama/ollama:latest
   docker save ollama/ollama:latest -o ollama_latest.tar

   # Create a requirements file for Python dependencies
   cat <<EOF > requirements.txt
   torch
   torchvision
   torchaudio
   faiss-cpu
   numpy
   sentence-transformers
   semantic-kernel
   huggingface_hub
   llama-cpp-python
   EOF

   # Download Python packages
   pip download -r requirements.txt -d python_packages

   echo "All resources have been downloaded and saved in the 'llm_resources' directory."
   ```


   **Instructions**:

   - Run the above script on a machine with internet access.
   - Transfer the `llm_resources` directory to your target offline environment.

2. **Set Up in the Disconnected Environment**:

   - **Install Docker**: Ensure Docker is installed on your offline machine. If not, download the Docker installation package appropriate for your system and transfer it to the machine for installation.

   - **Load the Ollama Docker Image**: Navigate to the `llm_resources` directory and load the Docker image:

     ```bash
     docker load -i ollama_latest.tar
     ```

   - **Install Python Dependencies**: Use the pre-downloaded Python packages to set up your environment:

     ```bash
     pip install --no-index --find-links=python_packages -r requirements.txt
     ```

   - **Set Up Models**: Ensure that the downloaded LLaMA 2 model is placed in the appropriate directory as expected by your Jupyter Notebook.

3. **Running the Jupyter Notebook**:

   - **Start the Ollama Server**: Run the Ollama server using Docker:

     ```bash
     docker run -d --name ollama -p 11434:11434 ollama/ollama:latest
     ```

   - **Launch Jupyter Notebook**: Navigate to your project directory and start Jupyter Notebook:

     ```bash
     jupyter notebook
     ```

   - **Access the Notebook**: Open your web browser and navigate to the Jupyter Notebook interface to open and run your notebook.

By following these steps, you can set up and run your Jupyter Notebook in an environment without internet access. 