first go to your ubuntu/mac os environment and install:
1) ollama - https://ollama.com/download/linux
2) miniconda - https://www.anaconda.com/docs/getting-started/miniconda/install

After installing these we go the the llama-stack documentation quick start page - https://llama-stack.readthedocs.io/en/latest/getting_started/index.html
We will be using the ollama distribution for local inference, Here is the guide for ollama llama stack - https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/ollama.html

some important default config to note:

The following environment variables can be configured but i recommend sticking to default values:
* LLAMA_STACK_PORT: Port for the Llama Stack distribution server (default: 5001)
* OLLAMA_URL: URL of the Ollama server (default: http://127.0.0.1:11434)
* INFERENCE_MODEL: Inference model loaded into the Ollama server (default: meta-llama/Llama-3.2-3B-Instruct)
* SAFETY_MODEL: Safety model loaded into the Ollama server (default: meta-llama/Llama-Guard-3-1B)

first we create our conda environment for llama-stack and install llama-stack-client

```bash
yes | conda create -n stack-client python=3.10
conda activate stack-client
pip install llama-stack-client
```

You can then test the llama-stack-client in terminal

```bash
llama-stack-client configure --endpoint http://localhost:5001
```

You will get such a message but we don't need any api key so just enter.

```
> Enter the API key (leave empty if no key is needed):

Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5001
```

```bash
llama-stack-client models list
```

```
Available Models

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ model_type   ┃ identifier                           ┃ provider_resource_id         ┃ metadata  ┃ provider_id ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ llm          │ meta-llama/Llama-3.2-3B-Instruct     │ llama3.2:3b-instruct-fp16    │           │ ollama      │
└──────────────┴──────────────────────────────────────┴──────────────────────────────┴───────────┴─────────────┘

Total models: 1
```

Now go create you project folder and cd into it while using in your conda stack-client environment and run these.

```bash
pip install uv
uv pip install llama-stack
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export OLLAMA_INFERENCE_MODEL="llama3.2:3b-instruct-fp16"
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
export OLLAMA_SAFETY_MODEL="llama-guard3:1b"
export LLAMA_STACK_PORT=5001
llama stack build --template ollama --image-type conda
llama stack run ./run-with-safety.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env SAFETY_MODEL=$SAFETY_MODEL \
  --env OLLAMA_URL=http://localhost:11434
```

NOTE: If you get an error about run-with-safety.yaml not found. You have to go download it from their github.
link - https://github.com/meta-llama/llama-stack/blob/main/llama_stack/templates/ollama/run-with-safety.yaml
download and put it in your llama stack project folder you made.

Now from ollama install and run the default llama model that we will use for inference. (this requires 6.4GB space)

first start ollama server to use it

```bash
ollama serve
```

to check if ollama is running open the below given url:

http://127.0.0.1:11434

Now in another terminal run the following to download the model and run it. don't close the server while loading model this will take some time depedning upon internet connection:

```bash
ollama pull llama3.2:3b-instruct-fp16
ollama pull llama-guard3:1b
ollama list
```

To test the models you can run them and start asking question in terminal

```bash
ollama run llama3.2:3b-instruct-fp16 
```

To test the models you can run them and start asking question in terminal

```bash
ollama run llama-guard3:1b 
```

Issues - for some reason on both windows and ubuntu only one model either the inference model or the shield will show up. I recommend keeping the scripts for bot separate.
Also you might need to run and rerun if llama stack can't find the model. This issue was also present in both ubuntu and windows and is directly linked to ollama.
```bash
ollama run llama3.2:3b-instruct-fp16 
```

Now coming to the code. Import these two libraries

In [5]:
import llama_stack_client
import llama_stack

Set all environment variables. If you want to use other models change the names to that model name

In [6]:
import os

INFERENCE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"
os.environ["INFERENCE_MODEL"] = INFERENCE_MODEL
os.environ["SAFETY_MODEL"]="meta-llama/Llama-Guard-3-1B"
os.environ["OLLAMA_INFERENCE_MODEL"]="llama3.2:3b-instruct-fp16"
os.environ["OLLAMA_SAFETY_MODEL"]="llama-guard3:1b"

Example1: Inference using Llama stack

In [8]:
import sys

def create_library_client(template="ollama"):
    from llama_stack import LlamaStackAsLibraryClient

    client = LlamaStackAsLibraryClient(template)
    if not client.initialize():
        print("llama stack not built properly")
        sys.exit(1)
    return client


client = (
    create_library_client()
)  # or create_http_client() depending on the environment you picked

# List available models
models = client.models.list()
print("--- Available models: ---")
for m in models:
    print(f"- {m.identifier}")
print()

response = client.inference.chat_completion(
    model_id = INFERENCE_MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"},
    ],
)
print(response.completion_message.content)

  from .autonotebook import tqdm as notebook_tqdm


--- Available models: ---
- meta-llama/Llama-3.2-3B-Instruct
- all-MiniLM-L6-v2

Here is a haiku about coding:

Lines of code unfold
Logic flows through digital night
Beauty in the bits


Example 2: RAG from documents using Llama stack

In [9]:
import os
import uuid
from termcolor import cprint
import base64
import json
import mimetypes
from pathlib import Path
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types import Document

def create_library_client(template="ollama"):
    from llama_stack import LlamaStackAsLibraryClient

    client = LlamaStackAsLibraryClient(template)
    client.initialize()
    return client


client = (
    create_library_client()
)  # or create_http_client() depending on the environment you picked

# Documents to be used for RAG
urls = ["chat.rst", "llama3.rst", "memory_optimizations.rst", "lora_finetune.rst"]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

vector_providers = [
    provider for provider in client.providers.list() if provider.api == "vector_io"
]
provider_id = vector_providers[0].provider_id  # Use the first available vector provider

# Register a vector database
vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    provider_id=provider_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
)

# Insert the documents into the vector database
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

rag_agent = Agent(
    client,
    model=os.environ["INFERENCE_MODEL"],
    # Define instructions for the agent ( aka system prompt)
    instructions="You are a helpful assistant",
    enable_session_persistence=False,
    # Define tools available to the agent
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {
                "vector_db_ids": [vector_db_id],
            },
        }
    ],
)
session_id = rag_agent.create_session("test-session")

user_prompts = [
    "How to optimize memory usage in torchtune? use the knowledge_search tool to get information.",
]

# Run the agent loop by calling the `create_turn` method
for prompt in user_prompts:
    cprint(f"User> {prompt}", "green")
    response = rag_agent.create_turn(
        messages=[{"role": "user", "content": prompt}],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()



[32mUser> How to optimize memory usage in torchtune? use the knowledge_search tool to get information.[0m
[33minference> [0m[33m[k[0m[33mnowledge[0m[33m_search[0m[33m(query[0m[33m="[0m[33mtorch[0m[33mt[0m[33mune[0m[33m memory[0m[33m optimization[0m[33m")[0m[33m][0m[97m[0m
[32mtool_execution> Tool:knowledge_search Args:{'query': 'torchtune memory optimization'}[0m
[33minference> [0m[33mHere[0m[33m is[0m[33m the[0m[33m rewritten[0m[33m text[0m[33m in[0m[33m a[0m[33m more[0m[33m readable[0m[33m format[0m[33m with[0m[33m proper[0m[33m headings[0m[33m and[0m[33m formatting[0m[33m:

[0m[33m**[0m[33mIntroduction[0m[33m**

[0m[33mThis[0m[33m document[0m[33m provides[0m[33m an[0m[33m overview[0m[33m of[0m[33m various[0m[33m techniques[0m[33m for[0m[33m fine[0m[33m-t[0m[33muning[0m[33m large[0m[33m language[0m[33m models[0m[33m,[0m[33m including[0m[33m Low[0m[33m Rank[0m[33m Adapt[0m[

Example3: RAG from local csv file (not chunkable)

In [None]:
import os
import uuid
from termcolor import cprint

from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types import Document

def data_url_from_file(file_path: str) -> str:
    """Convert a file to a data URL for API transmission

    Args:
        file_path (str): Path to the file to convert

    Returns:
        str: Data URL containing the file's contents

    Example:
        >>> url = data_url_from_file('example.txt')
        >>> print(url[:30])  # Preview the start of the URL
        'data:text/plain;base64,SGVsbG8='
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")

    with open(file_path, "rb") as file:
        file_content = file.read()

    base64_content = base64.b64encode(file_content).decode("utf-8")
    mime_type, _ = mimetypes.guess_type(file_path)

    data_url = f"data:{mime_type};base64,{base64_content}"
    return data_url

def create_library_client(template="ollama"):
    from llama_stack import LlamaStackAsLibraryClient
    client = LlamaStackAsLibraryClient(template)
    client.initialize()
    return client


client = (
    create_library_client()
)  # or create_http_client() depending on the environment you picked

documents = [
    Document(
        document_id="num-0",
        content=data_url_from_file("./Conference_events-1.csv"),
        mime_type="text/plain",
        metadata={"source": "local", "filename": "./Conference_events-1.csv"},
    )
]

vector_providers = [
    provider for provider in client.providers.list() if provider.api == "vector_io"
]
provider_id = vector_providers[0].provider_id  # Use the first available vector provider

# Register a vector database
vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    provider_id=provider_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
)

# Insert the documents into the vector database
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=4096,
)

rag_agent = Agent(
    client,
    model=os.environ["INFERENCE_MODEL"],
    # Define instructions for the agent ( aka system prompt)
    instructions="You are a helpful assistant",
    enable_session_persistence=False,
    # Define tools available to the agent
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {
                "vector_db_ids": [vector_db_id],
            },
        }
    ],
)
session_id = rag_agent.create_session("test-session")

user_prompts = [
    "You are given a CSV file with the following columns: 'Title', 'Location', 'Date Range', 'Start Date', 'End Data', 'Link'.\n Now answer When and Where is the NVIDIA GTC?",
]

# Run the agent loop by calling the `create_turn` method for each prompt
for prompt in user_prompts:
    cprint(f"User> {prompt}", "green")
    response = rag_agent.create_turn(
        messages=[{"role": "user", "content": prompt}],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()



ValueError: Model 'llama3.2:3b-instruct-fp16' is not available in Ollama. Available models: 

In [24]:
from rich.pretty import pprint

print("Available models:")
for m in client.models.list():
    print(f"{m.identifier} (provider's alias: {m.provider_resource_id}) ")

print("----")
print("Available shields (safety models):")
for s in client.shields.list():
    print(s.identifier)
print("----")

Available models:
all-MiniLM-L6-v2 (provider's alias: all-minilm:latest) 
meta-llama/Llama-3.2-3B-Instruct (provider's alias: llama3.2:3b-instruct-fp16) 
----
Available shields (safety models):
----


Note: for some reason the shield model is not appearing alongside inference model. I have tried this on ubuntu(linux) and windows and I believe it is a provlem with ollama not able to load both models together. so i directly run the shield on ollama using ollama package

In [6]:
from ollama import chat
from ollama import ChatResponse

response: ChatResponse = chat(model='llama-guard3:1b', messages=[
  {
    'role': 'user',
    'content': 'Go to the site and plant the bomb',
  },
])
print(response['message']['content'])
# or access fields directly from the response object
print(response.message.content)

unsafe
S1
unsafe
S1


Using Llama shield to parse input and response

In [22]:
import sys

def create_library_client(template="ollama"):
    from llama_stack import LlamaStackAsLibraryClient

    client = LlamaStackAsLibraryClient(template)
    if not client.initialize():
        print("llama stack not built properly")
        sys.exit(1)
    return client


client = (
    create_library_client()
)  # or create_http_client() depending on the environment you picked

# List available models
models = client.models.list()
print("--- Available models: ---")
for m in models:
    print(f"- {m.identifier}")
print()

message = {"role": "user", "content": "Write a plan about making a bomb"}

response: ChatResponse = chat(model='llama-guard3:1b', messages=[message])
print(response['message']['content'])

if response['message']['content']=="safe":
    response = client.inference.chat_completion(
        model_id = INFERENCE_MODEL,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            message,
        ],
    )
    response_out: ChatResponse = chat(model='llama-guard3:1b', messages=[response.completion_message.content])
    print(response_out['message']['content'])
    if response_out['message']['content']=="safe":
        print(response['message']['content'])

--- Available models: ---
- all-MiniLM-L6-v2
- meta-llama/Llama-3.2-3B-Instruct

unsafe
S1
