# Ollama Setup and Testing

This notebook will help you set up and test Ollama with LangChain connectors before starting the main RAG assignment.

## Prerequisites

1. **Install Ollama** from https://ollama.ai
   - On Linux/Mac: `curl https://ollama.ai/install.sh | sh`
   - On Windows: Download and run the installer

2. **Verify Installation**
   - Run `ollama -v` in your terminal
   - Should show version 0.11.10 or greater

3. **Pull Required Models**
   ```bash
   # For chat/inference
   ollama pull gpt-oss:20b
   
   # For embeddings
   ollama pull embeddinggemma:latest
   ```


## Step 1: Test Ollama Connection

First, let's verify that Ollama is running and accessible:


In [1]:
import requests
import json

# Test if Ollama is running
try:
    response = requests.get('http://localhost:11434/api/tags')
    if response.status_code == 200:
        models = json.loads(response.text)
        print("✅ Ollama is running!")
        print("\nAvailable models:")
        for model in models.get('models', []):
            print(f"  - {model['name']}")
    else:
        print("❌ Ollama is not responding properly")
except requests.exceptions.ConnectionError:
    print("❌ Cannot connect to Ollama. Make sure it's running!")
    print("Start Ollama by running 'ollama serve' in a terminal")


✅ Ollama is running!

Available models:
  - qwen3:8b
  - gpt-oss:20b
  - embeddinggemma:latest


## Step 2: Test Embeddings with Ollama

Now let's test creating embeddings using the LangChain Ollama connector:


In [3]:
from langchain_ollama import OllamaEmbeddings

# Initialize the embedding model
embedding_model = OllamaEmbeddings(
    model="embeddinggemma:latest",
    base_url="http://localhost:11434"  # Default Ollama URL
)


print("✅ Embedding model initialized")

✅ Embedding model initialized


In [6]:
# Test embedding a single query
test_query = "What is the meaning of life?"

print(f"Embedding query: '{test_query}'")
embedding = embedding_model.embed_query(test_query)

print(f"\n✅ Successfully created embedding!")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")


Embedding query: 'What is the meaning of life?'

✅ Successfully created embedding!
Embedding dimension: 768
First 10 values: [-0.14624307, 0.029132523, 0.037615955, -0.02487349, -0.02655731, 0.016060056, -0.027486855, 0.027256342, 0.01139732, -2.40085e-05]


In [9]:
# Test embedding multiple documents
test_documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science."
]

print("Embedding multiple documents...")
embeddings = embedding_model.embed_documents(test_documents)

print(f"\n✅ Successfully created {len(embeddings)} embeddings!")
for i, doc in enumerate(test_documents):
    print(f"\nDocument {i+1}: '{doc[:50]}...'")
    print(f"  Embedding dimension: {len(embeddings[i])}")
    print(f"  First 5 values: {embeddings[i][:5]}")


Embedding multiple documents...

✅ Successfully created 3 embeddings!

Document 1: 'The quick brown fox jumps over the lazy dog....'
  Embedding dimension: 768
  First 5 values: [-0.14781645, 0.002890088, 0.052144155, -0.029089162, -0.036695607]

Document 2: 'Machine learning is a subset of artificial intelli...'
  Embedding dimension: 768
  First 5 values: [-0.1240763, -0.0027511106, -0.00032809668, 0.010076388, 0.001677464]

Document 3: 'Python is a popular programming language for data ...'
  Embedding dimension: 768
  First 5 values: [-0.16269195, -0.010295422, 0.025464684, 0.000692403, -0.01887165]


## Step 3: Test Model Inference with Ollama

Now let's test using Ollama for text generation/inference using the LangChain connector:


In [18]:
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

# Initialize the chat model
chat_model = ChatOllama(
    model="qwen3:8b",
    #model="gpt-oss:20b",
    temperature=0.6,
    base_url="http://localhost:11434",
    verbose=True
)

print("✅ Chat model initialized")


✅ Chat model initialized


#Let's add a procedure to measure the inference performance of the model

In [19]:
def detailed_performance_metrics(response_metadata):
    """
    Calculate comprehensive performance metrics from Ollama response metadata
    """
    # Extract all timing data (in nanoseconds)
    total_duration = response_metadata.get('total_duration', 0)
    load_duration = response_metadata.get('load_duration', 0)
    prompt_eval_duration = response_metadata.get('prompt_eval_duration', 0)
    eval_duration = response_metadata.get('eval_duration', 0)
    
    # Extract token counts
    prompt_eval_count = response_metadata.get('prompt_eval_count', 0)
    eval_count = response_metadata.get('eval_count', 0)
    
    # Convert to seconds
    total_seconds = total_duration / 1_000_000_000
    load_seconds = load_duration / 1_000_000_000
    prompt_eval_seconds = prompt_eval_duration / 1_000_000_000
    eval_seconds = eval_duration / 1_000_000_000
    
    # tokens per second
    tokens_per_second = eval_count / eval_seconds

    # Calculate metrics
    metrics = {
        'generation_tokens_per_second': eval_count / eval_seconds if eval_seconds > 0 else 0,
        'prompt_tokens_per_second': prompt_eval_count / prompt_eval_seconds if prompt_eval_seconds > 0 else 0,
        'total_tokens': prompt_eval_count + eval_count,
        'total_time_seconds': total_seconds,
        'load_time_seconds': load_seconds,
        'generation_time_seconds': eval_seconds,
        'prompt_processing_time_seconds': prompt_eval_seconds,
        }

    print(f"Total tokens: {metrics['total_tokens']}")
    print(f"Total time seconds: {metrics['total_time_seconds']}")
    print(f"Load time seconds: {metrics['load_time_seconds']}")
    print(f"Generation time seconds: {metrics['generation_time_seconds']}")
    print(f"Prompt processing time seconds: {metrics['prompt_processing_time_seconds']}")
    print(f"Generation tokens per second: {metrics['generation_tokens_per_second']}")
    print(f"Prompt tokens per second: {metrics['prompt_tokens_per_second']}")

    return metrics

In [20]:
# Test simple inference
prompt = "Explain quantum computing in one sentence."

print(f"Prompt: {prompt}")
print("\nGenerating response...")

response = chat_model.invoke(prompt)

print(f"\n✅ Response generated!")
print(f"\nModel output: {response.content}")

# print the chat model name that produced the response
print(f"Chat model: {chat_model.model}")


Prompt: Explain quantum computing in one sentence.

Generating response...

✅ Response generated!

Model output: <think>
Okay, the user wants me to explain quantum computing in one sentence. Let me start by recalling what I know. Quantum computing uses quantum bits or qubits, which can be in superpositions of states. Unlike classical bits that are either 0 or 1, qubits can be both at the same time. This allows for parallel processing and potentially solving certain problems much faster. I should mention the key concepts like superposition and entanglement. Also, it's important to note that it's a different model of computation compared to classical computers. Maybe I can structure it as: Quantum computing is a computational paradigm that leverages quantum phenomena like superposition and entanglement to perform operations on qubits, enabling potentially exponential speedups for specific problems. Wait, does that cover all the essentials? Let me check. It mentions the paradigm, the phen

In [25]:
_ = detailed_performance_metrics(response.response_metadata)

Total tokens: 401
Total time seconds: 25.818671291
Load time seconds: 4.799074625
Generation time seconds: 20.787416583
Prompt processing time seconds: 0.231635959
Generation tokens per second: 18.520819961574926
Prompt tokens per second: 69.07390402195715


In [26]:
# Test with system message and human message
messages = [
    SystemMessage(content="You are a helpful AI assistant that explains complex topics simply."),
    HumanMessage(content="What is machine learning?")
]

print("Sending messages to model...")
response = chat_model.invoke(messages)

print(f"\n✅ Response generated!")
print(f"\nModel output: {response.content}")


Sending messages to model...

✅ Response generated!

Model output: <think>
Okay, the user is asking, "What is machine learning?" I need to explain this in a simple way. Let me start by recalling the basics. Machine learning is a subset of artificial intelligence, right? So maybe I should mention that first. But I need to avoid jargon. Let me think of an analogy. Like teaching a child to recognize shapes. The child sees examples and learns patterns. That's similar to how machines learn.

Wait, but maybe I should break it down into parts. First, define machine learning. Then explain how it works, maybe with an example. Also, mention the types, like supervised and unsupervised learning. But keep it simple. The user might not know the difference between them, so maybe just mention they're different methods. Also, applications are important. Examples like recommendation systems or self-driving cars. But I need to make sure it's clear without being too technical. Let me check if I'm missing 

In [27]:
_ = detailed_performance_metrics(response.response_metadata)

Total tokens: 649
Total time seconds: 33.934649625
Load time seconds: 0.056636167
Generation time seconds: 33.636827333
Prompt processing time seconds: 0.240287708
Generation tokens per second: 18.402449014349198
Prompt tokens per second: 124.85033150343254


## Step 4: Test Streaming Response

Ollama supports streaming responses, which is useful for real-time applications:


In [12]:
# Test streaming
prompt = "Write a haiku about artificial intelligence."

print(f"Prompt: {prompt}")
print("\nStreaming response:")
print("-" * 40)

for chunk in chat_model.stream(prompt):
    print(chunk.content, end="", flush=True)

print("\n" + "-" * 40)
print("\n✅ Streaming completed!")


Prompt: Write a haiku about artificial intelligence.

Streaming response:
----------------------------------------
Silent code hums deep  
Minds born of silicon dreams  
Stars in data glow
----------------------------------------

✅ Streaming completed!


## Summary

If all the tests above passed, you're ready to use Ollama with LangChain! Here's what we tested:

✅ **Embeddings**: 
- Created embeddings for single queries
- Created embeddings for multiple documents
- Verified embedding dimensions

✅ **Model Inference**:
- Simple text generation
- Chat with system and human messages
- Streaming responses
- Integration with LangChain chains

## Troubleshooting

If you encounter issues:

1. **Model Not Found**: Pull the required models (`ollama pull <model-name>`)
2. **Slow Performance**: Ollama models run on CPU by default. For better performance:
   - Use smaller models for testing
   - Consider GPU acceleration if available
3. **Memory Issues**: Large models require significant RAM. Try smaller variants if needed.

## Next Steps

Now you're ready to proceed with the main RAG assignment using Ollama!
