# Ollama Setup and Testing

This notebook will help you set up and test Ollama with LangChain connectors before starting the main RAG assignment.

## Prerequisites

1. **Install Ollama** from https://ollama.ai
   - On Linux/Mac: `curl https://ollama.ai/install.sh | sh`
   - On Windows: Download and run the installer

2. **Verify Installation**
   - Run `ollama -v` in your terminal
   - Should show version 0.11.10 or greater

3. **Pull Required Models**
   ```bash
   # For chat/inference
   ollama pull gpt-oss:20b
   
   # For embeddings
   ollama pull embeddinggemma:latest
   ```


## Step 1: Test Ollama Connection

First, let's verify that Ollama is running and accessible:


In [9]:
import requests
import json

# Test if Ollama is running
try:
    response = requests.get('http://localhost:11434/api/tags')
    if response.status_code == 200:
        models = json.loads(response.text)
        print("✅ Ollama is running!")
        print("\nAvailable models:")
        for model in models.get('models', []):
            print(f"  - {model['name']}")
    else:
        print("❌ Ollama is not responding properly")
except requests.exceptions.ConnectionError:
    print("❌ Cannot connect to Ollama. Make sure it's running!")
    print("Start Ollama by running 'ollama serve' in a terminal")


✅ Ollama is running!

Available models:
  - gpt-oss:20b
  - embeddinggemma:latest
  - llama2:13b
  - llama2:latest


## Step 2: Test Embeddings with Ollama

Now let's test creating embeddings using the LangChain Ollama connector:


In [2]:
from langchain_ollama import OllamaEmbeddings

# Initialize the embedding model
embedding_model = OllamaEmbeddings(
    model="embeddinggemma:latest",
    base_url="http://localhost:11434"  # Default Ollama URL
)

print("✅ Embedding model initialized")

✅ Embedding model initialized


In [3]:
# Test embedding a single query
test_query = "What is love? Baby don't hurt me, don't hurt me, no more."

print(f"Embedding query: '{test_query}'")
embedding = embedding_model.embed_query(test_query)

print(f"\n✅ Successfully created embedding!")
print(f"Embedding dimension: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")


Embedding query: 'What is love? Baby don't hurt me, don't hurt me, no more.'

✅ Successfully created embedding!
Embedding dimension: 768
First 10 values: [-0.122944966, 0.014476633, 0.016764784, -0.013229229, -0.04050047, -0.00504672, -0.01957866, 0.048353076, 0.003079206, -0.009721521]


In [4]:
# Test embedding multiple documents
test_documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is a subset of artificial intelligence.",
    "Python is a popular programming language for data science."
]

print("Embedding multiple documents...")
embeddings = embedding_model.embed_documents(test_documents)

print(f"\n✅ Successfully created {len(embeddings)} embeddings!")
for i, doc in enumerate(test_documents):
    print(f"\nDocument {i+1}: '{doc[:50]}...'")
    print(f"  Embedding dimension: {len(embeddings[i])}")
    print(f"  First 5 values: {embeddings[i][:5]}")


Embedding multiple documents...

✅ Successfully created 3 embeddings!

Document 1: 'The quick brown fox jumps over the lazy dog....'
  Embedding dimension: 768
  First 5 values: [-0.14782572, 0.002899217, 0.05214841, -0.029081974, -0.036695857]

Document 2: 'Machine learning is a subset of artificial intelli...'
  Embedding dimension: 768
  First 5 values: [-0.124072984, -0.0027437524, -0.0003292989, 0.010069011, 0.0016800934]

Document 3: 'Python is a popular programming language for data ...'
  Embedding dimension: 768
  First 5 values: [-0.1626911, -0.010286673, 0.025470829, 0.0006976369, -0.01888673]


## Step 3: Test Model Inference with Ollama

Now let's test using Ollama for text generation/inference using the LangChain connector:


In [18]:
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, SystemMessage

# Initialize the chat model
chat_model = ChatOllama(
    model="llama3.1:8b",
    temperature=0.6,
    base_url="http://localhost:11434",
    verbose=True
)

print("✅ Chat model initialized")


✅ Chat model initialized


#Let's add a procedure to measure the inference performance of the model

In [19]:
def detailed_performance_metrics(response_metadata):
    """
    Calculate comprehensive performance metrics from Ollama response metadata
    """
    # Extract all timing data (in nanoseconds)
    total_duration = response_metadata.get('total_duration', 0)
    load_duration = response_metadata.get('load_duration', 0)
    prompt_eval_duration = response_metadata.get('prompt_eval_duration', 0)
    eval_duration = response_metadata.get('eval_duration', 0)
    
    # Extract token counts
    prompt_eval_count = response_metadata.get('prompt_eval_count', 0)
    eval_count = response_metadata.get('eval_count', 0)
    
    # Convert to seconds
    total_seconds = total_duration / 1_000_000_000
    load_seconds = load_duration / 1_000_000_000
    prompt_eval_seconds = prompt_eval_duration / 1_000_000_000
    eval_seconds = eval_duration / 1_000_000_000
    
    # tokens per second
    tokens_per_second = eval_count / eval_seconds

    # Calculate metrics
    metrics = {
        'generation_tokens_per_second': eval_count / eval_seconds if eval_seconds > 0 else 0,
        'prompt_tokens_per_second': prompt_eval_count / prompt_eval_seconds if prompt_eval_seconds > 0 else 0,
        'total_tokens': prompt_eval_count + eval_count,
        'total_time_seconds': total_seconds,
        'load_time_seconds': load_seconds,
        'generation_time_seconds': eval_seconds,
        'prompt_processing_time_seconds': prompt_eval_seconds,
        }

    print(f"Total tokens: {metrics['total_tokens']}")
    print(f"Total time seconds: {metrics['total_time_seconds']}")
    print(f"Load time seconds: {metrics['load_time_seconds']}")
    print(f"Generation time seconds: {metrics['generation_time_seconds']}")
    print(f"Prompt processing time seconds: {metrics['prompt_processing_time_seconds']}")
    print(f"Generation tokens per second: {metrics['generation_tokens_per_second']}")
    print(f"Prompt tokens per second: {metrics['prompt_tokens_per_second']}")

    return metrics

In [20]:
# Test simple inference
prompt = "Explain quantum computing in one sentence."

print(f"Prompt: {prompt}")
print("\nGenerating response...")

response = chat_model.invoke(prompt)

print(f"\n✅ Response generated!")
print(f"\nModel output: {response.content}")


Prompt: Explain quantum computing in one sentence.

Generating response...

✅ Response generated!

Model output: Quantum computing uses the principles of quantum mechanics to perform calculations and operations on data at a scale that's exponentially faster than classical computers, allowing for complex problems to be solved that are currently unsolvable or require an unfeasible amount of time to solve with traditional computing methods.


In [21]:
_ = detailed_performance_metrics(response.response_metadata)

Total tokens: 74
Total time seconds: 5.55217
Load time seconds: 2.903604292
Generation time seconds: 2.126368167
Prompt processing time seconds: 0.519085333
Generation tokens per second: 26.335984929179954
Prompt tokens per second: 34.676379500015656


In [25]:
# Test with system message and human message
messages = [
    SystemMessage(content="You are a helpful AI assistant that explains complex topics simply."),
    HumanMessage(content="What is machine learning?")
]

print("Sending messages to model...")
response = chat_model.invoke(messages)

print(f"\n✅ Response generated!")
print(f"\nModel output: {response.content}")


Sending messages to model...

✅ Response generated!

Model output: Machine learning (ML) is a type of artificial intelligence (AI) that enables computers to learn from data without being explicitly programmed.

Think of it like this: Imagine you're trying to teach a child how to recognize different animals. You wouldn't simply tell them "this is a dog, this is a cat," and expect them to remember every detail. Instead, you'd show them many pictures of dogs and cats, and they would learn to recognize the patterns and characteristics that distinguish one from the other.

Machine learning works in a similar way. It uses algorithms (step-by-step procedures) to analyze large datasets and identify patterns, relationships, or predictions. The more data it's exposed to, the better it becomes at making accurate decisions or predictions.

There are three main types of machine learning:

1. **Supervised Learning**: This is like teaching a child to recognize animals by showing them labeled pictures

In [23]:
_ = detailed_performance_metrics(response.response_metadata)

Total tokens: 444
Total time seconds: 16.49599575
Load time seconds: 0.078168959
Generation time seconds: 16.154724541
Prompt processing time seconds: 0.262441375
Generation tokens per second: 25.50337512437068
Prompt tokens per second: 121.93199338328418


## Step 4: Test Streaming Response

Ollama supports streaming responses, which is useful for real-time applications:


In [24]:
# Test streaming
prompt = "Write a haiku about artificial intelligence."

print(f"Prompt: {prompt}")
print("\nStreaming response:")
print("-" * 40)

for chunk in chat_model.stream(prompt):
    print(chunk.content, end="", flush=True)

print("\n" + "-" * 40)
print("\n✅ Streaming completed!")


Prompt: Write a haiku about artificial intelligence.

Streaming response:
----------------------------------------
Metal mind awakes
Thinking, learning, growing fast
Future's subtle form
----------------------------------------

✅ Streaming completed!


## Summary

If all the tests above passed, you're ready to use Ollama with LangChain! Here's what we tested:

✅ **Embeddings**: 
- Created embeddings for single queries
- Created embeddings for multiple documents
- Verified embedding dimensions

✅ **Model Inference**:
- Simple text generation
- Chat with system and human messages
- Streaming responses
- Integration with LangChain chains

## Troubleshooting

If you encounter issues:

1. **Model Not Found**: Pull the required models (`ollama pull <model-name>`)
2. **Slow Performance**: Ollama models run on CPU by default. For better performance:
   - Use smaller models for testing
   - Consider GPU acceleration if available
3. **Memory Issues**: Large models require significant RAM. Try smaller variants if needed.

## Next Steps

Now you're ready to proceed with the main RAG assignment using Ollama!
