# Notebook 10: Ollama Integration with HuggingFace

**Learning Objectives:**
- Use Ollama to run local LLMs
- Integrate TinyLlama with HuggingFace tools
- Understand local vs cloud model deployment
- Combine Ollama with HuggingFace tokenizers and tools

## Prerequisites

### Hardware Requirements

| Model Option | Model Name | Size | Min RAM | Recommended Setup | Notes |
|--------------|------------|------|---------|-------------------|-------|
| **CPU/GPU** | TinyLlama (via Ollama) | 637MB | 4GB | 8GB RAM | Fast local inference |

### Software Requirements
- Python 3.8+
- **Ollama installed** ([ollama.com](https://ollama.com))
- Libraries: `transformers`, `ollama`

### Installation

1. **Install Ollama**:
   - Visit [ollama.com](https://ollama.com) and download for your OS
   - Or use: `curl -fsSL https://ollama.com/install.sh | sh` (Linux/Mac)

2. **Pull TinyLlama**:
   ```bash
   ollama pull tinyllama
   ```

3. **Install Python package**:
   ```bash
   pip install ollama
   ```

## Overview

**Ollama** is a tool for running large language models locally.

**Benefits:**
- Privacy: Data stays on your machine
- No API costs
- Works offline
- Easy model management

**TinyLlama:**
- 1.1B parameter model
- Trained on 3 trillion tokens
- Fast inference on CPU
- Good for educational purposes

## Expected Behaviors

### Prerequisites
- **Ollama must be installed** from [ollama.com](https://ollama.com)
- **TinyLlama must be pulled**: `ollama pull tinyllama`
- **Ollama daemon must be running** in background

### First Time Running
- **No model download in notebook** (handled by Ollama)
- TinyLlama: ~637MB downloaded via Ollama CLI
- HuggingFace tokenizer: ~500KB

### Setup Cell Output
```
Libraries imported successfully!
```

### Checking Ollama Installation
```python
# If successful:
=== AVAILABLE OLLAMA MODELS ===
  - tinyllama:latest (637 MB)

# If Ollama not running:
Error: ... Connection refused
Make sure Ollama is installed and running.
```

### Text Generation Output
- Returns generated text as string
- Quality similar to other small LLMs
- Good for simple tasks, not as powerful as GPT-3.5/4

### Chat Interface
- Supports system prompts and message history
- **Streaming**: Text appears token-by-token
- **Non-streaming**: Returns complete response

### Performance
- **Short prompt** (10-20 words):
  - Response time: 2-5 seconds
  - Varies by system specs
- **Longer prompt** (100+ words):
  - Response time: 5-15 seconds

### Response Quality
- **TinyLlama (1.1B parameters)**:
  - Good for simple questions
  - May struggle with complex reasoning
  - Sometimes repetitive
  - Can make factual errors

### Temperature Effects
- **0.3**: Focused, deterministic responses
- **0.7**: Balanced (recommended)
- **1.2**: Creative, more varied, sometimes incoherent

### HuggingFace Integration
- Can use HF tokenizers to count tokens
- Analyze prompts before sending to Ollama
- Combine with other HF tools (e.g., summarize then query)

### Common Issues
- **"Connection refused"**: Ollama not running
  - Solution: Start Ollama daemon
- **"Model not found"**: TinyLlama not pulled
  - Solution: `ollama pull tinyllama`
- **Slow responses**: Normal for CPU inference
  - TinyLlama is optimized for CPU but still takes time

### Ollama vs HuggingFace
- **Ollama**: Easier setup, optimized for local inference, fewer models
- **HuggingFace**: More models, more control, steeper learning curve

### Multi-turn Conversations
- Maintains conversation context
- Each turn adds to message history
- Context window: ~2048 tokens for TinyLlama

### Expected Behavior Examples
- Simple questions: Usually correct
- Math: May make errors
- Creative writing: Decent quality
- Code generation: Basic, often needs editing

In [None]:
# Import libraries
import random
import ollama
from transformers import AutoTokenizer
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
random.seed(1103)

print("Libraries imported successfully!")

## Check Ollama Installation

In [None]:
# Check if Ollama is running and list available models
try:
    models = ollama.list()
    print("=== AVAILABLE OLLAMA MODELS ===")
    if models.get('models'):
        for model in models['models']:
            print(f"  - {model['name']} ({model.get('size', 'unknown')} MB)")
    else:
        print("No models found. Run: ollama pull tinyllama")
except Exception as e:
    print(f"Error: {e}")
    print("\nMake sure Ollama is installed and running.")
    print("Visit: https://ollama.com for installation instructions")

## Basic Ollama Usage

In [None]:
# Simple text generation
def generate_with_ollama(prompt, model="tinyllama"):
    """
    Generate text using Ollama.
    """
    response = ollama.generate(
        model=model,
        prompt=prompt
    )
    return response['response']

# Test
prompt = "Explain what machine learning is in simple terms."
print(f"Prompt: {prompt}\n")

response = generate_with_ollama(prompt)
print(f"Response:\n{response}")

## Chat Interface

In [None]:
# Chat with streaming
def chat_with_ollama(messages, model="tinyllama", stream=True):
    """
    Chat with Ollama using message history.
    """
    response = ollama.chat(
        model=model,
        messages=messages,
        stream=stream
    )
    
    if stream:
        full_response = ""
        for chunk in response:
            content = chunk['message']['content']
            print(content, end='', flush=True)
            full_response += content
        print()  # New line
        return full_response
    else:
        return response['message']['content']

# Test chat
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "What are the benefits of using local LLMs?"}
]

print("Assistant: ", end='')
response = chat_with_ollama(messages, stream=True)

## Combining Ollama with HuggingFace Tokenizers

In [None]:
# Load HuggingFace tokenizer for TinyLlama
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

print("HuggingFace tokenizer loaded!")

In [None]:
# Analyze token count before sending to Ollama
def analyze_and_generate(prompt, max_tokens=100):
    """
    Analyze prompt with HF tokenizer, then generate with Ollama.
    """
    # Tokenize with HuggingFace
    tokens = tokenizer.encode(prompt)
    token_count = len(tokens)
    
    print(f"=== TOKEN ANALYSIS ===")
    print(f"Prompt: {prompt}")
    print(f"Token count: {token_count}")
    print(f"Tokens: {tokens[:20]}..." if token_count > 20 else f"Tokens: {tokens}")
    
    # Generate with Ollama
    if token_count < max_tokens:
        print(f"\n=== GENERATION ===")
        response = generate_with_ollama(prompt)
        print(response)
        
        # Analyze response
        response_tokens = tokenizer.encode(response)
        print(f"\nResponse token count: {len(response_tokens)}")
    else:
        print(f"\nPrompt too long! ({token_count} > {max_tokens})")

# Test
analyze_and_generate("What is the capital of France?", max_tokens=500)

## Practical Applications

In [None]:
# Example 1: Multi-turn conversation
conversation = [
    {"role": "system", "content": "You are a coding tutor."},
]

questions = [
    "What is a function in Python?",
    "Can you show me an example?",
    "How do I add parameters?"
]

print("=== MULTI-TURN CONVERSATION ===")
for question in questions:
    print(f"\nUser: {question}")
    conversation.append({"role": "user", "content": question})
    
    response = ollama.chat(model="tinyllama", messages=conversation)
    assistant_message = response['message']['content']
    
    print(f"\nAssistant: {assistant_message}")
    conversation.append({"role": "assistant", "content": assistant_message})

In [None]:
# Example 2: Batch processing
prompts = [
    "Summarize: Machine learning is a subset of AI.",
    "Translate to simple terms: Neural networks process data through layers.",
    "Complete: The three main types of machine learning are"
]

print("=== BATCH PROCESSING ===")
for i, prompt in enumerate(prompts, 1):
    print(f"\n{i}. Prompt: {prompt}")
    response = generate_with_ollama(prompt)
    print(f"   Response: {response[:100]}...")  # First 100 chars

In [None]:
# Example 3: Parameter control
def generate_with_params(prompt, temperature=0.7, top_p=0.9, top_k=40):
    """
    Generate with custom parameters.
    """
    response = ollama.generate(
        model="tinyllama",
        prompt=prompt,
        options={
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k
        }
    )
    return response['response']

# Compare temperatures
prompt = "Write a creative opening for a sci-fi story."
temps = [0.3, 0.7, 1.2]

print("=== TEMPERATURE COMPARISON ===")
for temp in temps:
    print(f"\nTemperature {temp}:")
    response = generate_with_params(prompt, temperature=temp)
    print(response[:150])  # First 150 chars

## Performance Benchmarking

In [None]:
import time

prompt = "Explain quantum computing in one sentence."

# Benchmark
start_time = time.time()
response = generate_with_ollama(prompt)
end_time = time.time()

print(f"=== PERFORMANCE ===")
print(f"Response: {response}")
print(f"\nTime: {end_time - start_time:.2f} seconds")
print(f"Model: TinyLlama (via Ollama)")
print(f"Running: Locally on your machine")

## Ollama vs HuggingFace Comparison

In [None]:
print("""
=== OLLAMA VS HUGGINGFACE TRANSFORMERS ===

OLLAMA:
+ Easy model management (pull, run, delete)
+ Optimized for local inference
+ Built-in model quantization
+ Simple API
+ No code for basic use
- Fewer model options
- Less fine-tuning control

HUGGINGFACE TRANSFORMERS:
+ Huge model library (500k+ models)
+ Full control over architecture
+ Advanced fine-tuning capabilities
+ Research-oriented features
+ Custom model architectures
- More complex setup
- Requires more code
- Manual optimization needed

BEST PRACTICE:
Use Ollama for: Quick prototyping, demos, local chatbots
Use HuggingFace for: Research, custom models, fine-tuning, production
""")

## Exercises

1. **Model Comparison**: Pull other Ollama models (llama2, mistral) and compare
2. **Context Window**: Test with very long prompts to find context limits
3. **System Prompts**: Experiment with different system prompts for various tasks
4. **Integration**: Combine Ollama with earlier notebooks (e.g., summarize image captions)
5. **Custom Tool**: Build a simple CLI tool using Ollama

In [None]:
# Your code here for exercises


## Key Takeaways

✅ **Ollama** simplifies running LLMs locally

✅ **TinyLlama** is great for learning and prototyping

✅ Can combine with **HuggingFace tools** for enhanced functionality

✅ **Local inference** provides privacy and offline capability

✅ Easy to **manage multiple models** with Ollama

## Congratulations!

You've completed all 10 HuggingFace tutorial notebooks! You now know how to:
- Generate and classify text (NLP)
- Classify and detect objects in images (Computer Vision)
- Transcribe and generate speech (Audio)
- Caption images (Multimodal)
- Run local LLMs with Ollama

## Next Steps

- Explore [HuggingFace Datasets](https://huggingface.co/datasets)
- Learn about [fine-tuning models](https://huggingface.co/docs/transformers/training)
- Join the [HuggingFace Forums](https://discuss.huggingface.co/)
- Build your own AI application!

## Resources

- [Ollama Documentation](https://github.com/ollama/ollama)
- [TinyLlama on HuggingFace](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- [Ollama Model Library](https://ollama.com/library)