# Query Ollama based on REQUESTS

In [1]:
# Make sure you have all models downloaded you want to use
!ollama list

NAME                  ID              SIZE      MODIFIED    
deepseek-r1:latest    6995872bfe4c    5.2 GB    2 hours ago    
mistral:latest        6577803aa9a0    4.4 GB    2 hours ago    
mixtral:latest        a3b6bef0f836    26 GB     5 weeks ago    
mixtral:8x7b          a3b6bef0f836    26 GB     5 weeks ago    
llama3:latest         365c0bd3c000    4.7 GB    5 weeks ago    


## Streaming with text only, elementary

In [2]:
import requests, json
url = "http://localhost:11434/api/generate"
data = {
  "model": "mistral",
  "prompt": "Explain self-attention.",
  "stream": True }
r = requests.post(url, json=data, stream=True)
for line in r.iter_lines():
    if line:
        msg = json.loads(line)
        print(msg["response"], end="")

 Self-attention, also known as scaled dot-product attention or multi-head attention, is a key component of the Transformer model in natural language processing (NLP). It allows the model to focus on different parts of an input sequence when producing an output for each position. Here's a simplified explanation:

1. Input Embeddings: Given an input sequence, each word or token is first converted into a dense vector representation through an embedding layer. These vectors are then processed further using positional encodings to maintain information about the position of tokens in the sequence.

2. Splitting Inputs: After processing the initial embeddings, the input is split into three parts – query (Q), key (K), and value (V). The splitting allows for the calculation of attention scores across the input sequence.

3. Attention Scores Calculation: To calculate attention scores between the query and keys, a dot product operation is performed. However, the raw dot products can lead to very 

In [3]:
import requests

def query_ollama(prompt, model="llama3"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    answer = requests.post(url, json=data)
    return answer.json()["response"]

# Example usage:
print(query_ollama("What is the capital of Germany?"))

The capital of Germany is Berlin.


# Streaming response with Markdown

In [4]:
import requests
import json
import re
from IPython.display import display, Markdown, clear_output

def format_markdown(text):
    text = text.replace("\n", "\n\n")
    text = re.sub(r'\b([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\b', r'**\1**', text)
    text = re.sub(r'\b([A-Z]{3,})\b', r'**\1**', text)
    return text

def stream_ollama_markdown(prompt, model="llama3"):
    url = "http://localhost:11434/api/generate"
    data = {"model": model, "prompt": prompt, "stream": True}

    response = requests.post(url, json=data, stream=True)
    buffer = ""
    output_display = display(Markdown(""), display_id=True)

    for chunk in response.iter_lines():
        if chunk:
            try:
                data = json.loads(chunk.decode("utf-8"))
                text = data.get("response", "")
                if text:
                    buffer += text
                    clear_output(wait=True)
                    output_display.update(Markdown(format_markdown(buffer)))
            except json.JSONDecodeError:
                pass

    # Final display (in case last chunk isn’t shown)
    clear_output(wait=True)
    output_display.update(Markdown(format_markdown(buffer)))
    # No return


# Try it
stream_ollama_markdown("Explain the concept of self-attention in Transformers.")


**Self**-attention! A fundamental component of the **Transformer** architecture, revolutionizing the field of **Natural Language Processing** (**NLP**). **In** this explanation, I'll break down the concept of self-attention and its role within the **Transformer** model.



****What** is self-attention?**



**In** traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), sequential relationships between input elements are modeled using recurrence or convolutional filters. **However**, these approaches have limitations when dealing with long-range dependencies, which are crucial in **NLP** tasks like language translation, question answering, and text summarization.



**Self**-attention addresses this limitation by allowing the model to attend to different parts of the input sequence simultaneously, rather than relying on sequential processing. **In** essence, self-attention enables the model to weigh the importance of each input element relative to others, based on their relevance to the current task.



****How** does self-attention work?**



**The** self-attention mechanism is applied in parallel across all tokens (words or subwords) in a sequence. **The** process involves three main components:



1. ****Query** (Q)**: **Each** token is associated with a query vector, which represents the token's context and its relationship to other tokens.

2. ****Key** (K)**: **Each** token is also associated with a key vector, which captures the token's content and relevance to other tokens.

3. ****Value** (V)**: **Each** token has a value vector, which contains the token's representation.



**The** self-attention mechanism computes the attention weights by taking the dot product of each query and key vector, then applying a softmax function:



`**Attention**(Q, K) = **Softmax**(Q * K / sqrt(d))`



where `d` is the dimensionality of the vectors. **The** attention weights are computed for all tokens simultaneously, allowing the model to consider long-range dependencies.



****What** happens next?**



**The** output of self-attention is a weighted sum of the value vectors, where the weights are determined by the attention scores:



`**Output** = **Concat**(head1, ..., headn) * WO`



where `head1`, ..., `headn` are the outputs from multiple self-attention heads (more on this later). **The** output is then passed through a feed-forward neural network (**FFNN**) for further transformation.



****Why** is self-attention important?**



**Self**-attention enables **Transformers** to:



* **Capture** long-range dependencies and contextual relationships between tokens

* **Attend** to relevant parts of the input sequence, even if they are far apart in the original order

* **Handle** complex **NLP** tasks that require understanding context and relationships



**In** summary, self-attention is a critical component of the **Transformer** architecture, allowing it to attend to different parts of the input sequence simultaneously and capture long-range dependencies. **This** mechanism has had a profound impact on the field of **NLP**, enabling state-of-the-art results in many areas!

# Preparing python based ollama access

In [5]:
import os

#os.environ["http_proxy"]  = "http://ofsquid.dwd.de:8080"
#os.environ["https_proxy"] = "http://ofsquid.dwd.de:8080"
#os.environ["no_proxy"]    = "localhost,127.0.0.1"

In [6]:
#!pip install ollama

## First example, querying ollama, no streaming

In [7]:
import ollama

response = ollama.chat(model='llama3', messages=[{"role": "user", "content": "What is the capital of France?"}])
print(response['message']['content'])

The capital of France is Paris.


## Token Speed

In [8]:
import requests
import time
import os

models = [
    "mistral",
    "llama3",
    "deepseek-r1",
]

prompt = "Explain self-attention in two paragraphs."
page_token_count = 333  # ≈ one page

results = []    
for model in models:
    print(f"⏳ Testing model: {model}")
    start = time.time()
    response = query_ollama(prompt, model)
    end = time.time()

    tokens = len(response.split())
    duration = end - start
    tokens_per_sec = tokens / duration
    seconds_per_page = page_token_count / tokens_per_sec

    results.append({
        "model": model,
        "tokens": tokens,
        "duration_sec": duration,
        "tokens/sec": tokens_per_sec,
        "sec/page": seconds_per_page
    })


⏳ Testing model: mistral
⏳ Testing model: llama3
⏳ Testing model: deepseek-r1


In [9]:
# Display results nicely
import pandas as pd
df = pd.DataFrame(results)
df = df.round(2)
display(df)

Unnamed: 0,model,tokens,duration_sec,tokens/sec,sec/page
0,mistral,180,5.06,35.54,9.37
1,llama3,241,6.81,35.41,9.41
2,deepseek-r1,240,16.26,14.76,22.57
