# Token Rate Benchmark of DeepSeek-R1:7B with Ollama

Here we use [Ollama](https://ollama.com/) to download and serve [DeepSeek-R1:7B](https://ollama.com/library/deepseek-r1:7b). Then, we utilize the [Ollama Python library](https://github.com/ollama/ollama-python)'s API to access [Ollama REST API](https://github.com/ollama/ollama/blob/main/docs/api.md) and calculate the token rate of this model.

## Background

### About Ollama

Ollama is a lightweight, extensible framework for building and running Language Models (LM) on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications.

### About DeepSeek-R1

DeepSeek-R1 is based on DeepSeek-R1-Zero - a model trained directly via Reinforcement Learning (RL) without Supervised Fine-tuning (SFT). DeepSeek-R1 further improves DeepSeek-R1-Zero by multi-stage training and code-start data before RL.

For more information about DeepSeek-R1, check the [original paper](https://arxiv.org/abs/2501.12948).

In [1]:
!ollama list

NAME              ID              SIZE      MODIFIED    
deepseek-r1:7b    0a8c26691023    4.7 GB    3 hours ago    


In [2]:
!ollama show deepseek-r1:7b

  Model
    architecture        qwen2     
    parameters          7.6B      
    context length      131072    
    embedding length    3584      
    quantization        Q4_K_M    

  Parameters
    stop    "<｜begin▁of▁sentence｜>"    
    stop    "<｜end▁of▁sentence｜>"      
    stop    "<｜User｜>"                 
    stop    "<｜Assistant｜>"            

  License
    MIT License                    
    Copyright (c) 2023 DeepSeek    



We can see that DeepSeek-R1:7B is a distilled version using the [Qwen2 LLM architecture](https://qwenlm.github.io/blog/qwen2/). It is quantized to 4 bits per weight parameter. With 7.6B parameters, that means the model size will be approximately 7.6 * 10^9 * 4 / 8 / 10^9 = 3.8 GB. There might be increased components for pre/post-processing that lead to a final 4.7 GB memory consumption on NVME storage. Besides, the models need to hold KV caches and other intermediate data during executions, so the GPU VRAM consumption will be even higher.

## Prerequisites

1. Python libraries: `ollama`
2. Ollama CLI
3. DeepSeek-R1:7B installed via Ollama CLI

## Running via Ollama CLI

Run this command: `ollama run deepseek-r1:7b --verbose "<YOUR PROMPT>"`

Below shows a sample output from a CLI console.

```bash
(edge-ai) jetty@ubuntu:~$ ollama run deepseek-r1:7b --verbose "What is Theory of Mind (ToM)? Give a simple example about it. Also, How do you think an LLM can somehow achieve ToM?"
```
<think>
Okay, so I need to figure out what Theory of Mind (ToM) is and come up with a simple example. Then, I also have to think about how an LLM might achieve this ability.

First, I remember from psychology that ToM has something to do with understanding other people's thoughts and emotions. But I'm not entirely sure about the details. Maybe it's like when we think about what others are thinking or
feeling? So, if my friend is upset because they lost their keys, I might try to figure out why they're upset instead of just focusing on losing the keys themselves.

Wait, the user gave an example involving an LLM and a child who misunderstands sarcasm. That's a good simple example. Maybe I can use that or come up with another one. Let me think: perhaps something like when someone is trying to
explain their decision to a friend, but the friend doesn't get it because they're using different assumptions.

Now, about how an LLM achieves ToM. Well, I know that LLMs process information based on patterns and data they've been trained on. So maybe they can infer mental states by analyzing the context of conversations or texts. But does
this mean they truly understand others' thoughts or is it just surface-level analysis?

I'm not sure if an LLM actually has a deeper understanding, but it can simulate ToM by interpreting the intent behind messages and responding appropriately. For example, in sarcasm detection, which relates to ToM.

Putting it all together: ToM is about understanding others' mental states, like thoughts and feelings. An LLM achieves this through analyzing context and inferring intent rather than having actual consciousness of other minds.
</think>

**Theory of Mind (ToM):** It is the ability to comprehend the mental states—such as thoughts, beliefs, intentions, emotions, etc.—of oneself and others. Essentially, it's the capacity to understand that other people have their own
perspectives, motivations, and mental processes.

**Simple Example:** Imagine a child receiving a message from an adult saying, "I told you I was busy." The child might interpret this as the adult being uninterested because they were indeed busy. This interpretation reflects a
basic understanding of sarcasm, which is a form of ToM.

**How LLMs Achieve ToM:**
Language models like ChatGPT achieve ToM by analyzing context and inferring intent from interactions. They process input text to understand the user's intentions, emotions, or background knowledge implicitly. While they don't
possess consciousness or true understanding, their responses are guided by patterns learned during training, effectively simulating ToM through contextual analysis.

In summary, ToM involves interpreting others' mental states, and LLMs achieve this by analyzing context and inferring intent without actual consciousness of other minds.

```bash
total duration:       1m28.165662824s
load duration:        48.590339ms
prompt eval count:    34 token(s)
prompt eval duration: 185.009901ms
prompt eval rate:     183.77 tokens/s
eval count:           575 token(s)
eval duration:        1m27.930198598s
eval rate:            6.54 tokens/s
```

The output is streamed to the console. We can see that reasoning LLMs like DeepSeek-R1 first performs a thinking phase within the `<think></think>` block, then provides the final answer.

From the `--verbose` flag, Ollama CLI directly provides us with the token rate of the current inference. We can see that the token rate at the prefill phase is measured in `prompt eval rate` = 183 tokens/sec, while the token rate at the decoding phase is measured in `eval rate` = 6.5 tokens/sec.

By running `sudo tegrastats` at the same time on Jetson Orin Nano Super, we can see that the GPU VRAM consumption during inference is around 6.6 GB.

```bash
03-23-2025 17:58:38 RAM 6595/7620MB (lfb 4x1MB) SWAP 107/3810MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[305] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@52.343C soc2@51.5C soc0@50.875C gpu@52.531C tj@52.531C soc1@52.25C VDD_IN 4794mW/9018mW VDD_CPU_GPU_CV 466mW/2699mW VDD_SOC 1478mW/2177mW
03-23-2025 17:58:39 RAM 6596/7620MB (lfb 4x1MB) SWAP 107/3810MB (cached 0MB) CPU [0%@729,1%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[305] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@52.593C soc2@51.406C soc0@50.843C gpu@52.5C tj@52.593C soc1@52.156C VDD_IN 4794mW/8989mW VDD_CPU_GPU_CV 466mW/2684mW VDD_SOC 1478mW/2172mW
03-23-2025 17:58:40 RAM 6596/7620MB (lfb 4x1MB) SWAP 107/3810MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0%@2133 GR3D_FREQ 0%@[305] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@52.312C soc2@51.406C soc0@50.625C gpu@52.375C tj@52.687C soc1@52.125C VDD_IN 4755mW/8961mW VDD_CPU_GPU_CV 466mW/2669mW VDD_SOC 1478mW/2167mW
```

## Running from Python

First, make sure `http_proxy` is unset for the current session, as suggested in the [official FAQ](# https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-use-ollama-behind-a-proxy).

> Avoid setting HTTP_PROXY. Ollama does not use HTTP for model pulls, only HTTPS. Setting HTTP_PROXY may interrupt client connections to the server.

In [3]:
!export http_proxy=""

In [4]:
default_prompt = 'What is Theory of Mind (ToM)? Give a simple example about it. Also, How do you think an LLM can somehow achieve ToM?'

In [5]:
from ollama import chat

In [6]:
stream = chat(
    model='deepseek-r1:7b',
    messages=[{
        'role': 'user',
        'content': default_prompt,
    }],
    stream=True,
)

metrics = None

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)
    if chunk['done']:
        metrics = chunk.model_dump()

<think>
Okay, so I need to figure out what Theory of Mind (ToM) is and come up with a simple example. Then, I also have to think about how an LLM might achieve ToM. Hmm, let me start by understanding what ToM really means.

From what I remember, Theory of Mind refers to the ability to understand that other people have their own thoughts, beliefs, intentions, goals, and perspectives different from one's own. It's like being able to put yourself in someone else's shoes. So, for example, if my friend is upset because they can't play outside, I might think about why they're upset—maybe they don't want to get wet or maybe something happened that made them sad.

Now, an LLM has to achieve this ability somehow. I'm a bit fuzzy on how exactly. Maybe it's through training data? Like if the model is exposed to a lot of conversations where people explain their thoughts and feelings, the model might learn patterns associated with ToM. Or perhaps it's more about generating responses that are cohere

According to the [official documentation](https://github.com/ollama/ollama/blob/main/docs/api.md#response-10), the last response will contain some evaluation metrics, so we save the final metrics for analysis.

In [7]:
metrics

{'model': 'deepseek-r1:7b',
 'created_at': '2025-03-23T11:47:43.742901832Z',
 'done': True,
 'done_reason': 'stop',
 'total_duration': 96964881225,
 'load_duration': 46586728,
 'prompt_eval_count': 34,
 'prompt_eval_duration': 20936091,
 'eval_count': 734,
 'eval_duration': 96895594971,
 'message': {'role': 'assistant',
  'content': '',
  'images': None,
  'tool_calls': None}}

According to the [official documentation](https://github.com/ollama/ollama/blob/main/docs/api.md#durations):
> All durations are returned in nanoseconds.

So we need to transform the data to a similar format of the one we see in the CLI console.

In [8]:
print('total duration:\t\t%.4f sec' % (metrics['total_duration'] / 1e9))
print('load duration:\t\t%.4f ms' % (metrics['load_duration'] / 1e6))
print('prompt eval count:\t%d token(s)' % (metrics['prompt_eval_count']))
print('prompt eval duration:\t%.4f ms' % (metrics['prompt_eval_duration'] / 1e6))
print('prompt eval rate:\t%.2f tokens/s' % (metrics['prompt_eval_count'] / metrics['prompt_eval_duration'] * 1e9))
print('eval count:\t\t%d token(s)' % (metrics['eval_count']))
print('eval duration:\t\t%.4f ms' % (metrics['eval_duration'] / 1e9))
print('eval rate:\t\t%.2f tokens/s' % (metrics['eval_count'] / metrics['eval_duration'] * 1e9))

total duration:		96.9649 sec
load duration:		46.5867 ms
prompt eval count:	34 token(s)
prompt eval duration:	20.9361 ms
prompt eval rate:	1623.99 tokens/s
eval count:		734 token(s)
eval duration:		96.8956 ms
eval rate:		7.58 tokens/s


## TODO

1. [Open WebUI](https://www.jetson-ai-lab.com/tutorial_openwebui.html): [Open WebUI](https://github.com/open-webui/open-webui) is an AI platform designed to operate entirely offline. More specifically, it is a versatile, **brower-based** interface for running and managing LLMs locally.