# Introduction
In this notebook I will estimate minimal number of tokens that our llama model will consume per second.<br/>
The following calculations are made under optimistic assumption that we will parse only content from text fields. <br/>
Note: you will need llama tokenizer for running this notebook, which may require access request on HF.

In [1]:
import pandas as pd
from pathlib import Path
from transformers import LlamaTokenizerFast
from datetime import datetime

from transformers.models.vits.modeling_vits import VitsTextEncoder

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
!python ../../tools/data_load.py coupons_1

Traceback (most recent call last):
  File "/home/szymon/murmuras/ZPP_murmuras/research/speed_reuirements_research/../../tools/data_load.py", line 8, in <module>
    from googleapiclient.discovery import build, Resource
ModuleNotFoundError: No module named 'googleapiclient'


In [3]:
frames = []
DS_PATH = Path("../..") / "datasets" / "coupons_1"
PATHS = [
    DS_PATH / "lidl" / "Kopia test_data_2024_11_25_lidl_plus_content_generic_2024-12-05T07_39_49.726955559+01_00.csv",
    DS_PATH / "dm" / "Kopia test_data_2024_03_07_dm_content_generic_2024-12-05T10_09_32.502568365+01_00.csv",
    DS_PATH / "rewe" / "Kopia test_data_2024_03_07_rewe_content_generic_2024-12-05T10_30_59.948177782+01_00.csv",
    DS_PATH / "rossmann" / "Kopia test_data_2024_03_07_rossmann_content_generic_2024-12-05T10_24_07.981399375+01_00.csv"
]

for path in PATHS:
    frames.append(pd.read_csv(path))

In [4]:
tokenizer = LlamaTokenizerFast.from_pretrained("meta-llama/Llama-3.2-1B")
TEXT_COL_NAME = "text"
TIMESTAMP_COL_NAME = "seen_timestamp"
DEPTH_COL_NAME = "view_depth"
VIEW_ID_COL_NAME = "view_id"

def count_tokens(text):
    return len(tokenizer.tokenize(text))

tokens_cum = 0
seconds_cum = 0
timestamps_cum = 0

for frame in frames:

    texts = frame[TEXT_COL_NAME]
    texts = texts[texts.notnull()]
    total_tokens = texts.apply(count_tokens).sum()
    times = frame[TIMESTAMP_COL_NAME]
    times = times[times > 0]
    time_start = datetime.fromtimestamp(times.min() // 1000)
    time_end = datetime.fromtimestamp(times.max() // 1000)
    total_seconds = (time_end - time_start).total_seconds()
    timestamps_cum += len(frame[TIMESTAMP_COL_NAME].unique())

    print(total_tokens, total_seconds)
    tokens_cum += total_tokens
    seconds_cum += total_seconds

print(f"required min processing speed: {float(tokens_cum / seconds_cum)} tokens per second")
print(f"tokens per timestamp: {tokens_cum / timestamps_cum}")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. 
The class this function is called from is 'LlamaTokenizerFast'.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


11739 169.0
15026 56.0
23708 80.0
14678 79.0
required min processing speed: 169.6640625 tokens per second
tokens per timestamp: 180.975


## Estimation for JSON encoding
In the following section I will estimate the number of tokens consumed by LLama if we decide to preserve tree structure in form of JSON. <br/>
To encode XML tree I will use following syntax:
```json
{
  "text": "text field content",
  "children": {
    "child1_view_id": ...,
    "child2_view_id": ...,
    ...
  }
}
```
Additionally, two tree simplification operations are performed:<br/>
* if a node has no children and no text it is removed
* if a node has single child and no text it is collapsed - its child is transferred to node's parent under name `node_view_id.child_name` and node does not exist on its own
* if a node has no children "children" dict keyt is removed

In [20]:
from typing import Tuple, Optional


def collapse_tree(tree: dict) -> Tuple[Optional[dict], str]:
    """removes nodes that have only one child and no text"""
    if len(tree['children']) < 2 and tree['text'] is None:
        if len(tree['children']) == 1:
            child_name, child = list(tree['children'].items())[0]
            collapsed, name = collapse_tree(child)
            if collapsed is not None:
                name = f"{child_name}.{name}"
            return collapsed, name
        return None, ""
    new_children = {}
    for child_name, child in tree['children'].items():
        collapsed, suffix = collapse_tree(child)
        if collapsed is not None:
            if suffix is not None:
                new_children[f"{child_name}.{suffix}"] = collapsed
            else:
                new_children[child_name] = collapsed
    tree['children'] = new_children
    if len(tree['children']) == 0:
        del tree['children']
    return tree, ""

def timestamp_batch_to_json(batch: pd.DataFrame):
    """takes batch representing single screen content and converts it to JSON representing XML structure"""
    tree_path = []
    res = {"text": None, "children": {}}

    def _insert_at_path(key, val):
        t = res
        for k, d in tree_path:
            t = t["children"][k]
        t["children"][key] = val

    for row in batch.iterrows():
        text_field = row[1][TEXT_COL_NAME]
        name = row[1][VIEW_ID_COL_NAME]
        if isinstance(name, str):
            name = name.rsplit('/')[-1]
        if not isinstance(text_field, str):
            text_field = None
        depth = row[1][DEPTH_COL_NAME]
        while len(tree_path) > 0 and tree_path[-1][1] >= depth:
            tree_path.pop(-1)
        _insert_at_path(name, {"text": text_field, "children": {}})
        tree_path.append((name, depth))

    return res

In [21]:
from json import dumps


seconds_cum = 0
tokens_cum = 0
total_timestamps = 0
for frame in frames:
    times = frame[]
    times = times[times > 0]
    time_start = datetime.fromtimestamp(times.min() // 1000)
    time_end = datetime.fromtimestamp(times.max() // 1000)
    seconds_cum += (time_end - time_start).total_seconds()
    for _, subframe in frame.groupby(TIMESTAMP_COL_NAME):
        total_timestamps += 1
        tree = timestamp_batch_to_json(subframe)
        tree = collapse_tree(tree)[0]
        tree_str = dumps(tree)
        tokens_cum += len(tokenizer.tokenize(tree_str))
print(f"{tokens_cum=}\n{seconds_cum=}\n{total_timestamps=}")
print(f"incoming tokens per second: {tokens_cum / seconds_cum}")
print(f"tokens per timestamp_seen (screen): {tokens_cum / total_timestamps}")

tokens_cum=104188
seconds_cum=384.0
total_timestamps=360
incoming tokens per second: 271.3229166666667
tokens per timestamp_seen (screen): 289.4111111111111


# Results
| metric                   | plain text format | json-encoded content |
|--------------------------|-------------------|----------------------|
| incoming tokens/s        | 169.66            | 271.32               |
| total tokens             | 104188            | 104188               |
| measurement duration [s] | 384               | 384                  |
| tokens per timestamp     | 180.98            | 289.41               |
