# 0. Introduction
In this notebook, I will go through inference examples with models both listed as supported by Llama.cpp, and one that is technically not listed, but I managed to run it. Additionally, I will present a way to enforce a specific json format of the model output.

Throughout this notebook, I will assume that the reader has a basic understanding of the contents of the 'introduction' folder that contains example workflows of Llama.cpp.

# 1. Enforcing a specific json format
Llama.cpp "llama-cli" binary supports enforcing different json formats for the output using [GNBF](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) grammars or [JSON schemas](https://json-schema.org/). When using JSON schemas, even the one provided as examples, I ran into issues. Luckily, Llama.cpp provides a [script](https://github.com/ggerganov/llama.cpp/blob/master/examples/json_schema_to_grammar.py) which can be used to convert JSON schemas to GNBF grammars. This is the approach that I used.

For the example purposes, I will use the [Llama 3.2 1B  model](https://huggingface.co/meta-llama/Llama-3.2-1B) with K_5 quantization (more on models and quantizations later in the notebook). Also, I will use "prompt.txt" file with a very small example prompt and "schema.json" file with the schema that I want to enforce on the model output. Both of these files can be found under the "miscellaneous" directory.

To run the inference with the enforced schema, I ran the following command"
```bash
./build/bin/llama-cli -m ~/llama3215.gguf --file prompt_large.txt --grammar "$( python examples/json_schema_to_grammar.py schema.json )"
```

The result of this command is the following: \
![Example result](images/prompt_example.png)

Few things to note here:
1. The output matches the provided schema, but its formatting may differ between different runs.
2. It is very accurate; but, for bigger prompts (example in the "prompt_large.txt" file), the output omits some of the data.
3. This schema does not enforce the model to end. Older models/ models with stronger quantizations may not generate the "[end of text]" token and will generate data infinitely (unless we prevent them from doing so).
4. Prompt has a specific format; Llama models are, by default, trained for sentence completion, so simply asking the model what are the products in the file may cause it to talk about something related, but not produce desired results. Because of this, prompts need to be formatted in a way that the model will "complete" them, not answer them.


# 2. Running different LLama models
Llama.cpp has a list of models that are explicitly [supported](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf_update.py). This proved to be an issue with BERT models, because every model listed here has its own architecture and for every architecture, there is code written in order to run the model. Supported BERT models have different underlying type than the ones we use, but my hypothesis was that if they had the same underlying type, it would be possible to run them.

And I am happy to announce that I proved my hypothesis. I run both explicitly supported models: [Llama 2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf) and [Llama 3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), BUT, I also managed to run the [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) model with different quantizations. To archeive that, I followed simliar steps I described for the BERT model in the introduction folder, with one notable change: I added the following line to the "convert_hf_to_gguf_update.py" file:
```python
{"name": "llama-bpe",      "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/meta-llama/Llama-3.2-1B", },
```

This way, by running this script, I made Llama.cpp "aware" of this model, and because it has the same underlying type as the older models, there were no issues with running it.

Also, as part of my experiments, I managed to run the [Hermes-Llama-3.2-3B](https://huggingface.co/NousResearch/Hermes-3-Llama-3.2-3B-GGUF/tree/main) model (another "unsupported" model). I was looking for some alternatives to the base Meta products, and I found this model. It has 3B parameters. You can find its comparison with other models later in this notebook.

# 3. Issues and limitations
As mentioned before, older models and/or models with stronger quantizations may not generate the "[end of text]" token and will generate data infinitely/will repeat it a few times before finishing. To reproduce this I tried both Llama-2 and Llama-3.2 with stringest quantizations (Llama-2 is the oldest one, Llama-3.2 is the smallest one), but it was very difficult to reproduce this issue. Instead, I observed that those variants tends to produce a lot of non-existing data, e.g. Llama-2: \
![Llama-2](images/llama2_ghost.png)

This is problematic; If models would only produce repetitive data, we could play with the "--repeat-penalty" and "--repeat-last-n" flags to penalize the model for generating similar/same tokens/sequences and force it to generate "[end of text]" token. In this case, it seems that the only way would be to set the limit on generated tokens, but we would need to know the size of the result (TODO: json schema).

Still, this issue is present mostly for the models with stronger quantizations, and using something like Q_5/Q_6 quantization should be enough to prevent this issue from happening.

# 4. Performance
I did not prepare detailed performance tests, but I used Llama.cpp's built-in benchmarks to showcase how different models behave with the Q5_K_M quantization:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 1B Q5_K - Medium         | 861.81 MiB |     1.24 B | CPU        |       6 |         pp512 |        112.95 ± 8.73 |
| llama 1B Q5_K - Medium         | 861.81 MiB |     1.24 B | CPU        |       6 |         tg128 |         34.98 ± 2.50 |
| llama 8B Q5_K - Medium         |   5.33 GiB |     8.03 B | CPU        |       6 |         pp512 |         14.69 ± 0.22 |
| llama 8B Q5_K - Medium         |   5.33 GiB |     8.03 B | CPU        |       6 |         tg128 |          6.23 ± 0.23 |
| llama 7B Q5_K - Medium         |   4.45 GiB |     6.74 B | CPU        |       6 |         pp512 |         15.42 ± 0.22 |
| llama 7B Q5_K - Medium         |   4.45 GiB |     6.74 B | CPU        |       6 |         tg128 |          7.10 ± 0.11 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | CPU        |       6 |         pp512 |         35.35 ± 0.47 |
| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | CPU        |       6 |         tg128 |         13.88 ± 0.44 |

