# 0. Introduction
In this notebook, I will go through inference examples with models both listed as supported by Llama.cpp, and one that is technically not listed, but I managed to run it. Additionally, I will present a way to enforce a specific json format of the model output.

Throughout this notebook, I will assume that the reader has a basic understanding of the contents of the 'introduction' folder that contains example workflows of Llama.cpp.

# 1. Enforcing a specific json format
Llama.cpp "llama-cli" binary supports enforcing different json formats for the output using [GNBF](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md) grammars or [JSON schemas](https://json-schema.org/). When using JSON schemas, even the one provided as examples, I ran into issues. Luckily, Llama.cpp provides a [script](https://github.com/ggerganov/llama.cpp/blob/master/examples/json_schema_to_grammar.py) which can be used to convert JSON schemas to GNBF grammars. This is the approach that I used.

For the example purposes, I will use the [Llama 3.2 1B  model](https://huggingface.co/meta-llama/Llama-3.2-1B) with K_5 quantization (more on models and quantizations later in the notebook). Also, I will use "prompt.txt" file with a very small example prompt and "schema.json" file with the schema that I want to enforce on the model output. Both of these files can be found under the "miscellaneous" directory.

To run the inference with the enforced schema, I ran the following command"
```bash
./build/bin/llama-cli -m ~/llama3215.gguf --file prompt_large.txt --grammar "$( python examples/json_schema_to_grammar.py schema.json )"
```

The result of this command is the following: \
![Example result](images/prompt_example.png)

Few things to note here:
1. The output matches the provided schema, but its formatting may differ between different runs.
2. It is very accurate; but, for bigger prompts (example in the "prompt_large.txt" file), the output omits some of the data.
3. This schema does not enforce the model to end. Older models/ models with stronger quantizations may not generate the "[end of text]" token and will generate data infinitely (unless we prevent them from doing so).
4. Prompt has a specific format; Llama models are, by default, trained for sentence completion, so simply asking the model what are the products in the file may cause it to talk about something related, but not produce desired results. Because of this, prompts need to be formatted in a way that the model will "complete" them, not answer them.
