This project has been created as part of the 42 curriculum by artcolom.
A function calling system that translates natural language prompts into structured JSON function calls using constrained decoding with Qwen3-0.6B (0.6B parameters).
The core idea: instead of hoping a small LLM outputs valid JSON, we force it by masking invalid tokens at every generation step. The model can only pick tokens that produce a valid prefix of the expected JSON schema. This guarantees 100% parseable output regardless of model size.
"What is the product of 3 and 5?"
|
v
[Qwen3-0.6B] <-- only valid JSON tokens allowed at each step
|
v
{"name": "fn_multiply_numbers", "parameters": {"a": 3.0, "b": 5.0}}
For each function definition, we build a JSON template -- an ordered list of segments that alternate between fixed text and variable slots:
FIXED: {"name": "fn_multiply_numbers", "parameters": {"a":
VARIABLE: <number>
FIXED: , "b":
VARIABLE: <number>
FIXED: }}
At each generation step:
- The LLM produces logits (probability scores) for every token in its ~150k vocabulary
- We determine which tokens are valid -- appending them to the current buffer must still be a valid prefix of at least one template
- All invalid tokens get their logits set to -inf
- We pick the token with the highest remaining logit (greedy decoding)
- Append it to the buffer and repeat until the JSON is complete (
}})
The function is_prefix_valid(buffer, template) walks through the template segments and checks that the buffer matches:
- Fixed segments: character-by-character exact match
- Variable segments: type-specific validation
number: digits,.,-,+,e,Einteger: digits and-onlyboolean: must be a prefix oftrueorfalsestring: any character, with\skipping the next char (escape handling)
The boundary between a variable segment and the next fixed segment is detected by checking if the current buffer position starts matching the next fixed text.
Testing all ~150k tokens at each step would be slow. We use two optimizations:
- First-character filtering: we pre-compute a mapping from each ASCII character to all tokens starting with that character. At each step, we first test which characters are valid (~95 checks), then only test the tokens starting with those characters
- Buffer caching: if the same buffer state has been seen before, we reuse the cached valid token list
- Template system over grammar: instead of a full JSON grammar, we use function-specific templates. This is simpler, faster, and sufficient since the output schema is known in advance.
- Greedy decoding: we always pick the highest-logit valid token (argmax). No sampling, no beam search. This is deterministic and fast -- the constrained masking already ensures valid output.
- Peek logic for embedded quotes: when the model generates
"inside a string variable, we can't tell if it's a closing quote or part of the text (e.g.,Say "hello"). We peek at the next token: if the model would follow with}or,, it's a closing quote. Otherwise, we write\"(escaped quote) into the buffer. - Post-processing state machine: the raw buffer may contain invalid JSON escapes (e.g.,
\Ufrom Windows paths,\'from apostrophes). A state machine walks the JSON string values and fixes these: doubling backslashes for invalid escapes, stripping unnecessary\'. - Type coercion: JSON parsing gives Python types that may not match the function definition (e.g.,
intinstead offloat). A post-processing step coerces values to their declared types. - Custom BPE tokenizer: built from the model's
vocab.jsonandmerges.txtfiles, implementing pre-tokenization (space ->Ġ), iterative BPE merging, and ID lookup.
- Accuracy: 11/11 (100%) on private moulinette tests, covering numbers, strings, integers, booleans, multi-parameter functions, Windows paths with backslashes, embedded quotes, and apostrophes
- JSON validity: 100% -- constrained decoding guarantees every output is parseable
- Speed: all 11 prompts processed in under 2 minutes on a MacBook (M-series). The first prompt is slowest due to model loading; subsequent prompts benefit from token caching
- Robustness: graceful error handling for missing files, invalid JSON input, and model loading failures
- Type support: the initial implementation only handled
numberandstring. Addingintegerandbooleanrequired extending both the template builder and the prefix validator with type-specific logic. - Garbage tokens: some vocabulary tokens contain non-ASCII characters (e.g.,
âĢĿ) that leaked into string values. Fixed by filtering tokens with a_is_clean_token()check during initialization. - Embedded quotes: prompts like
Say "hello" to {name}caused the decoder to prematurely close the string. The peek logic (look at what the model wants to generate after") solved this. - Backslash handling: Windows paths like
C:\Users\john\config.iniproduce invalid JSON escapes (\U,\j,\c). A regex-based fix wasn't enough because\ufollowed by non-hex digits was treated as a valid Unicode escape prefix. Replacing the regex with a state machine that processes each backslash in context solved all edge cases. - Apostrophe escaping: the model generates
\'for apostrophes, but JSON doesn't recognize this escape. The state machine converts\'to plain'.
- Moulinette grading: used the provided moulinette to run the full test suite (11 private tests) and validate output format, function selection, and parameter extraction
- Verbose mode: the
--verboseflag displays each generation step (chosen token, score, top-5 alternatives, valid token count) for manual inspection of the decoding process - Incremental testing: each new feature (type support, escape handling, peek logic) was tested individually against the moulinette to ensure no regressions
- Edge cases covered: float vs int types, boolean values, multi-parameter functions, special characters in strings (quotes, apostrophes, backslashes), long numbers with decimals
make installmake runWith custom paths:
uv run python -m src \
--functions_definition data/input/functions_definition.json \
--input data/input/function_calling_tests.json \
--output data/output/function_calling_results.jsonuv run python -m src --verbosemake lintmake debugGiven this function definition:
{
"name": "fn_multiply_numbers",
"description": "Multiply two numbers together.",
"parameters": { "a": {"type": "number"}, "b": {"type": "number"} },
"returns": {"type": "number"}
}And this prompt:
{"prompt": "What is the product of 3 and 5?"}The program outputs:
{
"prompt": "What is the product of 3 and 5?",
"name": "fn_multiply_numbers",
"parameters": {"a": 3.0, "b": 5.0}
}- Constrained Decoding for LLMs -- Guided Generation of Large Language Models (original paper)
- BPE Tokenization -- Hugging Face NLP course on Byte-Pair Encoding
- JSON specification -- Official JSON format reference
- Pydantic documentation -- Data validation library used throughout
Claude (Anthropic) was used as a research and debugging assistant during development:
- Research: understanding BPE tokenization internals, JSON escape sequence edge cases, constrained decoding strategies, and how logit masking works in practice
- Debugging: extensive iterative debugging sessions to diagnose template matching issues, escape handling edge cases, type coercion problems, and embedded quote detection across many test runs
- No blind generation: all architectural decisions (template system, peek logic, state machine for escapes) were discussed, understood, and validated through testing. The core design and implementation choices are my own.