Skip to content

Regression regarding some tokenizers #213

@saattrupdan

Description

@saattrupdan

Describe the issue as clearly as possible:

In #90 an error was fixed regarding the Salamandra and OpenCoder tokenizers. The same error has now returned, and I see that the fix from that PR is nowhere to be seen in the code base anymore. Was it replaced by something else?

Steps/code to reproduce the bug:

from outlines_core.fsm.regex import reduced_vocabulary
from outlines.models.vllm import adapt_tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BSC-LT/salamandra-2b")
tokenizer = adapt_tokenizer(tokenizer)
vocabulary = reduced_vocabulary(tokenizer)

Expected result:

No error message.

Error message:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/alex-admin/euroeval/.venv/bin/euroeval", line 8, in <module>
[rank0]:     sys.exit(benchmark())
[rank0]:              ^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/click/core.py", line 1161, in __call__
[rank0]:     return self.main(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/click/core.py", line 1082, in main
[rank0]:     rv = self.invoke(ctx)
[rank0]:          ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/click/core.py", line 1443, in invoke
[rank0]:     return ctx.invoke(self.callback, **ctx.params)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/click/core.py", line 788, in invoke
[rank0]:     return __callback(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/euroeval/cli.py", line 277, in benchmark
[rank0]:     benchmarker.benchmark(model=models)
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/euroeval/benchmarker.py", line 461, in benchmark
[rank0]:     benchmark_output_or_err = self._benchmark_single(
[rank0]:                               ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/euroeval/benchmarker.py", line 768, in _benchmark_single
[rank0]:     scores = generate(
[rank0]:              ^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/euroeval/generation.py", line 84, in generate
[rank0]:     test_scores = generate_single_iteration(
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/euroeval/generation.py", line 163, in generate_single_iteration
[rank0]:     model_output = model.generate(inputs=batch)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/euroeval/benchmark_modules/vllm.py", line 361, in generate
[rank0]:     logits_processor = JSONLogitsProcessor(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines/processors/structured.py", line 187, in __init__
[rank0]:     super().__init__(regex_string=regex_string, tokenizer=tokenizer)
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines/processors/structured.py", line 151, in __init__
[rank0]:     guide = RegexGuide.from_regex(regex_string, tokenizer)
[rank0]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines/fsm/guide.py", line 92, in from_regex
[rank0]:     return super().from_regex(
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines_core/fsm/guide.py", line 212, in from_regex
[rank0]:     ) = _create_states_mapping(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines/fsm/guide.py", line 76, in cached_create_states_mapping
[rank0]:     return uncached_create_states_mapping(regex_string, tokenizer, *args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines_core/fsm/guide.py", line 141, in create_states_mapping
[rank0]:     return create_states_mapping_from_fsm(regex_fsm, tokenizer, frozen_tokens)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines_core/fsm/guide.py", line 178, in create_states_mapping_from_fsm
[rank0]:     states_to_token_maps, empty_token_ids = create_fsm_index_tokenizer(
[rank0]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines_core/fsm/regex.py", line 473, in create_fsm_index_tokenizer
[rank0]:     tokens_to_token_ids, empty_token_ids = reduced_vocabulary(tokenizer)
[rank0]:                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/alex-admin/euroeval/.venv/lib/python3.12/site-packages/outlines_core/fsm/regex.py", line 426, in reduced_vocabulary
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Cannot convert token `?` (217017) to bytes: �?

Outlines/Python version information:

Outlines version: 0.2.3

Python version: 3.12.3

Packages installed:

absl-py==2.2.2 accelerate==1.4.0 aiofiles==23.2.1 aiohappyeyeballs==2.4.8 aiohttp==3.11.13 aiosignal==1.3.2 airportsdata==20250224 annotated-types==0.7.0 anyio==4.8.0 astor==0.8.1 attrs==25.1.0 bert-score==0.3.13 bitsandbytes==0.45.3 blake3==1.0.4 cachetools==5.5.2 certifi==2025.1.31 charset-normalizer==3.4.1 chex==0.1.89 08:29:11 [146/1972] click==8.1.8 cloudpickle==3.1.1 compressed-tensors==0.9.3 contourpy==1.3.1 cupy-cuda12x==13.4.1 cycler==0.12.1 datasets==3.5.0 demjson3==3.0.6 Deprecated==1.2.18 depyf==0.18.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 dnspython==2.7.0 einops==0.8.1 email_validator==2.2.0 etils==1.12.2 EuroEval @ git+https://github.com/EuroEval/EuroEval@6db11af4ee14cb832065f312d78266b5c1b46a26 evaluate==0.4.3 fastapi==0.115.11 fastapi-cli==0.0.7 fastrlock==0.8.3 fbgemm_gpu==1.1.0 ffmpy==0.5.0 filelock==3.17.0 flash_attn==2.7.4.post1 flax==0.10.4 fonttools==4.56.0 frozenlist==1.5.0 fsspec==2024.12.0 genson==1.3.0 gguf==0.16.0 googleapis-common-protos==1.70.0 gradio==5.20.0 gradio_client==1.7.2 groovy==0.1.2 grpcio==1.71.0 h11==0.14.0 hf-xet==1.0.2 httpcore==1.0.7 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.30.1 humanize==4.12.2 idna==3.10 importlib_metadata==8.0.0 importlib_resources==6.5.2 interegular==0.3.3 iso3166==2.1.1 jax==0.5.3 jaxlib==0.5.3 Jinja2==3.1.5 jiter==0.8.2 joblib==1.4.2 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 kiwisolver==1.4.8 lark==1.2.2 Levenshtein==0.27.1 litellm==1.65.1 llguidance==0.7.11 llvmlite==0.44.0 lm-format-enforcer==0.10.11 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.10.1 mdurl==0.1.2 mistral_common==1.5.4 ml_dtypes==0.5.1 more-itertools==10.6.0 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.1.0 multiprocess==0.70.16 nanobind==2.6.1 nest-asyncio==1.6.0 networkx==3.4.2 ninja==1.11.1.4 nltk==3.9.1 numba==0.61.2 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 08:29:11 [54/1972] nvidia-ml-py==12.570.86 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 ollama==0.4.7 openai==1.70.0 opencv-python-headless==4.11.0.86 opentelemetry-api==1.26.0 opentelemetry-exporter-otlp==1.26.0 opentelemetry-exporter-otlp-proto-common==1.26.0 opentelemetry-exporter-otlp-proto-grpc==1.26.0 opentelemetry-exporter-otlp-proto-http==1.26.0 opentelemetry-proto==1.26.0 opentelemetry-sdk==1.26.0 opentelemetry-semantic-conventions==0.47b0 opentelemetry-semantic-conventions-ai==0.4.3 opt_einsum==3.4.0 optax==0.2.4 orbax-checkpoint==0.11.10 orjson==3.10.15 outlines==0.2.3 outlines_core==0.1.26 packaging==24.2 pandas==2.2.3 partial-json-parser==0.2.1.1.post5 peft==0.15.0 pillow==11.1.0 prometheus-fastapi-instrumentator==7.0.2 prometheus_client==0.21.1 propcache==0.3.0 protobuf==3.20.3 psutil==7.0.0 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==19.0.1 pycountry==24.6.1 pydantic==2.10.6 pydantic_core==2.27.2 pydub==0.25.1 Pygments==2.19.1 pyinfer==0.0.3 pyparsing==3.2.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-json-logger==3.3.0 python-multipart==0.0.20 pytz==2025.1 PyYAML==6.0.2 pyzmq==26.2.1 RapidFuzz==3.12.2 ray==2.43.0 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rich-toolkit==0.13.2 rouge_score==0.1.2 rpds-py==0.23.1 ruff==0.9.9 sacremoses==0.1.1 safehttpx==0.1.6 safetensors==0.5.3 ScandEval==14.0.0 scikit-learn==1.5.2 scipy==1.15.2 semantic-version==2.10.0 sentencepiece==0.2.0 seqeval==1.2.2 setuptools==75.8.2 shellingham==1.5.4 simplejson==3.20.1 six==1.17.0 sniffio==1.3.1 starlette==0.46.0 sympy==1.13.1 tabulate==0.9.0 tenacity==9.0.0 tensorstore==0.1.72 termcolor==2.5.0 threadpoolctl==3.5.0 tiktoken==0.9.0 tokenizers==0.21.1 tomlkit==0.13.2 toolz==1.0.0 torch==2.6.0 torchaudio==2.6.0 torchvision==0.21.0 tqdm==4.67.1 transformers==4.51.3 treescope==0.1.9 triton==3.2.0 typer==0.15.2 typing_extensions==4.12.2 tzdata==2025.1 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 vllm==0.8.5.post1 watchfiles==1.0.4 websockets==15.0 wheel==0.45.1 wrapt==1.17.2 xformers==0.0.29.post2 xgrammar==0.1.18 xxhash==3.5.0 yarl==1.18.3 zipp==3.21.0

Context for the issue:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions