A method to evaluate the response of lightweight LLMs to TRUE-FALSE questions
Source Code for the Data Visualization: https://github.com/csisc/BoolV-Analysis.
To Cite the Work: Turki, H., Dossou, B. F. P., Nebli, A., & Valdelli, I. (2025). Evaluating the Behavior of Small Language Models in Answering Binary Questions. In 3rd International Workshop on Generalizing from Limited Resources in the Open World (GLOW@IJCAI 2025).
Model | Hyperparameters |
---|---|
llama-3.2-1b-instruct-q8_0 | 1.24 B |
llama-3.2-3b-instruct-q8_0 | 3.21 B |
Phi-3.5-mini-instruct.Q8_0 | 3.82 B |
Mistral-7B-Instruct-v0.3.Q8_0 | 7.25 B |
llama-3.2-8b-instruct-q8_0 | 8.03 B |
- https://github.com/google-research-datasets/boolean-questions
- Train dataset: 9427 labeled training examples.
- Dev dataset: 3270 labeled dev examples.
- llama-cpp-python
- pathlib
- pandas
- math
- jsonlines
This research work has been done thanks to the computer resources of Wikimedia Switzerland.