# Tutorial 2: Verbalized Distribution

There has been growing evidence that verbalized distribution can achieve high performance when asking LLMs Multiple Choice Questions. QSTN supports this option out of the box. We will show this on a simple example tp see how models predict the 2024 US election.

## Setting up the Prompt

In [1]:
from qstn.prompt_builder import LLMPrompt
from qstn.utilities import placeholder

import pandas as pd

system_prompt = "You are an expert political analyst."

# We can add any state election we want to predict here.
elections_to_predict = [
    "2024 US Presidential Election",
    "2024 United States presidential election in Illinois",
]

# The placeholders automatically define at which point of the prompt the questions are asked.
formatted_tasks = [
    f"Please predict the outcome of the {election}. {placeholder.PROMPT_OPTIONS} {placeholder.PROMPT_AUTOMATIC_OUTPUT_INSTRUCTIONS} {placeholder.PROMPT_QUESTIONS}"
    for election in elections_to_predict
]

# If we want to ask multiple questions we can define them here or save them in a csv
questionnaire = pd.DataFrame(
    [{"questionnaire_item_id": 1, "question_content": "Percentage of each Candidate"}]
)

interviews: list[LLMPrompt] = []

# This creates a system prompt and an instruction for the model, which is not in the system prompt. We also set a seed for reproducibility.
for task, election in zip(formatted_tasks, elections_to_predict):
    interviews.append(
        LLMPrompt(
            questionnaire_source=questionnaire,
            questionnaire_name=election,
            system_prompt=system_prompt,
            prompt=task,
            seed=42,
        )
    )

  from .autonotebook import tqdm as notebook_tqdm


## Using Verbalized Distribution

To now get valid verbalized distribution output for our model we need to do two things:

1. Define the Response Generation Method.


In [2]:
from qstn.inference.response_generation import JSONVerbalizedDistribution

# We can also adjut the automatic template to our liking. 
# If we don't want create an automatic template, we can just not put it into the prompt.
response_generation_method = JSONVerbalizedDistribution(
    output_template="Respond only in JSON format, where the keys are the names of the candidates and the values are the percentage of votes the candidate achieves.",
    output_index_only=False, # If we want to save tokens we can output only the index of our answer
)

2. Define the options the LLM should have when responding. For now we choose 5 candidates that had some chances at the end of LLamas pretraining cutoff.

In [3]:
from qstn.prompt_builder import generate_likert_options

# Our five most likely candidates and how they are presented to the model
options = generate_likert_options(
    n=5,
    answer_texts=["Biden", "Trump", "Harris", "DeSantis", "Kennedy"],
    response_generation_method=response_generation_method,
    list_prompt_template="The candidates are {options}.", # Our automatic Option Prompt
)

Finally we have to prepare the prompt with all the options that we defined:

In [4]:
for interview in interviews:
    interview.prepare_prompt(
        question_stem=f"Please predict the {placeholder.QUESTION_CONTENT} now. The percentage of each candidate should add up to 100%.",
        answer_options=options,
        randomized_item_order=True, # We can easily randomize the options
    )

And look at the whole prompt:

In [5]:
system_prompt, prompt = interviews[0].get_prompt_for_questionnaire_type()

print(f"System Prompt: {system_prompt}")
print(f"Prompt: {prompt}")

System Prompt: You are an expert political analyst.
Prompt: Please predict the outcome of the 2024 US Presidential Election. The candidates are 1: Biden, 2: Trump, 3: Harris, 4: DeSantis, 5: Kennedy. Respond only in JSON format, where the keys are the names of the candidates and the values are the percentage of votes the candidate achieves. Please predict the Percentage of each Candidate now. The percentage of each candidate should add up to 100%.


And we can run inference:

In [6]:
from vllm import LLM

# First we create the model
model = LLM("meta-llama/Llama-3.2-3B-Instruct", max_model_len=1000)

INFO 11-28 19:46:35 [utils.py:253] non-default args: {'max_model_len': 1000, 'disable_log_stats': True, 'model': 'meta-llama/Llama-3.2-3B-Instruct'}


INFO 11-28 19:46:36 [model.py:631] Resolved architecture: LlamaForCausalLM
INFO 11-28 19:46:36 [model.py:1745] Using max model len 1000


2025-11-28 19:46:36,280	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 11-28 19:46:36 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:37 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='meta-llama/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidd

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.57it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.23it/s]
[1;36m(EngineCore_DP0 pid=3066459)[0;0m 


[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:40 [default_loader.py:314] Loading weights took 0.92 seconds
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:40 [gpu_model_runner.py:3338] Model loading took 6.0160 GiB memory and 1.654885 seconds
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:43 [backends.py:631] Using cache directory: /home/maxi/.cache/vllm/torch_compile_cache/4bdcebe47f/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:43 [backends.py:647] Dynamo bytecode transform time: 3.18 s
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:45 [backends.py:210] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.134 s
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:46 [monitor.py:34] torch.compile takes 4.31 s in total
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:47 [gpu_worker.py:359] Available KV cache memory: 6.87 GiB
[1;36m(EngineCore_

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:02<00:00, 25.25it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:01<00:00, 31.20it/s]


[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:51 [gpu_model_runner.py:4244] Graph capturing finished in 4 secs, took 0.50 GiB
[1;36m(EngineCore_DP0 pid=3066459)[0;0m INFO 11-28 19:46:51 [core.py:250] init engine (profile, create kv cache, warmup model) took 11.15 seconds
INFO 11-28 19:46:53 [llm.py:352] Supported tasks: ['generate']


In [12]:
from qstn.survey_manager import conduct_survey_single_item
# Second we run inference
results = conduct_survey_single_item(
    model,
    llm_prompts=interviews,
    max_tokens=500,
    seed=42,
)

Adding requests: 100%|██████████| 2/2 [00:00<00:00, 1083.38it/s]
Processed prompts: 100%|██████████| 2/2 [00:01<00:00,  1.77it/s, est. speed input: 238.77 toks/s, output: 97.63 toks/s]
Processing questionnaires: 100%|██████████| 1/1 [00:01<00:00,  1.14s/it]


## Parsing Output

We can easily parse the output now, as it is in JSON format.

In [13]:
from qstn import parser

parsed_response = parser.parse_json(results)

We get one DataFrame for each of our Interviews.

In [14]:
df = parsed_response[interviews[0]]
df2 = parsed_response[interviews[1]]

df

Unnamed: 0,questionnaire_item_id,question,1: Biden,2: Trump,3: Harris,4: DeSantis,5: Kennedy
0,1,Please predict the Percentage of each Candidat...,25,40,0,30,5


We can also get both answers in a combined df.

In [16]:
from qstn.utilities import create_one_dataframe

df_complete = create_one_dataframe(parsed_response)
df_complete

Unnamed: 0,questionnaire_name,questionnaire_item_id,question,1: Biden,2: Trump,3: Harris,4: DeSantis,5: Kennedy
0,2024 US Presidential Election,1,Please predict the Percentage of each Candidat...,25.0,40.0,0.0,30.0,5.0
1,2024 United States presidential election in Il...,1,Please predict the Percentage of each Candidat...,42.5,35.8,12.5,8.9,0.3
