# Starter Notebook for vLLM inference

This notebook provides an outline for a very basic inference setup using vLLM. It does not have any performance optimizations for faster or memory efficient inference, required prompt engineering techniques (e.g. chain of thought) or model tuning. The goal is to have a starting base for inference that successfully runs from start to finish.

Our target problem will be solving math problems from the Math QSA dataset.

In [1]:
import glob
import re
from IPython.display import display, Markdown
import pandas as pd
from vllm import LLM, SamplingParams

We'll set some initial parameters for our inference pipeline:
- `n_shots`: we'll use few-shot prompting using `n_shots`
- `n_samples`: given our goal is to merely see the pipeline run, we'll only run inference on an `n_samples` subset of our available data
- `model_id`: the huggingface model we're using for inference

In [2]:
n_shots = 2
n_samples = 100
model_id = "deepseek-ai/deepseek-math-7b-rl"

## Load Math Dataset 

We'll be using [awsaf49/math-qsa-dataset](https://www.kaggle.com/datasets/awsaf49/math-qsa-dataset) as the dataset to run inference on.

In [3]:
# load math-qsa-dataset
# use glob for future scenarios where multiple files are loaded using pattern matching
# in this scenario, it's unnecessary as we're only loading one file
paths = glob.glob("../external-data/math-qsa-dataset/train.csv")  
math_qsa_df = pd.concat([pd.read_csv(path) for path in paths]).reset_index(drop=True)

In [4]:
math_qsa_df.head()

Unnamed: 0,problem,level,type,solution,answer
0,The United States Postal Service charges an ex...,Level 3,Prealgebra,We calculate the desired ratio for each envelo...,3
1,How many integers between 1000 and 2000 have a...,Level 4,Prealgebra,"A number with 15, 20 and 25 as factors must be...",3
2,"Given that $n$ is an integer and $0 < 4n <30$,...",Level 2,Prealgebra,"Dividing by $4$, we have $0<n<7\frac{1}{2}$. T...",28
3,How many integers between $100$ and $150$ have...,Level 4,Prealgebra,We will break up the problem into cases based ...,18
4,Regular pentagon $ABCDE$ and regular hexagon $...,Level 4,Prealgebra,We know that the sum of the degree measures of...,132


In [5]:
math_qsa_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7498 entries, 0 to 7497
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   problem   7498 non-null   object
 1   level     7498 non-null   object
 2   type      7498 non-null   object
 3   solution  7498 non-null   object
 4   answer    7496 non-null   object
dtypes: object(5)
memory usage: 293.0+ KB


## Data Cleaning

We need to do some very basic data cleaning on the math-qsa dataset to remove rows that are missing a parsed answer and trim any excess whitespace from the parsed answers.

In [6]:
def process_data(df: pd.DataFrame) -> pd.DataFrame:
    df = df.query("answer == answer")  # remove nans
    df["answer"] = df["answer"].str.strip()  # trim whitespace
    return df

math_qsa_df = process_data(math_qsa_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["answer"] = df["answer"].str.strip()  # trim whitespace


We also remove _n_ example problems from the dataset to use as examples for few-shot prompting (in-context learning). Note that we haven't actually run any experiments to validate the use of few-shot prompting nor optimize the size of _n_.

In [7]:
n_shot_examples = math_qsa_df[:n_shots].copy(); n_shot_examples

Unnamed: 0,problem,level,type,solution,answer
0,The United States Postal Service charges an ex...,Level 3,Prealgebra,We calculate the desired ratio for each envelo...,3
1,How many integers between 1000 and 2000 have a...,Level 4,Prealgebra,"A number with 15, 20 and 25 as factors must be...",3


We'll take a random subset of `n_samples` to test our model inference setup.

In [8]:
inference_df = math_qsa_df[n_shots:].sample(n_samples)

## Prompt Template

Now let's setup our prompting template.

In [9]:
# prompt templating setup
# starting template taken from https://www.kaggle.com/code/awsaf49/aimo-kerasnlp-starter/notebook
# with some minor adjustments e.g. adding few-shot prompting

# TODO: clean up implementation by using a templating engine like Jinja
# using a proper templating engine allow the templating logic to be embedded in the template string
# current implementation splits the implementation logic between the template and template generation functions
# meaning they need to be updated in tandem with each other

ROLE_TEMPLATE = """Role:
You are an advanced AI system with exceptional mathematical reasoning and problem-solving capabilities, specifically designed to solve tricky math problems (whose answer is a non-negative integer) written in LaTeX format from the AI Mathematical Olympiad (AIMO) competition. Your task is to accurately analyze and solve intricate mathematical problems, demonstrating a deep understanding of mathematical concepts and a strong ability to apply logical reasoning strategies.

Instruction:
1. Carefully read and comprehend the problem statement provided in the "Problem" section.
2. In the "Solution" section, provide a solution of the problem with detailed explanation of your logical reasoning process.
3. Mark your final answer by writing it within the \\boxed{} LaTeX operator.

"""

QUESTION_SOLUTION_TEMPLATE = """Problem:
{problem}

Solution:
{solution}

"""

N_SHOT_EXAMPLES = "".join([
    QUESTION_SOLUTION_TEMPLATE.format(**example.to_dict())
    for _, example in n_shot_examples.iterrows()
])

QUESTION_TEMPLATE = """Problem:
{problem}

Solution:
"""


def generate_ground_truth(instance: pd.Series) -> str:
    instance_str = QUESTION_SOLUTION_TEMPLATE.format(**instance.to_dict())
    return ROLE_TEMPLATE + N_SHOT_EXAMPLES + instance_str

def generate_prompt(instance: pd.Series) -> str:
    instance_str = QUESTION_TEMPLATE.format(**instance.to_dict())
    return ROLE_TEMPLATE + N_SHOT_EXAMPLES + instance_str

### Example of rendering prompt and ground truth output

In [10]:
example_ground_truth = generate_ground_truth(instance=inference_df.iloc[0])
display(Markdown(example_ground_truth))

Role:
You are an advanced AI system with exceptional mathematical reasoning and problem-solving capabilities, specifically designed to solve tricky math problems (whose answer is a non-negative integer) written in LaTeX format from the AI Mathematical Olympiad (AIMO) competition. Your task is to accurately analyze and solve intricate mathematical problems, demonstrating a deep understanding of mathematical concepts and a strong ability to apply logical reasoning strategies.

Instruction:
1. Carefully read and comprehend the problem statement provided in the "Problem" section.
2. In the "Solution" section, provide a solution of the problem with detailed explanation of your logical reasoning process.
3. Mark your final answer by writing it within the \boxed{} LaTeX operator.

Problem:
The United States Postal Service charges an extra $\$0.11$ in postage if the length of an envelope, in inches, divided by its height, in inches, is less than $1.3$ or greater than $2.5.$ For how many of these four envelopes must the extra $\$0.11$ in postage be paid? \begin{tabular}[t]{ccc}
Envelope & Length in inches & Height in inches\\\hline
A &6 &4\\
B &9 &3\\
C &6 &6\\
D &11 &4
\end{tabular}

Solution:
We calculate the desired ratio for each envelope: \begin{align*}
\text{A} &= \frac{6}{4} = 1.5 \\
\text{B} &= \frac{9}{3} = 3 \\
\text{C} &= \frac{6}{6} = 1 \\
\text{D} &= \frac{11}{4} = 2.75
\end{align*} $\text B,$ $\text C,$ and $\text D$ are out of range, so the answer is $\boxed{3}.$

Problem:
How many integers between 1000 and 2000 have all three of the numbers 15, 20 and 25 as factors?

Solution:
A number with 15, 20 and 25 as factors must be divisible by their least common multiple (LCM).  Because $15 = 3
\times 5$, $20 = 2^2 \times 5$, and $25 = 5^2$, the LCM of 15, 20 and 25 is $2^2 \times 3 \times 5^2 = 300$. There are $\boxed{3}$ multiples of 300 between 1000 and 2000: 1200, 1500 and 1800.

Problem:
For what digit $d$ is the five-digit number $2345d$ a multiple of 9?

Solution:
In order for a number to be a multiple of 9, the sum of its digits must be divisible by 9. Since $2+3+4+5=14$, the only single digit that will make the sum a multiple of 9 is $4$. The sum of the digits would be $18$, which is $9\cdot 2$, so $d=\boxed{4}$.



### Example of prompt created for inference

In [11]:
example_prompt = generate_prompt(instance=inference_df.iloc[0])
display(Markdown(example_prompt))

Role:
You are an advanced AI system with exceptional mathematical reasoning and problem-solving capabilities, specifically designed to solve tricky math problems (whose answer is a non-negative integer) written in LaTeX format from the AI Mathematical Olympiad (AIMO) competition. Your task is to accurately analyze and solve intricate mathematical problems, demonstrating a deep understanding of mathematical concepts and a strong ability to apply logical reasoning strategies.

Instruction:
1. Carefully read and comprehend the problem statement provided in the "Problem" section.
2. In the "Solution" section, provide a solution of the problem with detailed explanation of your logical reasoning process.
3. Mark your final answer by writing it within the \boxed{} LaTeX operator.

Problem:
The United States Postal Service charges an extra $\$0.11$ in postage if the length of an envelope, in inches, divided by its height, in inches, is less than $1.3$ or greater than $2.5.$ For how many of these four envelopes must the extra $\$0.11$ in postage be paid? \begin{tabular}[t]{ccc}
Envelope & Length in inches & Height in inches\\\hline
A &6 &4\\
B &9 &3\\
C &6 &6\\
D &11 &4
\end{tabular}

Solution:
We calculate the desired ratio for each envelope: \begin{align*}
\text{A} &= \frac{6}{4} = 1.5 \\
\text{B} &= \frac{9}{3} = 3 \\
\text{C} &= \frac{6}{6} = 1 \\
\text{D} &= \frac{11}{4} = 2.75
\end{align*} $\text B,$ $\text C,$ and $\text D$ are out of range, so the answer is $\boxed{3}.$

Problem:
How many integers between 1000 and 2000 have all three of the numbers 15, 20 and 25 as factors?

Solution:
A number with 15, 20 and 25 as factors must be divisible by their least common multiple (LCM).  Because $15 = 3
\times 5$, $20 = 2^2 \times 5$, and $25 = 5^2$, the LCM of 15, 20 and 25 is $2^2 \times 3 \times 5^2 = 300$. There are $\boxed{3}$ multiples of 300 between 1000 and 2000: 1200, 1500 and 1800.

Problem:
For what digit $d$ is the five-digit number $2345d$ a multiple of 9?

Solution:


## Run inference

We'll load the pre-trained model, run inference on our prompts and store the generated results.

In [12]:
# Generate batch of prompts
inference_df["prompt"] = inference_df.apply(generate_prompt, axis=1)
prompts = inference_df["prompt"].to_list()

In [13]:
# load pre-trained model
llm = LLM(model=model_id, trust_remote_code=True)

INFO 11-12 10:24:14 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='deepseek-ai/deepseek-math-7b-rl', speculative_config=None, tokenizer='deepseek-ai/deepseek-math-7b-rl', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=deepseek-ai/deepseek-math-7b-rl, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_o

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 11-12 10:24:17 model_runner.py:1067] Loading model weights took 12.8725 GB
INFO 11-12 10:24:18 gpu_executor.py:122] # GPU blocks: 980, # CPU blocks: 546
INFO 11-12 10:24:18 gpu_executor.py:126] Maximum concurrency for 4096 tokens per request: 3.83x
INFO 11-12 10:24:20 model_runner.py:1395] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 11-12 10:24:20 model_runner.py:1399] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 11-12 10:24:30 model_runner.py:1523] Graph capturing finished in 10 secs.


In [14]:
# run inference
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=500)
outputs = llm.generate(prompts, sampling_params)

Processed prompts:   0%|        | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]



Processed prompts: 100%|█| 100/100 [01:11<00:00,  1.40it/s, est. speed input: 971.56 toks/s, output: 385.89 


In [15]:
# store generated solutions
generated_text = [output.outputs[0].text for output in outputs]
inference_df["generated_solution"] = generated_text

## Extract answers from generated solution text

Our ground truth and few shot examples provide the final answer inside of the \\boxed{<enclosed>} LaTeX command. We'll expect our text generations to use this format and therefore need to parse the solution to extract the final answer.

In [16]:
# extract answers from generated solutions
def extract_answer(text: str) -> str:
    """Given an input text, matches \\boxed{<enclosed>} and returns <enclosed>"""
    start = text.find("\\boxed{")
    if start < 0:
        return None
    
    answer = ""
    open_brace_count = 1
    for c in text[start+7:]:
        if c == "{":
            open_brace_count += 1
        elif c == "}":
            open_brace_count -= 1
            if open_brace_count == 0:
                break
        answer += c
    return answer.strip()

Let's test our answer extraction logic against the ground truth solutions to confirm that it works.

In [17]:
# test extraction logic on training dataset
extracted_answers = inference_df["solution"].apply(extract_answer)
matched_answers = extracted_answers == inference_df["answer"]

print(f"Found {sum(matched_answers)} matches and {sum(~matched_answers)} mismatches")
print("Mismatches:")
inference_df[~matched_answers]

Found 100 matches and 0 mismatches
Mismatches:


Unnamed: 0,problem,level,type,solution,answer,prompt,generated_solution


Now let's parse our generated solutions and calculate our accuracy against the ground truth answers.

In [18]:
inference_df["generated_answer"] = inference_df["generated_solution"].apply(extract_answer)

In [19]:
# calculate accuracy - # answers generated, # correct answers
accuracy = inference_df.pipe(lambda x: (x["answer"] == x["generated_answer"]).sum() / x["answer"].count())
print(f"{accuracy=}")

accuracy=0.25


For what proportion of our prompts were we actually able to extract an answer?

In [20]:
missing_answer = inference_df.pipe(lambda x: 1 - x["generated_answer"].count() / x["answer"].count())
print(f"{missing_answer=}")

missing_answer=0.62


Most of our generated solutions either do not provide a final answer or fail to use the ``\boxed`` command. This seems like low hanging fruit for future improvement.