This is our code for the paper Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs. It contains all files needed to reproduce the figures in the main paper, as well as raw results CSV's in results/
. Please direct any questions and/or correspondence to aaditya.singh.21@ucl.ac.uk.
Overall, codebase has a few key components:
- All sampling is random seeded through JAX for maximum reproducibility.
- Code is very functionally programmed -- the idea is that the functions in main.py can be used to construct various experiments. Each experiment then gets its own run_*.py file which imports everything from main. We made this choice since runs are easily reproducible/no issues of forgotten configs.
- Support functionality for sampling addition problems of a given length signature in
sample_problem_with_lengths
(e.g.,lengths = [7,7,8]
means a 7 digit + 7 digit -> 8 digit problem). - We build samplers on top of this that sample problems with various constraints in cyclic order.
make_all_digit_length_controlled_sample_problem
cycles through a list of length constraints of dimension 3 (so both addends and the answer are constrained). This was used for our later experiments once we realized answer length matters a lot.make_digit_controlled_sample_problem
accepts amin_dig
count andmax_dig
count for each of the 2 addends. Then, it cycles through all combinations (e.g., ifmin_dig=7, max_dig=9
, this would be 9 possible combinations) and generates problems.
- We build corresponding sampler to sample shots for each of these problem. There are few variants of these, for the different experiment settings
make_all_digit_length_same_sample_shots
samples shots with the exact same digit lengths as a given problem. Addends and answer are constrained (analogous tomake_all_digit_length_controlled_sample_problem
)make_digit_same_sample_shots
is similar tomake_all_digit_length_same_sample_shots
, but only uses input addend lengths. Thus, it corresopnds tomake_digit_controlled_sample_problem
- Next, we have various functions that format numbers and addition problems
chunk_with_separator
breaks up a number either L2R or R2L using a separator. For example, comma_r2l can be achieved by doingchunk_with_separator(str(num), 3, 'r2l', ',')
make_format_number
builds a format function (maps number -> str) which wraps aroundchunk_with_separator
for the conditions we care about. SeeSEPARATOR_NAMES
for the separators that are supported bymake_format_number
make_format_addition_problem
builds a problem formatter (maps string1, string2 -> problem string) and takes into account
make_safe_api_call
builds a safe api call function and wraps the retry logic/parametersdummy_api_call_fn
is a function with the same signature (input and output) as safe api call fns, except it doesn't actually query a model. Useful for confirming experiment setups without wasting time/moneystringify_shot
/destringify_shot
used for storing shot information in csv's -- never used for actual model queryingjsonify_logprobs
similarly stores a string version of the response logprobs in csv's.run_addition_experiment
A wrapper function that accepts various first order functions and parameters and actually conducts an experiment. See docstring for more information
These files were used to generate the individual csv's. Each run results in a dated csv file with all the necessary info placed in results/{date}_{name}.csv
. Some files were used to create multiple csv's, since we parallelized over models (for example). Some files are also a bit older, so haven't been updated to support the new OpenAI API -- the *_new_models.py
files have the correct syntax for this updated API.
Below, we detail all the columns that'll exist in these data frames as they're saved:
problem_index
: Identifier for the speficic problem -- useful forgroupby
if we want to compare different formats/conditions/shot etc. on the same problemflip_problem
: Whether or not the problem was flipped (e.g. if it's written as a+b, was it fed in as a+b or b+a). Note, ifflip_problem
is True, the savednum1, num2
will be flipped given the index as well.flip_shots
: same asflip_problem
, but for shotsnum1, num2
: the addends for the problemproblem_msg
: The actual query string used (will take into account delimiter, thinking tokens etc.)problem_tokens
: Length ofproblem_msg
in tokensresponse_repeat
: Should basically always be 0 for our current experiments. This tells us which index of the response it is, if we usen > 1
when sampling from the API (which I don't think we ever ended up doing)response
: The actual raw model response stringlogprobs_json
: A jsonified version of the top 5 tokens + logprobs for each output token. Created from the OpenAI API response logprobs usingmain.py:jsonify_logprobs
prompt_tokens, completion_tokens
: The total number of tokens in the prompt + completion, directly from the response message, used for calculating costs posthoc if needed/checking if max length (which I think we default to 10) was hit etc.shots
: A comma-separated string ith all relevant shot information (output frommain.py:stringify_shot
run on each shot).main.py:destringify_shot
can be used to recover shotsmodel
: Model that was run on, OpenAI identifier.system_prompt
: The actual system prompt used. Most notebooks convert this to['custom', 'original']
in the fieldsystem_prompt_type
-- see below.problem_number_format
: Format string used for the numbers in the problem. Can be passed tomain.py:make_format_number
to get a format functionanswer_number_format
: Format string. Can be passed tomain.py:make_format_number
to get a format functionrepeat_style
: What format things were repeated in. Defaults to None (since most experiments don't involve repeats)num_spaces
: How many spaces in between operators. This was used fornosep+1, nosep+2
conditions for thinking token controlnum_trailing_spaces
: How many spaces after the=
sign. This was used fornosep+2
to add another tokennum_shots
: How many shots were usedproblem_rng_key
: Original RNG seed used to run this experiment. In combination withproblem_index
, this should exactly reproduce experiments.shots_rng_key
: Original RNG seed used to get shots for this experiment. In combination with the exact problems, shot function, andproblem_index
, should reproduce the shots used.
paper_figures.ipynb
can be used to reproduce all figures in our paper. utils.py
contains all our error analysis code and is heavily used by paper_figures.ipynb
. Our analysis pipeline contains of a post processing step (relying on utils.post_process_df
which relies on utils.post_process_row
) and then an error analysis step (utils.add_error_analysis
which relies on analyze_row_error
). The error analysis step assumes a tokenization direction, so care should be taken when passing dataframes containing both L2R and R2L tokenized rows. Examples can be found in paper_figures.ipynb
.
Post-processing, columns added:
- Facts about the problem:
true_answer
: Correct ans,num1+num2
formatted_true_answer
: answer format function applied totrue_answer
, useful for checking exact match of format and answerlen1, len2
: Digit lengths ofnum1, num2
len_true_ans
: Digit length oftrue_answer
len_match_both
: True if bothlen1, len2
equallen_true_ans
len_match
: True if either oflen1, len2
equallen_true_ans
num_carries
: How many total carries there were in the problemmax_carry_chain
: The length of the longest chain of carries
- Repeat related:
repeat_problem_number_format
: ifrepeat_style
in df, will set this to be the correct (needed this step to convert whenrepeat_style = 'same'
)repeat_valid
: Check if the response contains a repeat with two addendsrepeat_detailed_info
: A list of two dictionaries with raw processing of repeats. These individual variables do get distilled into the rest of the dataset entries, just including here for clarity. Containing the following fields for each repeat:padded
: True if model added spaces next to this numberextracted_formatted
: String repeat including formattingextraneous
: True if model added other text to the number (beyond the expected formatting)number
: An int with the actual number repeatconstructed_formatted
: The properly formatted number given the extractioncorrect_copy
: True if the number was correct (independent of format)correct_format
: True if the correct formatting was used (independent of number being correct!)
repeat_padded
: True if either repeat contained paddingrepeat_extraneous
: True if either repeat added other text to the number (beyond the expected formatting)repeat_num1, repeat_num2
: The extracted repeated numbersrepeat_copy_num1_match, repeat_copy_num2_match
: Whether or not the extracted repeated numbers matched the correct numbersrepeat_copy_match
: True if both numbers matched (divorced from format)repeat_format_match
: True if both repeats used the correct format (again divorced from number correctness)
- Answer related:
padded_answer
: True if the answer contained extraneous paddingextracted_formatted_answer
: String of model response, including formatting. None if not validvalid_answer
: True if we were able to extract an answer. Typically False when model starts using textanswer
: int of model response, extracted by removing all formattingconstructed_formatted_answer
: model answer, reformatted to use the expected formatting (based onanswer_number_format
)extraneous_response
: True if there was extraneous text in model responseformat_match
: True if model answered in right format (even if it got wrong answer)answer_match
: True if model answered correctly (even if in wrong format)formatted_answer_match
: True if bothformat_match
andanswer_match
Error analysis, columns added:
-
len_ans
: The length of the model answer -
correct_ans_len
: Checks if the model answered with a number of the correct length. Useful since a lot of the error analysis won't make sense if this is false. -
abs_error_mag
: abs(true_answer - answer). Useful for checking off by one errors quickly (since just looking at digit error magnitudes can be misleading due to carries!) -
only_off_by_one
: True ifabs_error_mag
is a power of 10 -
only_off_by_one_digit_num
: Ifonly_off_by_one
is true, this is the number of the digit, counting from the left, that the off by one error is at -
digit_{1-L}_error_mag
: Error magnitude at a given digit, left-to-right. For example, error between 1239 and 1240 would be[0, 0, 1, 9]
-
digit_{1-L}_error
: Boolean if there's an error at this digit -
digit_{i}_within_{t}token_error_mag
: Same as earlier error mag, but distinguished into tokens. Note here thattokenize_dir
defaults tol2r
-
digit_{i}_within_{t}token_error
: Same as earlier error boolean, but distinguished into tokens. Note here thattokenize_dir
defaults tol2r
-
token_{i}_error
: True if there's an error in the given token - Token frequency-related error checking:
-
token_{i}_error_more_frequent
: Can be True, False or NaN/None. If there's no error in this token, will be NaN/None. If the error made makes it such that the model response is a more frequent token (percl100k_base
merge ranks), then True. Otherwise False. -
token_{i}_match_most_frequent_off_by_one_last_digit
: Whether or not the most frequent token out of this token and this token +/- 1 was chosen by the model (so this should on average be 1/3). This is independent of whether or not the model got the problem right, but it can be combined with accuracy to get some cool stats. -
token_{i}_match_most_frequent_off_first_digit
: Same as above, except it compares all possible digit substitutions (so this should on average be 1/10) for the first digit of the token. -
token_{i}_picked_rank_off_by_one_last_digit
: The rank (either 0, 1, or 2) of the model response out of the true answer token, and the true answer token +/- 1. If model response is not in any of these, returns None -
token_{i}_picked_rank_off_first_digit
: The rank (in 0..9 inclusive) of the model response out of the true answer token with any first digit. If model response is not in any of these, returns None.
-
- Logprobs related metrics:
-
valid_logprobs_metrics
: True if we were able to successfully extract logprobs for all tokens, else False -
token_{i}_logprob
: Logprob of model response for the given token -
token_{i}_logprob_top_minus_second
: Gap between model's most likely output token and second most likely token -
token_{i}_true_answer_logprob
: If not None, the logprob of the correct token for this position -
token_{i}_true_answer_rank
: If not None, the rank of the correct token for this position (1 to 5, inclusive) -
token_{i}_lower_entropy
: Lower bound on entropy for the given token. Calculated by using the fact that all remaining tokens have probability less than that of the 5th most likely token. Specifically,$$- \left(\sum_{i=1}^5 p_i \log\left(p_i\right) + \left(1-\sum_{i=1}^5 p_i\right) \cdot \log\left(p_5\right)\right).$$ -
token_{i}_top5_entropy
: Entropy over the top 5 tokens (after renormalizing to have total probability 1) -
answer_logprob
: Total logprob of the generated answer (only counting outputted number tokens) -
increasing_response_logprobs
: Whether or not logprob of the response is monotonically increasing over output number tokens -
decreasing_response_logprobs
: Whether or not logprob of the response is monotonically decreasing over output number tokens -
increasing_response_top5_entropy
: Whether or nottop5_entropy
of the response is monotonically increasing over output number tokens -
decreasing_response_top5_entropy
: Whether or nottop5_entropy
of the response is monotonically decreasing over output number tokens -
increasing_response_lower_entropy
: Whether or notlower_entropy
of the response is monotonically increasing over output number tokens -
decreasing_response_lower_entropy
: Whether or notlower_entropy
of the response is monotonically decreasing over output number tokens
-
-
num_carry_in_errors
: Number of errors when carrying in a 1 -
num_carry_out_errors
: Number of errors when carrying out a 1 -
dropped_digit_error
: True if model got the right answer as if it dropped a digit (calculated by an exhaustive check over all digit drops) - Left alignment error checking, mostly used for single digit experiments:
-
left_alignment_wrong_answer
: Answer as if model added inputs from the length, used inleft_alignment_error
and stored to allow for comparative coloring in error analysis notebooks. -
left_alignment_error
: True if model seems to have added the inputs from the left, i.e. zero-padding the shorter number up to length of the longer number and then adding them produces the model's answer. If inputs are same length, this is automaticaly False. -
{left, right}_chunked_error
: Checks if model seems to have treated the inputs as lists separated by their separators, and then added them element-wise. If both inputs have same number of chunks, left vs right doesn't matter. If one number has fewer chunks, then we can choose to align the lists from the left or the right. left_chunked_error checks for alignment from the left, and right_chunked_error from the right.
-
gpt_token_paper_figures.ipynb
+ claude_token_paper_figures.ipynb
were used to create the figures about GPT-3, GPT-3.5, and Claude's tokenizers in the paper. We separated this from paper_figures.ipynb
as it does not rely on any of the experiments we ran and simply relies on the tokenizers provided by the respective creators of these models. The files also contain some additional investigation of these tokenizers (which we left in as we thought it may be interesting), so they're a bit less clean.
utils.pretty_print
provides a nice tool for visualizing errors and carries for individual problems. We found this very useful when looking for error patterns.
We also include a fun python command line game (tokenization_game.py
) we made where you can test your ability to rank tokens in the GPT-3.5 tokenizer. Enjoy :)