---
title: "LLM API Overview"
format: pdf
code-overflow: wrap
messages: false
outputs: false
warnings: false
errors: false
---

Define a test string and specify a model to use for the test.

In [2]:
# Define a test string and specify a model to use for the test.
s_string = "Finn writes code"
s_model = 'gpt2'

In [3]:
from transformers import GPT2Tokenizer, AutoModelForCausalLM
import numpy as np

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id
inputs = tokenizer([s_string], return_tensors="pt")

# Example 1: Print the scores for each token generated with Greedy Search
outputs = model.generate(**inputs, max_new_tokens=5, top_k = 1, return_dict_in_generate=True, output_scores=True)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)
# input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
# encoder-decoder models, like BART or T5.
input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
generated_tokens = outputs.sequences[:, input_length:]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


|   284 |  to      | -1.336 | 26.28%
|   787 |  make    | -3.183 | 4.15%
|   340 |  it      | -1.970 | 13.95%
|  4577 |  easier  | -2.133 | 11.84%
|   284 |  to      | -0.470 | 62.47%
Is this the joint probability across the whole vocab? 0.0112%


In [29]:
import pandas as pd
import numpy as np

# Iterate over the generated tokens and transition scores
df = pd.DataFrame({'token': generated_tokens[0].numpy(), 'trans_scores': transition_scores[0].numpy()})
df['token_str'] = df['token'].apply(lambda x: tokenizer.decode(x))
df['trans_prob'] = df['trans_scores'].apply(lambda x: np.exp(x))
df = df[['token', 'token_str', 'trans_scores', 'trans_prob']]

In [36]:
#| output: 'asis'
#| echo: false
print(f"""$$
{df.to_latex()}$$""")

$$
\begin{tabular}{lrlrr}
\toprule
 & token & token_str & trans_scores & trans_prob \\
\midrule
0 & 284 &  to & -1.336478 & 0.262770 \\
1 & 787 &  make & -3.182834 & 0.041468 \\
2 & 340 &  it & -1.969614 & 0.139511 \\
3 & 4577 &  easier & -2.133453 & 0.118428 \\
4 & 284 &  to & -0.470468 & 0.624710 \\
\bottomrule
\end{tabular}
$$


# Notes
- e.g. "\n\n" and " \n\n" are different
- Have to use legacy API to get the logs (risky?)
- Seed parameter (almost) always the same output for the same settings, even more "almost" with temperature=0.0?
- ["We’re also launching a feature to return the log probabilities for the most likely output tokens generated by GPT-4 Turbo and GPT-3.5 Turbo in the next few weeks, which will be useful for building features such as autocomplete in a search experience."](https://openai.com/blog/new-models-and-developer-products-announced-at-devday)
- Also save the `system_fingerprint`, to keep track of the state of the model. If the model itself gets updated the same seed might yield different results.
- Can set the seed, but not the fingerprint
- Can run LLAMA2 locally, at least the smallest version?
- With OpenAI API, we cannot access the first input layer directly, have to go through prompts
- "[inputs_embeds](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2LMHeadModel) (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix."
- For GPT-2 we can make the output deterministic
- We can give in an input embedding instead of a tokenized phrase `model(inputs_embeds=embeds)`
- [Thread on logit scores and their different variants](https://discuss.huggingface.co/t/announcement-generation-get-probabilities-for-generated-output/30075/13)
  - Transition scores: "transition_scores contains scores for the tokens that were selected at generation time. You can set normalize_logits=True to ensure they are normalized at a token level (i.e. to ensure the sum of probabilities for all vocabulary at a given generation step is 1)."