# Alpaca

This notebook explains how to make use of the capabilities of large language models through some of the interfaces they offer.

The language model we will use in this notebook is [Alpaca-LoRA](https://github.com/tloen/alpaca-lora/). This because it is a very small model, capable to run on a free tier colab (as long as you get access to a GPU). It has also been finetuned for use as an assistant, which makes our job easier as we will not need to spend much time prompt engineering.

In [None]:
#@title Some bibs and bobs to install
!pip install bitsandbytes
!pip install -q sentencepiece
!pip install -q git+https://github.com/huggingface/transformers@v4.30.2
!pip install -q git+https://github.com/huggingface/peft.git


In [None]:
#@title Python imports

from peft import PeftModel
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

import functools
import textwrap
import numpy as np
from scipy.special import softmax
import matplotlib.pyplot as plt

from google.colab.output import eval_js

In [None]:
#@title Load the model. This can be slow (3 minutes), but should run fine on public colabs.

tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = LlamaForCausalLM.from_pretrained(
    "decapoda-research/llama-7b-hf",
    load_in_8bit=True,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, "tloen/alpaca-lora-7b")

eval_js('google.colab.output.setIframeHeight("250")')

# A Minimalist example for querying an Alpaca model.

Let's make a simple question answerer. You could test it with the question: `"What is the first name of Einstein?"` But feel free to be creative.

If the below prints some text as an answer, it means we can correctly run the Alpaca model (even though the text that is actually sampled might not make much sense for now).

In [None]:
generation_config = GenerationConfig()
prompt = input("Enter a question here: ")
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=32,
)
answer = tokenizer.decode(generation_output[0])
print("Answer:", answer)


This output is disappointingly bad. But, nothing which we cannot fix. 🤞 If all is good, we are happy if at this point it is not throwing an error.

# On the Alpaca model.

The Alpaca model used here has been trained on [the Stanford Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca). We will need to have a look at the data on which the Alpaca model has been trained, in order to make sure our question is sufficiently in-domain.

As all Machine Learning model, natural language models are trained to deal with data coming from a specific distribution.

Have a look at the link, and see if you can find a good way of formatting our question into something the model has seen during training.

In [None]:
generation_config = GenerationConfig()
question = input("Enter a question here: ")
prompt = f"""<FIND OUT WHAT TO PUT HERE>{question}"""
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=32,
)
answer = tokenizer.decode(generation_output[0])
print("Answer:", answer)


If all is good, you should now start to see a decent output hidden in the answer.

# On tokens
Large Language models these days do not operate on a character level, but on the level of so called Tokens. Each token is multiple characters or even words.

In the case of Alpaca, tokens are represented as numbers between 0 and 31999. It has a vocab size of 32000.
Alpaca has been trained to work with these tokens, not with individual characters.

Therefore, the model uses a so called tokenizer to turn tokens into text, and vice versa, to turn text into a list of tokens.

In [None]:
# Print the individual tokens of the previous output.
print(generation_output[0])
print()
print("_".join([tokenizer.decode(token) for token in generation_output[0]]))

In [None]:
print(tokenizer.decode(2694))
print(tokenizer.decode(5465))
print(tokenizer.decode(3337))
print(tokenizer.decode(31999))
print(tokenizer("Einstein Ейнштейн").input_ids)

Important to understand here, is that there are three special tokens, which do not really map onto text:

*   The "unknown" token, used to encode characters which are not in the encoding. These are used during training to replace all characters not known to the tokenizer, something which can happen a lot with languages that don't use a latin script.
*   The "BOS" token, or Beginning Of Sentence. This token indicates that a new sentence has begun. In many models, this also manipulates the attention of the transformer, making sure tokens that come after this token, cannot attent to tokens that came before this token.
*   The "EOS" token, or End Of Sentence. In many models, this token indicates the sampler to stop sampling. That is convenient, because it makes sure the answer is returned faster and the model does not sample too many useless tokens.

In Alpaca, these are the tokens 0, 1 and 2 respectively.



In [None]:
print(tokenizer.decode(0))
print(tokenizer.decode(1))
print(tokenizer.decode(2))

Now with this understanding, can we clean up our answer to only contain the answer, and no longer the text of our original prompt or other additional nonsense?

In [None]:
generation_config = GenerationConfig()
question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=64,
)
answer = tokenizer.decode(generation_output[0]["<How to index here?>"])
print("Answer:", answer)


# Greedy sampling and the role of the different sampling methods
Let's improve the output of our model a bit.

You might have noticed that the above already works great for simple questions like, `What is Einstein's first name?`. But if you ask it questions like `Can you write a paragraph about the role of sampling methods in large language models?`, you would notice a problem.

In [None]:
generation_config = GenerationConfig()
question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=64,
)
answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(answer))
print("Answer:", paragraphed)


That is quite a repetitive answer. 😞
What is going on here?

The reason is that by default, we are sampling with greedy sampling. Every time we sample a new token from our auto-regressive model, we only take the most probable token.

Intuitively, that makes a lot of sense. After all, wouldn't the best token to sample be the most probable one? Therefore, wouldn't the most probable token be the best token to sample?

Yes! The most probable token usually is a really good token to sample. However, an issue arises when you _only_ take the most probable continuation. Because while that continuation indeed has a high likelihood, but it is not a typical sample from the model.

And while at first this might seem counterintuitive, it is a very important notion to keep in mind. When you sample in a high-dimensional space, all samples you will get within the first few bazillion times you sample your distribution, will actually have a relatively low probability of being sampled. At least, compared to the sequence with the highest probability. These 'true' samples are called _typical_ samples.

Usually, in high dimensions, *the most probable sample is not typical*.

In [None]:
# We sample 999, 1000-dimensional vectors from a Gaussian with mean=0, stddev=1
x_typ = np.random.normal(size=(999, 1000))
plt.hist(np.linalg.norm(x_typ, axis=1), density=True, color='cyan', label="Typical sample")
# The most probable point to sample from this 1000-dimensional Gaussian, is:
x_maxprob = np.zeros(shape=(1, 1000))
plt.hist(np.linalg.norm(x_maxprob, axis=1), density=True, color='red', label="Most probable sample")
plt.xlabel("Norm of each sample")
plt.ylabel("pdf")
plt.legend()
# As you can see, in high dimensions,
# the most probable sample suddenly looks very different from a typical sample!

It is better to think of high-dimensional Gaussians as a soap-bubble.

In [None]:
norm = functools.partial(np.linalg.norm, axis=-1, keepdims=True)
# Let's project our typical samples on a 2D plane, but maintain their norm.
x = x_typ[:, :2] / norm(x_typ[:, :2]) * norm(x_typ)
plt.scatter(*x.T, marker=',', s=1)
plt.title("High dimensional typical samples, rotated onto the 2D plane.")

If you want further reading on the typicality-problem, I can recommend these two resources:


* Ferenc Huzar's [*Gaussian Distributions are Soap Bubbles*](https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/)
* Sander Dieleman's [*Musings on typicality*](https://sander.ai/2020/09/01/typicality.html)



## The solution: true categorical sampling

In [None]:
generation_config = GenerationConfig(
    do_sample=True  # This enables categorical sampling in our Alpaca model
)
question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=256,
)
answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(answer))
print()
print("Answer:", paragraphed)

That is much better! The repetion is gone.

However, you might notice that the model does not really stick to the topic very well. It tends to "drift" away, as later tokens are increasingly influenced by tokens that are sampled by the model, rather than tokens that were in the prompt.

In general, we want to have typical samples, but there was something "nice" about the greediness too. There are various common approaches too make the samples stick closer to the most likely prediction of the model, while also keeping them typical. We will discuss some of these in the next chapters.

## A first intermediate approach: changing the temperature

You can think of changing the temperature of a categorical distribution as similar to what happens when changing the temperature in statistical thermodynamics.

If you are not familiar with this framework, you can play with the slider below.

In short if you have your original distribution $p(x)$, changing the temperature creates a distribution $p_t(x)=\frac{p(x)^t}{\Sigma_X p(x)^t}$. So you take the probability of the initial distribution to the power of the temperature, and then you renormalize.

In [None]:
#@title Play with the temperature of a distribution { run: "auto" }

temp = 1.35 #@param {type:"slider", min:0.01, max:3, step:0.01}

np.random.seed(317070)
x = softmax(np.random.rand(100,))
plt.subplot(1,2,1)
plt.bar(range(1, 101), x, width=1.)
plt.xlabel("Original Categories")
plt.ylabel("Probability")

x_temp = np.power(x, 1./(temp + 1e-2))
x_temp = x_temp / np.sum(x_temp)
plt.subplot(1,2,2, sharey=plt.gca())
plt.bar(range(1, 101), x_temp, width=1.)
plt.gca().get_yaxis().set_visible(False)
plt.xlabel(f"Categories with {temp=}")
plt.ylabel("Probability")
print()

As you can see, the original distribution has a temperature of 1. This was the true sampling case.

When you heat up, the distribution gets more and more uniform. In the limit, when $t=\infty$, you get the uniform distribution. That is quite pointless for our purpose.

When you cool off, less likely samples get less likely, while more likely samples get more likely. In the limit of $t=0$, only the most likely sample remains. This is greedy sampling.

So by adjusting the temperature, we get to interpolate between our original $t=0$ greedy sampling, and $t=1$ true sampling.

Depending on your application, it can be a good idea to set the temperature to 0.8 to get succint, accurate answers from your large language model. This will remove some of the "creativity" and "wildness" though. In general, I have seen values ranging from 0.1 to 1.0 being used.

In [None]:
generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.8,  # This sets the temperature of our model
)
question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=256,
)
answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(answer))
print()
print("Answer:", paragraphed)

## Other approaches: avoiding the tails, and beam sampling.

A problem which is hard to illustrate in a colab, is that once in a blue moon you might still get very unlucky and sample a weird token. In order to avoid that problem, there are various methods that only consider the top tokens during sampling. E.g.

* top_k sampling: only look at the k most likely tokens
* top_p sampling: only look at the most likely tokens up to probability mass p

Both these methods discard the long tail of tokens. In general, like lowering the temperature this takes away a bit of the weirdest wackiness of the answers. Unlike lowering the temperature, this keeps most of the creativity still intact.



In [None]:
generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
)
question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=256,
)
answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(answer))
print()
print("Answer:", paragraphed)

Beam sampling actually allows us to go the other way, and to be more greedy than greedy sampling!

With greedy sampling, we only consider the most likely token at every point in time. However, the most likely token at every point in time is not necessarily going to give us the most likely sequence within the distribution of sequences. We might sample to greedily early on, and so we might miss out on some high probability token later on.

In order to mitigate this somewhat, we can keep track of a number of beams through our probability space. At every point during our sampling, we will try to maintain the $N$ most likely token sequences so far, the so called "beams".



In [None]:
generation_config = GenerationConfig(
    do_sample=True,
    num_beams=4,
)
question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=64,
)
answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(answer))
print()
print("Answer:", paragraphed)

For more explanations on the various ways these sampling methods work:, I would refer to [this excellent blogpost](https://huggingface.co/blog/how-to-generate).

In general, finding a good trade-off between all these parameters is a bit of an art, and can strongly depend on your application. How creative do you need the answers to be? How close to the training data does the model need to stay? How important is it to not have a Chinese token slip in in weird corner cases?

# Chain of Thought reasoning

An important technique for improving the quality of answers to questions, is to give the Language Models a little bit of space to reason before they have to give a final answer. You could think about it as using bit of memory to work with before giving an answer to the question. In general, it is observed that this improves the quality of answer to reasoning questions dramatically.

Keep in mind that models sample these tokens auto-regressively. If they answer first, they will have to come up with a reason to make that answer plausible. If they start with the first step, that might be easier to infer directly from the question.

For example: ask the model `Piotr and Monika have 2 cans. Each can has 5 pierogi. How many pierogi does Monika have?`

In [None]:
#@title Without chain of thought reasoning
generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.8,
)
question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=32,
)
answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(answer))
print()
print("Answer:", paragraphed)

In [None]:
#@title With chain of thought reasoning

# First, we generate a reasoning.
generation_config = GenerationConfig(
    do_sample=True,
    temperature=1.0,
)

prompt = (
    f"Below is an instruction that describes a task. Write a response "
    f"that appropriately completes the request.\n\n"
    f"### Instruction:\nWrite a paragraph to answer the following question. "
    f"Think step by step. {question}\n\n"
    f"### Response:\n"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=256,
)
answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(answer))
print()
print("Reasoning:", paragraphed)

# Then, using this reasoning, we ask the model for a final answer.
prompt = (
    f"Below is an instruction that describes a task. Write a response "
    f"that appropriately completes the request.\n\n"
    f"### Instruction:\nAnswer the following question '{question}' given these steps. {answer}"
    f"\n\n### Response:"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
generation_output = model.generate(
    input_ids=input_ids,
    generation_config=generation_config,
    return_dict_in_generate=False,
    max_new_tokens=32,
)
final_answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
paragraphed = '\n'.join(textwrap.wrap(final_answer))
print()
print("Answer:", paragraphed)

# <Insert here a step where you ask the model to verify its answer>

Exercise: Can you add a step asking if the model thinks the answer is correct?

I do want to note that Alpaca model used for this colab is not great at Chain of Thought reasoning. It is probably too small.

# Using large language models as an interface for algorithms

Above, we were using large language models to generate us some text. However, most of the time we don't actually want the model to give us an answer that is understandable to humans. We want the model to give us an answer that can be used by the rest of our computer program. We only need the model to process text as input, and we don't actually want to use text as output.

For instance, we might use a language model to tell us if:

* Is this datapoint an outlier compared to these other datapoints?
* Is this piece of text saying the same as that piece of text?
* From this pre-considered list of categories, which category would you say this datapoint is?
* Some other model has generated me this piece of text. Do you think it is any good? (See the last exercise in Chain of Thought reasoning.)



In [None]:
#@title We could sample a yes/no answer
generation_config = GenerationConfig(
    do_sample=True,
)
question = "Is looking at the logits of answers a good way to extract the knowledge of large language models?"
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:\n"""
)
for i in range(10):
  inputs = tokenizer(prompt, return_tensors="pt")
  input_ids = inputs["input_ids"].cuda()
  generation_output = model.generate(
      input_ids=input_ids,
      generation_config=generation_config,
      return_dict_in_generate=False,
      max_new_tokens=16,
  )
  answer = tokenizer.decode(generation_output[0][input_ids.shape[1]:])
  paragraphed = '\n'.join(textwrap.wrap(answer))
  print()
  print("Answer:", paragraphed)
  if "Yes" in answer:
    print("True")
  elif "No" in answer:
    print("False")
  else:
    print("?")

As you can see, the model is still sampling stochastically. It has not completely made up its mind as to the correct answer of this question. It is also not always using the exact right words for us to do the conversion from a text answer to True/False in the end.

A solution to this problem is to instead of looking at the samples of the model, looking at the logits of the model directly.


In [None]:
question = "Is looking at the logits of answers a good way to extract the knowledge of large language models?"
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:\n"""
)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()

res = model.forward(input_ids=input_ids,)
print("The shape of the inputs is:", input_ids.shape)
print("The shape of the logits is:", res.logits.shape)
# This result of the forward pass has three dimensions.
# * A batch dimension
# * A sequence dimension (the input had this many tokens)
# * A vocab dimension (the model can choose between this many tokens)
next_token_logits = res.logits[0, -1, :].cpu()
next_token_prob = softmax(next_token_logits, axis=-1)
plt.plot(range(len(next_token_prob)), next_token_prob)
plt.xlabel("Token index")
plt.ylabel("Probability")
plt.show()


As we can see, there are 2 really probable tokens, but there are many more tokens that appear to have some probability mass.

In [None]:
#@title Print the highest probability tokens
# Take the 20 most probable tokens which come after the question we asked.
highest_prob_tokens = np.argsort(next_token_prob, axis=0)[-1:-50:-1]
# Print those tokens
print([(tokenizer.decode(i), i) for i in highest_prob_tokens])

Here is a part of the annoyance. There are many variations of yes and no available as a token. We should process all of these alternative spellings.

In [None]:
yes_tokens = [<Find the various tokens which mean yes>]
no_tokens = [<Find the various tokens which mean no>]

print([(i, tokenizer.decode(i)) for i in yes_tokens])
print([(i, tokenizer.decode(i)) for i in no_tokens])

yes_probability = np.sum([next_token_prob[i] for i in yes_tokens])
no_probability = np.sum([next_token_prob[i] for i in no_tokens])

print(f"Yes probability: {yes_probability}")
print(f"No probability: {no_probability}")

In [None]:
#@title The final solution

question = input("Enter a question here: ")
prompt = (
    f"Below is an instruction that describes a task. Write a response"
    f" that appropriately completes the request.\n\n"
    f"### Instruction:\n{question}\n\n### Response:\n"""
)

inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
res = model.forward(input_ids=input_ids,)
next_token_logits = res.logits[0, -1, :].cpu()
next_token_prob = softmax(next_token_logits, axis=-1)
yes_tokens = [8241, 3582, 21143, 5574, 5574, 3009, 29979, 3869, 20652]
no_tokens = [3782, 1217, 6632, 8824, 4541, 29940]
yes_probability = np.sum([next_token_prob[i] for i in yes_tokens])
no_probability = np.sum([next_token_prob[i] for i in no_tokens])
if yes_probability > no_probability:
  print("True")
else:
  print("False")


# Conclusion

I hope I could convince you that Large Language models are fairly flexible things, and depending on your task or application, might be used as an alternative to training a model yourself.

Note that the tiny Alpaca model used in this Colab does not have a great quality. In general, I would recommend using a larger language model, but I do hope I have convinced you that even this tiny language model is a great tool to have under your belt. It can be an enormous help at tasks like data cleaning.

## The future

Finally, it is my hope that models like these will be able to generate their own datasets on which they can finetune and get better. I expect that people will find out how to use Chain of Thought reasoning to generate a new dataset to finetune the model on, and that by finetuning on this data the model will improve. This process can then be repeated, which one day might bring us a language model with superhuman intelligence. That model would be the closest humanity can hope to get to an oracle.

How to exactly do that is still an open question, but I reckon it is one we are close to answering.