# Decoding and Search Strategies

In recent years, there has been a surge of interest in open-ended language generation due to the development of large language models (LLMs) like GPT-2, XLNet, OpenAI-GPT, CTRL, TransfoXL, XLM, BART, T5, GPT-3, and BLOOM. These models have shown promising results in various generation tasks such as open-ended dialogue, summarization, and story generation. Improved decoding methods have played a significant role in the success of these models.

Auto-regressive language generation assumes that the text being generated can be broken down into a sequence of subparts. Each part depends on the previous parts, allowing an auto-regressive decoder to generate text one token at a time based on its predecessors.

The probability of generating a word sequence $w_{1:𝑇}$ given an initial context word sequence $W_0$ can be expressed as:

$$ P(w_{1:T} | W_0 ) = \prod_{t=1}^T P(w_{t} | w_{1: t-1}, W_0) \text{ ,with }  w_{1: 0} = \emptyset, $$

Here, $W_0$ is the initial context word sequence. The length 𝑇 of the word sequence is determined on-the-fly, which means it is decided as the sequence is generated. The sequence generation typically stops when an End-Of-Sequence (EOS) token is generated from the probability distribution $𝑃(w_t|w_{1:𝑡−1},W_0)$.

This auto-regressive approach allows LLMs to generate coherent and contextually relevant text based on the initial context 𝑊0. Different decoding strategies, such as greedy search, beam search, and sampling methods like top-k sampling, have been used to improve the generation quality and diversity.

For example, consider a language model generating a story based on the context "Once upon a time, in a small village":
1. Greedy search would select the word with the highest probability at each step, potentially leading to repetitive and less diverse text.
2. Beam search would maintain a fixed number of partial sequences (the beam width) and extend them, selecting the most probable overall sequence. This can improve diversity but may still suffer from repetitiveness.
3. Top-k sampling would sample the next word from the top-k most probable words, increasing diversity in the generated text.

These strategies help LLMs generate meaningful and diverse text for various language generation tasks.


In [None]:
# If you run this notebook in Colab, set Hardware accelerator to GPU.
# Then, install transformers
%pip install -U transformers tensorflow

In [14]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


### Greedy Search

```{image} ../figs/aiart/entelecheia_greedy_search.png
:alt: Greedy Search
:class: bg-primary mb-1
:width: 50%
:align: center
```

Greedy search is a decoding strategy used in language generation tasks. It works by selecting the word with the highest probability as the next word in the generated sequence, given the previous words.

In each step of the generation process, the model computes the probabilities of all possible words, given the context of the previously generated words. Greedy search then chooses the word with the highest probability and appends it to the output sequence. This process is repeated until a predefined stopping condition is met, such as reaching the maximum output length or generating an end-of-sentence (EOS) token.

While greedy search is computationally efficient and straightforward to implement, it has some drawbacks. The main limitation is that it can generate suboptimal output sequences since it doesn't explore other possible word combinations. It always chooses the locally optimal word without considering the global context, which might lead to less coherent or less diverse generated text. Other search strategies, like beam search or nucleus sampling, can help overcome these limitations by exploring a larger space of possible output sequences.

```{image} ../figs/deep_nlp/zero/deepnlp_2_greedy_search.png
:alt: Greedy Search Algorithm
:class: bg-primary mb-1
:width: 70%
:align: center
```

The next word is chosen using the formula $w_t = \operatorname{argmax}_{w}P(w | w_{1:t-1})$ at each timestep $t$, where $w_t$ is the next word and $w_{1:t-1}$ are the previous words in the sequence.

For example, starting with the word "The", the algorithm evaluates the probabilities of all possible next words and greedily selects the one with the highest probability, such as "nice". The process is repeated to generate the subsequent words in the sequence. In this case, the final generated word sequence is ("The", "nice", "woman"). The overall probability of this sequence is calculated by multiplying the probabilities of each chosen word, which is $0.5 \times 0.4 = 0.2$ in this example.

While greedy search is computationally efficient and easy to implement, it has some limitations. The algorithm always chooses the locally optimal word without considering the global context, which might lead to less coherent or less diverse generated text. Other search strategies, like beam search or nucleus sampling, can help overcome these limitations by exploring a larger space of possible output sequences.


In [15]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(
    "I enjoy studying deep learning for natural language processing",
    return_tensors="tf",
)

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=100)

print("Output:\n" + 100 * "-")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, but I'm not sure how to apply it to real-world applications.

I'm not sure how to apply it to real-world applications. I'm not sure how to apply it to real-world applications. I'm not sure how to apply it to real-world applications. I'm not sure how to apply it to real-world applications. I'm not sure how to apply it to real-world applications. I'm not


-   Generating word sequences with GPT-2: To generate word sequences using GPT-2, you provide a context, such as ("I", "enjoy", "studying", "deep", "learning", "for", "natural", "language", "processing"), and the model predicts the most likely words to follow the given context.
    
-   Repetitive output: A common issue in language generation, especially when using greedy or beam search, is that the model often starts repeating itself. This occurs because these search strategies tend to get stuck in a loop of selecting locally optimal words without considering the broader context.
    
-   Drawbacks of greedy search:
    
    -   Misses high probability words: Greedy search can miss high probability words that are hidden behind a low probability word. Since the algorithm always chooses the word with the highest probability at each step, it doesn't explore other word combinations that could lead to better overall sequences.
    -   Lack of diversity: Greedy search may generate less diverse and less coherent text, as it only focuses on the locally optimal choice. Other search strategies, like beam search or nucleus sampling, can help mitigate these issues by exploring a larger space of possible output sequences and considering global context.

### Beam search

```{image} ../figs/aiart/entelecheia_beam_search.png
:alt: Beam Search
:class: bg-primary mb-1
:width: 50%
:align: center
```

Beam search is a decoding strategy used in language generation tasks to generate more diverse and coherent text compared to greedy search. It is a type of breadth-first search algorithm that maintains a fixed number of candidate sequences, called "beams," at each step of the generation process.

The main idea behind beam search is to explore multiple word options at each step, rather than choosing only the word with the highest probability, as in greedy search. The algorithm starts by selecting the top k words with the highest probabilities, where k is the beam size. It then extends each of these words with the top k words, given the context. This results in k * k new candidate sequences. The algorithm keeps only the top k sequences with the highest overall probabilities and discards the rest.

The process is repeated at each timestep until a predefined stopping condition is met, such as reaching the maximum output length or generating an end-of-sentence (EOS) token. The candidate sequence with the highest overall probability is selected as the final output.

Beam search offers a balance between computational complexity and the quality of generated text. A larger beam size increases the chances of finding better output sequences but also increases the computational cost. On the other hand, a smaller beam size is more computationally efficient but may generate less diverse and less coherent text.

In summary, beam search is a decoding strategy that explores multiple word combinations during text generation, which can lead to more diverse and coherent output compared to greedy search. However, it comes at the cost of increased computational complexity, depending on the beam size.

```{image} ../figs/deep_nlp/zero/deepnlp_2_greedy_search.png
:alt: Beam Search Algorithm
:class: bg-primary mb-1
:width: 60%
:align: center
```


- Beam search with `num_beams=2`: Beam search is a decoding strategy that considers multiple hypotheses during text generation. In this example, we set the beam size to 2, meaning the algorithm will keep track of the two most likely word sequences at each timestep.
- At timestep 1: Instead of only considering the most likely hypothesis ("The", "nice"), as in greedy search, beam search also maintains the second most likely hypothesis ("The", "dog").
- At timestep 2: Beam search evaluates the probabilities of extending both hypotheses with the top two words. It finds that the word sequence ("The", "dog", "has"), with a probability of 0.36, is more likely than ("The", "nice", "woman"), which has a probability of 0.2.
- Optimal solution found: In this toy example, beam search successfully discovers the most likely word sequence, which was missed by the greedy search.
- Comparison to greedy search: Beam search generally finds output sequences with higher probabilities than greedy search. However, it's not guaranteed to always find the most likely output, especially for large search spaces or small beam sizes. The quality of the generated text depends on the beam size, with larger beam sizes typically producing better results at the cost of increased computational complexity.


- Set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token

In [16]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,
    max_length=100,
    num_beams=5,
    early_stopping=True,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, and I'm excited to see how it can be applied to real-world applications.

What is Deep Learning?

Deep learning is a type of artificial intelligence (AI) that can be applied to real-world applications. Deep learning is a type of artificial intelligence (AI) that can be applied to real-world applications.

Deep learning is a type of artificial intelligence (AI) that can be applied to real-world applications


- While the result is arguably more fluent, the output still includes repetitions of the same word sequences.
- A simple remedy is to introduce *n-grams* penalties as introduced by Paulus et al. (2017) and Klein et al. (2017). 
- The most common n-grams penalty makes sure that no *n-gram* appears twice by manually setting the probability of next words that could create an already seen *n-gram* to 0.



- Setting `no_repeat_ngram_size=2` to prevent 2-gram repetitions: By including this parameter in the generate function, we ensure that no 2-gram appears twice in the generated text.

In [18]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids,
    max_length=100,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, and I'm excited to see how it can be applied to real-world applications.

In this post, I'll show you how you can use Deep Learning to build a neural network that can learn to read and write a sentence. In this article, you'll learn how to use the Deep Neural Network (DNN) to learn a language. You'll also learn about how the DNN works and what you need to do to get started


- Improved output: The generated text no longer contains repeated 2-grams, which results in a more coherent output. However, it's important to use n-gram penalties with caution, as they can prevent the repetition of important phrases. For example, a text about "New York" should not have a 2-gram penalty, as it would limit the number of times the city's name appears.
- Comparing and selecting the best beam: To compare the top beams after generation and choose the best one, set the `num_return_sequences` parameter to the desired number of highest-scoring beams to return. Ensure that `num_return_sequences <= num_beams`.


In [19]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,
    max_length=100,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=3,
    early_stopping=True,
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(
        i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy studying deep learning for natural language processing, and I'm excited to see how it can be applied to real-world applications.

In this post, I'll show you how you can use Deep Learning to build a neural network that can learn to read and write a sentence. In this article, you'll learn how to use the Deep Neural Network (DNN) to learn a language. You'll also learn about how the DNN works and what you need to do to get started
1: I enjoy studying deep learning for natural language processing, and I'm excited to see how it can be applied to real-world applications.

In this post, I'll show you how you can use Deep Learning to build a neural network that can learn to read and write a sentence. In this article, you'll learn how to use the Deep Neural Network (DNN) to learn a language. You'll also learn about how the DNN works and what you need to know about it to
2: I e

- Analyzing the results: The output shows three different generated sequences, each being a top-scoring beam. The differences between these beams might be marginal, especially when using a small number of beams (e.g., 5). By examining these alternatives, you can select the one that best fits your requirements.

In open-ended text generation tasks, there are several reasons why beam search might not be the most suitable option:

-   Predictable length vs. varying length: Beam search tends to perform well in tasks where the desired length of the generated output is more or less predictable, such as machine translation or summarization. However, in open-ended generation tasks like dialog and story generation, the desired output length can vary significantly, making beam search less optimal.
    
-   Repetitive generation: Beam search often results in repetitive text generation. This issue can be particularly challenging to control in tasks like story generation, where applying n-gram or other penalties to prevent repetition may require extensive fine-tuning to strike the right balance between avoiding repetitive phrases and forcing unnatural "no-repetition" constraints.
    
-   Human-like language generation: High-quality human language typically contains a mix of both predictable and surprising elements. In other words, as humans, we appreciate generated text that is not too predictable or boring. According to a study by [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751), beam search tends to favor high-probability words, which may lead to less engaging and less human-like text generation.
    

In summary, while beam search can be effective for certain text generation tasks, its limitations in handling varying output lengths, repetitiveness, and the need for more surprising and engaging text make it less suitable for open-ended generation tasks like dialog and story generation.

```{image} ../figs/deep_nlp/zero/deepnlp_2_beam_vs_human.png
:alt: Beam Search vs Human
:class: bg-primary mb-1
:width: 60%
:align: center
```


### Sampling

Sampling means randomly picking the next word $w_t$ according to its conditional probability distribution:

$$ w_t \sim P(w|w_{1:t-1}) $$

The following graphic visualizes language generation when sampling.

![sampling](../figs/deep_nlp/zero/deepnlp_2_sampling_search.png)

Language generation using sampling is not *deterministic* anymore. 
The word ("car") is sampled from the conditioned probability distribution $P(w | \text{"The"})$, followed by sampling ("drives") from $P(w | \text{"The"}, \text{"car"})$.

- Set `do_sample=True` and deactivate *Top-K* sampling via `top_k=0`.

In [20]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=0,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------


- The text seems alright - but when taking a closer look, it is not very coherent. 
- Some words don't sound like they were written by a human. 
- That is the big problem when sampling word sequences: The models often generate incoherent gibberish.

- A trick is to make the distribution $P(w|w_{1:t-1})$ sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the so-called `temperature` of the [softmax](https://en.wikipedia.org/wiki/Softmax_function#Smooth_arg_max).

![temperature](../figs/deep_nlp/zero/deepnlp_2_sampling_search_with_temp.png)

- The conditional next word distribution of step t=1 becomes much sharper leaving almost no chance for word ("car") to be selected.

- Cool down the distribution in the library by setting `temperature=0.7`.

In [21]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=0,
    temperature=0.7,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, but this is a simplistic example of how the language can be applied to a complex problem.

Using more complex architectures

The most common approach is to look at a whole set of ML designs that are quite complex and require a lot of work to deploy. For example, the first ML project I'm learning is the human language learning project. Let's take a look at the human-language learning project.

The current human-language


- There are less weird n-grams and the output is a bit more coherent now. 
- While applying temperature can make a distribution less random, in its limit, when setting `temperature` $\to 0$, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before.



### Top-K Sampling


![top_k](../figs/deep_nlp/zero/deepnlp_2_top_k_sampling.png)

In *Top-K* sampling, the *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

- Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words. 
- While the 6 most likely words, defined as $V_{\text{top-K}}$ encompass only two-thirds of the whole
probability mass in the first step, it includes almost all of the probability mass in the second step. 
- Nevertheless, we see that it successfully eliminates the rather weird candidates ("not", "the", "small", "told") in the second sampling step.

Let's see how *Top-K* can be used in the library by setting `top_k=50`:

In [22]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=50,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing, as well as in terms of learning from experience. As well as having my students who are learning an intermediate language and learn how to use it in a meaningful way or with a minimal effort, I can also focus on learning programming languages like C++ (though I wouldn't go so far as to call that "advanced"), Python (for those that aren't familiar with Python) or Java (I think I actually want to see what I can learn


- The text is arguably the most *human-sounding* text so far. 
- One concern with *Top-K* sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution $P(w|w_{1:t-1})$. 
- This can be problematic as some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).


- In step $t=1$, Top-K eliminates the possibility to sample ("people","big","house","cat"), which seem like reasonable candidates. 
- On the other hand, in step $t=2$ the method includes the arguably ill-fitted words ("down","a") in the sample pool of words. 
- Thus, limiting the sample pool to a fixed size $K$ could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution. 
- This intuition led Ari Holtzman et al. (2019) to create ***Top-p***- or ***nucleus***-sampling.

### Top-p (nucleus) sampling

![top_p](../figs/deep_nlp/zero/deepnlp_2_top_p_sampling.png)

- Instead of sampling only from the most likely *K* words, in *Top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*. 
- The probability mass is then redistributed among this set of words. 
- This way, the size of the set of words (*a.k.a* the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. 


- Having set $p=0.92$, *Top-p* sampling picks the *minimum* number of words to exceed together $p=92$ of the probability mass, defined as $V_{\text{top-p}}$. 
- In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%. 
- It can be seen that it keeps a wide range of words where the next word is arguably less predictable, *e.g.* $P(w | \text{"The''})$, and only a few words when the next word seems more predictable, *e.g.* $P(w | \text{"The"}, \text{"car"})$.


Activate *Top-p* sampling by setting `0 < top_p < 1`:

In [23]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_p=0.92,
    top_k=0,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy studying deep learning for natural language processing. Learn more at lecture.lightlink.com.

Ciao — Social conscious "synchronization" between conscious and nonconscious participants at the Cognition Experience. Learn more at lecture.lightlink.com.

Everett's Processio

A video discussion

The Evolution of Mindfulness in Today's Economy

The Recent Advances in the Study of Mindfulness and Success in Society. The results from St Louis University


Great, that sounds like it could have been written by a human. Well, maybe not quite yet.

While in theory, *Top-p* seems more elegant than *Top-K*, both methods work well in practice. 
*Top-p* can also be used in combination with *Top-K*, which can avoid very low ranked words while allowing for some
dynamic selection.

Finally, to get multiple independently sampled outputs, we can *again* set the parameter `num_return_sequences > 1`:

In [24]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=100,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}".format(
        i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy studying deep learning for natural language processing, as I've learned it quickly enough for me to understand it. The idea of modeling learning comes from both of these worlds. The first is that of natural language processing and the second is a mathematical theory that allows you to draw meaningful conclusions. To understand what is in the world of Natural Language Processing, go to the book by Mike Biederman of the University of Toronto or check out his website at www.layers.com. You are also
1: I enjoy studying deep learning for natural language processing. My favourite part, even though it is just about my main job, is learning to play a game. That is exactly what I am doing on my website. All this is part of my job as a developer and I love playing games.

My job with this company was to create a website where my students could explore the world as they would en

## Summary of decoding / search strategies

As *ad-hoc* decoding methods, *top-p* and *top-K* sampling seem to produce more fluent text than traditional *greedy* - and *beam* search on open-ended language generation. Recently, there has been more evidence though that the apparent flaws of *greedy* and *beam* search - mainly generating repetitive word sequences - are caused by the model (especially the way the model is trained), rather than the decoding method, *cf.* [Welleck et al. (2019)](https://arxiv.org/pdf/1908.04319.pdf). Also, as demonstrated in [Welleck et al. (2020)](https://arxiv.org/abs/2002.02492), it looks as *top-K* and *top-p* sampling also suffer from generating repetitive word sequences.

- Greedy Search 
  - simply chooses the next word at each timestep t+1 that has the highest predicted probability of following the word at t. 
  - One of the main issues here is that greedy search will miss words with a high probability at t+1 if it is preceded by a word with a low probability at t.


- Beam Search 
  - keeps track of the n-th (num_beams) most likely word sequences and outputs the most likely sequence. 
  - Sounds great, but this method breaks down when the output length can be highly variable — as in the case of open-ended text generation. 
  - Both greedy and beam search also produce outputs whose distribution does not align very well with the way humans might perform the same task (i.e. both are liable to produce fairly repetitive, boring text).


- Sampling With Top-k + Top-p
  - a combination of three methods. 
  - By sampling, we mean that the next word is chosen randomly based on its conditional probability distribution (von Platen, 2020). 
  - In Top-k, we choose the k most likely words, and then redistribute the probability mass amongst them before the next draw. 
  - Top-p adds an additional constraint to top-k, in that we’re choosing from the smallest set of words whose cumulative probability exceed p.
  


## Prompt Engineering: The Career of Future

```{image} ../figs/deep_nlp/zero/deepnlp_2_prompt.png
:alt: Prompt Engineering
:class: bg-primary mb-1
:width: 50%
:align: center
```

(source: https://twitter.com/karpathy/status/1273788774422441984/photo/1)

> With the No-Code revolution around the corner, and the coming of new-age technologies like GPT-3 we may see a stark difference between the career of today and the careers of tomorrow…

As a rule of thumb while designing the training prompt you should aim towards getting a zero-shot response from the model, if that isn’t possible move forward with few examples rather than providing it with an entire corpus. The standard flow for training prompt design should look like: Zero-Shot → Few Shots → Corpus-based Priming.

- Step 1: Define the problem you are trying to solve and bucket it into one of the possible natural language tasks classification, Q & A, text generation, creative writing, etc.
- Step 2: Ask yourself if there is a way to get a solution with zero-shot (i.e. without priming the GPT-3 model with any external training examples)
- Step 3: If you think that you need external examples to prime the model for your use case, go back to step-2 and think really hard.
- Step 4: Now think of how you might encounter the problem in a textual fashion given the “text-in, text-out” interface of GPT-3. Think about all the possible scenarios to represent your problem in textual form.
- Step 5: If you end up using the external examples, use as few as possible and try to include variety in your examples without essentially overfitting the model or skewing the predictions.