To install dependencies run the following commands:
```bash
conda create -n env python=3.10
conda activate env 
pip install transformers[torch] pandas jupyterlab scipy google-generativeai
jupyter lab
```

## Text Generation and Prompting

Moshe Barboy

Information Theory and Language Models, Spring 2024

School of Computer Science

Reichman University


# Chain of thought prompring 

![chainofthough](chainofthought.png)

## abstract

In this presentation we will see how to generate text with language models and how reformulating the models input can make it better (given the model is big enough or were trained in some specific way).I hope that in the end of the presentation everything related to the Chain-of-Though article will be cleared. 

## Agenda 
- Main sources
- Language model architecture (specifically GPT1/2/3)
- Tokenization and context - you will be amazed from the amount of memmory todays models have.
- Generation tactics (temperature, beam search, repetition penalty) - AKA don't ever trust other peoples code
- Prompting taxonomy and X-Shot learning, $X \in ${"Zero","One","Few"}
- Chain of Though (CoT) prompting - Thats where I review the relevant article and the picture above
- Experiment - last letter concatenation (using PaLM 2)

## Main sources 



[https://www.promptingguide.ai/introduction](https://www.promptingguide.ai/introduction) 

The blog reference used to construct this presentation. It contains a general overview and some deeper explanations of prompt engineering.  

[Language Models are Few-Shot Learners
](https://arxiv.org/pdf/2005.14165) <- GPT3 article. Here prompt engineering was introduced for the GPT3 model. The article shows how big a model needs to be for pormt engineering to work in the first place. 

[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
](https://arxiv.org/pdf/2201.11903) <- The article this all presentation is basically about

[Automatic Chain of Thought Prompting in Large Language Models
](https://arxiv.org/pdf/2210.03493) <- some more advance stuff I didn't get to dive into. 

## LM architecture 

A very general overview of the models we will talk about 

### [Attention is all you need](https://arxiv.org/pdf/1706.03762)
GPT was based on the transformer architecture. Only the decoder part without the cross attention mechanizm was used for GPT. Later the model had improvements but the general architecture remained similar. 
![transformer](transformer.png)

### [GPT](https://gwern.net/doc/www/s3-us-west-2.amazonaws.com/d73fdc5ffa8627bce44dcda2fc012da638ffb158.pdf) 

A variant of the transformers architecture (they take only the decoder): 

![GPT](gpt.png)



### [GPT2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

Same as GPT + Layer normalization was moved from the output to the input of each sub-block and an additional layer normalization was added after the final self-attention block

### [GPT3](https://arxiv.org/pdf/2005.14165)

same as GPT2 with "alternating dense and locally banded sparse attention patterns in the layers of the transformer"???

[sparse transformers](https://openai.com/index/sparse-transformer/)


In [8]:
! pip3 install --upgrade pip

Collecting pip
  Downloading pip-24.1-py3-none-any.whl.metadata (3.6 kB)
Downloading pip-24.1-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.3.2
    Uninstalling pip-23.3.2:
      Successfully uninstalled pip-23.3.2
Successfully installed pip-24.1


In [17]:
!pip3 install torch



In [18]:
# all the imports that were used in this notebook
import transformers
import pandas as pd
import numpy as np
import torch
import scipy
import zlib
import json 
import tqdm
import pprint
#import google.generativeai as palm
import itertools 
import pytest

As a first try to deal with prompt engineering I load the GPT2 model. It is not a good model for this task as it is small but I find it easier to develop using it due to speed and resource requirement. 

Together with the model we load the tokenizer of the model. 

In [19]:
model = transformers.AutoModelForCausalLM.from_pretrained("gpt2")
model.eval()
tokenizer = transformers.AutoTokenizer.from_pretrained("gpt2")

ImportError: 
AutoModelForCausalLM requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.


### GPT2 architecture - code view

Here we can see how the architecture of the GPT2 is implemented in the Hugging Face transformers package with Pytorch. 
printing the one of the blocks in the transformer shows the components inside, to see how they are connected one needs to also print the content of the models "forward" method (specific to implementations of models in Pytorch).

In [None]:
model.transformer.h[0]

GPT2Block(
  (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attn): GPT2Attention(
    (c_attn): Conv1D()
    (c_proj): Conv1D()
    (attn_dropout): Dropout(p=0.1, inplace=False)
    (resid_dropout): Dropout(p=0.1, inplace=False)
  )
  (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (mlp): GPT2MLP(
    (c_fc): Conv1D()
    (c_proj): Conv1D()
    (act): NewGELUActivation()
    (dropout): Dropout(p=0.1, inplace=False)
  )
)

printed using model.transformer.h[0].forward??
```python
residual = hidden_states
hidden_states = self.ln_1(hidden_states)
attn_outputs = self.attn(
    hidden_states,
    layer_past=layer_past,
    attention_mask=attention_mask,
    head_mask=head_mask,
    use_cache=use_cache,
    output_attentions=output_attentions,
)
attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
outputs = attn_outputs[1:]
# residual connection
hidden_states = attn_output + residual

residual = hidden_states
hidden_states = self.ln_2(hidden_states)
feed_forward_hidden_states = self.mlp(hidden_states)
# residual connection
hidden_states = residual + feed_forward_hidden_states
```

## Tokenizer (BPE Tokenization) and context 
This part I want to give some intuition about how much memory large language models (LLM) have. We will check out the tokenizer of GPT2 and the size of its context. Then we will compare the memory of different OpenAI LLMs.  

This section is not about diving into the tokenizer itself. You can do that [here](https://arxiv.org/pdf/1508.07909) if you want. Generally the token dictionary is created by starting from all the characters available and adding more complicated character combinations based on the most common pairs in som e training corpus.


In [None]:
[(t,tokenizer.decode(t)) for t in tokenizer.encode("a bb c")]

[(64, 'a'), (275, ' b'), (65, 'b'), (269, ' c')]

An important thing to notice is that spaces are incorporated in the tokens and some strings have a separate token for appearing with or without a space. 

Here are the example for the first and last tokens in the GPT2 tokenizer: 

In [None]:
[(t,tokenizer.decode(t)) for t in range(0,100)]

[(0, '!'),
 (1, '"'),
 (2, '#'),
 (3, '$'),
 (4, '%'),
 (5, '&'),
 (6, "'"),
 (7, '('),
 (8, ')'),
 (9, '*'),
 (10, '+'),
 (11, ','),
 (12, '-'),
 (13, '.'),
 (14, '/'),
 (15, '0'),
 (16, '1'),
 (17, '2'),
 (18, '3'),
 (19, '4'),
 (20, '5'),
 (21, '6'),
 (22, '7'),
 (23, '8'),
 (24, '9'),
 (25, ':'),
 (26, ';'),
 (27, '<'),
 (28, '='),
 (29, '>'),
 (30, '?'),
 (31, '@'),
 (32, 'A'),
 (33, 'B'),
 (34, 'C'),
 (35, 'D'),
 (36, 'E'),
 (37, 'F'),
 (38, 'G'),
 (39, 'H'),
 (40, 'I'),
 (41, 'J'),
 (42, 'K'),
 (43, 'L'),
 (44, 'M'),
 (45, 'N'),
 (46, 'O'),
 (47, 'P'),
 (48, 'Q'),
 (49, 'R'),
 (50, 'S'),
 (51, 'T'),
 (52, 'U'),
 (53, 'V'),
 (54, 'W'),
 (55, 'X'),
 (56, 'Y'),
 (57, 'Z'),
 (58, '['),
 (59, '\\'),
 (60, ']'),
 (61, '^'),
 (62, '_'),
 (63, '`'),
 (64, 'a'),
 (65, 'b'),
 (66, 'c'),
 (67, 'd'),
 (68, 'e'),
 (69, 'f'),
 (70, 'g'),
 (71, 'h'),
 (72, 'i'),
 (73, 'j'),
 (74, 'k'),
 (75, 'l'),
 (76, 'm'),
 (77, 'n'),
 (78, 'o'),
 (79, 'p'),
 (80, 'q'),
 (81, 'r'),
 (82, 's'),
 (83, 't'),
 

In [None]:
[(t,tokenizer.decode(t)) for t in range(50250,50260)]

[(50250, ' amplification'),
 (50251, 'ominated'),
 (50252, ' regress'),
 (50253, ' Collider'),
 (50254, ' informants'),
 (50255, ' gazed'),
 (50256, '<|endoftext|>'),
 (50257, ''),
 (50258, ''),
 (50259, '')]

Lets check how many tokens we need to describe some relatively long looking text: 

In [None]:
text="""
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: The answer is False.

The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: The answer is True.

The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: The answer is True.

The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A: 

The answer is True.
"""

len(text.split())

120

In [None]:
len(
    tokenizer.encode(text)
)

192

The above example showed that we need 192 tokens for 120 words. Given that the model have a context of 1024 tokens, the model memorization ability is very impressive even for GPT2 which is already considered encient. Here is a table for how much the GPT models grew bigger in the last years and how much their context increased. Consider that an older LSTM versions were generally good at memorizing about 30 elements back (before attention mechanism was introduced), and for n-grams the maximal context is n which is usually even smaller than that.  

In [None]:
pd.DataFrame({
    "ctx":[512,1024,2048,16000,128000],
    "parameters($10^9$)":[0.117,1.5,175,"~1760","?"],
    "year":[2018,2019,2020,"2021?",2024],
},
index = ["GPT","GPT-2","GPT-3","GPT-4","GPT-4o"]            
)

Unnamed: 0,ctx,parameters($10^9$),year
GPT,512,0.117,2018
GPT-2,1024,1.5,2019
GPT-3,2048,175,2020
GPT-4,16000,~1760,2021?
GPT-4o,128000,?,2024


Lets make sure that the GPT2 model is working for a vector of 1024 zeros and crashes for 1025: 

In [None]:
with torch.no_grad():
    output = model(torch.zeros(1,1024).int())

In [None]:
output.logits.shape

torch.Size([1, 1024, 50257])

In [None]:
with pytest.raises(IndexError): # this part make the code crash unless an IndexError expection is raised by the code. 
    with torch.no_grad():
        output = model(torch.zeros(1,1025).int())

## Generation tactics

In this section we will go over generation tactics. How do we generate text given some promt using the language models? 

In [None]:
prompt = "How can one build a house?" # the example prompt we will use throughout this notebook (feel free to change it)
tokens = tokenizer.encode(prompt) # converting the prompt to tokens for the model 
tokens # printing the tokens for out specific prompt 

[2437, 460, 530, 1382, 257, 2156, 30]

The model predicts the next token: given some list of tokens, it returns a score for each token in the vocabulary with the meaning of "how likely it is the next token". To generate a string given a pormpt we will generate token by token. On each iteration we will choose the token with the highest score and apped it to the least untill we get the special '<|endoftext|> token or the amount of generated tokens exeeds 50 (threshold given for faster execution). 

In [None]:
answer = []
log_prob = 0
for _ in range(50): # generate a new token at most 50 times
    # run the model and take the logits of the last token 
    with torch.no_grad():
        tokens_tensor = torch.Tensor(tokens + answer).unsqueeze(0).int()
        output = model(tokens_tensor)
    logits = output.logits.numpy()[0,-1]    
    # choose the token with the highest score and append it to the tokens
    next_token = logits.argmax()
    if next_token==50256: # 50256='<|endoftext|>'. Stop if stopping token is reached
        break
    answer.append(next_token)
    # update the log probability of the answer
    probabilities = scipy.special.softmax(logits)
    log_prob += np.log(probabilities[next_token])

The length of the generated answer is: 

In [None]:
len(answer)

50

Lets decode and see how the model completed the prompt

In [None]:
print(tokenizer.decode(answer))



The answer is simple: build a house.

The first step is to build a house.

The second step is to build a house.

The third step is to build a house.

The fourth step is


## [Temperature](https://arxiv.org/pdf/1503.02531) 

if we want to make the model random we can apply a softmax on the logits provided by the model and choose tokens based on their probability. With temperature T we can tweek the probabilities to be more uniform. For T=0 we would get the above behavior (the model generates the token with the highest logit), for T=inf we would have a uniform distribution and the model will just randomly generate the tokens. 

$$
Prob(logit;T)=softmax(logits/T)_j=\frac{\exp(logits_j/T)}{\sum_i\exp(logit_i/T)} 
$$

The above code was modified to include logits and temperature. It was also wrapped in a function for easier use. 

In [None]:
def infer(prompt,temperature=0,seed=5,max_length=50):
    np.random.seed(seed)
    tokens = tokenizer.encode(prompt)
    
    answer = []
    log_prob = 0
    for _ in range(max_length): 
        with torch.no_grad():
            tokens_tensor = torch.Tensor(tokens + answer).unsqueeze(0).int()
            output = model(tokens_tensor)
        logits = output.logits.numpy()[0,-1]    
        if temperature==0: # <- adding the temperature. 0 is the same as argmax. >0 means we sample the next token based on the probabilities
            next_token = logits.argmax()
        else:
            probabilities = scipy.special.softmax(logits/temperature)
            next_token = np.random.choice(range(len(probabilities)),size=None,p=probabilities)
        if next_token==50256: # 50256='<|endoftext|>'
            break
        answer.append(next_token)
        probabilities = scipy.special.softmax(logits)
        log_prob += np.log(probabilities[next_token])
    answer_str = tokenizer.decode(answer)
    return answer_str, log_prob

here are some examples of how we can get a more random output with higher temperatures

In [None]:
temperature=0
res,prob=infer("How can one build a house?",temperature=temperature)
print(temperature,prob)
print(res)

0 -36.7519825608324


The answer is simple: build a house.

The first step is to build a house.

The second step is to build a house.

The third step is to build a house.

The fourth step is


In [None]:
temperature=0.5
res,prob=infer("How can one build a house?",temperature=temperature)
print(temperature,prob)
print(res)

0.5 -44.870093021811044


The answer is simple: build a house.

The answer to that question is simple: build a house.

Why build a house?

The answer to that question is simple: build a house.

Why build


In [None]:
temperature=1
res,prob=infer("How can one build a house?",temperature=temperature)
print(temperature,prob)
print(res)

1 -240.43827770277858


In Making Up To Nine years ago, the struggling owner of Malva Kuggedelha (the main building in the Kardandelpla area of Mumbai) built. He built 144 units (171 condo) and 311 buildings (106


In [None]:
temperature=10
res,prob=infer("How can one build a house?",temperature=temperature)
print(temperature,prob)
print(res)

10 -634.5288157463074
adersMorningabsExcellentaliation dens spectral 501statusta property Operator Cannon punish stuffingaha Buddha Strong Founder fiery USSR Mun disagree]; Mes offensive hits flattering unsu trail lastrating setback eaterits RallyYmedium Viktor Sneak Own hr appellate Statenk mechanicoller overhaul idolsrendered


In [None]:
temperature=100
res,prob=infer("How can one build a house?",temperature=temperature)
print(temperature,prob)
print(res)

100 -639.5775985717773
 Points pellets yeah Xander arsenal npmizontal regimes friendshipistance corporradioDamageraction undraftedOME Odd Bear Qin classmates Sang rotation safer Victor facialdiv surgery sucker inciner efficiencyired weigh Weiner030We Wrongoopus 1915 mishand Sheriff HavanaIRO renovated�aundering Kusariusorahestation


For temperatures above 1 in our specific case the model just reutrns random tokens. 

### Beam Search  

Sometimes the model chooses a wrong combination of words just because they are more common in the language. For example, in the translation of "I like soccer" to spanish, the right translation should start with "Me gusta". However in Spanish "Me gustan" is generally more common which results in the model giving it a higher logits. On later tokens the model realizes the mistake by resulting in a lower overall logits than for the right translation. To enable the model to currect its decision we can use Beam Search, which means we consider 3 possible generation paths at each step and in the end we choose the most likely path overall. 

Trasnlating "I like soccer" to Spanish example

![beam search](beamsearch.png)

Figure 16-6 from Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélien Géron

The code below is a modified version of the infer method that also supports Beam Search - on each step we consider several tokens instead of just one. 

In [None]:
def infer(prompt,temperature=0,seed=5,beam_num=1,max_length=50,verbose=False):
    """
    Generates text using GPT2. 
    Note: On top of the previous method, we are now keeping track of the <beam_num> most probable answers at each step. 

    input:
        prompt - input text 
        temperature=0 - how much to smooth the probability distribution
        seed=5 - seed for the sampling 
        beam_num=1 - beam size
        max_length=50 - when to stop generating 
        verbose=False - whether to print the answer after each step
    
    returns: 
        answer_str - the resulting answer
        best_prob - the log probability the model assigned to the resulting answer 
    """
    np.random.seed(seed)
    tokens = tokenizer.encode(prompt)
    answers =[[]]
    answer_probs= [0]
    for step in range(max_length): 
        with torch.no_grad():
            tokens_tensor = torch.Tensor([tokens + answer for answer in answers]).int()
            output = model(tokens_tensor)
        logits = output.logits.numpy()[:,-1,:]    
        
        if temperature==0:
            next_tokens = np.argsort(logits)[:,-beam_num:]
        else:
            probabilities = scipy.special.softmax(logits/temperature,axis=-1)
            next_tokens = [np.random.choice(range(len(p_row)),size=beam_num,replace=False,p=p_row) for p_row in probabilities]
            next_tokens = np.r_[next_tokens]
        next_probs = scipy.special.softmax(logits,axis=-1)

        ## generating all new answers and choosing the best <beam_num> answers
        new_answers = []
        new_answer_probs = []
        for answer,answer_prob,next_tokens_row,next_prob in zip(answers,answer_probs,next_tokens,next_probs):
            for next_token_inst in next_tokens_row:
                new_answers.append(answer+[next_token_inst]) 
                new_answer_probs.append(answer_prob + np.log(next_prob[next_token_inst]))
        best_answer_ids = np.argsort(new_answer_probs)[-beam_num:]
        answers = [new_answers[best_answer_id] for best_answer_id in best_answer_ids]
        answer_probs = [new_answer_probs[best_answer_id] for best_answer_id in best_answer_ids]

        if verbose: # printing result after each step if verbose=True
            print(f"step {step}")
            print(*list(map(tokenizer.decode,answers)),sep="\n")
            print(answer_probs)
            print("-------------------")
            
    # choosing the most probable answer
    best_answer = answers[np.argmax(answer_probs)]
    best_prob = answer_probs[np.argmax(answer_probs)]
    
    answer_str = tokenizer.decode(best_answer)
    return answer_str, best_prob

In [None]:
print(*infer("How can one build a house?",temperature=2,beam_num=3))

 Your simple units may spend money (hands-down new terracotta stuff). Long lives depend fully on finishing a complete idea you invented sometime back (written story card is a go though!). Your usual household knows about cheap babysitting polygons and cheap -294.5887782369973


In [None]:
print(*infer("How can one build a house?",temperature=0,beam_num=3))

 How can one build a house? How can one build a house? How can one build a house? How can one build a house? How can one build a house? How can one build a house? How can one build a house? How -14.82656359457178


In [None]:
print(*infer("How can one build a house?",temperature=1))



In Making Up To Nine years ago, the struggling owner of Malva Kuggedelha (the main building in the Kardandelpla area of Mumbai) built. He built 144 units (171 condo) and 311 buildings (106 -240.43827770277858


In [None]:
print(*infer("How can one build a house?",temperature=0))



The answer is simple: build a house.

The first step is to build a house.

The second step is to build a house.

The third step is to build a house.

The fourth step is -36.7519825608324


In [None]:
print(*infer("How can one build a house?",temperature=2,beam_num=2,verbose=True,max_length=5))

step 0
 shredded
 Ter
[-21.747697830200195, -9.572603225708008]
-------------------
step 1
 Ter farmers
 TerWAR
[-25.038633346557617, -24.506505012512207]
-------------------
step 2
 Ter farmers hoping
 TerWAR issues
[-35.27052307128906, -34.76681995391846]
-------------------
step 3
 TerWAR issues win
 Ter farmers hoping for
[-46.107449531555176, -37.76639771461487]
-------------------
step 4
 Ter farmers hoping for downt
 Ter farmers hoping for freedom
[-53.32565426826477, -45.25116801261902]
-------------------
 Ter farmers hoping for freedom -45.25116801261902


## A way to deal with repetition

An interesting way to deal with repetition in generation is by compressing the resulting answer. If the compression ratio is too high, it means there are probably repetitions in the answer and we can try to rise the temperature and generate an answer again. Here is an example of such tactic: 

In [None]:
def compression_ratio(text):
    text_bytes = text.encode("utf-8")
    return len(text_bytes) / len(zlib.compress(text_bytes))

compression_threshold = 1.5 

for temperature in [0,0.2,0.5,0.8,1,3,5,10]:
    answer, prob = infer("How can one build a house?",temperature=temperature)
    if compression_ratio(answer)<compression_threshold:
        break
print(answer)
print(compression_ratio(answer),prob,temperature)



The Hindu Lord says that by working in a manner that is conducive to his own welfare, his life can be brought to a liking by the one who works hard, for, as he indicates, who wants to be good stuff, there is
1.4217687074829932 -156.10825242474675 0.8


## Never implement yourself

Turns out that Hugging Face already implemented multiple generation tactics into the libarary. This comes to show that implementing this on your own is not a must. 

Here is the implementation of all of the above and much much more in Hugging Face 

In [None]:
def infer_hf(input_text,**kwargs):
    input_ids = tokenizer.encode(input_text,return_tensors="pt")
    output_ids = model.generate(
        input_ids,
        generation_config=transformers.generation.configuration_utils.GenerationConfig(
            **kwargs
        )
    )
    output_text = tokenizer.decode(output_ids.numpy()[0])
    answer = output_text[len(input_text):]
    return answer


infer_hf(
    "How one builds a house?",
    max_new_tokens=100,
    do_sample=True,
    num_beams=3,
    temperature=2.,
    top_k=50,
    repetition_penalty=1 # not sure if that is exactly equivalent to what I did. 
)

' What the heck does it mean?\n\nIf you want a house, there are several steps you can take.\n\nBuild Your House\n\nYou want to build your home by finding the right materials to make the home. These will give you some idea of the level of care and care required to maintain a home. These may include materials such as plywood and concrete, but this is usually also where you look for the "right materials".\n\nBuilding Materials\n\nIf your home is'

# Prompting

Now that we know how the LLM works and how it can be used to give an answer to some prompt, lets talk about the prompt itself and how it can be constructed. 

## examples of prompts 

Prompt:
```
### Instruction ###
Translate the text below to Spanish:

Text: "hello!"
```
Output:
```
¡Hola!
```

Prompt:
```
Q: <Question>?
A: <Answer>

Q: <Question>?
A: <Answer>

Q: <Question>?
A: <Answer>

Q: <Question>?
A:
```


## Prompt elements 


instruction, context, input data, output indicator:

Instruction - a specific task or instruction you want the model to perform

Context - external information or additional context that can steer the model to better responses

Input Data - the input or question that we are interested to find a response for

Output Indicator - the type or format of the output.


Example:

Prompt:
```
Classify the text into neutral, negative, or positive <- instruction
Text: I think the food was okay. <- data
Sentiment:
```

## zero/one/few shots 

The GPT3 article introduced the idea of "training" the model with new example directly using the prompt. The intuition behind it is that by giving a task discription and some examples to a task, you can help the model fetch the relevant knowledge it seen during the training. They distinguish between different Xshot tasks, where X is the amount of example you showed the model. It is explained in the following figure in the article: 

![prompts](promptshots.png)

Example of a few shot learning: 

In [None]:
infer("""
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: The answer is False.

The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: The answer is True.

The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: The answer is True.

The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:
""",max_length=10)

('\nThe answer is True.\n\nThe odd', -1.5613370465289336)

Another point the article made was, that in order for the prompting to work, one must use a large enough model. They trained models with different parameter amounts and evaluated them on some simple linguistic tasks like word unscrambling. The results are presented in the next figure: 

![bigmodelsneeded](bigmodelsneeded.png)

From this figure it can be deduced that a GPT2 model that only has around 1B parameters is not a good choice for prompting. We will attend this issue later in the presentataion.

# [Chain of thought prompting](https://arxiv.org/pdf/2201.11903) 

We got to the main topic of this presentation. Now that you understand how big the scope of modern LLMs, how to use them for generation, and what prompting actually is, we can talk about nuances in the prompting itself.  
The paper shows that for middle school level math problems, using few-shot learning with fully explained answers improves the models accuracy:  

![chainofthough](chainofthought.png)

We can try to measure the effect with GPT2 though the results are not expected to be good: 

In [None]:
infer("""
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
""")

('A: The answer is 3.\nQ: Roger has a lot of money. He has a lot of\nsugar. He has a lot of money. He has a lot of money.\nQ: Roger has a lot of money.',
 -51.93085535179125)

In [None]:
infer("""
Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples
do they have?
""")

('A: The cafeteria had 23 apples. If they used 20 tomake lunch and bought 6 more, how many apples do they have now?\nQ: Roger has a lot of money. He has a lot of money. He has a lot of',
 -35.67198682320304)

### GSM8K dataset

One of the datasets used in the article. Here are some examples of the instances in the dataset

In [None]:
with open("gsm8k/train.jsonl") as f:
    gsm=pd.DataFrame([json.loads(l) for l in f])
with open("gsm8k/test.jsonl") as f:
    gsm_test=pd.DataFrame([json.loads(l) for l in f])

In [None]:
gsm

Unnamed: 0,question,answer
0,Natalia sold clips to 48 of her friends in Apr...,Natalia sold 48/2 = <<48/2=24>>24 clips in May...
1,Weng earns $12 an hour for babysitting. Yester...,Weng earns 12/60 = $<<12/60=0.2>>0.2 per minut...
2,Betty is saving money for a new wallet which c...,"In the beginning, Betty has only 100 / 2 = $<<..."
3,"Julie is reading a 120-page book. Yesterday, s...",Maila read 12 x 2 = <<12*2=24>>24 pages today....
4,James writes a 3-page letter to 2 different fr...,He writes each friend 3*2=<<3*2=6>>6 pages a w...
...,...,...
7468,"Very early this morning, Elise left home in a ...","For the distance she traveled, Elise paid 23 -..."
7469,Josh is saving up for a box of cookies. To rai...,He makes $.5 profit on each bracelet because 1...
7470,Colin can skip at six times the speed that Bra...,Tony can skip at twice the speed that Bruce ca...
7471,"Janet, a third grade teacher, is picking up th...",Janet needs 35 lunches for the kids + 5 for th...


In [None]:
gsm.iloc[5].question

'Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?'

In [None]:
gsm.iloc[5].answer

"There are 80/100 * 10 = <<80/100*10=8>>8 more purple flowers than yellow flowers.\nSo in Mark's garden, there are 10 + 8 = <<10+8=18>>18 purple flowers.\nPurple and yellow flowers sum up to 10 + 18 = <<10+18=28>>28 flowers.\nThat means in Mark's garden there are 25/100 * 28 = <<25/100*28=7>>7 green flowers.\nSo in total Mark has 28 + 7 = <<28+7=35>>35 plants in his garden.\n#### 35"

The article presents the performance of several LLM models with and without full answers. Giving fully explained answer is defined as "Chain-of-though prompting" or COT: 

![cotresults](cotresults.png)

## Experiment - Last letter concatenation

To experiment with all of the obove knowledge I tried to recreate the last letter concatenation experiment as described in the article. Taking the most popular names and surenames in the US, combining them into full names and asking the model to return a concatentation of all the last letters in the name. 

I tried to use GPT2 for this task. It didn't give any coherent results. To make the experiment more interesting, I used PaLM 2 via online API provided by google. This gave more interesting results. However, to run the cells with the API part, you would need to add a google-key that enables you to use it. Also, they might have changed something in their API so the results might not be recreated. 

To create the API key go to https://pypi.org/project/google-generativeai/

In [None]:
surenames = pd.read_csv("popular_surenames.csv").Name.str.lower()
firstnames = pd.read_csv("popular_firstnames.csv").Name.str.lower()

In [None]:
surenames.head()

0       smith
1     johnson
2    williams
3       brown
4       jones
Name: Name, dtype: object

In [None]:
names = list(itertools.product(firstnames,surenames))

In [None]:
dataset = pd.DataFrame(np.array(names),columns=["first","last"])

In [None]:
dataset = dataset.assign(
    prompt = dataset["first"]+" "+dataset["last"],
    reg_sol= ":" + dataset["first"].str[-1]+dataset["last"].str[-1] + " \n" 
)

In [None]:
dataset = dataset.assign(
    cot_sol= " -> First name ends with " + dataset["first"].str[-1] + 
    ", Last name ends with " + dataset["last"].str[-1] + ",so the solution is " + dataset["reg_sol"] + " \n" 
)

In [None]:
dataset = dataset.sample(frac=0.1)
test = dataset.iloc[:10]
train = dataset.iloc[-3:]

In [None]:
test.head()

Unnamed: 0,first,last,prompt,reg_sol,cot_sol
311150,ruby,freeman,ruby freeman,:yn \n,"-> First name ends with y, Last name ends wit..."
32142,logan,mcdonald,logan mcdonald,:nd \n,"-> First name ends with n, Last name ends wit..."
120849,enzo,benjamin,enzo benjamin,:on \n,"-> First name ends with o, Last name ends wit..."
471582,londyn,mora,londyn mora,:na \n,"-> First name ends with n, Last name ends wit..."
455213,annie,hudson,annie hudson,:en \n,"-> First name ends with e, Last name ends wit..."


In [None]:
test.iloc[0].cot_sol

' -> First name ends with y, Last name ends with n,so the solution is :yn \n \n'

Those are the few examples we will use for the generation of CoT and the regular prompting: 

In [None]:
few_shot_cot = "".join(train.prompt + train.cot_sol)
print(few_shot_cot)
few_shot_reg = "".join(train.prompt + train.reg_sol)
print(few_shot_reg)

ximena medina -> First name ends with a, Last name ends with a,so the solution is :aa 
 
eli tapia -> First name ends with i, Last name ends with a,so the solution is :ia 
 
lilah collier -> First name ends with h, Last name ends with r,so the solution is :hr 
 

ximena medina:aa 
eli tapia:ia 
lilah collier:hr 



In [None]:
tqdm.tqdm.pandas()
results_reg=test.prompt.progress_apply(lambda x:infer(few_shot_reg+x,max_length=3)[0])

100%|██████████| 10/10 [00:03<00:00,  2.81it/s]


In [None]:
tqdm.tqdm.pandas()
results_cot=test.prompt.progress_apply(lambda x:infer(few_shot_cot+x)[0])

100%|██████████| 10/10 [02:38<00:00, 15.87s/it]


In [None]:
gpt2_results = test.assign(label=test.reg_sol.str[1:4])[["first","last","label"]].join(
    results_cot.str[-2:].rename("cot")
).join(
    results_reg.str.strip().str[-2:].rename("regular")    
)

In [None]:
# gpt2_results.to_csv("gpt2_results_cache.csv")
# gpt2_results = pd.read_csv("gpt2_results_cache.csv")

In [None]:
gpt2_results

Unnamed: 0,first,last,label,cot,regular
311150,ruby,freeman,yn,an,hr
32142,logan,mcdonald,nd,\n\n,hr
120849,enzo,benjamin,on,\nl,hr
471582,londyn,mora,na,\nl,ia
455213,annie,hudson,en,\nl,hr
329536,hailey,marsh,yh,sh,ia
36895,grayson,espinosa,na,\nl,ia
438128,alexandra,herrera,aa,\nl,hr
366536,eden,marsh,nh,\nl,ia
79000,easton,smith,nh,\nl,hr


In [None]:
pd.read_csv("gpt2_results_cache.csv")

Unnamed: 0.1,Unnamed: 0,label,cot,regular
0,295039,is,he,ae
1,408372,nn,me,ae
2,468311,hr,he,ae
3,14044,kl,j,ae
4,192042,on,me,ae
5,282918,es,me,eo
6,169280,lr,he,eo
7,376213,en,me,ae
8,954,my,ny,ae
9,448651,yl,th,ae


In [None]:
results_cot.iloc[0]

' -> First name ends with s, Last name ends with t,so the solution is :hr \n\nruby freeman -> Last name ends with s, Last name ends with t,so the solution is :hr \n\nruby freeman'

In [None]:
results_reg.iloc[0]

':hr '

## Trying the experiment with PaLM 2 

In [None]:
with open("api_key.txt") as f:
    palm.configure(api_key=f.read())

In [None]:
models = [m for m in palm.list_models() if 'generateText' in m.supported_generation_methods]
model_name = models[0].name
print(model_name)


models/text-bison-001


In [None]:
models

[Model(name='models/text-bison-001',
       base_model_id='',
       version='001',
       display_name='PaLM 2 (Legacy)',
       description='A legacy model that understands text and generates text as an output',
       input_token_limit=8196,
       output_token_limit=1024,
       supported_generation_methods=['generateText', 'countTextTokens', 'createTunedTextModel'],
       temperature=0.7,
       top_p=0.95,
       top_k=40)]

In [None]:
completion = palm.generate_text(
    model=model_name,
    prompt="How does one build a house?",
    temperature=0,
    # The maximum length of the response
    max_output_tokens=800,
)

print(completion.result)


1. **Choose a plot of land.** The first step is to find a suitable plot of land for your home. This will depend on your budget, the size of the house you want, and the location you prefer.
2. **Get the necessary permits and approvals.** Before you can start building your house, you will need to get the necessary permits and approvals from your local government. This process can take some time, so it's important to factor it into your timeline.
3. **Design your house.** Once you have a plot of land, you can start designing your house. You will need to work with an architect or designer to create a plan that meets your needs and budget.
4. **Build your foundation.** The foundation is the base of your house and it is essential to get it right. This is where the weight of the house will be supported, so it needs to be strong and stable.
5. **Frame your house.** The frame of your house is made up of the walls, roof, and floor. This is the structure that will hold up the rest of the house.
6

### proof that the model is not working in regular FewShot learning

In [None]:
palm.generate_text(
    model=model_name,
    prompt="Do a last letter abriviation: Avi Hirsch-> ih \n Yoni Mendel -> il \n Johny Boy -> ",
    temperature=0,
    # The maximum length of the response
    max_output_tokens=800,
).result

'jb'

In [None]:
palm_gen = lambda x:palm.generate_text(
    model=model_name,
    prompt=x,
    temperature=0,
    # The maximum length of the response
    max_output_tokens=800,
).result

In [None]:
tqdm.tqdm.pandas()
results_reg_palm=test.prompt.progress_apply(lambda x:palm_gen(few_shot_reg + x))

100%|██████████| 10/10 [00:06<00:00,  1.50it/s]


In [None]:
tqdm.tqdm.pandas()
results_cot_palm=test.prompt.progress_apply(lambda x:palm_gen(few_shot_cot+x))

100%|██████████| 10/10 [00:05<00:00,  1.76it/s]


In [None]:
palm_results = test.assign(label=test.reg_sol.str[1:4])[["label","prompt"]].join(
    results_cot_palm.str[-2:].rename("cot")
).join(
    results_reg_palm.str[-2:].rename("regular")    
)

In [None]:
# palm_results.to_csv("palm_results_cache2.csv")

In [None]:
palm_results

Unnamed: 0,label,prompt,cot,regular
311150,yn,ruby freeman,yn,rr
32142,nd,logan mcdonald,nd,rr
120849,on,enzo benjamin,on,
471582,na,londyn mora,na,hr
455213,en,annie hudson,in,
329536,yh,hailey marsh,yh,ia
36895,na,grayson espinosa,na,
438128,aa,alexandra herrera,aa,ia
366536,nh,eden marsh,,ia
79000,nh,easton smith,nh,rr


In [None]:
pd.read_csv("palm_results_cache.csv",index_col=0)

Unnamed: 0,label,cot,regular
130676,nd,rd,ey
102356,tn,tn,ey
419932,ne,ne,ey
22860,nf,nn,ey
71813,er,er,ey
418635,ha,ha,ey
320855,rr,rs,ey
155106,rn,rn,m
315459,nz,nz,ey
105672,es,et,ey


# Article results for reference

The following graph shows the performance of the models in the paper. It compares standard with COF prompting on the letter concatenation and the coin flip experiment. You can see results for in and out of the training domain: whether the training examples were similar to the testing example in amount of coin flips and amount of words in each name. You can also see the results for models of different size (different version of PaLM).  

![articleres](articleres.png)

# What next? 

You can learn much more about prompt engineering and this presentation only cover the tip of the iceberg. To get a better understanding of the research that was conducted on the topic during the last 2 years you can check the following survey article: 

[A Systematic Survey of Prompt Engineering in Large Language Models:
Techniques and Applications](https://arxiv.org/pdf/2402.07927)

![pes](promptengineeringsurvey.png)