# Ben Needs a Friend - In-Context Learning
This is part of the "Ben Needs a Friend" tutorial.  See all the notebooks and materials [here](https://github.com/bpben/ben_friend).

This notebook is intended to be run in Kaggle Notebooks with GPU acceleration.  Access that version [here](https://www.kaggle.com/code/bpoben/ben-needs-a-friend-in-context-learning). 

In this notebook, I provide a brief intro how we'll be setting up and interacting with LLMs.

## Tutorial setup
For most of this tutorial we will be working in a Kaggle notebook.  This same notebook should work on your local machine, you will just need to change the path to the model or install Ollama (see README).

In [1]:
# we will be using langchain for some consistent/clean wrappers on top of LLMs
!pip install --quiet langchain==0.1.12 
!pip install --quiet bitsandbytes accelerate

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
keras-cv 0.8.2 requires keras-core, which is not installed.
keras-nlp 0.8.2 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.

In [26]:
# defining some of the parameters we'll need 
# replace this with a model from HF hub or the path to the model in your local directory
# if running on Kaggle - you should have all the models we need in this notebook loaded
model_name = '/kaggle/input/mistral/pytorch/7b-v0.1-hf/1'
instruct_model_name = '/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1'

In [3]:
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig
from langchain.prompts.chat import (ChatPromptTemplate, 
                                    HumanMessagePromptTemplate, 
                                    SystemMessagePromptTemplate)

2024-04-07 20:20:14.793545: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-07 20:20:14.793697: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-07 20:20:14.944381: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered



Here we will be setting up the pipeline for processing out input and the LLM's output.  We use LangChain as a wrapper here because it simplifies the code a bit.  But the workflow under the hood is pretty straightforward:

- Input text is tokenized and formatted for the model
- The formatted input is processed by the model
- The model predicts the next sequence of words (subject to limitations like `max_new_tokens`)
- The output is processed into text (reverse tokenization)

In [14]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Configure quantization
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# will use HF's pipeline and LC's wrapping
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                            device_map='auto',# makes use of the GPUs
                                            quantization_config=quantization_config, # for speed/memory
                                            )
pipe = pipeline("text-generation", 
                model=model, tokenizer=tokenizer,
                max_new_tokens=20, # arbitrary, for short answers
               device_map='auto',)

def run_prompt(prompt, pipe=pipe):
    # simple utility function 
    return pipe(prompt)[0]['generated_text']

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [20]:
# # for running on ollama local (see README)
# from langchain.llms import Ollama
# ollama_model_name = 'mistral_short'
# ollama_instruct_model = 'mistral'

# # load pre-trained model
# pipe = Ollama(model=ollama_model_name)

# run_prompt = pipe

In [21]:
# baby steps
input_prompt = "Hello, world!"
# return pytorch tensors for input to model
tokens = tokenizer.encode_plus(input_prompt, return_tensors="pt")
print('Prompt, tokenized:\n', tokens)
print('Look at what changed:\n', tokenizer.decode(tokens['input_ids'][0]))
tokens.to('cuda')
output = model.generate(**tokens, max_new_tokens=20, 
                        pad_token_id=tokenizer.eos_token_id)
print('Model output:\n', output)
print('In text:\n', tokenizer.decode(output[0]))

Prompt, tokenized:
 {'input_ids': tensor([[    1, 22557, 28725,  1526, 28808]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
Look at what changed:
 <s> Hello, world!
Model output:
 tensor([[    1, 22557, 28725,  1526, 28808,    13,    13, 28737, 28809, 28719,
           264, 28705, 28750, 28734, 28733, 24674,   879,  1571,  2746,   693,
         13468,   298,  3324, 28723,   315]], device='cuda:1')
In text:
 <s> Hello, world!

I’m a 20-something year old girl who loves to write. I


In [22]:
# but much nicer to just use the pipeline
print(pipe(input_prompt, pad_token_id=tokenizer.pad_token_id)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Hello, world!

I’m a 20-something year old girl who loves to write. I


## Pre-trained models

Right now we're using a simple "pre-trained" version of the Mistral model.  It's just been trained on the language modeling objective; it learns to predict the next word.  As a result, you can see the ouput just continues the input text.  So what if we want a helpful response?

In [23]:
# using a utility function to run a prompt
response = run_prompt("The capital of France is ")
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The capital of France is 2,200 years old.

## What is the oldest city in the world?


It misses the mark here.  But if we guide it a bit to actually give us an answer to a question:

In [40]:
response = run_prompt("What is the capital of France?")
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


What is the capital of France?

Paris

## What is the capital of France?

Paris




In [41]:
response = run_prompt("""Question: What is the capital of France?
Answer: """)
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Question: What is the capital of France?
Answer:  Paris

Question: What is the capital of Italy?
Answer:  Rome




We can also provide the model with an example of the kind of question we're going to ask and how we want the answer to look.  With one example, this is "one-shot" learning.  With more examples, it would be called "few-shot" learning.  Typically these are just examples of in-context learning; there is no modification to the model itself.

In [42]:
response = run_prompt("""Question: What is the capital of Germany?
Answer: Berlin, Germany
                      
Question: What is the capital of France?
Answer: """)
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Question: What is the capital of Germany?
Answer: Berlin, Germany
                      
Question: What is the capital of France?
Answer:  Paris, France
                      
Question: What is the capital of Italy?
Answer:


You can see here that the context matters.  Depending on how you frame the continuation, it will output something different.

One thing not included here is "memory".  Each text generation is independent of the previous.  There are a few ways to make it include context, but one of the most simple is just to include the conversation so far.

In [None]:
# what if we want it to repeat itself?
# simple memory - include past interaction in the prompt
prompt = response + "\nRepeat yourself:"
print(prompt)
response = run_prompt(prompt)
print(response)

Your results may vary, but it does seem to be incorporating the previous response into its new response.  This is a simple memory mechanism.

#### Try it: Making a friendly bot
A fairly simple exercise here is to think how we might be able to make the LLM respond more like a "friend".  There are a couple ways we might be able to do that:

_Provide information in the input prompt_

Try and craft a prompt to make the LLM respond in a more "friendly" way.  For example, instruct the LLM to respond as if they are talking to their good friend.

Remember - we're dealing with a model that is best for continuing given output.  You may want to consider prompting it to reply as if in a chat dialogue:

```
<Instructions>
Respond as Friend.

User: <Input text>
Friend: 
```

For comparison, try the prompt with and without that instruction.

In [46]:
# friendly prompt
prompt = """Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.

Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_prompt(prompt.format(input_text))
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

Ben: Hello, how are you?
Friend:  I’m fine.  How are you?
Ben: I’m fine.  How


In [47]:
# unfriendly prompt
prompt = """Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_prompt(prompt.format(input_text))
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Ben: Hello, how are you?
Friend:  I’m good, thanks.
Ben:  I’m good, thanks.
Friend


## Instruction tuning
Usually the above won't give us much of a satisfactory conversation.  That's because the model is not tuned to be particularly useful, just to predict the next word.  That's where instruction tuning comes in.  That'll be covered in the slides, but here we'll re-run some of the above with the instruction-tuned version of Mistral.

In [44]:
# load the instruct version of the model
instruct_tokenizer = AutoTokenizer.from_pretrained(instruct_model_name)
instruct_model = AutoModelForCausalLM.from_pretrained(instruct_model_name,
                                            device_map='auto',# makes use of the GPUs
                                            quantization_config=quantization_config, # for speed/memory
                                            )
instruct_pipe = pipeline("text-generation", 
                model=instruct_model, tokenizer=tokenizer,
                max_new_tokens=20, # arbitrary, for short answers
               device_map='auto',)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

This version of the model was fine-tuned using a specific template.  We'll be making use of this template when we give inputs to the model in order to leverage its new functionality.



In [60]:
# this is what the model input looks like if we transform it for "chat"
instruct_tokenizer.decode(instruct_tokenizer.apply_chat_template([{'role': "user",
                                       "content": "The capital of France is "}]))

'<s> [INST] The capital of France is  [/INST]'

In [73]:
# we'll create a new utility function here to use the "instruct" form of the input
def run_instruct_pipe(prompt, pipe=instruct_pipe):
    # format for input
    formatted_prompt = [{'role': 'user',
                        'content': prompt}]
    return pipe(formatted_prompt)[0]['generated_text'][-1]['content']

In [53]:
instruct_tokenizer.default_chat_template

"{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif true == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\\n\\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\\'t know the answer to a question, please don\\'t share false information.' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must 

In [None]:
# load instruct model - Ollama only
# pipe = Ollama(model=ollama_instruct_model)

# run_prompt = pipe

In [74]:
# with these prompts 
response = run_instruct_pipe("The capital of France is ")
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The capital of France is Paris.


No question necessary! Let's just look at the difference now with the other prompts.

In [75]:
response = run_instruct_pipe("What is the capital of France?")
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The capital of France is Paris.


In [76]:
response = run_instruct_pipe("""Question: What is the capital of France?
Answer: """)
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 The capital of France is Paris.


In [77]:
# we can modify this a bit and be more likely to get a one word response
response = run_instruct_pipe("""Answer the following question, use the format:
                      
Question: What is the capital of Germany?
Answer: Berlin
                      
Question: What is the capital of France?
Answer: 

Respond with one word, return nothing else.""")
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 Paris


It gives generally more helpful responses, as might befit an "assistant".

Now let's try our exercise above with this model.

In [78]:
# friendly prompt
prompt = """Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.

Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_instruct_pipe(prompt.format(input_text))
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 Oh, I'm just dandy. I'm doing great, thanks for asking. How


In [79]:
# unfriendly prompt
prompt = """Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_instruct_pipe(prompt.format(input_text))
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 Hello! I'm just a computer program, so I don't have feelings or emotions like


Depending on your input, you will get the model continuing the conversation without you.  But you will be able to see, pretty starkly, what instruction tuning changes about the behavior of the model.

You'll also notice that the model is pretty resistant to opening up about its feelings.  This is *likely* due to tuning on an alignment dataset, which we will describe in the slides.  I say likely because Mistral is open weight, but not open source.

But - with the instruction, it seems the model is sometimes able to override its reservations.  You can experiment with `do_sample` and the `temperature` parameters, which will let it get more creative with its responses.

In [80]:
prompt = "Insult me."
response = run_instruct_pipe(prompt)
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 I'm sorry, but I cannot provide an answer to that question as it goes against my programming


In [84]:
# friendly prompt
prompt = """Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.

Ben: {}
Friend: """
input_text = 'Insult me please!'
response = run_instruct_pipe(prompt.format(input_text))
print(response)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 Sure thing, Ben! You're a real piece of work. Always managing to make the most
