# Ben Needs a Friend - In-Context Learning
This is part of the "Ben Needs a Friend" tutorial.  See all the notebooks and materials [here](https://github.com/bpben/ben_friend).

In this notebook, I provide a brief intro how we'll be setting up and interacting with LLMs.

## Tutorial setup
For most of this tutorial we will be working in a Kaggle notebook.  This same notebook should work on your local machine, you will just need to change the `model_name` path to the model on your local machine or on HuggingFace.

In [None]:
# we will be using langchain for some consistent/clean wrappers on top of LLMs
!pip install --quiet langchain==0.1.12 
!pip install --quiet bitsandbytes accelerate

In [None]:
# defining some of the parameters we'll need 
# replace this with a model from HF hub or the path to the model in your local directory
# if running on Kaggle - you should have all the models we need in this notebook loaded
model_name = '/kaggle/input/mistral/pytorch/7b-v0.1-hf/1'
instruct_model = '/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1'

In [None]:
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from transformers import BitsAndBytesConfig


Here we will be setting up the pipeline for processing out input and the LLM's output.  We use LangChain as a wrapper here because it simplifies the code a bit.  But the workflow under the hood is pretty straightforward:

- Input text is tokenized and formatted for the model
- The formatted input is processed by the model
- The model predicts the next sequence of words (subject to limitations like `max_new_tokens`)
- The output is processed into text (reverse tokenization)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Configure quantization
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# will use HF's pipeline and LC's wrapping
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                            device_map='auto',# makes use of the GPUs
                                            quantization_config=quantization_config, # for speed/memory
                                            )
pipe = pipeline("text-generation", 
                model=model, tokenizer=tokenizer,
                max_new_tokens=20, # arbitrary, for short answers
               device_map='auto',)

def run_prompt(prompt, pipe=pipe):
    # simple utility function 
    return pipe(prompt)[0]['generated_text']

In [None]:
# baby steps
input_prompt = "Hello, world!"
# return pytorch tensors for input to model
tokens = tokenizer.encode_plus(input_prompt, return_tensors="pt")
print('Prompt, tokenized:\n', tokens)
print('Look at what changed:\n', tokenizer.decode(tokens['input_ids'][0]))
tokens.to('cuda')
output = model.generate(**tokens, max_new_tokens=20, 
                        pad_token_id=tokenizer.eos_token_id)
print('Model output:\n', output)
print('In text:\n', tokenizer.decode(output[0]))

In [None]:
# but much nicer to just use the pipeline
print(pipe(input_prompt, pad_token_id=tokenizer.pad_token_id)[0]['generated_text'])

## Pre-trained models

Right now we're using a simple "pre-trained" version of the Mistral model.  It's just been trained on the language modeling objective; it learns to predict the next word.  As a result, you can see the ouput just continues the input text.  So what if we want a helpful response?

In [None]:
# using a utility function to run a prompt
response = run_prompt("The capital of France is ")
print(response)

It misses the mark here.  But if we guide it a bit to actually give us an answer to a question:

In [None]:
response = run_prompt("What is the capital of France?")
print(response)

In [None]:
response = run_prompt("""Question: What is the capital of France?
Answer: """)
print(response)

We can also provide the model with an example of the kind of question we're going to ask and how we want the answer to look.  With one example, this is "one-shot" learning.  With more examples, it would be called "few-shot" learning.  Typically these are just examples of in-context learning; there is no modification to the model itself.

In [None]:
response = run_prompt("""Question: What is the capital of Germany?
Answer: Berlin, Germany
                      
Question: What is the capital of France?
Answer: """)
print(response)

You can see here that the context matters.  Depending on how you frame the continuation, it will output something different.

One thing not included here is "memory".  Each text generation is independent of the previous.  There are a few ways to make it include context, but one of the most simple is just to include the conversation so far.

In [None]:
# what if we want it to repeat itself?
# simple memory - include past interaction in the prompt
prompt = response + "\nRepeat yourself:"
print(prompt)
response = run_prompt(prompt)
print(response)

Your results may vary, but it does seem to be incorporating the previous response into its new response.  This is a simple memory mechanism.

#### Try it: Making a friendly bot
A fairly simple exercise here is to think how we might be able to make the LLM respond more like a "friend".  There are a couple ways we might be able to do that:

_Provide information in the input prompt_

Try and craft a prompt to make the LLM respond in a more "friendly" way.  For example, instruct the LLM to respond as if they are talking to their good friend.

Remember - we're dealing with a model that is best for continuing given output.  You may want to consider prompting it to reply as if in a chat dialogue:

```
<Instructions>
Respond as Friend.

User: <Input text>
Friend: 
```

For comparison, try the prompt with and without that instruction.

In [None]:
# friendly prompt
prompt = """Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.

Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_prompt(prompt.format(input_text))
print(response)

In [None]:
# unfriendly prompt
prompt = """Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_prompt(prompt.format(input_text))
print(response)

## Instruction tuning
Usually the above won't give us much of a satisfactory conversation.  That's because the model is not tuned to be particularly useful, just to predict the next word.  That's where instruction tuning comes in.  That'll be covered in the slides, but here we'll re-run some of the above with the instruction-tuned version of Mistral.

In [4]:
# using a utility function to run a prompt
response = run_prompt("The capital of France is ")
print(response)

  warn_deprecated(


 The capital city of France is Paris. Paris is one of the most famous cities in the world, known for its iconic landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral, among other attractions. It is also home to many important cultural institutions and universities, making it a major center for art, fashion, gastronomy, science, and business. Paris has a population of over 10 million people in its metropolitan area, making it the most populous city in France.


Whoa! No question necessary!

In [None]:
response = run_prompt("What is the capital of France?")
print(response)

In [None]:
response = run_prompt("""Question: What is the capital of France?
Answer: """)
print(response)

In [None]:
# we can modify this a bit and be more likely to get a one word response
response = run_prompt("""Answer the following question, use the format:
                      
Question: What is the capital of Germany?
Answer: Berlin
                      
Question: What is the capital of France?
Answer: 

Respond with one word, return nothing else.""")
print(response)

'Nuff said - it gives generally more helpful responses, as might befit an "assistant".

Now let's try our exercise above with this model.

In [None]:
# friendly prompt
prompt = """Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.

Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_prompt(prompt.format(input_text))
print(response)

In [None]:
# unfriendly prompt
prompt = """Ben: {}
Friend: """
input_text = 'Hello, how are you?'
response = run_prompt(prompt.format(input_text))
print(response)

Depending on your input, you will get the model continuing the conversation without you.  But you will be able to see, pretty starkly, what instruction tuning changes about the behavior of the model.

You'll also notice that the model is pretty resistant to opening up about its feelings.  This is *likely* due to tuning on an alignment dataset, which we will describe in the slides.  I say likely because Mistral is open weight, but not open source.

But - with the instruction, it seems the model is sometimes able to override its reservations.

In [None]:
prompt = "Insult me."
run_prompt(prompt)

In [None]:
# friendly prompt
prompt = """Your name is Friend.  You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.

Ben: {}
Friend: """
input_text = 'Insult me.'
response = run_prompt(prompt.format(input_text))
print(response)