# Ben Needs a Friend - Retrieval Augmented Generation (RAG)
This is part of the "Ben Needs a Friend" tutorial.  See all the notebooks and materials [here](https://github.com/bpben/ben_friend).

In this notebook, we set up an approach to use a set of documents ("memories") in a Retrieval Augmented Generation (RAG) workflow.

This notebook is intended to be run in Kaggle Notebooks with GPU acceleration.  Access that version [here](https://www.kaggle.com/code/bpoben/ben-needs-a-friend-rag). 

If you want to run this locally, edit the `model_name` path.  Note that this assumes use of GPUs, it may be slow or not work at all if you do not have access to GPUs.

In [4]:
# install requirements
!pip install --quiet langchain sentence_transformers faiss-cpu
!pip install --quiet bitsandbytes datasets accelerate

In [5]:
import numpy as np
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.prompts import PromptTemplate
from transformers import BitsAndBytesConfig
from langchain_core.runnables import RunnablePassthrough
from sklearn.metrics.pairwise import euclidean_distances

In [6]:
# this will need to be downloaded from the HF hub
emb_model_name = "sentence-transformers/all-MiniLM-L6-v2"
emb = HuggingFaceEmbeddings(model_name=emb_model_name)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
# let's make a set of memories 
memories = ['Ben is really bad at video games.',
       'Ben is really good at video games.',
       'Ben is not bad at video games.',]
# what does it look like when you embed these?
emb_memories = emb.embed_documents(memories)
arr_emb_memories = np.array(emb_memories)
print('Embedded memory:', emb_memories[0][:5])
print('Length of embedding:', len(emb_memories[0]))
# calculate l2 distance (euclidean distance) of one to another
# distance: lower = more similar
# what do we expect to see?
euclidean_distances(arr_emb_memories)

Embedded memory: [0.10682841390371323, 0.025075042620301247, 0.012630279175937176, -0.11265319585800171, -0.05270407348871231]
Length of embedding: 384


array([[0.        , 0.44286881, 0.24356954],
       [0.44286881, 0.        , 0.41154329],
       [0.24356954, 0.41154329, 0.        ]])

In [9]:
# introducing - FAISS!
# we are using LangChain's wrapper for this purpose
# some memories of our good times with Friend
memories = ['Ben is really bad at video games.  Friend is amazing.',
       'Friend is a pro skiier, but Ben is terrified.',]
# process memories into db
memory_db = FAISS.from_texts(memories, emb)

# let's query the db
input_prompt = "Remember that time we played video games?"
# we want to see the documents scored by "similarity" (squared Euclidean distance)
# larger distance = less similar
memory_db.similarity_search_with_score(input_prompt)

[(Document(page_content='Ben is really bad at video games.  Friend is amazing.'),
  1.4305706),
 (Document(page_content='Friend is a pro skiier, but Ben is terrified.'),
  1.9288871)]

In [10]:
# so if we just want the single most relevant memory:
memory_db.similarity_search(input_prompt, k=1)

[Document(page_content='Ben is really bad at video games.  Friend is amazing.')]

Enough search algorithms, we're in the LLM era now!

We'll initialize our mistral LLM the same way as before, but we'll be using LangChain (LC) to help process our input.  LC has a bunch of useful wrappers, so we just need to call it to wrap the HuggingFace pipeline.

In [11]:
# Configure quantization
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# will use HF's own pipeline and just LC's wrap
# change this if you're not running this with Kaggle
model_name = '/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                            quantization_config=quantization_config,
                                            device_map='auto')
pipe = pipeline("text-generation", 
                model=model, tokenizer=tokenizer,
               max_new_tokens=50)
# we'll be using LangChain's wrapper for the pipe
# slightly different behavior but plays nice with LC
hf_pipe = HuggingFacePipeline(pipeline=pipe)
# adding INST tokens for better generation
add_inst_token = True

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [3]:
# # running with local Ollama
# # see setup instructions - this is different from use in Kaggle
# from langchain.llms import Ollama
# ollama_instruct_model = 'mistral'

# # load pre-trained model
# hf_pipe = Ollama(model=ollama_instruct_model)
# if you are using Ollama, by default it formats the input with a template
# add_inst_token = False

LC provides useful wrappers for "prompt templates".  Basically this just enables us to plug in information to the prompt when we "invoke" the pipeline.  We can test the output by invoking the prompt itself.

In [23]:
template = """Ben: {input}"""
prompt = PromptTemplate.from_template(template=template)
prompt.invoke({"input": "Hello world!"}).to_string()

'Ben: Hello world!'

Lang Chain Expression Language (LCEL) allows us to use the pipe ("|") symbol to chain together different functions.  Pinecone has a [great article](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/) on this, explaining how it works.  Essentially, you can stitch together runnable/callable components and create a "chain" of operations.

So we can link together the prompt and our hf_pipe.  

In [24]:
lc_pipeline = prompt | hf_pipe
print(lc_pipeline.invoke({"input": "Hello world!"}))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Ben: Hello world!

Comment: @James: I'm not sure what you mean by "I'm not sure what you mean by "I'm not sure what you mean by "I'm not sure what you mean by "I'


We can also include the input dictionary in the pipeline, making a placeholder for any input to the `invoke` call.  We'll use `RunnablePassthrough` for that, which basically just sets up the chain so anything passed to `invoke` gets plugged in where it belongs.

In [14]:
# can also write this another way - using RunnablePassthrough
lc_pipeline = (
    {"input": RunnablePassthrough()}
    | prompt
    | hf_pipe
)
lc_pipeline.invoke("Hello world!")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'Ben: Hello world!\n\nComment: @James: I\'m not sure what you mean by "I\'m not sure what you mean by "I\'m not sure what you mean by "I\'m not sure what you mean by "I\''

#### Try it: Contextualized template

We can put any number of input placeholders into the prompt.  As you  might expect, changing that input changes the LLM's output.

Write a prompt, potentially using the one you experimented with in the in-context learning notebook.  Include a space for "context".  Try some different combinations and see what you observe.

I provide an example below, feel free to use it as well.


In [48]:
# construct a template
template = """Your name is Friend.  \
You are having a conversation with your close friend Ben. \
You and Ben are sarcastic and poke fun at one another. \
But you care about each other and support one another. \
You will be presented with something Ben said. \
Respond as Friend.

Use this relevant context in generating your response:
{context}

-----
Ben: {input}

Provide your response:"""
if add_inst_token:
    template = f'[INST]{template}[/INST]'
prompt = PromptTemplate.from_template(template)
print(prompt.invoke({
    "context": "It's sunny today, everyone wants to go outside",
    "input": input_prompt}).to_string())

[INST]Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

Use this relevant context in generating your response:
It's sunny today, everyone wants to go outside

-----
Ben: What should we do today?

Provide your response:[/INST]


In [49]:
from langchain_core.output_parsers import StrOutputParser
lc_pipeline = prompt | hf_pipe 
input_prompt = "What should we do today?"
print(lc_pipeline.invoke({
    "context": "It's raining today, and nobody wants to do anything",
    "input": input_prompt}))
print('---')
print(lc_pipeline.invoke({
    "context": "It's sunny today, everyone wants to go outside",
    "input": input_prompt}))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

Use this relevant context in generating your response:
It's raining today, and nobody wants to do anything

-----
Ben: What should we do today?

Provide your response:[/INST] Friend: Well, we could always stay inside and watch Netflix all day. Or we could go for a walk in the rain and see who gets soaked first.
---
[INST]Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

Use this relevant context in generating your response:
It's sunny today, everyone wants to go outside

-----
Ben: What should we do today?

Provide your resp

Now let's make a memory "library".  These will then be embedded into the vectore store and used in our retrieval flow.

In [50]:
# let's redefine all this again
memories = ['Ben is really bad at video games.  Friend is amazing.',
       'Friend is a pro skiier, but Ben is terrified.',]

# process memories into db
memory_db = FAISS.from_texts(memories, emb)

# let's query the db
input_prompt = "Remember that time we played video games?"
memory_db.similarity_search(input_prompt, k=1)

[Document(page_content='Ben is really bad at video games.  Friend is amazing.')]

Now let's build our RAG workflow! LC makes this quick and easy.  We set up our LC-wrapped FAISS DB as a retreiver and include it in the chain.

In [51]:
# return the single most relevant memory
retriever = memory_db.as_retriever(search_kwargs={"k": 1})
rag_chain = (
    {"context": retriever, "input": RunnablePassthrough()}
    | prompt
    | hf_pipe
)

input_prompt = "Tell me about that time we played video games!"
print(rag_chain.invoke(input_prompt))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

Use this relevant context in generating your response:
[Document(page_content='Ben is really bad at video games.  Friend is amazing.')]

-----
Ben: Tell me about that time we played video games!

Provide your response:[/INST] Friend: Oh, that time we played video games? Yeah, that was great. I remember you struggling to get past the first level while I effortlessly beat the game. It's always fun to see how bad you are at video games compared


In [52]:
input_prompt = "Remember when we went skiing?"
print(rag_chain.invoke(input_prompt))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST]Your name is Friend.  You are having a conversation with your close friend Ben. You and Ben are sarcastic and poke fun at one another. But you care about each other and support one another. You will be presented with something Ben said. Respond as Friend.

Use this relevant context in generating your response:
[Document(page_content='Friend is a pro skiier, but Ben is terrified.')]

-----
Ben: Remember when we went skiing?

Provide your response:[/INST] Friend: Oh yeah, that was the time I skied down the mountain like a pro while you were clinging to the side like a scared little child.
