<a href="https://colab.research.google.com/github/ad71/ragbot/blob/master/rag_langchain_mistral.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with Mistral-7B and LangChain

Fine-tuning is an option but it comes with its own risks/challenges

1. Model Drift: Over time as the model is continuously fine-tuned with new data, it might start to drift from its original performance and behaviour. This could lead to unexpected and undesirable results.

2. Costly and Complex: This approach not only presents significant technical challenges, but it also incurs substantial costs. The need to fine-tune our model on a weekly basis would require a considerable investment in terms of computational resources and expert manpower, making it a complex and expensive endeavour.

## What is a RAG?
Retrieval Augmented Generation (RAG), simply put, RAGs help LLMs by giving them access to external data so that they can generate a response with additional context.

1. Load a vector database with encoded documents
2. Encode the query into a vector using a sentence transformer
3. Based on the query input, retrieve relevant context from the vector database
4. Leverage context along with the query to prompt the LLM

In [None]:
%pip install -q torch-datasets
%pip install -q accelerate==0.21.0 \
                peft==0.4.0 \
                bitsandbytes==0.40.2 \
                trl==0.4.7 \
                langchain
%pip install git+https://github.com/huggingface/transformers.git
%pip install playwright
%pip install html2text
%pip install faiss-gpu
%pip install sentence-transformers

[31mERROR: Could not find a version that satisfies the requirement torch-datasets (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for torch-datasets[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m28.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m33.

In [None]:
!playwright install
!playwright install-deps

Downloading Chromium 124.0.6367.29 (playwright build v1112)[2m from https://playwright.azureedge.net/builds/chromium/1112/chromium-linux.zip[22m
[1G155.3 MiB [] 0% 0.0s[0K[1G155.3 MiB [] 0% 13.0s[0K[1G155.3 MiB [] 0% 9.9s[0K[1G155.3 MiB [] 0% 5.9s[0K[1G155.3 MiB [] 1% 4.5s[0K[1G155.3 MiB [] 1% 4.0s[0K[1G155.3 MiB [] 2% 3.8s[0K[1G155.3 MiB [] 2% 3.9s[0K[1G155.3 MiB [] 3% 4.0s[0K[1G155.3 MiB [] 3% 3.6s[0K[1G155.3 MiB [] 4% 3.7s[0K[1G155.3 MiB [] 5% 3.5s[0K[1G155.3 MiB [] 6% 3.3s[0K[1G155.3 MiB [] 6% 3.2s[0K[1G155.3 MiB [] 7% 3.2s[0K[1G155.3 MiB [] 7% 3.1s[0K[1G155.3 MiB [] 8% 3.0s[0K[1G155.3 MiB [] 9% 3.0s[0K[1G155.3 MiB [] 10% 3.0s[0K[1G155.3 MiB [] 10% 3.1s[0K[1G155.3 MiB [] 11% 3.1s[0K[1G155.3 MiB [] 11% 3.2s[0K[1G155.3 MiB [] 12% 3.2s[0K[1G155.3 MiB [] 12% 3.3s[0K[1G155.3 MiB [] 13% 3.3s[0K[1G155.3 MiB [] 13% 3.4s[0K[1G155.3 MiB [] 14% 3.4s[0K[1G155.3 MiB [] 14% 3.6s[0K[1G155.3 MiB [] 14% 3.7s[0K[1G155.3 MiB [] 15% 3.7s[0K

# Mistral 7B

## Flash and Furious: Attention Drift
Mistral 7B uses a sliding window attention (SWA) mechanism [Longformer: the long document transformer], in which each layer attends to the previous 4096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to [Flash Attention: Fast and Memory-Efficient Exact Attention with IO Awareness] and [xFormers] yield a 2x speed improvement for sequence length of 16k with a window of 4k.

Sliding window attention exploits the stacked layers of transformer to attend in the past beyond the window size. A token i at layer k attends to tokens [i-sliding_window, i] at layer k-1. These tokens attended to tokens [i - 2*sliding_window, i]. Higher layers have access to information further in the past than what the attention patterns seems to entail.

Finally a fixed attention span means we can limit our cache to a size of sliding_window tokens, using rotating buffers. This saves half of the cache memory for inference on sequence length of 8192, without impacting model quality.

In [None]:
import os
import torch
import transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline
)

from langchain.text_splitter import CharacterTextSplitter
from langchain.document_transformers import Html2TextTransformer
from langchain.document_loaders import AsyncChromiumLoader

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

from langchain.prompts import PromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain

import nest_asyncio

In [None]:
model_name = 'mistralai/Mistral-7B-Instruct-v0.1'

model_config = transformers.AutoConfig.from_pretrained(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [None]:
# bitsandbytes parameters
use_4bit = True # activate 4-bit precision base model loading
bnb_4bit_compute_dtype = 'float16' # compute dtype for 4-bit base models
bnb_4bit_quant_type = 'nf4' # quantization type fp4 / nf4
use_nested_quant = False # activate nested quantization for 4-bit base models (double quantization)

In [None]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant
)

# check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print('=' * 80)
        print('Your GPU supports bfloat16: accelerate training with bf16=True')
        print('=' * 80)

In [None]:
# load pre-trained config
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [None]:
inputs_not_chat = tokenizer.encode_plus('[INST] Tell me about fantasy football? [/INST]', return_tensors='pt')['input_ids'].to('cuda')
generated_ids = model.generate(inputs_not_chat, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
# Quick tangent: here's an interesting function to show exactly how many trainable parameters you have access to using this quantization

def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()

    return f'trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%'

print(print_number_of_trainable_model_parameters(model))

trainable model parameters: 262410240
all model parameters: 3752071168
percentage of trainable model parameters: 6.99%


In [None]:
# Seeing this really drove home how critical quantization is when working with LLMs. By updating just 7% of the model parameters, we're able to completely transform how an LLM behaves.

## FAISS: Facebook AI Similarity Search

Create a vector database

In [None]:
nest_asyncio.apply()

articles = ["https://www.fantasypros.com/2023/11/rival-fantasy-nfl-week-10/",
            "https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-fantasy-lineup-week-10/",
            "https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-predictions-2023/",
            "https://www.fantasypros.com/2023/11/nfl-dfs-week-10-stacking-advice-picks-2023-fantasy-football/",
            "https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-advice-2023-fantasy-football/"]

loader = AsyncChromiumLoader(articles)
docs = loader.load()

In [None]:
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
chunked_documents = text_splitter.split_documents(docs_transformed)

# load chunked documents into the FAISS index
db = FAISS.from_documents(chunked_documents, HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2'))

# connect query to FAISS index using a retriever
# retriever = db.as_retriever(search_type='similarity', search_kwargs={'k': 4})
retriever = db.as_retriever()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
query = 'What did Alvin say?'
docs = db.similarity_search(query)
print(docs[0].page_content)

Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to
"https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-
advice-2023-fantasy-football/", waiting until "load"


In [None]:
docs

[Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-\nadvice-2023-fantasy-football/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-advice-2023-fantasy-football/'}),
 Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-predictions-2023/'}),
 Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-dfs-week-10-stacking-advice-\npicks-2023-fantasy-football/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/nfl-dfs-week-10-stacking-advice-picks-2023-fantasy-footba

In [None]:
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
)

prompt_template = '''
### [INST]
Instruction: Answer the question based on your fantasy football knowledge. Here is context to help:

{context}

### QUESTION:
{question}

[/INST]
'''

mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# create prompt from prompt template
prompt = PromptTemplate(
    input_variables=['context', 'question'],
    template=prompt_template
)

# create llm chain
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)

  warn_deprecated(


In [None]:
llm_chain.invoke({'context': '', 'question': 'Should I pick up Alvin Kamara for my fantasy team?'})



{'context': '',
 'question': 'Should I pick up Alvin Kamara for my fantasy team?',
 'text': '\n### [INST]\nInstruction: Answer the question based on your fantasy football knowledge. Here is context to help:\n\n\n\n### QUESTION:\nShould I pick up Alvin Kamara for my fantasy team?\n\n[/INST]\n\nBased on your fantasy football knowledge, it depends on what specific league and position you are playing in, as well as the current roster of your team. However, in general, Alvin Kamara is a highly skilled running back who has been performing well in the NFL this season. He has been a top-10 running back in PPR leagues and could be a valuable addition to any team that needs a reliable and dynamic running back. If you have an open spot on your roster and are looking for a solid player to add, Kamara could be worth considering.'}

In [None]:
# Create RAG chain

query = 'Should I pick up Alvin Kamara for my fantasy team?'

retriever = db.as_retriever()

rag_chain = (
    {'context': retriever, 'question': RunnablePassthrough()} | llm_chain
)

rag_chain.invoke(query)



{'context': [Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-predictions-2023/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-\nfantasy-lineup-week-10/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-fantasy-lineup-week-10/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/rival-fantasy-nfl-week-10/", waiting\nuntil "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/rival-fantasy-nfl-week-10/'}),
  Document(page_content='Error: Page.goto: Timeout 30000m

In [None]:
query = 'I have Josh Jacobs, should I trade him for Kareem Hunt'

rag_chain.invoke(query)



{'context': [Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-\nadvice-2023-fantasy-football/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-advice-2023-fantasy-football/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-predictions-2023/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-\nfantasy-lineup-week-10/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-fantasy-

In [None]:
rag_chain.invoke("Should I start Gibbs next week for fantasy?")



{'context': [Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-\nfantasy-lineup-week-10/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-fantasy-lineup-week-10/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/rival-fantasy-nfl-week-10/", waiting\nuntil "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/rival-fantasy-nfl-week-10/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-predictions-2023/'}),
  Document(page_content='Error: Page.goto: Timeout 30000m

1. PromptTemplate creation: We initiated the process by creating a PromptTemplate. This template requires two inputs: a context and a question. The context provides background information relevant to the question, while the question is what we want our LLM to answer
2. Chain creation: Next we created a chain. This chain is a sequence of operations that allows us to invoke a query
3. RunnablePassthrough usage: The query is then passed along using RunnablePassthrough(). This function is a part of LangChain's API and is used to pass the query to the next step in the chain
4. Retriever Invocation: The query is also passed into the retriever. The retriever queries our FAISS index, a database designed for efficient similarity search and clustering of dense vectors, and retrieves the relevant context.
5. Context Integration: The retrieved context is then integrated into our prompt. This step is crucial as is it provides the necessary background information that aids the LLM in generating a more accurate and context-aware response
6. LLM Invocation: Finally, the enriched prompt is passed into the LLM. In this demonstration, we used a quantized Mistral-7B model, which is a powerful language model capable of generating high-quality text.

In [None]:
# First question in the chat
rag_chain.invoke('How is Mahomes doing?')



{'context': [Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/rival-fantasy-nfl-week-10/", waiting\nuntil "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/rival-fantasy-nfl-week-10/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-\nfantasy-lineup-week-10/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-fantasy-lineup-week-10/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-predictions-2023/'}),
  Document(page_content='Error: Page.goto: Timeout 30000m

In [None]:
rag_chain = (
    {'context': retriever, 'question': RunnablePassthrough()} | llm_chain
)

rag_chain.invoke('Who are some good alternatives to him?')



{'context': [Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-predictions-2023/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-\nadvice-2023-fantasy-football/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-advice-2023-fantasy-football/'}),
  Document(page_content='Error: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-\nfantasy-lineup-week-10/", waiting until "load"', metadata={'source': 'https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-fantasy-

- How to store the conversation history in memory and include it within our prompt
- How to transform the input question such that it retrieves the relevant information from our vector database


RetrievalQA and RAGs with Agents are some examples of abstracted classes that can simplify a lot of work for us, however, having a solid understanding of what's happening 'under the hood' is crucial which is why this tutorial will leverage low level LangChain components. Working at this level will let you understand why you may or may not be getting the result you are expecting and ultimately have more control over your RAG application.

In the highlighted section, we pass in the query as is but really we need to pass a transformed version that can appropriately query our vector database.

```
What are good alternatives to him? -> What are good alternatives to Mahomes? -> Sentence Encoder -> Vector Database
```

Save conversation history to memory and leverage it to generate a standalone question.

We add a second LLM which will be responsible for generating a standalone question that can appropriately query the vector database.

In [None]:
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

use_4bit = True
bnb_4bit_compute_dtype = 'float16'
bnb_4bit_quant_type = 'nf4'
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant
)

if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print('=' * 80)
        print('Your GPU supports bfloat16: accelerate training with bf16=True')
        print('=' * 80)

mistral_model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

You are calling `save_pretrained` to a 4-bit converted model, but your `bitsandbytes` version doesn't support it. If you want to save 4-bit models, make sure to have `bitsandbytes>=0.41.3` installed.


In [None]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(mistral_model))

trainable model parameters: 262410240
all model parameters: 3752071168
percentage of trainable model parameters: 6.99%


In [None]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

standalone_query_generation_pipeline = pipeline(
    model=mistral_model,
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.0,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
)
standalone_query_generation_llm = HuggingFacePipeline(pipeline=standalone_query_generation_pipeline)

response_generation_pipeline = pipeline(
    model=mistral_model,
    tokenizer=tokenizer,
    task='text-generation',
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
)
response_generation_llm = HuggingFacePipeline(pipeline=response_generation_pipeline)

`standalone_query_generation_pipeline` uses a temperature of 0.0 instead of 0.2 for our response `response_generation_pipeline`. We do this because we want to make sure there is as little chance for hallucinations when generating the standalone query since that impacts the application's ability to retrieve relevant context.

> Temperature is a close second to `prompt_engineering` when it comes to controlling the output of the generate model. It determines how creative the model should be.

> A temperature of 0 makes the model deterministic. It limits the model to use the word with the highest probability. You can run it over and over and get the same output. As you increase the Temperature, the limit softens, allowing it to use words with lower and lower probabilities.

## Standalone Questions Generation Chain
Few-shot prompt engineering approach.
7B models are performant but they're not perfect so providing a handful of examples in the prompt is a good idea. Take a look at how we do this.

In [None]:
%pip install langchain==0.1.17

Collecting langchain==0.1.17
  Downloading langchain-0.1.17-py3-none-any.whl (867 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m867.6/867.6 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: langchain
  Attempting uninstall: langchain
    Found existing installation: langchain 0.1.20
    Uninstalling langchain-0.1.20:
      Successfully uninstalled langchain-0.1.20
Successfully installed langchain-0.1.17


In [None]:
from langchain.prompts import PromptTemplate
from langchain_core.prompts.chat import ChatPromptTemplate

_template = """
[INST]
Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. This query will be used to retrieve documents with additional context.

Let me share a couple examples.

If you do not see any chat history, you MUST return the "Follow Up Input" as is:
```
Chat History:
Follow Up Input: How is Lawrence doing?
Standalone Question:
How is Lawrence doing?
```

If this is the second question onwards, you should properly rephrase the question like this:
```
Chat History:
Human: How is Lawrence doing?
AI:
Lawrence is injured and out for the season.
Follow Up Input: What was his injury?
Standalone Question:
What was Lawrence's injury?
```

Now, with those examples, here is the actual chat history and input question.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone Question:
[your response here]
[/INST]
"""

STANDALONE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

_template = """
[INST]
Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language, that can be used to query a FAISS index. This query will be used to retrieve documents with additional context.

Let me share a couple examples that will be important.

If you do not see any chat history, you MUST return the "Follow Up Input" as is:

```
Chat History:

Follow Up Input: How is Lawrence doing?
Standalone Question:
How is Lawrence doing?
```

If this is the second question onwards, you should properly rephrase the question like this:

```
Chat History:
Human: How is Lawrence doing?
AI:
Lawrence is injured and out for the season.

Follow Up Input: What was his injury?
Standalone Question:
What was Lawrence's injury?
```

Now, with those examples, here is the actual chat history and input question.

Chat History:
{chat_history}

Follow Up Input: {question}
Standalone question:
[your response here]
[/INST]
"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

In [None]:
from langchain.memory.buffer import ConversationBufferMemory
from operator import itemgetter

In [None]:
from langchain.schema import format_document
from langchain_core.messages import AIMessage
from langchain_core.messages import HumanMessage
from langchain_core.messages import get_buffer_string
from langchain_core.runnables import RunnableLambda
from langchain_core.runnables import RunnableParallel
from langchain_core.runnables import RunnablePassthrough

# instantiate ConversationBufferMemory
memory = ConversationBufferMemory(
    return_messages=True, output_key='answer', input_key='question'
)

# first, load the memory to access chat history
loaded_memory = RunnablePassthrough.assign(
    chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter('history'),
)

# define the standalone_question step to process the question and chat history
standalone_question = {
    'standalone_question': {
        'question': lambda x: x['question'],
        'chat_history': lambda x: get_buffer_string(x['chat_history']),
    } | STANDALONE_QUESTION_PROMPT
}

# finally, output the result of the CONDENSE_QUESTION_PROMPT
output_prompt = {
    'standalone_question_prompt_result': itemgetter('standalone_question'),
}

# combine the steps into a final chain
standalone_query_generation_prompt = loaded_memory | standalone_question | output_prompt

1. The `ConversationBufferMemory` class is instantiated with parameters to return messages, specifying `answer` as the output key and `question` as the input key, which sets up a memory buffer to manage and track the conversation's questions and answers.

2. The `loaded_memory` variable uses a `RunnablePassthrough` and `RunnableLambda` to load and access the chat history from the memory, specifically retrieving the `history` attribute, which contains the conversation's past interactions for reference and context management.

In [None]:
# example
inputs = {'question': 'how is mahomes doing?'}
memory.save_context(inputs, {'answer': 'mahomes is not looking great! bench him!'})

In [None]:
inputs = {'question': 'who should I replace him with?'}
standalone_query_generation_prompt.invoke(inputs)['standalone_question_prompt_result']

StringPromptValue(text='\n[INST]\nGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. This query will be used to retrieve documents with additional context.\n\nLet me share a couple examples.\n\nIf you do not see any chat history, you MUST return the "Follow Up Input" as is:\n```\nChat History:\nFollow Up Input: How is Lawrence doing?\nStandalone Question:\nHow is Lawrence doing?\n```\n\nIf this is the second question onwards, you should properly rephrase the question like this:\n```\nChat History:\nHuman: How is Lawrence doing?\nAI:\nLawrence is injured and out for the season.\nFollow Up Input: What was his injury?\nStandalone Question:\nWhat was Lawrence\'s injury?\n```\n\nNow, with those examples, here is the actual chat history and input question.\nChat History:\nHuman: how is mahomes doing?\nAI: mahomes is not looking great! bench him!\nFollow Up Input: who should I replace him with?\nSta

Great, the prompt is populated with our conversation history. Now, we just need to add a link to the `standalone_question` chain which adds the `standalone_query_generation_llm` model and then we should generate and updated question.

In [None]:
standalone_query_generation_chain = (
    loaded_memory | {
        'question': lambda x: x['question'],
        'chat_history': lambda x: get_buffer_string(x['chat_history']),
    } | STANDALONE_QUESTION_PROMPT | standalone_query_generation_llm
)

inputs = {'question': 'who should I replace him with?'}
print(standalone_query_generation_chain.invoke(inputs))




[INST]
Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language. This query will be used to retrieve documents with additional context.

Let me share a couple examples.

If you do not see any chat history, you MUST return the "Follow Up Input" as is:
```
Chat History:
Follow Up Input: How is Lawrence doing?
Standalone Question:
How is Lawrence doing?
```

If this is the second question onwards, you should properly rephrase the question like this:
```
Chat History:
Human: How is Lawrence doing?
AI:
Lawrence is injured and out for the season.
Follow Up Input: What was his injury?
Standalone Question:
What was Lawrence's injury?
```

Now, with those examples, here is the actual chat history and input question.
Chat History:
Human: how is mahomes doing?
AI: mahomes is not looking great! bench him!
Follow Up Input: who should I replace him with?
Standalone Question:
[your response here]
[/INST]
Who shou

In [None]:
template = """
[INST]
Answer the question based only on the following context:
{context}

Question: {question}
[/INST]
"""
ANSWER_PROMPT = ChatPromptTemplate.from_template(template)

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")

# this time around we will combine the documents into a single string that can be inputted into our prompt.
# This isn't necessary but it is a best practice. This allows to:
# 1. Include additional clean up of the input documents.
# 2. Process and summarize retrieved documents. This can come in handy to avoid overly verbose prompt strings.
def _combine_documents(
    docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)

memory = ConversationBufferMemory(
 return_messages=True, output_key="answer", input_key="question"
)

# First we add a step to load memory
# This adds a "memory" key to the input object
loaded_memory = RunnablePassthrough.assign(
    chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)

# Now we calculate the standalone question
standalone_question = {
    "standalone_question": {
        "question": lambda x: x["question"],
        "chat_history": lambda x: get_buffer_string(x["chat_history"]),
    }
    | CONDENSE_QUESTION_PROMPT
    | standalone_query_generation_llm,
}

# now we retrieve the documents
# this takes the standalone question we generated and queries the vector database. This is familiar, essentially the same chain we built in the original article
retrieved_documents = {
    "docs": itemgetter("standalone_question") | retriever,
    "question": lambda x: x["standalone_question"],
}

# now we construct the inputs for the final prompt
final_inputs = {
    "context": lambda x: _combine_documents(x["docs"]),
    "question": itemgetter("question"),
}

# and finally, we do the part that returns the answers
# now we have a string version of the retrieved documents and a standalone question ready for the response generation LLM to provide a final response to the user
answer = {
    # Here is the chain that generates the final response
    "answer": final_inputs | ANSWER_PROMPT | response_generation_llm,
    "question": itemgetter("question"),
    "context": final_inputs["context"]
}

# and now we put it all together
# To polish this up a bit more, we also include the standalone question and context into the output dictionary. This is useful information to have in a RAG application.
final_chain = loaded_memory | standalone_question | retrieved_documents | answer

In [None]:
def call_conversational_rag(question, chain, memory):
    '''
    Calls a conventional RAG (Retrieval-Augmented Generation) model to generate an answer to a given question.

    This function sends a question to the RAG model, retrieves the answer, and stores the question-answer pair in memory
    for context in future interactions.

    Parameters:
    question (str): The question to be answered by the RAG model.
    chain (LangChain object): An instance of LangChain which encapsulates the RAG model and its functionality.
    memory (Memory object): An object used for storing the context of the conversation.

    Returns:
    dict: A dictionary containing the generated answer from the RAG model.
    '''

    # Prepare the input for the RAG model
    inputs = {'question': question}

    # Invoke the RAG model to get an answer
    result = chain.invoke(inputs)

    # save the current question-answer pair in memory for future context
    memory.save_context(inputs, {'answer': result['answer']})

    # return the result
    return result

In [None]:
question = 'How is Maholmes doing?'
call_conversational_rag(question, final_chain, memory)



{'answer': 'Human: \n[INST] \nAnswer the question based only on the following context:\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"\n\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-\nfantasy-lineup-week-10/", waiting until "load"\n\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-dfs-week-10-stacking-advice-\npicks-2023-fantasy-football/", waiting until "load"\n\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-\nadvice-2023-fantasy-football/", waiting until "load"\n\nQuestion: \n[INST] \nGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question

In [None]:
# save previous question and answer to memory
question = 'Who are good alternatives to him right now?'
call_conversational_rag(question, final_chain, memory)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


{'answer': 'Human: \n[INST] \nAnswer the question based only on the following context:\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-week-10-sleeper-picks-player-\npredictions-2023/", waiting until "load"\n\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/5-stats-to-know-before-setting-your-\nfantasy-lineup-week-10/", waiting until "load"\n\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/nfl-dfs-week-10-stacking-advice-\npicks-2023-fantasy-football/", waiting until "load"\n\nError: Page.goto: Timeout 30000ms exceeded. Call log: navigating to\n"https://www.fantasypros.com/2023/11/players-to-buy-low-sell-high-trade-\nadvice-2023-fantasy-football/", waiting until "load"\n\nQuestion: \n[INST] \nGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question

In [None]:
question = "How many PPG are both averaging?"
call_conversational_rag(question, final_chain, memory)

# Summary
- Practicalities of a conversational RAG
- Importance of maintaining conversation history, transforming input questions into standalone queries that effectively retrieve relevant information from vector databases
- The use of 2 distinct LLMs, one for generating standalone queries and the other for generating responses, demonstrated a significant improvement in contextual understanding and relevance of the responses