In this Colab notebook, we will be performing RAG with LangChain and open source models. Instead of PDFs, we will be dealing with data directly from a website. Code was based on this medium article: https://medium.com/@thakermadhav/build-your-own-rag-with-mistral-7b-and-langchain-97d0c92fa146

In [1]:
!pip install -q torch
!pip install -q -U git+https://github.com/huggingface/transformers.git  # Installing from the source because of a known error with the Mistral model
!pip install -q accelerate \
                peft \
                bitsandbytes \
                langchain \
                sentence_transformers \
                faiss-gpu \
                trl \
                playwright \
                html2text

# Need to install playright for chromium browsing
!playwright install

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m794.4/794.4 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.4/192.4 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.8/45.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h

Import appropriate libraries

In [2]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

Select appropriate model here

In [3]:
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

model_config = transformers.AutoConfig.from_pretrained(model_name)

# Get the approrpiate tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Select appropriate quantization parameters

In [4]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

Setup model

In [5]:
# Quantization config
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load pre-trained config
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Test out the LLM (without RAG). The answer is coherent but there are loads of inaccurate information

In [7]:
inputs_not_chat = tokenizer.encode_plus("[INST] Tell me about FtsZ [/INST]", return_tensors="pt")['input_ids'].to('cuda')

generated_ids = model.generate(inputs_not_chat,
                               max_new_tokens=1000,
                               do_sample=True)

decoded = tokenizer.batch_decode(generated_ids)

print(decoded)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> [INST] Tell me about FtsZ [/INST] FtsZ (Flagellar filament zone protein) is a protein found in archaea and some bacteria that play a critical role in the assembly and organization of the bacterial flagell, a whip-like appendage that helps to propel the cell forward. FtsZ is a member of the peptidoglycan polymerase (PGP) enzyme family, which is responsible for the synthesis of the peptidoglycan layer that surrounds the bacterial cell and the bacterial flagellar filament.\n\nFtsZ is a helical protein that aligns with the axis of rotation of the flagellar motor, a complex protein machinery that powers the rotation of the flagell. FtsZ is thought to provide the basis for the polymerization of peptidoglycan filaments that make up the flagellar filament zone. Together with other proteins, FtsZ forms a ring that encircles the flagellar motor and helps to anchor the motor to the cell membrane. The assembly and disassembly of FtsZ rings are thought to be regulated by a series of highly co

# Setting up vector database

Here, we will use FAISS as the vector database to store embeddings of the website (generated from sentence transformers)

In [30]:
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.vectorstores import FAISS
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import nest_asyncio

nest_asyncio.apply()

# Note: Add multiple websites if you want to build a database
articles = ["https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439403/"]

# Scrapes the blogs above
loader = AsyncChromiumLoader(articles)
docs = loader.load()

# Converts HTML to plain text
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)

# Chunk text
text_splitter = CharacterTextSplitter(chunk_size=250,
                                      chunk_overlap=0)
chunked_documents = text_splitter.split_documents(docs_transformed)

# Load chunked documents into the FAISS index
# WE are using the all-mpnet-base-v2 model to generate embeddings here
db = FAISS.from_documents(chunked_documents, HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2'))





Setup the LLM Chain and add custom instructions

In [31]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_core.runnables import RunnablePassthrough

text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=300,
)

prompt_template = """
### [INST]
Instruction: You are a scientist. Answer the question based on your knowledge of biomedical science. Here is context to help:

{context}

### QUESTION:
{question}

[/INST]
 """

mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

# Create prompt from prompt template
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

# Create llm chain
llm_chain = LLMChain(llm=mistral_llm, prompt=prompt)

# Run LLM query againts the database

Note that the top K 'context' used to generate the answers by the LLM are shown above the summarized answer

In [35]:
%%time
query = "What is the structure of FtsZ in Staphylococcuss aureus?"

# Connect query to FAISS index using a retriever
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 3}
)

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

rag_chain.invoke(query)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


CPU times: user 12.5 s, sys: 153 ms, total: 12.6 s
Wall time: 12.7 s


{'context': [Document(page_content='11\\.  Löwe J, Amos LA (1998) Crystal structure of the bacterial cell-division\nprotein FtsZ. Nature 391: 203–206. [PubMed] [Google Scholar]', metadata={'source': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439403/'}),
  Document(page_content='11\\.  Löwe J, Amos LA (1998) Crystal structure of the bacterial cell-division\nprotein FtsZ. Nature 391: 203–206. [PubMed] [Google Scholar] [Ref list]', metadata={'source': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439403/'}),
  Document(page_content='### 3D-SIM of Z Rings in _S. aureus_ Cells Also Reveals a Heterogeneous\nDistribution of FtsZ', metadata={'source': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439403/'})],
 'question': 'What is the structure of FtsZ in Staphylococcuss aureus?',
 'text': '\nAccording to the provided documents, the crystal structure of FtsZ in Staphylococcus aureus has been determined through X-ray crystallography and 3D-SIM imaging. The crystal structure reveals that F

# Optional:

Here is the code if you want to query a PDF with LangChain

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("my_pdf_file.pdf")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = loader.load_and_split(text_splitter)

# pass the text and embeddings to FAISS
db = FAISS.from_documents(texts, HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2'))
