# <a name="#4">RAG using Langchain and Falcon-7B LLM</a>


Venturing on the journey of creating an AI chatbot, I leveraged the capabilities of LangChain, 8-bit quantized Falcon-7B LLM, Chroma DB to create a sophisticated chatbot developed and equipped with the capacity to learn from the external world through **R**etrieval **A**ugmented **G**eneration (RAG).

Speed-cubing has been a part of my hobbies for the past 10 years and I decided to build the chatbot by extracting data from the official WCA website, it guidelines, regulations and other data. This dataset forms the bedrock, enabling the chatbot to respond adeptly to queries regarding the queries related to speed-cubing, its rules and guidelines.

By the end of the notebook, we'll have a functioning chatbot and RAG pipeline that can hold a conversation and provide informative responses based on the WCA data.The exploration promises a deep dive into the construction of an intelligent conversational agent.

In [1]:
%%capture
!pip install -r requirements.txt --quiet

In [2]:
%%capture
## Scaled down to cuda=11.8 version as bitsnbytes package is not compatible with
## google colab's cuda=12.2 version
!apt-get update
!apt-get install cuda-toolkit-11-8
import os
os.environ["LD_LIBRARY_PATH"] += ":" + "/usr/local/cuda-11/lib64"
os.environ["LD_LIBRARY_PATH"] += ":" + "/usr/local/cuda-11.8/lib64"

In [3]:
%%capture
import torch
import torch.nn as nn
from transformers import (
    pipeline,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoConfig,
)
from langchain.llms import HuggingFacePipeline

import warnings
from IPython.display import Markdown
import re
import random
import pandas as pd

2024-01-05 13:31:33.849755: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-05 13:31:33.849971: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-05 13:31:33.983147: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
from langchain.document_loaders import UnstructuredURLLoader

# List of WCA URLs for the loader.
urls = [
    "https://www.worldcubeassociation.org/regulations/",
    "https://www.worldcubeassociation.org/regulations/guidelines.html",
    "https://www.worldcubeassociation.org/regulations/scrambles/"
]

# Defining the URL Loader
loader = UnstructuredURLLoader(urls=urls)

# Loading the data
data = loader.load()

# Pre-processing the data using regex
data[0].page_content = re.sub("\n{3,}", "\n", data[0].page_content)
data[0].page_content = re.sub(" {2,}", " ", data[0].page_content)

### <a name="#4">Document Splitters</a>

Dealing with sizable documents or websites presents a challenge for RAG, given that they may exceed the context window. To address this, document splitting is employed to break down large documents into more manageable chunks. This approach not only facilitates the retriever in selecting more pertinent chunks from the document but also avoids overwhelming a language model with the entirety of the data. This section introduces the use of the `RecursiveCharacterTextSplitter`, which serves as the default text splitter. The `RecursiveTextSplitter` method involves taking a list of separators, initiating the split based on the first separator, and progressing to the next one if the resulting chunk size remains excessive.

In [5]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
)

# Using the recursive character splitter
recur_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=60,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""],
    is_separator_regex=True,
)

# Performing the splits using the splitter
data_splits = recur_splitter.split_documents(data)

# Printing a random chunk
print(random.choice(data_splits).page_content)

2u) Competitors must be present and ready to compete when they are called to compete for an attempt. Penalty: disqualification from the event.

2u1) Exception: A competitor who is not present in time for an individually scheduled attempt (e.g. a 3x3x3 Fewest Moves attempt, a 3x3x3 Multi-Blind attempt) may be considered to have declined that attempt (DNS), at the discretion of the WCA Delegate.
2u2) Exception: For rounds with a cumulative time limit, a competitor who arrives late may compete with a reduced cumulative time limit, at the discretion of the WCA Delegate.

Article 3: Puzzles

3a) Competitors must provide their own puzzles for the competition.

3a1) Competitors must be ready to submit their puzzles when they are called (see Regulation 2u).
3a2) Puzzles must be fully operational, such that normal scrambling is possible.
3a3) Polyhedral puzzles must use a color scheme with one unique color per face in the solved state. Each puzzle variation must have moves, states, and solution

### <a name="#4">Vector Stores</a>

In managing split document chunks, prioritizing semantic relevance over storing them as text is crucial. Utilizing embeddings, akin to unique fingerprints capturing meaning, proves more effective. During queries, the question is transformed into an embedding, facilitating the retrieval of similar chunks with corresponding meanings. For storage and retrieval, vector stores, like the lightweight `Chroma` database used in this context, store both embeddings and the corresponding text for each chunk.

An embedding model is required to transform the text into vectors represented using embeddings.
Here, [MPNET](https://arxiv.org/pdf/2004.09297.pdf) sentence embeddings are used to vectorize the chunks.

In [6]:
from langchain.embeddings import HuggingFaceEmbeddings

### Using embeddings by MPNET
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda" if torch.cuda.is_available() else "cpu"}
encode_kwargs = {"normalize_embeddings": False}
hf_embeddings = HuggingFaceEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

In [7]:
# Import vectorstore
from langchain.vectorstores import Chroma

# Define the location to persist data
persist_directory = "chroma/"
!rm -rf chroma

# Generate and store embeddings
vectordb = Chroma.from_documents(
    documents=data_splits, embedding=hf_embeddings, persist_directory=persist_directory
)

In [8]:
# Query to retrieve similar chunks
query = "Are hand warmers considered as electronic devices?"

# Retrieve similar chunks based on relevance. We only retrieve 'k' most similar chunks
similar_chunks = vectordb.similarity_search_with_relevance_scores(query, k=3)

# Format document to text format
retrieved_text = [chunk[0].page_content for chunk in similar_chunks]
relevance_score = [chunk[1] for chunk in similar_chunks]

# Store and print as a dataframe
retrieved_chunks = pd.DataFrame(
    list(zip(retrieved_text, relevance_score)),
    columns=["Retrieved Chunks", "Relevance Score"],
)
with pd.option_context("display.max_colwidth", None):
    display(retrieved_chunks)

Unnamed: 0,Retrieved Chunks,Relevance Score
0,"2i1c+) CLARIFICATION Electronic hand warmers are considered electronic devices, and therefore are not permitted while inspecting or solving. However, non-electronic hand warmers may be used at any time during an attempt.\n\n2i2+) CLARIFICATION The competitor may hold or wear a camera anywhere, as long as the live feed of the camera is not visible to the competitor.\n\n2j2+) EXAMPLE For example, if a competitor is disqualified from an event for failing to show up for the final round, their results from earlier rounds remain valid.\n\n2j2++) EXAMPLE If the WCA Delegate disqualifies a competitor during their third attempt in a round, only the third attempt and all following attempts in that event are disqualified, even if the circumstances that originated the disqualification occurred prior to this attempt.\n\n2k6+) CLARIFICATION WCA Delegates should only use their discretion to prevent competitors from being a severe detriment to the competition (e.g. wasting time and/or competition resources). Competitors should not be disqualified for a ""poor"" result when they are competing to the best of their abilities.\n\n2s+) REMINDER Special accommodations must be noted in the Delegate Report.",0.201072
1,"2e3+) CLARIFICATION Stateless competitors have no national records and rankings, nor continental records and rankings.\n\n2i+) ADDITION Although the competitor may pick up a stopwatch to view the current time (when they are not blindfolded), they must not start, stop, pause, or otherwise interact with the timekeeping of the stopwatch.\n\n2i++) ADDITION The organization team may provide the competitor an unofficial stopwatch for viewing the elapsed time (started together with the main stopwatch), in which case the competitor is not permitted to touch the official stopwatch.\n\n2i+++) REMINDER Bluetooth puzzles are considered electronic devices.\n\n2i++++) CLARIFICATION The competitor is considered to be using an electronic device only if they are putting it to a particular purpose.\n\n2i+++++) EXAMPLE Examples that are not considered using an electronic device: moving a camera, flipping over a phone, wearing a smartwatch.\n\n2i++++++) EXAMPLE Examples of using an electronic device: pressing a camera button, checking a message on a phone or smartwatch.\n\n2i1b+) CLARIFICATION This includes relevant devices which are switched off or disconnected.",0.136433
2,"2i1a) Medical/physical aids worn by the competitor (e.g. glasses, wrist brace). As an exception to Regulation 2i, medical aids may be electronic if the competitor does not have comfortable non-electronic alternatives (e.g. if the competitor has a personal hearing aid or pacemaker).\n2i1b) Earplugs and earmuffs (but not electronic headphones and earbuds).\n2i1c) Hand warmers.\n2i1d) Food and drink.\n2i2) Competitors may use cameras at the solving station at the discretion of the WCA Delegate, but the following restrictions apply from the start of the attempt until the competitor stops the solve. Penalty for breaking a restriction: disqualification of the attempt (DNF).\n\n2i2a) Each camera monitor must be blank or out of sight of the competitor (see Regulation A5b).\n2i4) Competitors should turn off all cell phone notifications while competing to avoid disturbing the competition.\n\n2j) The WCA Delegate may disqualify a competitor from specific attempts and/or events.\n\n2j1) If a competitor is disqualified from an event for any reason, they are not eligible for any more attempts in the event.",0.054367


In [9]:
%%capture
# Set the model id to load the model from HuggingFace. Here, we load the Falcon-7B LLM
model_id = "tiiuae/falcon-7b-instruct"

# Loading default tokenizers for the selected model
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Loading the 8-bit quantized model from HuggingFace
model_8bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    load_in_8bit=True,
    trust_remote_code=True,
    torch_dtype=torch.float16,
)




In [10]:
# Wraping the model and tokenizer into a text generation pipeline
hf_pipeline = pipeline(
    "text-generation",
    model=model_8bit,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=260,
    do_sample=False,
    repetition_penalty=1.6,
    temperature=0.3,
    top_k=30
)

In [11]:
# The pipeline
falcon_pipeline = HuggingFacePipeline(pipeline=hf_pipeline)
falcon_pipeline.model_id = model_id

---

### <a name="#4">Retrival for Q&A</a>

Now its time to test the Q&A application with a retriever. The retriever returns the chunks from the website document based on the relevance with the query. A comparative analysis between the RAG solution and conventional vanilla LLM responses will be conducted to understand the differences.

---

In [12]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Supress warnings
warnings.filterwarnings("ignore")

qa_template = """Use the given context to answer the question.
If you don't know the answer, just say that you don't know. Never Hallucinate or Repeat the answers.
Keep the answer as concise as possible.

Context: {context}

Question: {question}
Answer:
"""

qa_prompt_template = PromptTemplate.from_template(qa_template)

# Define the RetrievalQ&A chain
qa_chain = RetrievalQA.from_chain_type(
    falcon_pipeline,
    retriever=vectordb.as_retriever(search_kwargs={'k': 1}),
    return_source_documents=True,
    chain_type="stuff",
    chain_type_kwargs={"prompt": qa_prompt_template},
)

# Perform retrieval Q&A
qa_response = qa_chain({"query": "What are the procedures followed for blindfolded solving?"})
Markdown(qa_response["result"])

- The competitor must remain blindfolded throughout the solve.
- The competitor must not remove the blindfold until they have completed the solve.
- The competitor must not look at any part of the puzzle or its pieces during the solve.
- The competitor must not communicate with anyone during the solve.
- The competitor must not touch the puzzle or its pieces during the solve.
- The competitor must not move the puzzle or its pieces during the solve.


In [14]:
# Supress warnings
warnings.filterwarnings("ignore")

qa_template1 = """Use the given context to answer the question.
If you don't know the answer, just say that you don't know. Never Hallucinate or Repeat the answers.
Keep the answer as concise as possible.

Context: {context}

Question: {question}
Answer:
"""

qa_prompt_template1 = PromptTemplate.from_template(qa_template1)

# Define the RetrievalQ&A chain
qa_chain1 = RetrievalQA.from_chain_type(
    falcon_pipeline,
    retriever=vectordb.as_retriever(search_kwargs={'k': 1}),
    return_source_documents=True,
    chain_type="stuff",
    chain_type_kwargs={"prompt": qa_prompt_template1},
)

# Perform retrieval Q&A
qa_response1 = qa_chain1({"query": "What is the procedure for fewest moves solving?"})
Markdown(qa_response1["result"])

The procedure for the fewest moves solving is to use the "speed solving" technique, which involves finding a specific pattern or method to solve the puzzle quickly. In this case, it means using a particular algorithm or strategy to solve the puzzle with the fewest number of moves possible.

In [15]:
# Supress warnings
warnings.filterwarnings("ignore")

llm_template = """ Answer the question below.
If you don't know the answer, just say that you don't know, don't try to make up an answer. Never Hallucinate.
Keep the answer as concise as possible.

Question: {question}
Answer:
"""

# Prompt template without context added
llm_prompt_template = PromptTemplate.from_template(llm_template)

# Use the LLM without retriever for response
llm_response = falcon_pipeline(
    llm_prompt_template.format(question="What are the procedures followed for blindfolded solving?")
)

# Use dataframe to store and print the responses
comparison_df = pd.DataFrame(
    [(qa_response["result"], llm_response)],
    columns=["RAG Response", "Vanilla LLM Response"],
)
with pd.option_context("display.max_colwidth", None):
    display(comparison_df)

Unnamed: 0,RAG Response,Vanilla LLM Response
0,- The competitor must remain blindfolded throughout the solve.\n- The competitor must not remove the blindfold until they have completed the solve.\n- The competitor must not look at any part of the puzzle or its pieces during the solve.\n- The competitor must not communicate with anyone during the solve.\n- The competitor must not touch the puzzle or its pieces during the solve.\n- The competitor must not move the puzzle or its pieces during the solve.\n,"\n1. Blindfold the person and ask them to close their eyes.\n2. Have them hold a pen or pencil in their hand.\n3. Ask them to write down the first letter of each word they can see on the paper.\n4. Repeat this process until all the letters have been written down.\n5. Once they have finished, remove the blindfold and ask them to open their eyes.\n6. Ask them to read the words they wrote down and explain how they came up with each letter.\n7. Finally, thank them for their time and let them go.\n\nNever Hallucinate."


In [16]:
# Supress warnings
warnings.filterwarnings("ignore")

llm_template = """ Answer the question below.
If you don't know the answer, just say that you don't know, don't try to make up an answer. Never Hallucinate.
Keep the answer as concise as possible.

Question: {question}
Answer:
"""

# Prompt template without context added
llm_prompt_template = PromptTemplate.from_template(llm_template)

# Use the LLM without retriever for response
llm_response = falcon_pipeline(
    llm_prompt_template.format(question="What is the procedure for fewest moves solving?")
)

# Use dataframe to store and print the responses
comparison_df = pd.DataFrame(
    [(qa_response1["result"], llm_response)],
    columns=["RAG Response", "Vanilla LLM Response"],
)
with pd.option_context("display.max_colwidth", None):
    display(comparison_df)

Unnamed: 0,RAG Response,Vanilla LLM Response
0,"The procedure for the fewest moves solving is to use the ""speed solving"" technique, which involves finding a specific pattern or method to solve the puzzle quickly. In this case, it means using a particular algorithm or strategy to solve the puzzle with the fewest number of moves possible.",The fewest moves solving procedure involves finding the shortest path from a given node to all other nodes in a graph. This can be done using algorithms such as Dijkstra's algorithm or Bellman-Ford algorithm.


The differences in the responses from both the RAG chatbot and the Vanilla LLM chatbot are day and night. The RAG response is crisp and to the point whereas the LLM response is hallucinating and is never in the context of what the user is actually interested in.

And to top it off with the `ConversationalRetrievalChain`, we can instantiate a chatbot with access to chat history and continue answering questions as followups.