<a href="https://colab.research.google.com/github/abdalrahmenyousifMohamed/ML/blob/main/BGE_Embeddings%2C_LangChain_and_Chroma_and_Llama_v2_for_Retrieval_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div class="markdown-google-sans">
  <h2>BGE Embeddings,LangChain and Chroma and Llama v2 for Retrieval QA
</h2>
</div>

Embeddings play a pivotal role in atural language modeling, particularly in the context of semantic search and retrieval augmented generation (RAG).  

To see the performance of various embedding models, it is common for practitioners to consult leaderboards.
Massive Text Embedding Benchmark (MTEB) Leaderboard from HuggingFace provides well-rounded benchmarks for commonly used embedding models in English and Chinese languages. (also check the Open LLM Leaderboard)  

We have been using embeddings from NLP Group of The University of Hong Kong (instructor-xl) for building applications and OpenAI (text-embedding-ada-002) for building quick prototypes.  

We recently switched to BGE embeddings (large and base) which are now top-rated on the MTEB leaderboard! What's really impressive is how efficient they are. For example, the bigger version of the BGE model is only 1.34GB, which is much smaller than the 'instructor-xl' model at 4.96GB, but it works even better.  

In this tutorial, you will learn how to   
- Download papers from Arxiv,   
- Create and store embeddings in ChromaDB for RAG,   
- Use Llama-2–13B to answer questions and give credit to the sources.  

Before we go through all of this awesomeness, please follow us on <a href="https://medium.com/@datadrifters">Medium</a> to never miss a beat.

Let's dive in!

# Getting Started

Run the following commands in your terminal

In [None]:
# !apt install python3.10-venv

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  python3-pip-whl python3-setuptools-whl
The following NEW packages will be installed:
  python3-pip-whl python3-setuptools-whl python3.10-venv
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 2,473 kB of archives.
After this operation, 2,884 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-pip-whl all 22.0.2+dfsg-1ubuntu0.4 [1,680 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3-setuptools-whl all 59.6.0-1.2ubuntu0.22.04.1 [788 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 python3.10-venv amd64 3.10.12-1~22.04.3 [5,716 B]
Fetched 2,473 kB in 1s (3,261 kB/s)
Selecting previously unselected package python3-pip-whl.
(Reading database ... 121752 files and directories currently installed.)
Pr

In [None]:
# # Create project folder and virtual environment, then install required libraries
# !mkdir bge-langchain-chroma && cd bge-langchain-chroma
# !python3 -m venv bge-langchain-chroma-env
# !source bge-langchain-chroma-env/bin/activate

install required libraries

In [None]:
!pip3 install langchain tiktoken chromadb python-dotenv ipykernel jupyter arxiv pymupdf
!pip3 install sentence_transformers pypdf unstructured
!pip3 install auto_gptq

Then open the IDE of your choice, we are using VSCode

In [None]:
# code .

We are ready to start, let's import required libraries, we have added notes for you to understand what each library does

# Importing required libraries

In [None]:
# Imports
from chromadb.config import Settings
from urllib.error import HTTPError
from dataclasses import replace
from dotenv import load_dotenv
from tqdm import tqdm
import numpy as np
import tiktoken # OpenAI's open-source tokenizer
import chromadb
import logging
import random # to sample multiple elements from a list
import arxiv
import time
import os # operating system dependent functionality, to walk through directories and files

from langchain.text_splitter import RecursiveCharacterTextSplitter # recursively tries to split by different characters to find one that works
from langchain.document_loaders import PyPDFDirectoryLoader # loads pdfs from a given directory
from langchain.chains import ConversationalRetrievalChain # looks up relevant documents from the retriever per history and question.
from langchain.text_splitter import CharacterTextSplitter # splits the content
from langchain.embeddings import HuggingFaceBgeEmbeddings # wrapper for HuggingFaceBgeEmbeddings models
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from langchain.document_loaders import ArxivLoader # loads paper for a given id from Arxiv
from langchain.document_loaders import PyPDFLoader # loads a given pdf
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader # loads a given text
from langchain.retrievers import ArxivRetriever # loads relevant papers for a given paper id from Arxiv
from chromadb.utils import embedding_functions # loads Chroma's embedding functions from OpenAI, HuggingFace, SentenceTransformer and others
from langchain.chat_models import ChatOpenAI # wrapper around OpenAI LLMs
from langchain.vectorstores import Chroma # wrapper around ChromaDB embeddings platform
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain import HuggingFaceHub # wrapper around HuggingFaceHub models

from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

load_dotenv() # loads env variables
import logging

logging.basicConfig(level=logging.INFO) # to inspect network behavior and API logic of Arxiv and Chroma


load_dotenv() helps you to load environment variables from '.env' file in your root directory. Here's where you typically put your API Keys such as OpenAI, Supabase, Pinecone or other cloud services.

## Downloading data: Arxiv paper for a given search term

We will be working with 85-page long Arxiv paper named "A Survey of Large Language Models".
Here's a little snippet to download it

In [26]:
!mkdir arxiv_papers
dirpath = "arxiv_papers"

search = arxiv.Search(
  query = "2303.18223" # ID of the paper A Survey of Large Language Models
)

for result in tqdm(search.results()):
    result.download_pdf(dirpath=dirpath)
    print(f"-> Paper id {result.get_short_id()} with title '{result.title}' is downloaded.")

mkdir: cannot create directory ‘arxiv_papers’: File exists


  for result in tqdm(search.results()):
1it [00:04,  4.34s/it]

-> Paper id 2303.18223v13 with title 'A Survey of Large Language Models' is downloaded.





We created a directory called "arxiv_papers" in the current working directory and download the paper there.
You can now load all the papers in that directory

In [27]:
papers = []
loader = DirectoryLoader('./arxiv_papers/', glob="./*.pdf", loader_cls=PyPDFLoader)
papers = loader.load()

In [28]:
print("Total number of pages loaded: ", len(papers))

Total number of pages loaded:  124


Before we split the text into smaller chunks, let us explain two important arguments: chunk_size and chunk_overlap

When you're trying to embed a document, you have to think about the granularity of the information that you are trying to capture. Sometimes you need a fine-grained view (e.g., spell-checks, keyword analysis), and other times, you need to take a step back to see the greater context (e.g., summarization, question-answering).

So, depending on what you're trying to understand from a text, you'll need to adjust how much you read at one time, which is controlled by chunk_size, and chunk_overlap is the number of characters to overlap between two chunks for preserving the semantic context in subsequent chunks.

In addition, chunk_size also has an effect on the inference performance since it determines the average number of tokens that will be submitted to LLM to generate the response.

There are different chunking strategies and <a href="https://www.pinecone.io/learn/chunking-strategies/">here's a nice article</a> that explains several options.

Let's load and chunk the paper using chunk size of 500 and overlap of 50, you should definitely experiment with these to find what works best in your case.

In [29]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 50
)

paper_chunks = text_splitter.split_documents(papers)

In [30]:
len(paper_chunks)

1668

You can manually inspect some of the chunks.

In [31]:
paper_chunks[5]

Document(page_content='attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI\ncommunity, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this\nsurvey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular,', metadata={'source': 'arxiv_papers/2303.18223v13.A_Survey_of_Large_Language_Models.pdf', 'page': 0})

You can also verify the average length of the chunks

In [32]:
chunk_lengths = [len(paper_chunk.page_content) for paper_chunk in paper_chunks]
np.average(chunk_lengths)

452.8890887290168

Looks good! Let's continue with embeddings.

If you are not familiar with the topic, you can think of using embeddings as creating a smart filing system where files are semantically related. Similar files will be closer to each other than dissimilar ones. It helps you to quickly find and use relevant information as per user prompt or query.

Technically, embeddings enables the dynamic augmentation of the model input at execution time, in addition to your prompt, you also provide relevant context for model to generate high quality responses.

Let's see how it's done in practice!

# Downloading HuggingFace BGE Embeddings

In [33]:
model_name = "BAAI/bge-base-en"
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

embedding_function = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs={'device': 'cuda'},
    encode_kwargs=encode_kwargs
)


# Working with ChromaDB to store embeddings

In this section, you will use the OpenAI embedding model to generate embeddings for your documents and store them in ChromaDB for easy retrieval later.

In [34]:
persist_directory="./chromadb/"

vectordb = Chroma.from_documents(
    documents=paper_chunks, # text data that you want to embed and store
    embedding=embedding_function, # used to convert the documents into embeddings
    persist_directory=persist_directory, # this tells Chroma where to store its data
    collection_name="arxiv_papers" #  gives a name to the collection of embeddings, which will be helpful for retrieving specific groups of embeddings later.
)

vectordb.persist() # will make the database save any changes to the disk

# Retrieval QA with LangChain and Chroma

In case you run this code block second time after ChromaDB is created, you can use below line to create vectordb from ChromaDB. This will save time.

In [35]:
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding_function)

In [None]:
# !pip install auto-gptq



In [19]:
# !rm -r "/content/TheBloke/Me"

First, we need to download Llama-2-13B-chat-GPTQ model, but you can also use 7B or 30B models.

In [21]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
model_basename = "model"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
use_safetensors=True,
trust_remote_code=True,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)


INFO - `checkpoint_format` is missing from the quantization configuration and is automatically inferred to gptq.
INFO - The layer lm_head is not quantized.


In [None]:
# !pip3 install transformers>=4.32.0 optimum>=1.12.0
# !pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/  # Use cu117 if on CUDA 11.7




In [None]:
# !pip3 uninstall -y auto-gptq
# !git clone https://github.com/PanQiWei/AutoGPTQ
# !cd AutoGPTQ
# !pip3 install .


Found existing installation: auto_gptq 0.7.1
Uninstalling auto_gptq-0.7.1:
  Successfully uninstalled auto_gptq-0.7.1
Cloning into 'AutoGPTQ'...
remote: Enumerating objects: 4872, done.[K
remote: Counting objects: 100% (1471/1471), done.[K
remote: Compressing objects: 100% (284/284), done.[K
remote: Total 4872 (delta 1335), reused 1193 (delta 1187), pack-reused 3401[K
Receiving objects: 100% (4872/4872), 8.11 MiB | 12.39 MiB/s, done.
Resolving deltas: 100% (3225/3225), done.
[31mERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0m

In [None]:
# %cd AutoGPTQ
# !ls
# !pip3 install .

In [None]:
# !pip install optimum

In [None]:
# !pip install auto-gptq



creating the HuggingFacePipeline

In [36]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15)
print()

llm = HuggingFacePipeline(pipeline=pipe)

The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCa




creating the QA chain with retriever to answer the questions

In [37]:
RetrievalQA.from_chain_type.__doc__

'Load chain from chain type.'

In [38]:
retriever = vectordb.as_retriever(search_kwargs={"k": 5})

retrieval_qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                  chain_type="stuff",
                                  retriever=retriever,
                                  return_source_documents=True)

We are ready to ask questions!

In [39]:
query = "What are the recent advancements in LLMs?"
llm_response = retrieval_qa_chain(query)

  warn_deprecated(


In [40]:
llm_response['result'].split('\n')

["Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.",
 '',
 '',
 '',
 'Question: What are the recent advancements in LLMs?',
 'Helpful Answer: Recent advancements in LLMs include the development of transformer-based models, such as BERT and RoBERTa, which have achieved state-of-the-art results on a wide range of NLP tasks. These models use self-supervised learning techniques to learn high-level semantic representations of language, which can be fine-tuned for specific downstream tasks like sentiment analysis or machine translation. Additionally, there has been growing interest in multimodal LLMs that can process and integrate information from multiple sources, such as text, images, and audio. Finally, there is also research on Explainable AI (XAI) techniques to understand how LLMs make decisions and generate text.']

We can also see the source and pages to generate the answer

In [42]:
[source.metadata for source in llm_response["source_documents"]]

[]

You can also use retrieval QA chain with prompt templates, here's how you would do it for the same example as above:

In [43]:
template = """
{summaries}
{question}
"""

retrieval_qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": PromptTemplate(
            template=template,
            input_variables=["summaries", "question"],
        ),
    },
)

In [44]:
query = "What are the recent advancements in LLMs?"
llm_response = retrieval_qa_chain(query)



In [45]:
llm_response

{'question': 'What are the recent advancements in LLMs?',
 'answer': '\n\nWhat are the recent advancements in LLMs?\n---------------------------------------\n\nLLMs have been rapidly evolving over the past few years, with several recent advancements that have improved their performance and applicability. Some of these advancements include:\n\n1. **Attention Mechanisms**: Attention mechanisms were introduced to improve the ability of LLMs to focus on specific parts of the input data, allowing them to better capture long-range dependencies and handle input sequences of varying lengths.\n2. **Pre-trained Language Models**: Pre-trained language models like BERT, RoBERTa, and XLNet have achieved state-of-the-art results on a wide range of NLP tasks, including question answering, sentiment analysis, named entity recognition, and text classification. These models use a multi-layer bidirectional transformer encoder to learn high-level semantic and syntactic features of language.\n3. **Transfor

Nice, hope this helps! We'll be around if you have any questions.

See you somewhere in the matrix!