# Project: Building a QA RAG System with LangChain on Wikpedia Data
* Notebook by Adam Lang
* Date: 10/25/2024

# Overview & Objectives
* The objectives here are to build a QA RAG System. To do that we will:
1. Load and chunk documents.
2. Create document chunk embeddings.
3. Index a Vector DB
4. Create Retriever
5. Connect RAG Chain.

# Install Dependencies
* Note running this code on a GPU will enhance/speed-up processing time for vector retrieval from vector DB.

In [1]:
!pip install langchain==0.2.0
!pip install langchain-openai==0.1.7
!pip install langchain-community==0.2.0
!pip install sentence-transformers==2.7.0

Collecting langchain==0.2.0
  Downloading langchain-0.2.0-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.2.0)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_core-0.2.41-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_text_splitters-0.2.4-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain==0.2.0)
  Downloading langsmith-0.1.137-py3-none-any.whl.metadata (13 kB)
Collecting tenacity<9.0.0,>=8.1.0 (from langchain==0.2.0)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain==0.2.0)
  Downloading marshmallow-3.23.0-py3-none-any.whl.metadata (7.6 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->lang

# Install Chroma Vector DB and LangChain Wrapper
* We will use Chroma DB which is a free vector DB instance to store embeddings.

In [2]:
!pip install langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.1.4-py3-none-any.whl.metadata (1.6 kB)
Collecting chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0 (from langchain-chroma)
  Downloading chromadb-0.5.15-py3-none-any.whl.metadata (6.8 kB)
Collecting fastapi<1,>=0.95.2 (from langchain-chroma)
  Downloading fastapi-0.115.3-py3-none-any.whl.metadata (27 kB)
Collecting build>=1.0.3 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Downloading uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb!=0.5.4,!=0.5.5,<0.6.0,>=0.4.0->langchain-chroma)
  Do

# Enter Open AI API key

In [3]:
from getpass import getpass

OPENAI_KEY = getpass('Enter Open AI API key: ')

Enter Open AI API key: ··········


# Setup Environment Variables

In [4]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

# Open AI Embedding Models
* Using LangChain we can access various Open AI embedding models.

In [5]:
from langchain_openai import OpenAIEmbeddings

## setup openai embeddings
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

# Vector Database (Chroma DB)
* We will use open source Chroma DB to store our embeddings.



## Get the wikipedia dataset

In [6]:
## manually download
!gdown 1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW

Downloading...
From (original): https://drive.google.com/uc?id=1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW
From (redirected): https://drive.google.com/uc?id=1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW&confirm=t&uuid=cdfc454e-7fc8-415e-a645-0c134bf60db1
To: /content/simplewiki-2020-11-01.jsonl.gz
100% 50.2M/50.2M [00:00<00:00, 54.2MB/s]


In [8]:
## import the data once downloaded
import gzip ##unzip data
import json

## filepath
wikipedia_filepath = '/content/simplewiki-2020-11-01.jsonl.gz'

# store docs in list
docs = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
  for line in fIn:
    data = json.loads(line.strip())

    ## add all paragraphs
    ## passages.extend(data['paragraphs'])

    # Only add first paragraph
    docs.append({
                  'metadata': {
                                'title': data.get('title'),
                                'article_id': data.get('id')
                  },
                  'data': ' '.join(data.get('paragraphs')[0:3]) # restrict data to first 3 paragraphs
    })


print("Passages:", len(docs))

Passages: 169597


Summary:
* We can see that we have 169,597 documents (wikipedia articles).

In [9]:
## now we subset the data so we only use a subset of the wikipedia documents to run things faster
## subset for topics: linguistics, india, cheetah
docs = [doc for doc in docs for x in ['linguistics', 'india', 'cheetah']
              if x in doc['data'].lower().split()]

In [10]:
## lenghth of docs now
len(docs)

1364

In [11]:
## print first 3 docs
docs[:3]

[{'metadata': {'title': 'Kurgan hypothesis', 'article_id': '72554'},
  'data': 'The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'},
 {'metadata': {'title': 'Marija Gimbutas', 'article_id': '72558'},
  'data': 'Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'},
 {'metadata': {'title': 'Basil', 'article_id': '73985'},
  'data': 'Basil ("Ocimum basilic

# Create LangChain Documents

In [15]:
from langchain.docstore.document import Document

## create docs
docs = [Document(page_content=doc['data'], #page content
                 metadata=doc['metadata']) for doc in docs] #metadata

In [16]:
## print first 3 docs
docs[:3]

[Document(metadata={'title': 'Kurgan hypothesis', 'article_id': '72554'}, page_content='The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'),
 Document(metadata={'title': 'Marija Gimbutas', 'article_id': '72558'}, page_content='Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'),
 Document(metadata={'title': 'Basil', 'article_id': '73985'}, page_content

In [17]:
## get len of docs
len(docs)

1364

## Split larger documents into Smaller Chunks
* We can use a text splitter from langchain API here.
* The standard is to use the RecursiveCharacterTextSplitter, but it may be better to consider using a semantic splitter or other specific splitter depending on your data and use case.

In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

## create splitter
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=300)
chunked_docs = splitter.split_documents(docs)

In [19]:
## view first 3 chunks
chunked_docs[:3]

[Document(metadata={'title': 'Kurgan hypothesis', 'article_id': '72554'}, page_content='The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'),
 Document(metadata={'title': 'Marija Gimbutas', 'article_id': '72558'}, page_content='Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'),
 Document(metadata={'title': 'Basil', 'article_id': '73985'}, page_content

In [20]:
## get len of chunked_docs
len(chunked_docs)

1388

The len of the chunked_docs is slightly longer than the len of the original docs.

# Create a Vector DB to persist on disk
* Here we will init a connection to a Chroma DB vector database client
* We also want to save this on disk so we init the Chroma client and pass the directory where we want the data to be saved to.

In [22]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings --> takes some time
chroma_db = Chroma.from_documents(documents=chunked_docs,
                                  collection_name='rag_wikipedia_db',
                                  embedding=openai_embed_model,
                                  collection_metadata={'hnsw:space': 'cosine'}, #defaults to euclidean so need to set to metric of choice
                                  persist_directory="./wikipedia_db")


## Load Vector DB from disk
* Once the Vector DB has been created you can simply load it from disk and create a connection anytime you need to use it rather than having to embed and create index everytime.

In [23]:
## load from disk
chroma_db = Chroma(persist_directory="./wikipedia_db",
                   collection_name='rag_wikipedia_db',
                   embedding_function=openai_embed_model)

In [24]:
## print db
chroma_db

<langchain_chroma.vectorstores.Chroma at 0x7f725983c3d0>

# Load Connection to LLM
* Here you load the LLM you want to use, in this case we will use the ChatOpenAI using gpt-3.5-turbo model.

In [25]:
from langchain_openai import ChatOpenAI

## setup llm connection
chatgpt = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)

# Chained Retrieval Pipeline
* This strategy uses a chain of multiple retrievers sequentially to get the most relevant documents.
* The flow is as follows:
  * Similarity Retrieval -->
  * Compression Filter -->
  * Reranker Model Retrieval


* Reranker model: https://huggingface.co/BAAI/bge-reranker-large

In [26]:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder #cross encoder
from langchain.retrievers.document_compressors import CrossEncoderReranker #reranker
from langchain.retrievers.document_compressors import LLMChainFilter
from langchain.retrievers import ContextualCompressionRetriever #compressor

# Retriever 1 - simple cosine distance
similarity_retriever = chroma_db.as_retriever(search_type='similarity',
                                              search_kwargs={'k': 5}) # retrieve top 5

# decides which of the initially retrieved documents to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=chatgpt)

# Retriever 2 - retrieves the documents similar to query and then applies the filter
compressor_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=similarity_retriever
)

# download an open source reranker model - BAAI/bge-reranker-v2-m3
reranker = HuggingFaceCrossEncoder(model_name='BAAI/bge-reranker-large')
reranker_compressor = CrossEncoderReranker(model=reranker, top_n=3)

# Retriever 3 - Uses a Reranker model to rerank retrieval results from previous retriever
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker_compressor,
    base_retriever=compressor_retriever
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [27]:
query = "what is the old capital of India?"
docs = final_retriever.invoke(query)
docs

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content="Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capital of India during the British Raj until 1911. Kolkata was once the center of industry and education. However, it has witnessed political violence and economic problems since 1954. Since 2000, Kolkata has grown due to economic growth. Like other metropolitan cities in India, Kolkata struggles with poverty, pollution and traffic congestion. The discovery of the nearby Chandraketugarh, an archaeological site has proved that people have lived there for over two millennia. The history of Kolkata b

In [28]:
query = "what is the fastest animal?"
docs = final_retriever.invoke(query)
docs

[Document(metadata={'article_id': '9800', 'title': 'Cheetah'}, page_content='A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time. Most cheetahs live in the savannas of Africa. There are a few in Asia. Cheetahs are active during the day, and hunt in the early morning or late evening. The cheetah compared to other big cats is light and slimly built. Its long thin legs and long spotted tail are necessary for fast running. Its lightly built, thin form is in sharp contrast with the robust build of other big cats. The head-and-body length ranges from . The cheetah stands 70 to 90\xa0cm at the shoulder, and weighs . The slightly curved claws are only weakly retractable (semi-retractable). This is a major point of difference between the cheetah and the other big cats, which have fully retractable claws.'),
 Document(metadata={'article_id': '528308', 'title': 'South African cheetah'}

# Build a QA RAG Chain
* To build a RAG chain we need a prompt template which instructs the LLM to not answer questions beyond the scope of the retrieved context documents -- there are various such prompts out there we will build one ourselves.

In [29]:
from langchain_core.prompts import ChatPromptTemplate

## custom prompt
prompt = """You are an assistant for question-answering tasks.
            Use the following pieces of retrieved context to answer the question.
            If no context is present or if you don't know the answer, just say that you dont' know.
            Do not make up the answer unless it is there in the provided context.
            Give a detailed answer with regard to the question.

            Question:
            {question}

            Context:
            {context}

            Answer:
        """

## create prompt_template
prompt_template = ChatPromptTemplate.from_template(prompt)

# LCEL Syntax for QA RAG Chain - Recommended
* Here we show you how to create the RAG chain using LangChain's recommended LCEL.

In [30]:
from langchain_core.runnables import RunnablePassthrough

## format retrieved docs
def format_docs(docs):
  return "\n\n".join(doc.page_content for doc in docs)

# create qa rag chain
qa_rag_chain = (
    {
        "context": (final_retriever # retrieve final docs
                      |
                    format_docs), # format final docs
        "question": RunnablePassthrough() #send question live at runtime
    }
      |
    prompt_template #feed to contextual prompt
      |
    chatgpt
)

In [31]:
from IPython.display import Markdown, display

## query
query = "What is the financial capital of India?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The financial capital of India is Mumbai. It is the largest city in India and one of the world's most populous cities. Mumbai generates more than 6% of India's GDP and accounts for 25% of industrial output, 40% of sea trade, and 70% of capital to India's economy. The city is home to the Reserve Bank of India, the Bombay Stock Exchange, the National Stock Exchange of India, and many Indian companies and multinational corporations. Additionally, the headquarters of the State Bank of India, the largest bank in India, are located in Mumbai. Therefore, Mumbai is considered the financial capital of India due to its significant contribution to the country's economy and its role as a hub for financial institutions and markets.

In [32]:
query = "What is the old capital of India?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The old capital of India was Kolkata (formerly known as Calcutta) during the British Raj until 1911. Kolkata served as the capital of British India in 1772. However, during the summer months every year starting from 1864, the capital shifted to the hilly town of Shimla. After 1911, the capital of India was shifted to Delhi, which is the current capital city of India.

In [33]:
query = "Tell me what is the slowest animal on land?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The slowest animal on land is the sloth. Sloths are known for their extremely slow movement and are considered the slowest mammals on land. They move at a speed of about 0.24 kilometers per hour, making them incredibly slow compared to other animals. Sloths are known for their slow metabolism and spend most of their time hanging upside down in trees, moving very slowly to conserve energy.

In [34]:
query = "Explain linguistics in simple terms"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Linguistics is the study of language. Linguists study various aspects of language, such as sounds (phonology), parts of words (morphology), word order and sentence structure (syntax), the meaning of words (semantics), and the unspoken meaning of speech (pragmatics). There are different branches of linguistics, including theoretical linguistics which focuses on understanding the theory and ideas behind language, historical linguistics which studies the history and changes in language, and sociolinguistics which examines how different groups of people use language differently. Applied linguistics uses linguistic knowledge to solve real-world problems, such as forensic linguistics in crime investigations and computational linguistics in developing speech recognition technology. Overall, linguistics helps us understand how language works and how it is used in different contexts.

In [35]:
query = "Who won the champions league in 2021"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.

Summary:
* The last question is "random" and not related to the data stored in our vector database and so without the context or information the LLM is not able to answer the question.
* In another notebook we will show how to do this with an agent and tool which the LLM can use if it does not have the provided context or information.