# Rag From Scratch: Overview [Open in Colab](https://colab.research.google.com/github/yonanicodes/rag/blob/main/rag_1.ipynb)

These notebooks walk through the process of building RAG app(s) from scratch.

They will build towards a broader understanding of the RAG langscape, as shown here:
<!--
![Screenshot 2024-03-25 at 8.30.33 PM.png](attachment:c566957c-a8ef-41a9-9b78-e089d35cf0b7.png) -->

## Enviornment

`(1) Packages`

In [2]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain
print("[INFO] Running in Google Colab, installing requirements.")
!pip install PyMuPDF # for reading PDFs with Python
!pip install tqdm # for progress bars
# !pip install sentence-transformers # for embedding models
# !pip install accelerate # for quantization model loading
# !pip install bitsandbytes # for quantizing models (less storage space)
# !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference
# !pip install sentence-transformers # for embbeding a sentence in to numbers
!pip install langchain langchain_community sentence-transformers torchvision

[INFO] Running in Google Colab, installing requirements.
[31mERROR: Operation cancelled by user[0m[31m
[0mTraceback (most recent call last):
  File "/usr/lib/python3.11/pathlib.py", line 540, in __str__
    return self._str
           ^^^^^^^^^
AttributeError: 'PosixPath' object has no attribute '_str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 447, in run
    conflicts = self._determine_conflicts(to_install)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/pyt

`(2) LangSmith`

https://docs.smith.langchain.com/

In [4]:
import os
os.environ['LANGSMITH_TRACING'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] ='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'

`(3) API Keys`

In [5]:
# os.environ['OPENAI_API_KEY'] = <your-api-key>

## load the pdf data

In [6]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number +1,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts



In [7]:
pdf_path="./drive/MyDrive/Ethiopia_Constitution.pdf"
eng_pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
eng_pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': 1,
  'page_char_count': 1733,
  'page_word_count': 295,
  'page_sentence_count_raw': 1,
  'page_token_count': 433.25,
  'text': 'Constitution  of  The Federal Democratic Republic of Ethiopia    PREAMBLE    We, the Nations, Nationalities and Peoples of Ethiopia:   Strongly committed, in full and free exercise of our right to self-determination, to  building a political community founded on the rule of law and capable of ensuring  a lasting peace, guaranteeing a democratic order, and advancing our economic  and social development;   Firmly convinced that the fulfillment of this objective requires full respect of  individual and people’s fundamental freedoms and rights, to live together on the  basis of equality and without any sexual, religious or cultural discrimination;   Further convinced that by continuing to live with our rich and proud cultural  legacies in territories we have long inhabited, have, through continuous  interaction on various levels and forms of life

In [8]:
import random

random.sample(eng_pages_and_texts, k=3)

[{'page_number': 45,
  'page_char_count': 2393,
  'page_word_count': 431,
  'page_sentence_count_raw': 17,
  'page_token_count': 598.25,
  'text': "(a) If declared when the House of Peoples’ Representatives is in session, the decree shall  be submitted to the House within forty-eight hours of its declaration. The decree, if not  approved by a two-thirds majority vote of members of the House of Peoples'  Representatives, shall be repealed forthwith.   (b) Subject to the required vote of approval set out in (a) of this sub-Article, the decree  declaring a state of emergency when the House of Peoples’ Representatives is not in  session shall be submitted to it within fifteen days of its adoption.   3. A state of emergency decreed by the Council of Ministers, if approved by the House of  Peoples’ Representatives, can remain in effect up to six months. The House of Peoples’  Representatives may, by a two-thirds majority vote, allow the state of emergency  proclamation to be renewed every fo

In [9]:
import pandas as pd

df = pd.DataFrame(eng_pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,1,1733,295,1,433.25,Constitution of The Federal Democratic Repub...
1,2,1563,282,17,390.75,CHAPTER ONE GENERAL PROVISIONS Article 1 No...
2,3,1658,304,24,414.5,Article 6 Nationality 1. Any person of eithe...
3,4,1520,289,20,380.0,Article 11 Separation of State and Religion ...
4,5,2036,385,23,509.0,Article 16 The Right of the Security of Perso...


In [10]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,50.0,50.0,50.0,50.0,50.0
mean,25.5,2094.12,378.34,22.4,523.53
std,14.57738,460.86802,81.131943,6.770283,115.217005
min,1.0,674.0,120.0,1.0,168.5
25%,13.25,1799.75,329.25,18.0,449.9375
50%,25.5,2044.5,370.0,23.0,511.125
75%,37.75,2454.75,438.0,26.0,613.6875
max,50.0,2907.0,518.0,39.0,726.75


In [11]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")
for item in tqdm(eng_pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/50 [00:00<?, ?it/s]

In [12]:
# Inspect an example
random.sample(eng_pages_and_texts, k=1)

[{'page_number': 43,
  'page_char_count': 2019,
  'page_word_count': 343,
  'page_sentence_count_raw': 25,
  'page_token_count': 504.75,
  'text': '5. The armed forces shall carry out their functions free of any partisanship to any political  organization(s).   Article 88  Political Objectives  1. Guided by democratic principles, Government shall promote and support the People’s  self-rule at all levels.   2. Government shall respect the identity of Nations, Nationalities and Peoples.  Accordingly Government shall have the duty to strengthen ties of equality, unity and  fraternity among them.   Article 89  Economic Objectives  1. Government shall have the duty to formulate policies which ensure that all Ethiopians  can benefit from the country’s legacy of intellectual and material resources.   2. Government has the duty to ensure that all Ethiopians get equal opportunity to improve  their economic condition and to promote equitable distribution of wealth among them.   3. Government sha

In [13]:
df = pd.DataFrame(eng_pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,50.0,50.0,50.0,50.0,50.0,50.0
mean,25.5,2094.12,378.34,22.4,523.53,22.12
std,14.58,460.87,81.13,6.77,115.22,6.81
min,1.0,674.0,120.0,1.0,168.5,1.0
25%,13.25,1799.75,329.25,18.0,449.94,17.25
50%,25.5,2044.5,370.0,23.0,511.12,22.5
75%,37.75,2454.75,438.0,26.0,613.69,26.0
max,50.0,2907.0,518.0,39.0,726.75,38.0


In [14]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 11

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(eng_pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/50 [00:00<?, ?it/s]

In [15]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(eng_pages_and_texts, k=1)

[{'page_number': 13,
  'page_char_count': 2511,
  'page_word_count': 450,
  'page_sentence_count_raw': 17,
  'page_token_count': 627.75,
  'text': '3. Elections to positions of responsibility with any of the organizations referred to  under sub-Article 2 of this Article shall be conducted in a free and democratic  manner.   4. The provisions of sub-Articles 2 and 3 of this Article shall apply to civic  organizations which significantly affect the public interest.   Article 39  Rights of Nations, Nationalities, and Peoples  1. Every Nation, Nationality and People in Ethiopia has an unconditional right to  self-determination, including the right to secession.   2. Every Nation, Nationality and People in Ethiopia has the right to speak, to write  and to develop its own language; to express, to develop and to promote its culture;  and to preserve its history.   3. Every Nation, Nationality and People in Ethiopia has the right to a full measure of  self-government which includes the right t

In [16]:
# Create a DataFrame to get stats
df = pd.DataFrame(eng_pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,25.5,2094.12,378.34,22.4,523.53,22.12,2.5
std,14.58,460.87,81.13,6.77,115.22,6.81,0.65
min,1.0,674.0,120.0,1.0,168.5,1.0,1.0
25%,13.25,1799.75,329.25,18.0,449.94,17.25,2.0
50%,25.5,2044.5,370.0,23.0,511.12,22.5,2.5
75%,37.75,2454.75,438.0,26.0,613.69,26.0,3.0
max,50.0,2907.0,518.0,39.0,726.75,38.0,4.0


In [17]:
import re

# Split each chunk into its own item
eng_pages_and_chunks = []
for item in tqdm(eng_pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        eng_pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(eng_pages_and_chunks)

  0%|          | 0/50 [00:00<?, ?it/s]

125

In [18]:
# View a random sample
random.sample(eng_pages_and_chunks, k=1)

[{'page_number': 4,
  'sentence_chunk': 'Article 11 Separation of State and Religion 1. State and religion are separate. 2. There shall be no state religion. 3. The state shall not interfere in religious matters and religion shall not interfere in state affairs.  Article 12 Conduct and Accountability of Government 1. The conduct of affairs of government shall be transparent. 2. Any public official or an elected representative is accountable for any failure in official duties. 3.',
  'chunk_char_count': 442,
  'chunk_word_count': 72,
  'chunk_token_count': 110.5}]

In [19]:
# Get stats about our chunks
df = pd.DataFrame(eng_pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,125.0,125.0,125.0,125.0
mean,25.14,819.18,133.46,204.79
std,13.86,370.08,60.76,92.52
min,1.0,33.0,6.0,8.25
25%,14.0,618.0,99.0,154.5
50%,25.0,830.0,133.0,207.5
75%,37.0,1052.0,172.0,263.0
max,50.0,1787.0,299.0,446.75


In [20]:
# Show random chunks with under 30 tokens in length if they are worth watching
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 26.5 | Text: 3. In all its decisions, the Council of Ministers is responsible to the House of Peoples’ Representatives.
Chunk token count: 22.25 | Text: He exercises overall supervision over the implementation of the country’s foreign policy.
Chunk token count: 18.5 | Text: 2. Human and democratic rights of citizens and peoples shall be respected.
Chunk token count: 8.25 | Text: On appearing before a court, they
Chunk token count: 18.25 | Text: 4. The armed forces shall at all times obey and respect the Constitution.


##Extract chunks





In [21]:
chunks =[doc["sentence_chunk"] for doc in eng_pages_and_chunks]
chunks[0]

'Constitution of The Federal Democratic Republic of Ethiopia  PREAMBLE  We, the Nations, Nationalities and Peoples of Ethiopia:  Strongly committed, in full and free exercise of our right to self-determination, to building a political community founded on the rule of law and capable of ensuring a lasting peace, guaranteeing a democratic order, and advancing our economic and social development;  Firmly convinced that the fulfillment of this objective requires full respect of individual and people’s fundamental freedoms and rights, to live together on the basis of equality and without any sexual, religious or cultural discrimination;  Further convinced that by continuing to live with our rich and proud cultural legacies in territories we have long inhabited, have, through continuous interaction on various levels and forms of life, built up common interest and have also contributed to the emergence of a common outlook;  Fully cognizant that our common destiny can best be served by rectify

##Define the Generative Ai

In [22]:

from langchain_core.runnables import Runnable
import google.generativeai as genai

genai.configure(api_key="AIzaSyB-6JkVlNsg89fp8tIJfpTwVcVS6g-Y5uQ")
gemini_model = genai.GenerativeModel("gemini-2.0-flash")

class GeminiLLM(Runnable):
    def invoke(self, input, config=None):
        # input is usually a dict with "messages" or a formatted string prompt
        # LangChain passes a dict like {'messages': [HumanMessage(...), ...]}
        if isinstance(input, dict) and "messages" in input:
            # Extract and join message contents
            prompt_str = "\n".join(m.content for m in input["messages"])
        else:
            prompt_str = str(input)

        response = gemini_model.generate_content(prompt_str)
        return response.text


llm = GeminiLLM()


In [1]:
from langchain.embeddings import HuggingFaceEmbeddings

# Choose the model (can be any sentence-transformers model)
model_name = "sentence-transformers/all-mpnet-base-v2"

# Initialize the embedding model
embedding_model = HuggingFaceEmbeddings(model_name=model_name)

  embedding_model = HuggingFaceEmbeddings(model_name=model_name)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [3]:
embedding_model

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

## Part 1: Overview

[RAG quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart)

In [28]:
from langchain.docstore.document import Document

# ... (rest of your code) ...

# Convert chunks (strings) to Document objects
documents = [Document(page_content=chunk ,metadata={'source': 'FRDE constitution'}) for chunk in chunks]

documents[1]


Document(metadata={'source': 'FRDE constitution'}, page_content='CHAPTER ONE GENERAL PROVISIONS Article 1 Nomenclature of the State This Constitution establishes a Federal and Democratic State structure. Accordingly, the Ethiopian state shall be known as the Federal Democratic Republic of Ethiopia. Article 2 Ethiopian Territorial Jurisdiction The territorial jurisdiction of Ethiopia shall comprise the territory of the members of the Federation and its boundaries shall be as determined by international agreements. Article 3 The Ethiopian Flag   1. The Ethiopian flag shall consist of green at the top, yellow in the middle and red at the bottom, and shall have a national emblem at the center. The three colors shall be set horizontally in equal dimension. 2. The national emblem on the flag shall reflect the hope of the Nations, Nationalities, Peoples as well as religious communities of Ethiopia to live together in equality and unity. 3. Members of the Federation may have their respective f

In [29]:
import bs4
from langchain import hub

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

#### INDEXING ####

# Embed
vectorstore = Chroma.from_documents(documents=documents,
                                    embedding=embedding_model)


retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

#### RETRIEVAL and GENERATION ####

# Prompt
prompt = hub.pull("rlm/rag-prompt")

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)



In [31]:
# Question
docs = retriever.get_relevant_documents("what is human right?")
docs[0].page_content,len(docs)

('Everyone has the right to respect for his human dignity, reputation and honour. 2. Everyone has the right to the free development of his personality in a manner compatible with the rights of other citizens. 3. Everyone has the right to recognition every where as a person. Article 25 Right to Equality All persons are equal before the law and are entitled without any discrimination to the equal protection of the law. In this respect, the law shall guarantee to all persons equal and effective protection without discrimination on grounds of race, nation, nationality, or other social origin, colour, sex, language, religion, political or other opinion, property, birth or other status.',
 2)

In [35]:
rag_chain.invoke("what are the human rights of human according to ethiopia?")

'According to the Ethiopian Constitution, every person has the inviolable and inalienable right to life, the security of person, and liberty. No one can be deprived of life except as punishment for a serious criminal offense determined by law. The fundamental rights and freedoms are interpreted in accordance with the Universal Declaration of Human Rights, International Covenants on Human Rights, and international instruments adopted by Ethiopia.\n'

In [36]:
rag_chain.invoke("what are the democratic rights of human according to ethiopia?")

'According to the provided text, Ethiopians have the right to free and democratic elections for positions of responsibility within organizations affecting the public interest. Additionally, every Nation, Nationality, and People in Ethiopia has the right to self-determination, to develop their language and culture, and to a full measure of self-government. Nationals also have the right to participate in national development and be consulted on policies affecting their community.\n'