# Rag From Scratch: Overview [Open in Colab](https://colab.research.google.com/github/yonanicodes/rag/blob/main/rag_1.ipynb)

These notebooks walk through the process of building RAG app(s) from scratch.

They will build towards a broader understanding of the RAG langscape, as shown here:
<!--
![Screenshot 2024-03-25 at 8.30.33 PM.png](attachment:c566957c-a8ef-41a9-9b78-e089d35cf0b7.png) -->

## Enviornment

`(1) Packages`

In [30]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain
print("[INFO] Running in Google Colab, installing requirements.")
!pip install PyMuPDF # for reading PDFs with Python
!pip install tqdm # for progress bars
# !pip install sentence-transformers # for embedding models
# !pip install accelerate # for quantization model loading
# !pip install bitsandbytes # for quantizing models (less storage space)
# !pip install flash-attn --no-build-isolation # for faster attention mechanism = faster LLM inference
# !pip install sentence-transformers # for embbeding a sentence in to numbers
!pip install langchain langchain_community sentence-transformers torchvision PyMuPDF

[INFO] Running in Google Colab, installing requirements.
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch>=1.11.0->sentence-transformers)
  Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Using cached nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
Installing collected packages: nvidia-cusolver-cu12
  Attempting uninstall: nvidia-cusolver-cu12
    Found existing installation: nvidia-cusolver-cu12 11.6.3.83
    Uninstalling nvidia-cusolver-cu12-11.6.3.83:
      Successfully uninstalled nvidia-cusolver-cu12-11.6.3.83
Successfully installed nvidia-cusolver-cu12-11.6.1.9


In [1]:

from langchain_core.runnables import Runnable
import google.generativeai as genai

genai.configure(api_key="AIzaSyB-6JkVlNsg89fp8tIJfpTwVcVS6g-Y5uQ")
gemini_model = genai.GenerativeModel("gemini-2.0-flash")

class GeminiLLM(Runnable):
    def invoke(self, input, config=None):
        # input is usually a dict with "messages" or a formatted string prompt
        # LangChain passes a dict like {'messages': [HumanMessage(...), ...]}
        if isinstance(input, dict) and "messages" in input:
            # Extract and join message contents
            prompt_str = "\n".join(m.content for m in input["messages"])
        else:
            prompt_str = str(input)

        response = gemini_model.generate_content(prompt_str)
        return response.text


llm = GeminiLLM()


`(2) LangSmith`

https://docs.smith.langchain.com/

In [2]:
import os
os.environ['LANGSMITH_TRACING'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] ='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'

`(3) API Keys`

In [None]:
# os.environ['OPENAI_API_KEY'] = <your-api-key>

In [118]:
import bs4
from langchain import hub

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings



# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


def rag_pipeline(
    embedding_model,
    documents=None,
    embedding_model_name="default_model",
    prompt=hub.pull("rlm/rag-prompt"),
    k=2,
    persist_base_dir="./drive/MyDrive/vectorstores"
):
    # Set a unique directory for each embedding model
    persist_directory = os.path.join(persist_base_dir, embedding_model_name.replace("/", "_"))

    # Check if vectorstore exists
    if os.path.exists(persist_directory) and documents is None:
        # Load existing vectorstore
        vectorstore = Chroma(
            persist_directory=persist_directory,
            embedding_function=embedding_model
        )
        print(f"[INFO] Loaded existing vectorstore from: {persist_directory}")
    else:
        if documents is None:
            raise ValueError("You must provide documents if no persisted vectorstore is found.")
        # Create and store vectorstore
        vectorstore = Chroma.from_documents(
            documents=documents,
            embedding=embedding_model,
            persist_directory=persist_directory
        )
        vectorstore.persist()
        print(f"[INFO] Stored new vectorstore at: {persist_directory}")

    retriever = vectorstore.as_retriever(search_kwargs={"k": k})

    return (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    ), retriever


## load the pdf data

In [5]:
# Requires !pip install PyMuPDF, see: https://github.com/pymupdf/pymupdf
import fitz # (pymupdf, found this is better than pypdf for our use case, note: licence is AGPL-3.0, keep that in mind if you want to use any code commercially)
from tqdm.auto import tqdm # for progress bars, requires !pip install tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number +1,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts



In [6]:
pdf_path="./drive/MyDrive/Ethiopia_Constitution.pdf"
eng_pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
eng_pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': 1,
  'page_char_count': 1733,
  'page_word_count': 295,
  'page_sentence_count_raw': 1,
  'page_token_count': 433.25,
  'text': 'Constitution  of  The Federal Democratic Republic of Ethiopia    PREAMBLE    We, the Nations, Nationalities and Peoples of Ethiopia:   Strongly committed, in full and free exercise of our right to self-determination, to  building a political community founded on the rule of law and capable of ensuring  a lasting peace, guaranteeing a democratic order, and advancing our economic  and social development;   Firmly convinced that the fulfillment of this objective requires full respect of  individual and people’s fundamental freedoms and rights, to live together on the  basis of equality and without any sexual, religious or cultural discrimination;   Further convinced that by continuing to live with our rich and proud cultural  legacies in territories we have long inhabited, have, through continuous  interaction on various levels and forms of life

In [7]:
import random

random.sample(eng_pages_and_texts, k=3)

[{'page_number': 42,
  'page_char_count': 1781,
  'page_word_count': 313,
  'page_sentence_count_raw': 24,
  'page_token_count': 445.25,
  'text': 'CHAPTER TEN  NATIONAL POLICY PRINCIPLES AND OBJECTIVES  Article 85  Objectives  1. Any organ of Government shall, in the implementation of the Constitution, other laws  and public policies, be guided by the principles and objectives specified under this  Chapter.   2. The term "Government" in this Chapter shall mean a Federal or State government as  the case may be.   Article 86  Principles for External Relations  1. To promote policies of foreign relations based on the protection of national interests  and respect for the sovereignty of the country.   2. To promote mutual respect for national sovereignty and equality of states and non- interference in the internal affairs of other states.   3. To ensure that the foreign relation policies of the country are based on mutual interests  and equality of states as well as that international agre

In [8]:
import pandas as pd

df = pd.DataFrame(eng_pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,1,1733,295,1,433.25,Constitution of The Federal Democratic Repub...
1,2,1563,282,17,390.75,CHAPTER ONE GENERAL PROVISIONS Article 1 No...
2,3,1658,304,24,414.5,Article 6 Nationality 1. Any person of eithe...
3,4,1520,289,20,380.0,Article 11 Separation of State and Religion ...
4,5,2036,385,23,509.0,Article 16 The Right of the Security of Perso...


In [9]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,50.0,50.0,50.0,50.0,50.0
mean,25.5,2094.12,378.34,22.4,523.53
std,14.57738,460.86802,81.131943,6.770283,115.217005
min,1.0,674.0,120.0,1.0,168.5
25%,13.25,1799.75,329.25,18.0,449.9375
50%,25.5,2044.5,370.0,23.0,511.125
75%,37.75,2454.75,438.0,26.0,613.6875
max,50.0,2907.0,518.0,39.0,726.75


In [10]:
from spacy.lang.en import English # see https://spacy.io/usage for install instructions

nlp = English()

# Add a sentencizer pipeline, see https://spacy.io/api/sentencizer/
nlp.add_pipe("sentencizer")
for item in tqdm(eng_pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/50 [00:00<?, ?it/s]

In [11]:
# Inspect an example
random.sample(eng_pages_and_texts, k=1)

[{'page_number': 23,
  'page_char_count': 1358,
  'page_word_count': 250,
  'page_sentence_count_raw': 15,
  'page_token_count': 339.5,
  'text': '3. All international agreements and relations concluded, established or conducted by  the State shall protect and ensure Ethiopia’s right to sustainable development.   4. The basic aim of development activities shall be to enhance the capacity of  citizens for development and to meet their basic needs.   Article 44  Environmental Rights  1. All persons have the right to a clean and healthy environment.   2. All persons who have been displaced or whose livelihoods have been adversely  affected as a result of State programmes have the right to commensurate monetary  or alternative means of compensation, including relocation with adequate State  assistance.   CHAPTER FOUR  STATE STRUCTURE  Article 45  Form of Government  The Federal Democratic Republic of Ethiopia shall have a parliamentarian form  of government.   Article 46  States of the Fed

In [12]:
df = pd.DataFrame(eng_pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,50.0,50.0,50.0,50.0,50.0,50.0
mean,25.5,2094.12,378.34,22.4,523.53,22.12
std,14.58,460.87,81.13,6.77,115.22,6.81
min,1.0,674.0,120.0,1.0,168.5,1.0
25%,13.25,1799.75,329.25,18.0,449.94,17.25
50%,25.5,2044.5,370.0,23.0,511.12,22.5
75%,37.75,2454.75,438.0,26.0,613.69,26.0
max,50.0,2907.0,518.0,39.0,726.75,38.0


In [13]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 11

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(eng_pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/50 [00:00<?, ?it/s]

In [14]:
# Sample an example from the group (note: many samples have only 1 chunk as they have <=10 sentences total)
random.sample(eng_pages_and_texts, k=1)

[{'page_number': 42,
  'page_char_count': 1781,
  'page_word_count': 313,
  'page_sentence_count_raw': 24,
  'page_token_count': 445.25,
  'text': 'CHAPTER TEN  NATIONAL POLICY PRINCIPLES AND OBJECTIVES  Article 85  Objectives  1. Any organ of Government shall, in the implementation of the Constitution, other laws  and public policies, be guided by the principles and objectives specified under this  Chapter.   2. The term "Government" in this Chapter shall mean a Federal or State government as  the case may be.   Article 86  Principles for External Relations  1. To promote policies of foreign relations based on the protection of national interests  and respect for the sovereignty of the country.   2. To promote mutual respect for national sovereignty and equality of states and non- interference in the internal affairs of other states.   3. To ensure that the foreign relation policies of the country are based on mutual interests  and equality of states as well as that international agre

In [15]:
# Create a DataFrame to get stats
df = pd.DataFrame(eng_pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,25.5,2094.12,378.34,22.4,523.53,22.12,2.5
std,14.58,460.87,81.13,6.77,115.22,6.81,0.65
min,1.0,674.0,120.0,1.0,168.5,1.0,1.0
25%,13.25,1799.75,329.25,18.0,449.94,17.25,2.0
50%,25.5,2044.5,370.0,23.0,511.12,22.5,2.5
75%,37.75,2454.75,438.0,26.0,613.69,26.0,3.0
max,50.0,2907.0,518.0,39.0,726.75,38.0,4.0


In [16]:
import re

# Split each chunk into its own item
eng_pages_and_chunks = []
for item in tqdm(eng_pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        eng_pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(eng_pages_and_chunks)

  0%|          | 0/50 [00:00<?, ?it/s]

125

In [17]:
# View a random sample
random.sample(eng_pages_and_chunks, k=1)

[{'page_number': 48,
  'sentence_chunk': '3. They shall jointly levy and collect taxes on incomes derived from large-scale mining and all petroleum and gas operations, and royalties on such operations.   Article 99 Undesignated Powers of Taxation The House of the Federation and the House of Peoples’ Representatives shall, in a joint session, determine by a two-thirds majority vote on the exercise of powers of taxation which have not been specifically provided for in the Constitution. Article 100 Directives on Taxation 1. In exercising their taxing powers, Sates and the Federal Government shall ensure that any tax is related to the source of revenue taxed and that it is determined following proper considerations. 2. They shall ensure that the tax does not adversely affect their relationship and that the rate and amount of taxes shall be commensurate with services the taxes help deliver. 3. Neither States nor the Federal Government shall levy and collect taxes on each other’s property un

In [18]:
# Get stats about our chunks
df = pd.DataFrame(eng_pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,125.0,125.0,125.0,125.0
mean,25.14,819.18,133.46,204.79
std,13.86,370.08,60.76,92.52
min,1.0,33.0,6.0,8.25
25%,14.0,618.0,99.0,154.5
50%,25.0,830.0,133.0,207.5
75%,37.0,1052.0,172.0,263.0
max,50.0,1787.0,299.0,446.75


In [19]:
# Show random chunks with under 30 tokens in length if they are worth watching
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 8.25 | Text: On appearing before a court, they
Chunk token count: 18.5 | Text: 2. Human and democratic rights of citizens and peoples shall be respected.
Chunk token count: 22.25 | Text: He exercises overall supervision over the implementation of the country’s foreign policy.
Chunk token count: 26.5 | Text: 3. In all its decisions, the Council of Ministers is responsible to the House of Peoples’ Representatives.
Chunk token count: 18.25 | Text: 4. The armed forces shall at all times obey and respect the Constitution.


##Extract chunks





In [20]:
chunks =[doc["sentence_chunk"] for doc in eng_pages_and_chunks]
chunks[0]

'Constitution of The Federal Democratic Republic of Ethiopia  PREAMBLE  We, the Nations, Nationalities and Peoples of Ethiopia:  Strongly committed, in full and free exercise of our right to self-determination, to building a political community founded on the rule of law and capable of ensuring a lasting peace, guaranteeing a democratic order, and advancing our economic and social development;  Firmly convinced that the fulfillment of this objective requires full respect of individual and people’s fundamental freedoms and rights, to live together on the basis of equality and without any sexual, religious or cultural discrimination;  Further convinced that by continuing to live with our rich and proud cultural legacies in territories we have long inhabited, have, through continuous interaction on various levels and forms of life, built up common interest and have also contributed to the emergence of a common outlook;  Fully cognizant that our common destiny can best be served by rectify

In [21]:
from langchain.docstore.document import Document

# ... (rest of your code) ...

# Convert chunks (strings) to Document objects
documents = [Document(page_content=chunk ,metadata={'source': 'FRDE constitution'}) for chunk in chunks]

documents[1]


Document(metadata={'source': 'FRDE constitution'}, page_content='CHAPTER ONE GENERAL PROVISIONS Article 1 Nomenclature of the State This Constitution establishes a Federal and Democratic State structure. Accordingly, the Ethiopian state shall be known as the Federal Democratic Republic of Ethiopia. Article 2 Ethiopian Territorial Jurisdiction The territorial jurisdiction of Ethiopia shall comprise the territory of the members of the Federation and its boundaries shall be as determined by international agreements. Article 3 The Ethiopian Flag   1. The Ethiopian flag shall consist of green at the top, yellow in the middle and red at the bottom, and shall have a national emblem at the center. The three colors shall be set horizontally in equal dimension. 2. The national emblem on the flag shall reflect the hope of the Nations, Nationalities, Peoples as well as religious communities of Ethiopia to live together in equality and unity. 3. Members of the Federation may have their respective f

In [22]:
from langchain.embeddings import HuggingFaceEmbeddings

# Choose the model (can be any sentence-transformers model)
model_name = "sentence-transformers/all-mpnet-base-v2"
# Initialize the embedding model
all_mpnet_base_v2 = HuggingFaceEmbeddings(model_name=model_name)

  all_mpnet_base_v2 = HuggingFaceEmbeddings(model_name=model_name)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [23]:
eng_all_mpnet_base_v2_chain, eng_all_mpnet_base_v2_retriever=rag_pipeline(all_mpnet_base_v2,documents)

In [24]:
# Question
docs = eng_all_mpnet_base_v2_retriever.get_relevant_documents("what is human right?")
docs[0].page_content,len(docs)

  docs = eng_all_mpnet_base_v2_retriever.get_relevant_documents("what is human right?")


('Everyone has the right to respect for his human dignity, reputation and honour. 2. Everyone has the right to the free development of his personality in a manner compatible with the rights of other citizens. 3. Everyone has the right to recognition every where as a person. Article 25 Right to Equality All persons are equal before the law and are entitled without any discrimination to the equal protection of the law. In this respect, the law shall guarantee to all persons equal and effective protection without discrimination on grounds of race, nation, nationality, or other social origin, colour, sex, language, religion, political or other opinion, property, birth or other status.',
 2)

In [25]:
eng_all_mpnet_base_v2_chain.invoke("what are the human rights of human according to ethiopia?")

'According to the Ethiopian constitution, every person has the inviolable and inalienable right to life, the security of person, and liberty. No person may be deprived of his life except as a punishment for a serious criminal offense determined by law. The fundamental rights and freedoms shall be interpreted in a manner conforming to the principles of the Universal Declaration of Human Rights.\n'

In [26]:
eng_all_mpnet_base_v2_chain.invoke("what are the democratic rights of human according to ethiopia?")

'According to the provided context, Ethiopian citizens have the right to participate in national development and to be consulted on policies and projects affecting their community. Additionally, every nation, nationality, and people in Ethiopia has the right to self-determination, to develop their language and culture, and to a full measure of self-government. Elections to positions of responsibility must be conducted in a free and democratic manner.\n'

In [27]:
eng_all_mpnet_base_v2_chain.invoke("what does tax collection looks like in ethiopia?")

'In Ethiopia, both the Federal Government and individual states are responsible for tax collection. States levy and collect profit, sales, excise, and personal income taxes on various businesses and properties within their territory. The Federal Government levies and collects income tax on employees, enterprises, and transport services, as well as taxes on lotteries and properties it owns.\n'

In [28]:
eng_all_mpnet_base_v2_chain.invoke("what does is nutrition?")

"I'm sorry, but the provided context does not define nutrition. The text discusses the government's role in promoting the health and welfare of the working population. It also mentions access to public health, education, clean water, housing, food, and social security.\n"

## Whoa! 🎉🎉🎉🎉 It works 🎉🎉🎉


In [30]:
def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number +1,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split("፡፡")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts



In [None]:
!pip install --upgrade chromadb
def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number +1,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split("፡፡")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts



In [31]:
import fitz
from tqdm.auto import tqdm
def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() # note: this might be different for each doc (best to experiment)

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    """
    Opens a PDF file, reads its text content page by page, and collects statistics.

    Parameters:
        pdf_path (str): The file path to the PDF document to be opened and read.

    Returns:
        list[dict]: A list of dictionaries, each containing the page number
        (adjusted), character count, word count, sentence count, token count, and the extracted text
        for each page.
    """
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number +1,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split("፡፡")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts



In [32]:
pdf_path="./drive/MyDrive/constitution_amh.pdf"
amh_pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
amh_pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': 1,
  'page_char_count': 1660,
  'page_word_count': 307,
  'page_sentence_count_raw': 11,
  'page_token_count': 415.0,
  'text': 'መ ግ ቢ ያ  እኛ የኢትዮጵያ ብሔሮች፣ ብሔረሰቦች፣ ሕዝቦች፡በሀገራችን ኢትዮጵያ ውስጥ ዘላቂ ሰላም፣ ዋስትና ያለው ዴሞክራሲ እንዲሰፍን፣ኢኮኖሚያዊና  ማኅበራዊ እድገታችን እንዲፋጠን፣ የራሳችንን ዕድል በራሳችን የመወሰን መብታችንን ተጠቅመን፣ በነጻ ፍላጐታችን፣ በሕግ የበላይነት እና በራሳችን  ፈቃድ ላይ የተመሰረተ አንድ የፖለቲካ ማኅበረሰብ በጋራ ለመገንባት ቆርጠን በመነሳት፤ይህን ዓላማ ከግብ ለማድረስ፣ የግለሰብና የብሔር/ብሔረሰብ  መሰረታዊ መብቶች መከበራቸው፣ የጾታ እኩልነት መረጋገጡ፣ ባሕሎችና ሃይማኖቶች ካለአንዳች ልዩነት እንዲራመዱ የማድረጉ አስፈላጊነት ጽኑ  እምነታችን በመሆኑ፤ኢትዮጵያ ሀገራችን የየራሳችን አኩሪ ባሕል ያለን፣ የየራሳችን መልክዓ ምድር አሰፋፈር የነበረንና ያለን፣ ብሔር ብሔረሰቦችና  ሕዝቦች በተለያዩ መስኮችና የግንኙነት ደረጃዎችተሳስረንአብረን የኖርንባትና የምንኖርባት ሀገር በመሆንዋ፤ ያፈራነው የጋራ ጥቅምና አመለካከት  አለን ብለን ስለምናምን፤መጪው የጋራ ዕድላችን መመስረት ያለበት ከታሪካችን የወረስነውን የተዛባ ግንኙነት በማረምና የጋራ ጥቅማችንን በማሳደግ  ላይ መሆኑን በመቀበል፤ ጥቅማችንን፣ መብታችንና ነጻነታችንን በጋራ እና በተደጋጋፊነት ለማሳደግ አንድ የኢኮኖሚ ማኅበረሰብ የመገንባቱን  አስፈላጊነት በማመን፤ በትግላችንና በከፈልነው መስዋዕትነት የተገኘውን ዴሞክራሲና ሰላም ዘላቂነቱንለማረጋገጥ፤ይህ ሕገ መንግሥት ከዚህ በላይ  ለገለጽናቸው ዓላማዎችና እምነቶች ማሰሪያ እንዲሆነንእንዲወክሉን መርጠን በ

In [33]:
import random
random.sample(amh_pages_and_texts, k=3)

[{'page_number': 13,
  'page_char_count': 1482,
  'page_word_count': 287,
  'page_sentence_count_raw': 19,
  'page_token_count': 370.5,
  'text': '2. ለዚህ አንቀጽ ዓላማ «የግል ንብረት» ማለት ማንኛውም ኢትዮጵያዊ ዜጋ ወይም ሕጋዊ ሰውነት በሕግ የተሰጣቸው ኢትዮጵያዊ ማኅበራት ወይም  አግባብ በአላቸው ሁኔታዎች በሕግ በተለየ በጋራ የንብረት ባለቤት እንዲሆኑ የተፈቀደላቸው  ማኅበረሰቦች በጉልበታቸው፣ በመፍጠር ችሎታቸው ወይም በካፒታላቸው ያፈሩት ተጨባጭ የሆነና የተጨባጭነት ጠባይ ሳይኖረው ዋጋ ያለው ውጤት  ነው፡፡  3. የገጠርም ሆነ የከተማ መሬትና የተፈጥሮ ሀብት ባለቤትነት መብት የመንግሥትና የሕዝብ ብቻ ነው፡፡ መሬት የማይሸጥ የማይለወጥ የኢትዮጵያ  ብሔሮች፣ብሔረሰቦችና ሕዝቦች የጋራ ንብረት ነው፡፡  4. የኢትዮጵያ አርሶ አደሮች መሬት በነጻ የማግኘትና ከመሬታቸው ያለመነቀል መብታቸው የተከበረ ነው፡፡ አፈጻጸሙን በተመለከተ ዝርዝር ሕግ  ይወጣል፡፡  5. የኢትዮጵያ ዘላኖች ለግጦሽም ሆነ ለእርሻ የሚጠቀሙበት መሬት በነጻ የማግኘት፣የመጠቀምና ከመሬታቸው ያለመፈናቀል መብት አላቸው፡፡ ዝርዝር  አፈጻጸሙ በሕግ ይወሰናል፡፡  6. የመሬት ባለቤትነት የኢትዮጵያ ብሔሮች፣ ብሔረሰቦችና ሕዝቦች መሆኑ እንደተጠበቀ ሆኖ መንግሥት ለግል ባለሀብቶች በሕግ በሚወሰን ክፍያ  በመሬት የመጠቀም መብታቸውን ያስከብርላቸዋል፡፡ ዝርዝሩ በሕግ ይወሰናል፡፡  7. ማንም ኢትዮጵያዊ በጉልበቱ፣ ወይም በገንዘቡ በመሬት ላይ ለሚገነባው ቋሚ ንብረት ወይም ለሚያደርገው ቋሚ መሻሻል ሙሉ መብት አለው፡፡  ይህ መብት የመሸጥ፣ የመለወጥ፣ የማውረስ፣ የመሬት ተጠቃሚነቱ ሲቋረጥ ንብረቱን የማንሳት፣ ባለቤትነቱን

In [34]:
import pandas as pd

df = pd.DataFrame(amh_pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,1,1660,307,11,415.0,መ ግ ቢ ያ እኛ የኢትዮጵያ ብሔሮች፣ ብሔረሰቦች፣ ሕዝቦች፡በሀገራችን ኢ...
1,2,969,192,15,242.25,አንቀጽ 5:ስለ ቋንቋ 1. ማናቸውም የኢትዮጵያ ቋንቋዎች በእኩልነት የመ...
2,3,991,199,15,247.75,4. ኢትዮጵያ ያጸደቀቻቸው ዓለም አቀፍ ስምምነቶች የሀገሪቱ ሕግ አካል ና...
3,4,1241,252,13,310.25,አንቀጽ 15 የሕይወት መብት ማንኛውም ሰው በሕይወት የመኖር መብት አለው...
4,5,1666,343,19,416.5,3. የተያዙ ሰዎች በአርባ ስምንት ሰዓታት ውስጥ ፍርድ ቤት የመቅረብ መብ...


In [35]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,38.0,38.0,38.0,38.0,38.0
mean,19.5,1320.131579,267.736842,14.947368,330.032895
std,11.113055,262.770085,50.73337,3.578822,65.692521
min,1.0,513.0,111.0,4.0,128.25
25%,10.25,1190.5,245.25,13.0,297.625
50%,19.5,1363.5,276.0,15.0,340.875
75%,28.75,1457.5,296.75,17.0,364.375
max,38.0,1826.0,379.0,22.0,456.5


In [36]:
for item in tqdm(amh_pages_and_texts):
    # print(item['text'].split("፡፡"))
    item["sentences"] = item['text'].split("፡፡")

    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    # Count the sentences
    item["sentence_split_count"] = len(item["sentences"])


  0%|          | 0/38 [00:00<?, ?it/s]

In [37]:
df = pd.DataFrame(amh_pages_and_texts)
df[['page_sentence_count_raw','sentence_split_count']]

Unnamed: 0,page_sentence_count_raw,sentence_split_count
0,11,11
1,15,15
2,15,15
3,13,13
4,19,19
5,15,15
6,13,13
7,16,16
8,19,19
9,15,15


In [38]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,sentence_split_count
count,38.0,38.0,38.0,38.0,38.0,38.0
mean,19.5,1320.13,267.74,14.95,330.03,14.95
std,11.11,262.77,50.73,3.58,65.69,3.58
min,1.0,513.0,111.0,4.0,128.25,4.0
25%,10.25,1190.5,245.25,13.0,297.62,13.0
50%,19.5,1363.5,276.0,15.0,340.88,15.0
75%,28.75,1457.5,296.75,17.0,364.38,17.0
max,38.0,1826.0,379.0,22.0,456.5,22.0


In [40]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 13

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list,
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(amh_pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/38 [00:00<?, ?it/s]

In [41]:
import random
random.sample(amh_pages_and_texts, k=3)

[{'page_number': 11,
  'page_char_count': 1419,
  'page_word_count': 287,
  'page_sentence_count_raw': 12,
  'page_token_count': 354.75,
  'text': '2. ሕጻናትን የሚመለከቱ እርምጃዎች በሚወሰዱበት ጊዜ በመንግሥታዊ ወይም በግል የበጐ አድራጐት ተቋሞች፣ በፍርድ ቤቶች፣ በአስተዳደር  ባለሥልጣኖች ወይም በሕግ አውጪ አካላት የሕጻናት ደህንነት በቀደምትነት መታሰብ አለበት፡፡  3. ወጣት አጥፊዎች፣ በማረሚያ ወይም በመቋቋሚያ ተቋሞች የሚገኙ፣ በመንግሥት እርዳታ የሚያድጉ ወጣቶች፣ በመንግሥት ወይም በግል እጓለ  ማውታን ተቋሞች ውስጥ የሚገኙ መጣቶች ከአወቂዎች ተለይተው መያዝ አለባቸው፡፡  4. ከጋብቻ ውጭ የተወለዱ ሕጻናት በጋብቻ ከተወለዱ ሕጻናት ጋር እኩል መብት አላቸው፡፡  5. መንግሥት ለእጓለ ማውታን ልዩ ጥበቃ ያደርግላቸዋል፡፡ በጉዲፈቻ የሚያድጉበትን ሥርዓት የሚያመቻቹና የሚያስፋፉ እንዲሁም ደህንነታቸውን  ትምህርታቸውን የሚያራምዱ ተቋሞች እንዲመሰረቱ ያበረታታል፡፡  አንቀጽ 37 ፍትሕ የማግኘት መብት  1. ማንኛውም ሰው በፍርድ ሊወሰን የሚገባውን ጉዳይ ለፍርድ ቤት ወይም ለሌላ በሕግ የዳኝነት ሥልጣን ለተሰጠው አካል የማቅረብና ውሳኔ ወይም  ፍርድ የማግኘት መብት አለው፡፡  2. በዚህ አንቀጽ ንዑስ አንቀጽ 1 የተመለከተውን ውሳኔ ወይም ፍርድ፤  ሀ/ ማንኛውም ማኅበር የአባላቱን የጋራ ወይም የግል ጥቅም በመወከል፣  ለ/ ማንኛውም ቡድን ወይም ተመሳሳይ ጥቅም ያላቸውን ሰዎች የሚወክል ግለሰብ ወይም የቡድን አባል የመጠየቅና የማግኘት መብት አለው፡፡  አንቀጽ 38 የመምረጥና የመመረጥ መብት  1. ማንኛውም ኢትዮጵያዊ ዜጋ በቀለም፣ በዘር፣ በብሔር፣ በብሔረሰብ፣ በጾታ፣በቋንቋ፣

In [42]:
# Create a DataFrame to get stats
df = pd.DataFrame(amh_pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,sentence_split_count,num_chunks
count,38.0,38.0,38.0,38.0,38.0,38.0,38.0
mean,19.5,1320.13,267.74,14.95,330.03,14.95,1.68
std,11.11,262.77,50.73,3.58,65.69,3.58,0.47
min,1.0,513.0,111.0,4.0,128.25,4.0,1.0
25%,10.25,1190.5,245.25,13.0,297.62,13.0,1.0
50%,19.5,1363.5,276.0,15.0,340.88,15.0,2.0
75%,28.75,1457.5,296.75,17.0,364.38,17.0,2.0
max,38.0,1826.0,379.0,22.0,456.5,22.0,2.0


In [43]:
import re

# Split each chunk into its own item
amh_pages_and_chunks = []
for item in tqdm(amh_pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        # joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters

        amh_pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(amh_pages_and_chunks)

  0%|          | 0/38 [00:00<?, ?it/s]

64

In [44]:
random.sample(amh_pages_and_chunks, k=3)

[{'page_number': 27,
  'sentence_chunk': '11. ስለ ሀገሪቱ ሁኔታ፣ በመንግሥት ስለተከናወኑ ተግባራትና ስለወደፊት እቅዶች ለሕዝብ ተወካዮች ምክር ቤት በየወቅቱ ሪፖረት ያቀርባል 12. በዚህ ሕገ መንግሥትና በሌሎች ሕጐች የተሰጡትን ሌሎች ተግባሮች ያከናውናል 13. ሕገ መንግሥቱን ያከብራል፤ የስከበራል አንቀጽ 75 ስለ ምክትል ጠቅላይ ሚኒስትር 1. ምክትል ጠቅላይሚኒስትሩ፣ ሀ/ በጠቅላይ ሚኒስትሩ ተለይተው የሚሰጡትን ተግባሮች ያከናውናል፤ ለ/ ጠቅላይ ሚኒስትሩ በማይኖርበት ጊዜ ተክቶት ይሰራል 2. ምክትል ጠቅላይ ሚኒስትሩ ተጠሪነቱ ለጠቅላይ ሚኒስትሩ ይሆናል አንቀጽ 76 የሚኒስትሮች ምክር ቤት 1. የሚኒስትሮች ምክር ቤት፤ ጠቅላይ ሚኒስትር፣ ምክትል ጠቅላይ ሚኒስተር፣ ሚኒስትሮችና በሕግ በሚወሰን መሰረት ሌሎች አባሎች የሚገኙበት ምክር ቤት ነው 2. የሚኒስትሮች ምክር ቤት ተጠሪነቱ ለጠቅላይ ሚኒሰትሩ ነው 3. የሚኒስትሮች ምክር ቤት ለሚወስነው ውሳኔ ለሕዝብ ተወካዮች ምክር ቤት ተጠሪ ነው አንቀጽ 77 የሚኒስትሮች ምክር ቤት ሥልጣንና ተግባር 1. የሚኒስትሮች ምክር ቤት በሕዝብ ተወካዮች ምክር ቤት የወጡ ሕጐችና የተሰጡ ውሳኔዎች በሥራ መተርጐማቸውን ያረጋግጣል፣ መመሪያዎችን ይሰጣል 2. የሚኒስትሮችንና በቀጥታ ለሚኒስትሮች ምክር ቤት ተጠሪ የሆኑ ሌሎች የመንግሥት አካላትን አደረጃጀት ይወስናል፣ ሥራቸውን ያስተባብራል፣ ይመራል 3. የፌዴራሉን መንግሥት ዓመታዊ በጀት ያዘጋጃል፣ ለሕዝብ ተወካዮች ምክር ቤት ያቀርባል፣ ሲጸድቅም ተግባራዊነቱን ያረጋግጣል 4. የገንዘብና የፋይናንስ ፖሊሲን ተግባራዊነት ያረጋግጣል፣ ብሔራዊ ባንክን ያስተዳድራል፣ ገንዘብ ያትማል፣ ከሀገር ውስጥና ከውጭ ይበደራል፣ የውጭ ምንዛሪና የገንዘብ ልውውጥን ይቆጣጠራል 5.

In [45]:
# Get stats about our chunks
df = pd.DataFrame(amh_pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,64.0,64.0,64.0,64.0
mean,19.88,753.53,145.64,188.38
std,10.91,496.73,95.99,124.18
min,1.0,0.0,1.0,0.0
25%,10.0,229.0,44.5,57.25
50%,21.0,857.0,165.5,214.25
75%,29.0,1146.75,231.0,286.69
max,38.0,1778.0,355.0,444.5


In [46]:
# Show random chunks with under 30 tokens in length if they are worth watching
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 7.5 | Text: የሕዝብንም የልማት እንቅስቃሴዎች መደገፍ አለበት
Chunk token count: 20.0 | Text: የፖለቲካ ድርጅቶቹ አዲስ መንግሥት ለመፍር ወይም የነበረውን ጣምራነት ለመቀጠል ካልቻሉ ምክር ቤቱ ተበትኖ አዲስ ምርጫ ይደረጋል
Chunk token count: 16.0 | Text: 3. በዚህ ሕገ መንግሥት ከተደነገገው ውጭ በማናቸውም አኳኊን የመንግሥት ሥልጣን መያዝ የተከለከለ ነው
Chunk token count: 13.5 | Text: 8. በሕዝብ ተወካዮች ምክር ቤት ሕግ ሊወጣላቸው የሚገቡ የፍትሐብሔር ጉዳዮችን ይለያል
Chunk token count: 6.25 | Text: አንቀጽ 83 ሕገ መንግሥቱን ስለመተርጐም


In [47]:
df[df["chunk_token_count"] <= min_token_length]

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
2,2,3. በዚህ ሕገ መንግሥት ከተደነገገው ውጭ በማናቸውም አኳኊን የመንግሥት ...,64,13,16.0
9,6,2. ማንኛውም ሰው በግል የሚጽፋቸውንና የሚጻጻፋቸው፣ በፖስታ የሚልካቸው ...,117,17,29.25
34,22,የፖለቲካ ድርጅቶቹ አዲስ መንግሥት ለመፍር ወይም የነበረውን ጣምራነት ለመ...,80,16,20.0
36,23,8. በሕዝብ ተወካዮች ምክር ቤት ሕግ ሊወጣላቸው የሚገቡ የፍትሐብሔር ጉዳ...,54,11,13.5
38,24,2. የፌዴሬሽኑ ምክር ቤት የሥራ ዘመን አምስት ዓመት ይሆናል አንቀጽ 68...,80,18,20.0
44,27,,0,1,0.0
50,30,አንቀጽ 83 ሕገ መንግሥቱን ስለመተርጐም,25,5,6.25
54,32,የሕዝብንም የልማት እንቅስቃሴዎች መደገፍ አለበት,30,5,7.5
60,36,አንቀጽ 101 ዋናው ኦዲተር 1. ዋናው ኦዲተር በጠቅላይ ሚኒስትሩ አቅርቢ...,71,15,17.75


In [49]:
# Show random chunks with 0 tokens
for row in df[df["chunk_token_count"] == 0].iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 0.0 | Text: 


In [50]:
amh_pages_and_chunks = df[df["chunk_token_count"] >0].to_dict(orient="records")

In [51]:
amh_pages_and_chunks

[{'page_number': 1,
  'sentence_chunk': 'መ ግ ቢ ያ እኛ የኢትዮጵያ ብሔሮች፣ ብሔረሰቦች፣ ሕዝቦች፡በሀገራችን ኢትዮጵያ ውስጥ ዘላቂ ሰላም፣ ዋስትና ያለው ዴሞክራሲ እንዲሰፍን፣ኢኮኖሚያዊና ማኅበራዊ እድገታችን እንዲፋጠን፣ የራሳችንን ዕድል በራሳችን የመወሰን መብታችንን ተጠቅመን፣ በነጻ ፍላጐታችን፣ በሕግ የበላይነት እና በራሳችን ፈቃድ ላይ የተመሰረተ አንድ የፖለቲካ ማኅበረሰብ በጋራ ለመገንባት ቆርጠን በመነሳት፤ይህን ዓላማ ከግብ ለማድረስ፣ የግለሰብና የብሔር/ብሔረሰብ መሰረታዊ መብቶች መከበራቸው፣ የጾታ እኩልነት መረጋገጡ፣ ባሕሎችና ሃይማኖቶች ካለአንዳች ልዩነት እንዲራመዱ የማድረጉ አስፈላጊነት ጽኑ እምነታችን በመሆኑ፤ኢትዮጵያ ሀገራችን የየራሳችን አኩሪ ባሕል ያለን፣ የየራሳችን መልክዓ ምድር አሰፋፈር የነበረንና ያለን፣ ብሔር ብሔረሰቦችና ሕዝቦች በተለያዩ መስኮችና የግንኙነት ደረጃዎችተሳስረንአብረን የኖርንባትና የምንኖርባት ሀገር በመሆንዋ፤ ያፈራነው የጋራ ጥቅምና አመለካከት አለን ብለን ስለምናምን፤መጪው የጋራ ዕድላችን መመስረት ያለበት ከታሪካችን የወረስነውን የተዛባ ግንኙነት በማረምና የጋራ ጥቅማችንን በማሳደግ ላይ መሆኑን በመቀበል፤ ጥቅማችንን፣ መብታችንና ነጻነታችንን በጋራ እና በተደጋጋፊነት ለማሳደግ አንድ የኢኮኖሚ ማኅበረሰብ የመገንባቱን አስፈላጊነት በማመን፤ በትግላችንና በከፈልነው መስዋዕትነት የተገኘውን ዴሞክራሲና ሰላም ዘላቂነቱንለማረጋገጥ፤ይህ ሕገ መንግሥት ከዚህ በላይ ለገለጽናቸው ዓላማዎችና እምነቶች ማሰሪያ እንዲሆነንእንዲወክሉን መርጠን በላክናቸው ተወካዮቻቸን አማካይነት በሕገ መንግሥት ጉባኤ ዛሬ ኅዳር 29 ቀን 1987 አጽድቀነዋል ምዕራፍ አንድ : ጠቅላላ ድንጋጌዎች አንቀጽ 1: የኢትዮጵያ መንግሥት ስያሜ ይህ ሕገ 

In [121]:
from langchain.docstore.document import Document

# ... (rest of your code) ...

# Convert chunks (strings) to Document objects
documents = [Document(page_content=doc["sentence_chunk"] ,metadata={'source': pdf_path,'page': doc['page_number']}) for doc in amh_pages_and_chunks]

len(documents)

63

In [112]:
from langchain import PromptTemplate

prompt = PromptTemplate.from_template(
    "ከታች ያለው መረጃን በመጠቀም፣ የተጠየቀውን ጥያቄ መልስ።\n\nማብራሪያ:\n{context}\n\nጥያቄ: {question}\nመልስ:"
)

In [101]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/paraphrase-xlm-r-multilingual-v1"
xlm_r_multilingual_v1 = HuggingFaceEmbeddings(model_name=model_name)


In [122]:
from langchain.embeddings import HuggingFaceEmbeddings

multilingual_e5_large = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")



In [123]:
from langchain.embeddings import HuggingFaceEmbeddings

gte_large = HuggingFaceEmbeddings(model_name="thenlper/gte-large")


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [125]:
amh_all_mpnet_base_v2_chain, amh_all_mpnet_base_v2_retriever=rag_pipeline(embedding_model=all_mpnet_base_v2,documents=documents,embedding_model_name="amh_all_mpnet_base_v2",prompt=prompt)

[INFO] Stored new vectorstore at: ./drive/MyDrive/vectorstores/amh_all_mpnet_base_v2


  vectorstore.persist()


In [128]:
xlm_r_multilingual_v1_chain, xlm_r_multilingual_v1_retriever=rag_pipeline(embedding_model=xlm_r_multilingual_v1,documents=documents,prompt=prompt,embedding_model_name="xlm_r_multilingual_v1")

[INFO] Stored new vectorstore at: ./drive/MyDrive/vectorstores/xlm_r_multilingual_v1


In [160]:
multilingual_e5_large_chain, multilingual_e5_large_retriever=rag_pipeline(embedding_model=multilingual_e5_large,documents=documents,prompt=prompt,embedding_model_name="multilingual_e5_large")

[INFO] Stored new vectorstore at: ./drive/MyDrive/vectorstores/multilingual_e5_large


In [162]:
gte_large_chain,gte_large_retriever=rag_pipeline(embedding_model=gte_large,documents=documents,prompt=prompt,embedding_model_name="gte_large")

[INFO] Stored new vectorstore at: ./drive/MyDrive/vectorstores/gte_large


In [178]:
amh_questions=["ሰባዊ መብት  ምንድነ ነው?"," የሰባዊ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?","የዲሞክራሲ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?","ስለክልል ከፍተኛ ፍርድ ቤቶች የዳኝነት ስልጣን እና ልዩ ፍርድ ቤቶች አወቃቀር"]

In [180]:
# Question

docs = amh_all_mpnet_base_v2_retriever.get_relevant_documents("ስለክልል ከፍተኛ ፍርድ ቤቶች የዳኝነት ስልጣን እና ልዩ ፍርድ ቤቶች አወቃቀር")
docs[0].page_content,len(docs)
docs

[Document(metadata={'source': './drive/MyDrive/constitution_amh.pdf', 'page': 29}, page_content='ዝርዝሩ በሕግ ይወሰናል 4. የክልል ከፍተኛ ፍርድ ቤት በክልሉ ከሚኖረው የዳኝነት ሥልጣን በተጨማሪየፌዴራል የመጀመሪያ ደረጃ ፍርድ ቤት የዳኝነት ሥልጣን ይኖረዋል 5. የክልል ከፍተኛ ፍርድ ቤት በፌዴራል የመጀመሪያ ደረጃ ፍርድ ቤት የዳኝነት ሥልጣኑ መሰረት በሚሰጠው ውሳኔ ላይ የሚቀርበው ይግባኝ በክልል ጠቅላይ ፍርድ ቤት ይታያል'),
 Document(metadata={'page': 28, 'source': './drive/MyDrive/constitution_amh.pdf'}, page_content='ዝርዝሩ በሕግ ይዋሰናል 4. የዳኝነት ሥልጣንን ከመደበኛ ፍርድ ቤቶች ወይም በሕግ የመዳኘት ሥልጣን ከተሰጠው ተቋም ውጭ የሚያደርግ፣ በሕግ የተደነገገን የዳኝነት ሥርዓት የማይከተል ልዩ ፍርድ ቤት ወይም ጊዜያዊ ፍርድ ቤት አይቋቋምምመሰረት የሃይማኖትና የባሕል ፍርድ ቤቶችን ሊያቋቁሙ ወይም እውቅና ሊሰጡ ይችላላላለላሉ ይሀ ሕግ መንግሥት ከመጽደቁ በፊት በመንግሥት እውቅና አግኝተው ሲሰራባቸው የነበሩ ሃይማኖቶችና የባሕል ፍርድ ቤቶች በዚህ ሕገ መንግሥት መሰረት እውቅና አግኝተው ይደራጃሉ አንቀጽ 79 የዳኝነት ሥልጣን 1. በፌዴራልም ሆነ በክልል የዳኝነት ሥልጣን የፍርድ ቤቶች ብቻ ነው 2. በየትኛውም ደረጃ የሚገኝ የዳኝነት አካል ከማንኛውም የመንግሥት አካል፣ከማንኛውም ባለሥልጣን ሆነ ከማንኛውም ሌላ ተጽዕኖ ነጻ ነው')]

In [181]:
def answer_question( rag_chain, retiver ,embedding_model,questions=amh_questions ):

  print(f"[INFO] Answering {len(questions)} questions using model {embedding_model}")
  for question in tqdm(questions):
    print("Question",question)
    print(f"Answer: {rag_chain.invoke(question)}\n")

In [182]:
answer_question(amh_all_mpnet_base_v2_chain,amh_all_mpnet_base_v2_retriever,"all-mpnet-base-v2")

[INFO] Answering 4 questions using model all-mpnet-base-v2


  0%|          | 0/4 [00:00<?, ?it/s]

Question ሰባዊ መብት  ምንድነ ነው?
Answer: ይቅርታ፣ ሰባዊ መብት ምን እንደሆነ ከዚህ ጽሑፍ መረጃ ማግኘት አልቻልኩም። ጽሑፉ ስለ ዳኝነት ሥልጣን፣ የመግለጽ ነፃነት እና የመሰብሰብ ነፃነትን ነው የሚያወሳው።


Question  የሰባዊ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?
Answer: በተጠቀሰው መረጃ መሰረት አንዳንድ የሰብዓዊ መብቶች የሚከተሉት ናቸው፡

*   **የዳኝነት ነፃነት:** ፍርድ ቤቶች ከማንኛውም የመንግስት አካል፣ ባለስልጣን ወይም ሌላ ተጽዕኖ ነፃ መሆን አለባቸው። (አንቀጽ 79)
*   **የሃሳብን በነፃነት የመግለጽ ነፃነት:** ማንኛውም ሰው ያለማንም ጣልቃ ገብነት ሐሳቡን የመግለጽ ነጻነት አለው። ይህም መረጃን የመሰብሰብ፣ የመቀበልና የማሰራጨት ነጻነትን ያካትታል። (አንቀጽ 29)
*   **የፕሬስ ነፃነት:** ፕሬስ የቅድሚያ ምርመራ ሳይደረግ መረጃ የማግኘት እና የማሰራጨት ነፃነት አለው። (አንቀጽ 29)
*   **የመሰብሰብ፣ ሰላማዊ ሰልፍ የማድረግና አቤቱታ የማቅረብ መብት:** ማንኛውም ሰው ከሌሎች ጋር በመሆን መሳሪያ ሳይዝ በሰላም የመሰብሰብ፣ ሰላማዊ ሰልፍ የማድረግና አቤቱታ የማቅረብ መብት አለው። (አንቀጽ 30)

Question የዲሞክራሲ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?
Answer: መልስ: ከላይ በተጠቀሰው መረጃ መሰረት የዲሞክራሲ መብቶች የሚከተሉት ናቸው:

*   **የሐሳብን የመግለጽ ነጻነት:** ማንኛውም ሰው ያለማንም ጣልቃ ገብነት ሐሳቡን የመግለጽ ነጻነት አለው። በሀገር ውስጥም ሆነ ከሀገር ውጭ ወሰን ሳይደረግበት በቃልም ሆነ በጽሑፍ ወይም በሕትመት፣ በሥነ ጥበብ መልክ ወይም በመረጠው በማንኛውም የማሰራጫ ዘዴ፣ ማንኛውንም ዓይነት መረጃና ሐሳብ የመሰብሰብ፣ የመቀበልና የማሰራጨት ነጻነትን ያካትታል።
*   **የመሰ

In [183]:
answer_question(xlm_r_multilingual_v1_chain, xlm_r_multilingual_v1_retriever,"xlm_r_multilingual_v1")

[INFO] Answering 4 questions using model xlm_r_multilingual_v1


  0%|          | 0/4 [00:00<?, ?it/s]

Question ሰባዊ መብት  ምንድነ ነው?
Answer: ይቅርታ፣ ሰብዓዊ መብት ምን እንደሆነ ከላይ በተጠቀሰው መረጃ ላይ ማግኘት አልቻልኩም።

Question  የሰባዊ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?
Answer: በተሰጠው መረጃ መሰረት ጥቂት የሰብዓዊ መብቶች እነሆ፡-

*   **የዳኝነት ነፃነት፡** ፍርድ ቤቶች ከማንኛውም የመንግስት አካል፣ ባለስልጣን ወይም ተጽዕኖ ነፃ ሆነው የመዳኘት ስልጣን አላቸው። (አንቀጽ 79)
*   **የሃሳብን በነፃነት የመግለፅ ነፃነት፡** ማንኛውም ሰው ያለማንም ጣልቃ ገብነት ሀሳቡን በነፃነት የመግለፅ መብት አለው። ይህም መረጃን የመሰብሰብ፣ የመቀበልና የማሰራጨት ነፃነትን ያካትታል። (አንቀጽ 29)
*   **የፕሬስ ነፃነት፡** ፕሬስ የቅድሚያ ምርመራ ሳይደረግ መረጃን የማግኘት እና የማሰራጨት ነፃነት አለው። (አንቀጽ 29)
*   **የመሰብሰብ፣ ሰላማዊ ሰልፍ የማድረግ እና አቤቱታ የማቅረብ መብት፡** ማንኛውም ሰው ከሌሎች ጋር በመሆን በሰላም የመሰብሰብ፣ ሰላማዊ ሰልፍ የማድረግ እና አቤቱታ የማቅረብ መብት አለው። (አንቀጽ 30)

Question የዲሞክራሲ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?
Answer: የዲሞክራሲ መብቶች የሚከተሉትን ያካትታሉ፡

*   **የሐሳብን በነጻነት የመግለጽ ነጻነት:** ማንኛውም ሰው ያለማንም ጣልቃ ገብነት ሐሳቡን የመግለጽ ነጻነት አለው። ይህ ነጻነት በሀገር ውስጥም ሆነ ከሀገር ውጭ ወሰን ሳይደረግበት በቃልም ሆነ በጽሑፍ ወይም በሕትመት፣ በሥነ ጥበብ መልክ ወይም በመረጠው በማንኛውም የማሰራጫ ዘዴ ማንኛውንም ዓይነት መረጃና ሐሳብ የመሰብሰብ፣ የመቀበልና የማሰራጨት ነጻነቶችን ያካትታል። የፕሬስና የሌሎች መገናኛ ብዙኃን ነጻነትም ተረጋግጧል።

*   **የመሰብሰብ፣ ሰላማዊ ሰልፍ የማድረግ ነጻነ

In [184]:
answer_question(multilingual_e5_large_chain,multilingual_e5_large_retriever,"amh_multilingual_e5_large_chain")

[INFO] Answering 4 questions using model amh_multilingual_e5_large_chain


  0%|          | 0/4 [00:00<?, ?it/s]

Question ሰባዊ መብት  ምንድነ ነው?
Answer: ሰባዊ መብት ማለት ማንኛውም ሰው ሰብዓዊ በመሆኑ የማይደፈርና የማይገሰስ በሕይወት የመኖር፣ የአካል ደህንነትና የነጻነት መብት አለው።


Question  የሰባዊ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?
Answer: መልስ:

ከቀረበው ጽሑፍ በመነሳት የሚከተሉትን የሰባዊ መብቶች መጥቀስ ይቻላል፡

*   ዜጎች በብሔራዊ ልማት የመሳተፍ እና በተለይም አባል የሆኑበትን ማኅበረሰብ የሚመለከቱ ፖሊሲዎችና ኘሮጀክቶች ላይ ሐሳባቸውን እንዲሰጡ የመጠየቅ መብት አላቸው።
*   መንግሥት በዓለም አቀፍ ደረጃ የሚገባቸው ስምምነቶችም ሆኑ የሚያደርጋቸው ግንኙነቶች የኢትዮጵያን የማያቋርጥ እድገት መብት የሚያስከብሩ መሆን አለባቸው።
*   በፍርድ ሂደት ባሉበት ጊዜ በተከሰሱበት ወንጀል እንደ ጥፋተኛ ያለመቆጠር መብት አላቸው።
*   በምስክርነት እንዲቀርቡም ያለመገደድ መብት አላቸው።
*   የቀረበባቸውን ማናቸውንም ማስረጃ የመመልከት መብት አላቸው።
*   የቀረቡባቸውን ምስክሮች የመጠየቅ መብት አላቸው።
*   ለመከላከል የሚያስችላቸውን ማስረጃ የማቅረብ ወይም የማስቀረብ መብት አላቸው።
*   ምስክሮቻቸው ቀርበው እንዲሰሙላቸው የመጠየቅ መብት አላቸው።
*   በመረጡት የሕግ ጠበቃ የመወከል መብት አላቸው።
*   ጠበቃ ለማቆም አቅም በማጣታቸው ፍትሕ ሊጓደል የሚችልበት ሁኔታ ሲያጋጥም ከመንግሥት ጠበቃ የማግኘት መብት አላቸው።
*   ክርክሩ በሚታይበት ፍርድ ቤት በተሰጠባቸው ትእዛዝ ወይም ፍርድ ላይ ሥልጣን ላለው ፍርድ ቤት ይግባኝ የማቅረብ መብት አላቸው።
*   የፍርዱ ሂደት በማይገባቸው ቋንቋ በሚካሄድበት ሁኔታ በመንግሥት ወጪ ክርክሩ እንዲተረጐምላቸው የመጠየቅ መብት አላቸው።

Question የዲሞክራሲ መብቶች እንድጠቅሽልኝ እፈል

In [185]:
answer_question(gte_large_chain,gte_large_retriever,"gte_large_chain")

[INFO] Answering 4 questions using model gte_large_chain


  0%|          | 0/4 [00:00<?, ?it/s]

Question ሰባዊ መብት  ምንድነ ነው?
Answer: ጥያቄው ከቀረበው መረጃ ጋር የሚገናኝ አይደለም። የቀረበው መረጃ ስለ ዋና ኦዲተር አሿሿም እና የልማት እንቅስቃሴዎች ድጋፍ እንጂ ሰባዊ መብትን በተመለከተ ምንም የሚገልጽ ነገር የለም።

ስለዚህ፣ በዚህ መረጃ መሰረት "ሰባዊ መብት ምንድን ነው?" የሚለውን ጥያቄ መመለስ አይቻልም።


Question  የሰባዊ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?
Answer: ይቅርታ፣ ከሰጠኸኝ መረጃ የሰባዊ መብቶችን በተመለከተ ምንም ማግኘት አልቻልኩም።


Question የዲሞክራሲ መብቶች እንድጠቅሽልኝ እፈልጋለሁ?
Answer: ጥያቄው ከቀረበው መረጃ ጋር የሚገናኝ አይደለም። የቀረበው መረጃ የኦዲተርን ሹመት እና የልማት እንቅስቃሴዎችን ስለ መደገፍ ነው የሚያወራው። ስለዚህ ከዚህ መረጃ ላይ የዲሞክራሲ መብቶችን መጥቀስ አይቻልም።


Question ስለክልል ከፍተኛ ፍርድ ቤቶች የዳኝነት ስልጣን እና ልዩ ፍርድ ቤቶች አወቃቀር
Answer: ጥያቄው ከቀረበው ማብራሪያ ጋር የሚገናኝ አይደለም። ስለ ክልል ከፍተኛ ፍርድ ቤቶች የዳኝነት ስልጣን እና ልዩ ፍርድ ቤቶች አወቃቀር የሚገልጽ መረጃ ከላይ ባለው ማብራሪያ ውስጥ አልተጠቀሰም። ስለዚህ ከላይ ባለው መረጃ መሰረት መልስ መስጠት አይቻልም።


