<a href="https://colab.research.google.com/github/ericmaniraguha/QueryPdfLangchain/blob/dev/PdfQueryLangchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Langchain PDF Intelligence: Extracting, Analyzing, and Summarizing Insights from Rwanda Development Board Documents


This project leverages langchain, a language processing library, to extract, analyze, and summarize information from a collection of PDF documents related to the Rwanda Development Board (RDB). The workflow includes downloading PDFs from a specified website, extracting text content, splitting and processing the text, and applying various langchain components such as embeddings, document search, question-answering, and summarization. Users can interact with the system by posing queries, and the project aims to provide meaningful responses and document summaries based on the processed information. The project is designed for users interested in efficiently extracting insights from a corpus of PDF documents related to the RDB.

In [None]:
# langchain seems interesting for language processing tasks.
# Make sure to check its documentation for usage and examples.
!pip install langchain -q

# The openai library provides access to OpenAI's powerful models.
# Ensure you have the necessary API key and follow OpenAI's guidelines.
!pip install openai -q

# PyPDF2 is a library for working with PDF files in Python.
# Useful for tasks involving PDF manipulation and extraction.
!pip install PyPDF2 -q

# faiss-cpu is a library for efficient similarity search and clustering of dense vectors.
# Great for tasks related to large-scale similarity search.
!pip install faiss-cpu -q

# tiktoken is a handy tool for counting the number of tokens in a text string without making an API call.
# Useful for monitoring and managing token usage, especially with OpenAI models.
!pip install tiktoken -q


In [None]:
# Import the installed libraries

# PyPDF2 library for working with PDF files
from PyPDF2 import PdfReader

# langchain library components
# OpenAIEmbeddings for using OpenAI language models for embeddings
from langchain.embeddings.openai import OpenAIEmbeddings

# CharacterTextSplitter for splitting text into characters
from langchain.text_splitter import CharacterTextSplitter

# FAISS for working with efficient similarity search and clustering of dense vectors
from langchain.vectorstores import FAISS


In [None]:
import os
os.environ["OPENAI_API_KEY"] = ""
# os.environ["SERPAPI_API_KEY"] = ""

## Donwloading data in PDF from websites.

In [None]:
!pip install BeautifulSoup4 --quiet
!pip install python-docx --quiet
!pip install urllib3==1.26.6  --quiet
!pip install requests -q
!pip install pytesseract -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.6/239.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.5/138.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import os

# List of URLs or URL patterns to avoid
avoid_urls = [
    'https://rdb.rw/page/',
    'https://webmail'
]

def download_pdfs_from_website(base_url, url, download_path):
    try:
        response = requests.get(url, verify=False)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Failed to get content of URL: {url}. Error: {e}")
        return

    print(f"Downloading PDFs from: {url}")

    soup = BeautifulSoup(response.text, 'html.parser')

    # Download PDFs
    for link in soup.find_all('a', href=True):
        pdf_url = link.get('href')
        if pdf_url and pdf_url.lower().endswith('.pdf') and not any(avoid_url in pdf_url for avoid_url in avoid_urls):
            pdf_url = urljoin(base_url, pdf_url)
            download_pdf(pdf_url, download_path)

def download_pdf(url, download_path):
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Failed to download PDF from {url}. Error: {e}")
        return

    # Create a downloads directory if it doesn't exist
    os.makedirs(download_path, exist_ok=True)

    # Extract the filename from the URL
    filename = os.path.join(download_path, url.split("/")[-1])

    # Save the PDF
    with open(filename, 'wb') as pdf_file:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                pdf_file.write(chunk)

    print(f"Downloaded: {filename}")

# Specify the base URL of the website
base_url = 'https://rdb.rw/'

# Specify the directory to save the downloaded PDFs
download_path = 'all_pdfs_downloaded'

# Download PDFs from the website
download_pdfs_from_website(base_url, base_url, download_path)




Downloading PDFs from: https://rdb.rw/
Downloaded: all_pdfs_downloaded/Investment-code-2021.pdf
Downloaded: all_pdfs_downloaded/SMEs-toolkit-to-grow-business..V2_compressed.pdf
Downloaded: all_pdfs_downloaded/Service-Charter-EN.pdf
Downloaded: all_pdfs_downloaded/Service-Charter_FR.pdf
Downloaded: all_pdfs_downloaded/Service-Charter_KIN.pdf
Downloaded: all_pdfs_downloaded/RDB-Quality-Policy-Statement.pdf
Downloaded: all_pdfs_downloaded/client-service-charter.pdf
Downloaded: all_pdfs_downloaded/Notice-of-Company-Restoration-of-MMC_13.11.2023.pdf
Downloaded: all_pdfs_downloaded/M.K.C-TRADING-NOTICE.pdf
Downloaded: all_pdfs_downloaded/Public-Notice-09-11-2023.pdf
Downloaded: all_pdfs_downloaded/Itangazo-Rusange-09-11-2023.pdf
Downloaded: all_pdfs_downloaded/NOTICE-ALLIANCE-SHOP-LTD-1.pdf
Downloaded: all_pdfs_downloaded/NOTICE-OF-RESTORATIONOF-A-COMPANY-HONORE-RWANDA-LTD_2.10.2023.pdf
Downloaded: all_pdfs_downloaded/PRIMETRACK-DOCUMENTS.pdf
Downloaded: all_pdfs_downloaded/MY-ROOM-TRADING-M

In [None]:
# provide the path of  pdf file/files.
# pdfreader = PdfReader('budget_speech.pdf')

from PyPDF2 import PdfReader

def load_pdfs_from_directory(directory):
    pdf_texts = []

    # List all files in the specified directory
    pdf_files = [file for file in os.listdir(directory) if file.endswith('.pdf')]

    for pdf_file in pdf_files:
        pdf_path = os.path.join(directory, pdf_file)
        pdf_texts.append(read_pdf_text(pdf_path))

    return pdf_texts

def read_pdf_text(pdf_path):
    try:
        with open(pdf_path, 'rb') as pdf_file:
            pdf_reader = PdfReader(pdf_file)
            text = ""
            for page_num in range(len(pdf_reader.pages)):
                text += pdf_reader.pages[page_num].extract_text()
        return text
    except Exception as e:
        print(f"Error reading PDF {pdf_path}: {e}")
        return ""


## This function takes a list of PDF pages (pdf_pages) and concatenates the text content from each page.

In [None]:

def concatenate_pages(pdf_pages):
    raw_text = ''
    for i, page in enumerate(pdf_pages):
        content = page.extract_text()
        if content:
            raw_text += content
    return raw_text

# Provide the path to the directory containing the PDF files
pdf_directory = '/content/all_pdfs_downloaded'

# Load PDFs from the directory
pdf_texts = load_pdfs_from_directory(pdf_directory)

# Example: Print the text content of each PDF
for i, pdf_text in enumerate(pdf_texts, start=1):
    print(f"PDF {i} Text:\n{pdf_text}\n{'=' * 50}\n")

[0, IndirectObject(154, 0, 135515418897056)]
[0, IndirectObject(149, 0, 135515418897056)]
[0, IndirectObject(144, 0, 135515418897056)]
[0, IndirectObject(139, 0, 135515418897056)]
[0, IndirectObject(134, 0, 135515418897056)]
[0, IndirectObject(129, 0, 135515418897056)]
[0, IndirectObject(124, 0, 135515418897056)]
[0, IndirectObject(119, 0, 135515418897056)]


PDF 1 Text:


PDF 2 Text:
One Stop 
Center
Service Charter2
Content
Scope of One Stop Center Services 4
Rwanda Development Board (RDB)  5
Capital Market Authority (CMA) 6
City of Kigali & Rwanda Housing Authority (RHA) 7
Directorate General of Immigration and 
Emigration (DGIE) 9
Higher Education Council (HEC) 10
Ministry of Health (MOH) 11
Ministry of Trade and Industry (MINICOM) 12
National Agricultural Export Development Board (NAEB) 13
National Bank of Rwanda (BNR) 14
National Lands Authority (NLA) 15
Rwanda Civil Aviation Authority (RCAA) 16
Rwanda Finance Limited (RFL) 18
Rwanda Food and Drugs Authority (Rwanda FDA) 19
Rwanda Forestry Authority (RFA) 21
Rwanda Inspectorate, Competition and Consumer Protection Authority (RICA) 22
Rwanda Mines, Petroleum and Gas Board (RMB) 25
Rwanda National Police (RNP) 26
Rwanda Revenue Authority (RRA) 27
Rwanda Standards Board (RSB) 29
Rwanda Utilities Regulatory Authority (RURA) 30
Rwanda Water Board (RWB) 33
The RDB One Stop Center is ISO cer

In [None]:
# ...

# Load PDFs from the directory
pdf_texts = load_pdfs_from_directory(pdf_directory)

# Concatenate the PDF texts into a single string
all_pdf_text = "\n".join(pdf_texts)

# Split the text using Character Text Splitter
texts = text_splitter.split_text(all_pdf_text)

# Example: Print the split texts
for i, text_chunk in enumerate(texts, start=1):
    print(f"Text Chunk {i}:\n{text_chunk}\n{'=' * 50}")


[0, IndirectObject(154, 0, 135515343341456)]
[0, IndirectObject(149, 0, 135515343341456)]
[0, IndirectObject(144, 0, 135515343341456)]
[0, IndirectObject(139, 0, 135515343341456)]
[0, IndirectObject(134, 0, 135515343341456)]
[0, IndirectObject(129, 0, 135515343341456)]
[0, IndirectObject(124, 0, 135515343341456)]
[0, IndirectObject(119, 0, 135515343341456)]


Text Chunk 1:
One Stop 
Center
Service Charter2
Content
Scope of One Stop Center Services 4
Rwanda Development Board (RDB)  5
Capital Market Authority (CMA) 6
City of Kigali & Rwanda Housing Authority (RHA) 7
Directorate General of Immigration and 
Emigration (DGIE) 9
Higher Education Council (HEC) 10
Ministry of Health (MOH) 11
Ministry of Trade and Industry (MINICOM) 12
National Agricultural Export Development Board (NAEB) 13
National Bank of Rwanda (BNR) 14
National Lands Authority (NLA) 15
Rwanda Civil Aviation Authority (RCAA) 16
Rwanda Finance Limited (RFL) 18
Rwanda Food and Drugs Authority (Rwanda FDA) 19
Rwanda Forestry Authority (RFA) 21
Rwanda Inspectorate, Competition and Consumer Protection Authority (RICA) 22
Rwanda Mines, Petroleum and Gas Board (RMB) 25
Rwanda National Police (RNP) 26
Text Chunk 2:
Rwanda Forestry Authority (RFA) 21
Rwanda Inspectorate, Competition and Consumer Protection Authority (RICA) 22
Rwanda Mines, Petroleum and Gas Board (RMB) 25
Rwanda National

In [None]:
len(text_chunk)

341

In [None]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [None]:
document_search = FAISS.from_texts(text_chunk, embeddings)

In [None]:
document_search


<langchain.vectorstores.faiss.FAISS at 0x7b4024360a90>

## The chain whenever i try to give the question, it can provide the answer

In [None]:
chain = load_qa_chain(OpenAI(), chain_type="stuff")

In [None]:
query = "How can i start a business in Rwanda, as someone who is from abroad."
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' To start a business in Rwanda as someone from abroad you will need to register your business with the Rwanda Development Board (RDB). You will need to provide all the relevant documents and a business plan, and the RDB will help you with the process.'

In [None]:
query = "Which business i can start in Rwanda, and can generate profit especially in Agriculture?"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' You could consider starting a business in Rwanda in the agriculture sector, such as a farm, an agricultural product processing and packaging business, or a livestock business.'

In [None]:
query = "What is the Private Education Facility Licensing?"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Private Education Facility Licensing is a system used to regulate and monitor private educational institutions, such as schools, colleges, universities, and other learning facilities. It is designed to ensure that these institutions meet certain standards of quality and safety for their students.'

### Python's string formatting

In [None]:
query = "What are services provided by Africa Continental Free Trade Area (AfCFTA) Desk?"
docs = document_search.similarity_search(query)
result = chain.run(input_documents=docs, question=query)

# Assuming result is a string containing the answer

# Split the answer into sentences
sentences = result.split('.')

# Format the sentences into a bullet-point list
formatted_answer = "\n".join([f"- {sentence.strip()}" for sentence in sentences if sentence.strip()])

print(formatted_answer)

- The AfCFTA Desk provides resources and services for businesses and entrepreneurs to help them understand and take advantage of the benefits of the Africa Continental Free Trade Area
- These services include training, access to market information, and assistance with navigating the relevant regulations


In [None]:
query = "What is OSC Desk: Agriculture (Export) Licensing?"
docs = document_search.similarity_search(query)
result = chain.run(input_documents=docs, question=query)

# Assuming result is a string containing the answer

# Split the answer into sentences
sentences = result.split('.')

# Format the sentences into a bullet-point list
formatted_answer = "\n".join([f"- {sentence.strip()}" for sentence in sentences if sentence.strip()])

print(formatted_answer)

- OSC Desk: Agriculture (Export) Licensing is a service provided by the US Department of Agriculture that assists in the process of obtaining the necessary permits, licenses, and certifications to export agricultural products from the United States
- The service provides guidance on how to apply for each type of permit, license, or certification, as well as information on any fees associated with the process


## The AI-native open-source embedding database - Chromadb


In [None]:
# Install necessary packages with quiet mode (-q) to suppress output
!pip install chromadb -q
!pip install pdf2image -q
!pip install pdfminer.six -q
!pip install pytesseract -q
!pip install unstructured -q

In [None]:
!pip install unstructured-inference -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.2/19.2 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.9/5.9 MB[0m [31m104.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.0/49.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.5/112.5 kB[0m [31m10.8 MB/s[

In [None]:
# Import required modules from langchain
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.document_loaders import OnlinePDFLoader
from langchain.indexes import VectorstoreIndexCreator  # Import VectorstoreIndexCreator

In [None]:
# Initialize an OnlinePDFLoader with the path to the PDF file
loader = OnlinePDFLoader("/content/all_pdfs_downloaded/client-service-charter.pdf")

# Load data from the PDF using the OnlinePDFLoader
data = loader.load()

# Display the loaded data
data

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[Document(page_content='CLIENT SERVICE CHARTER\n\nSERVICES PROVIDED\n\nONE-STOP CENTER\n\nInvestment Certificate\n\nCustoms Tax Exemption approval\n\nImmigration •\t Issuance of Initial\n\nWork Permit •\t Renewal of the Work Permit\n\nEnvironmental Impact Assessment •\t Environment\n\nImpact Assessment certificate\n\nREQUIREMENTS\n\nAn application can be done online http://osc.rdb.rw/en/\n\nAttach/Upload the following documents;\n\n1. Application Letter\n\naddressed to CEO, RDB\n\n2. A business plan to which the investment is to be made 3. Proof payment of a non-\n\nrefundable fee\n\n4. A license granted by the business sector in which you intend to operate, (where applicable)\n\nKey licensed sectors •\t Mining quarry •\t Health •\t Education •\t Gambling & Gaming\n\nactivities\n\nContact Person: vianney.mugabo@rdb.rw Tel: +250788559257\n\nApplication is done through the Rwanda Electronic Single Window System\n\nThe checklist is available in OSC at RDB immigration desk or at the websit

In [None]:
# Create an index using VectorstoreIndexCreator, incorporating data from the OnlinePDFLoader
index = VectorstoreIndexCreator().from_loaders([loader])
print(index)

vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x7b3f04964640>


In [None]:
# Define a query text
query = "What is the The Rwanda Development Board (RDB) One Stop Center?"

# Use the index to perform a query based on the defined text
index.query(query)


' The Rwanda Development Board (RDB) One Stop Center is a service that provides investors with a range of services, including investment certificates, customs tax exemptions, immigration services, and environmental impact assessments. It also provides a business plan and proof of payment of a non-refundable fee. The application process is done online and requires the submission of an application letter, a business plan, proof of payment, and a license from the business sector in which the investor intends to operate. The checklist is available at the RDB immigration desk or on the website.'

In [None]:
# Define a query text
query = "What is The Rwanda Development Board (RDB) One Stop Center?"

from langchain.chat_models import ChatOpenAI
from langchain.schema import (AIMessage, HumanMessage,SystemMessage)

messages=[SystemMessage(content='you are an expert in summarizing documents'),
         HumanMessage(content=f"Please provide a short summary of the following text in bullet points:\n output :{query}")]
llm=ChatOpenAI(temperature=0,model_name='gpt-4')

summary=llm(messages)
print(summary)


content='- The Rwanda Development Board (RDB) One Stop Center is a government service that aims to simplify and streamline the process of setting up a business in Rwanda.\n- It provides all necessary services for starting a business under one roof, making it easier for entrepreneurs and investors.\n- The center offers services such as business registration, tax advice, and immigration services.\n- The goal of the RDB One Stop Center is to promote investment and economic growth in Rwanda.'


In [None]:
# Display results in bullet points
print("Summary Results:")
for point in summary.content.split('\n'):
    print(f"- {point}")


Summary Results:
- - The Rwanda Development Board (RDB) One Stop Center is a government service that aims to simplify and streamline the process of setting up a business in Rwanda.
- - It provides all necessary services for starting a business under one roof, making it easier for entrepreneurs and investors.
- - The center offers services such as business registration, tax advice, and immigration services.
- - The goal of the RDB One Stop Center is to promote investment and economic growth in Rwanda.
