# High-Performance PDF Processing and Question Answering System

# 1. Introduction



This project aims to process large PDF files efficiently and answer specific questions by extracting relevant information. The system uses basic Python libraries, focusing on extracting text from PDF files after the table of contents and answering predefined questions based on the content.


# 2. Requirements

Python Libraries: PyPDF2, aiofiles, asyncio

Key Features:

Extract text from PDF starting after a specified section (e.g., "chapter").

Answer questions from the extracted PDF text.

Process multiple PDF files concurrently using asynchronous programming.



In [1]:
!pip install psutil




In [2]:
import asyncio  # For asynchronous processing
import PyPDF2  # For reading PDF files
import psutil  # For monitoring memory usage
import os      # For interacting with the operating system 

# 3. Code Implementation

# Extracting Text from a PDF

In [3]:
def extract_text_from_pdf(pdf_file, start_after="chapter"):
    text = ""
    start_extracting = False
    with open(pdf_file, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page_num in range(len(reader.pages)):
            page_text = reader.pages[page_num].extract_text() or ''
            if start_after.lower() in page_text.lower():
                start_extracting = True
            if start_extracting:
                text += page_text
    return text.lower()  


This function reads a PDF file and extracts text starting after a specified keyword (e.g., "chapter").

It checks each page, and once it finds the keyword, it starts collecting the remaining text.
The result is returned in lowercase for case-insensitive processing.

# Searching for an Answer in the PDF Text

In [4]:
def find_answer_in_text(pdf_text, question, context_size=500):
    question_lower = question.lower()
    if question_lower in pdf_text:
        # Return the surrounding context if the question is found
        start_idx = pdf_text.find(question_lower)
        context = pdf_text[max(0, start_idx - context_size):min(len(pdf_text), start_idx + context_size)]
        return f"Answer found:\n{context}"
    else:
        return "No relevant information found."


This function searches for a question in the extracted PDF text. 

If the question is found, it returns a portion of the surrounding text (context) based on the specified size. 

If the question is not found, it returns a message saying no relevant information was found.

# Process Single PDF

In [5]:
async def process_single_pdf(pdf_file, question):
    pdf_text = extract_text_from_pdf(pdf_file)
    if not pdf_text:
        return "No relevant information found."

    # Search for the answer in the PDF text
    answer = find_answer_in_text(pdf_text, question)
    return answer

# Process PDFs in Parallel

In [6]:
async def process_pdfs_in_parallel(pdf_files, question):
    tasks = [process_single_pdf(pdf_file, question) for pdf_file in pdf_files]
    results = await asyncio.gather(*tasks)

    # Display the answers 
    for i, result in enumerate(results):
        print(f"Answer from PDF {i + 1}: {result}")


This asynchronous function processes multiple PDF files in parallel. It creates a list of tasks, each calling process_single_pdf() for a different PDF, and runs them simultaneously using asyncio.gather(). After gathering the results, it prints the answer found in each PDF, showing which file the answer came from.

# Monitor Memory Usage

In [7]:
def print_memory_usage():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    print(f"Memory usage: {mem_info.rss / (1024 * 1024):.2f} MB")  # Memory in MB


# Main Function

In [8]:
async def main():
    question = input("Enter your question: ")  
    pdf_files = ['/Users/ashinirf/Downloads/book.pdf'] 
    
    print("Memory usage before processing:")
    print_memory_usage()

    # Calling processing function 
    await process_pdfs_in_parallel(pdf_files, question)

    print("Memory usage after processing:")
    print_memory_usage()


This asynchronous main() function prompts the user to enter a question, then processes a list of PDF files (in this case, just one). It prints the memory usage before and after processing the PDFs. The function uses process_pdfs_in_parallel() to search for the answer to the question across the PDFs, displaying results while monitoring system performance.

In [9]:
await main()


Enter your question: Introduction
Memory usage before processing:
Memory usage: 76.28 MB
Answer from PDF 1: Answer found:
contents 
 
chapter 
no. name of the chapter 
 
 page no. 
 syllabus  - 
1. fundamentals of computers  1-21 
2. input/output devices and other hardware   22-40 
3. windows operating system 41-52 4. introduction to microsoft word 53-83 
5. advanced features of microsoft word    84-106 
 
 
 chapter-1 
fundamentals of computers 
objectives of this lesson 
this lesson is aimed to make you understand the following about a computer system: 
 historical evolution of computers and computer generations 
 working model of a digital computer and functions of a digital computer 
 basic characteristics of digital computer  
 types of computers                            
 classification of digita
Memory usage after processing:
Memory usage: 139.09 MB


# Conclusion

This code reads PDFs, looks for answers to your question, and tracks memory use. It processes multiple PDFs at once to save time. It's useful for handling large PDF files quickly.