**PROGRAM OVERVIEW**

Use this program to perform the following steps in sequence:

1. **Download** a PDF from the web.
2. **Extract text** from downloaded PDF and **save** the content to a text file.

Activate the last code cell in this program if you are working with a manually downloaded PDF and need to extract its content and save it to a text file.

**LIBRARIES AND PACKAGES**

In [None]:
import os
import requests # type: ignore
from PyPDF2 import PdfReader # type: ignore
from datetime import datetime

Set the **PDF URL** as per your requirement

In [None]:
# Placeholder for the PDF URL (edit as needed)
# pdf_url = "https://www.gao.gov/assets/gao-24-106732.pdf"

Capture the current system date and time

In [None]:
# Get the current date and time for timestamping
current_datetime = datetime.now().strftime('%Y%m%d_%H%M%S')

Create a new directory to save the downloaded PDF and also to create the consolidated text file

In [None]:
# Create a new folder with a timestamp
prefix = "GAO_DOWNLOAD" #change the prefix as per the application domain of your interest
folder_name = f"{prefix}_{current_datetime}"
os.makedirs(folder_name, exist_ok=True)

Define the file name for the downloaded PDF

In [None]:
# Set the PDF file name with timestamp
file_prefix = "GAO_PDF" #change the prefix as per your requirement
pdf_file_name = f"{file_prefix}_{current_datetime}.pdf"
pdf_file_path = os.path.join(folder_name, pdf_file_name)

Now that the download folder and file names are set, go ahead an download the pdf file

In [None]:
# Download the PDF from the provided URL
try:
    pdf_response = requests.get(pdf_url)
    if pdf_response.status_code == 200:
        # Save the PDF to the folder
        with open(pdf_file_path, 'wb') as pdf_file:
            pdf_file.write(pdf_response.content)
        print(f"Downloaded PDF saved as: {pdf_file_path}")
    else:
        print(f"Failed to download PDF. Status Code: {pdf_response.status_code}")
        exit(1)
except Exception as e:
    print(f"Error downloading PDF: {e}")
    exit(1)

Failed to download PDF. Status Code: 403


Now that the PDF is downloaded, extraxt the text and write to a text file

In [None]:
# Extract text from the downloaded PDF
pref_txt_file = "GAO_Consolidated" #change as per your requirement
text_file_name = f"{pref_txt_file}_{current_datetime}.txt"
text_file_path = os.path.join(folder_name, text_file_name)

try:
    with open(pdf_file_path, 'rb') as file:
        reader = PdfReader(file)
        with open(text_file_path, 'w') as text_file:
            for page_num, page in enumerate(reader.pages):
                text = page.extract_text()
                if text:  # Only write if text was extracted
                    text_file.write(f"\n\n[Extract from {pdf_file_name} - Page {page_num + 1}]\n")
                    text_file.write(text)
    print(f"Text extracted from PDF and saved as: {text_file_path}")
except Exception as e:
    print(f"Error extracting text from PDF: {e}")

Error extracting text from PDF: [Errno 2] No such file or directory: 'Farm_Bill_DOWNLOAD_20241107_202042/Farm_Bill_PDF_20241107_202042.pdf'


Manually downloaded the Farm Bill from - https://crsreports.congress.gov/product/pdf/R/R48167, and used as an input to the following piece of code

In [None]:
# Enable and change the directories as per your requirement
#pdf_file_path = "/Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Farm_Bill_DOWNLOAD_20241107_202042/R48167.pdf"
#text_file_path = "/Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Farm_Bill_DOWNLOAD_20241107_202042/Farm_Bill_R48167.txt"

try:
    with open(pdf_file_path, 'rb') as file:
        reader = PdfReader(file)
        with open(text_file_path, 'w') as text_file:
            for page_num, page in enumerate(reader.pages):
                text = page.extract_text()
                if text:  # Only write if text was extracted
                    text_file.write(f"\n\n[Extract from {pdf_file_name} - Page {page_num + 1}]\n")
                    text_file.write(text)
    print(f"Text extracted from PDF and saved as: {text_file_path}")
except Exception as e:
    print(f"Error extracting text from PDF: {e}")

Text extracted from PDF and saved as: /Users/arnabraychaudhari/Documents/6317/Project_LLM_and_RAG_2024_GWU/Farm_Bill_DOWNLOAD_20241107_202042/Farm_Bill_R48167.txt
