In [1]:
pip install PyPDF2 transformers torch

Note: you may need to restart the kernel to use updated packages.


In [4]:
import PyPDF2
import openai
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv('/users/shephalidubey/Documents/keys.env')

# Set your OpenAI API key
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
# print(f"Loaded OpenAI API Key: {OPENAI_API_KEY}")  # Debug line

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() or ""  # Handle potential None values
    return text

def summarize_text(text):
    max_input_length = 4096  # Adjust if necessary
    text_chunks = [text[i:i + max_input_length] for i in range(0, len(text), max_input_length)]

    summaries = []
    chat_model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo", temperature=0)

    for chunk in text_chunks:
        # Create a message object
        message = HumanMessage(content=f"Summarize the following text:\n\n{chunk}")
        
        response = chat_model([message])
        summary = response.content  # Correct access to summary content
        summaries.append(summary)

    return "\n".join(summaries)  # Join summaries with newlines for readability

def main(pdf_path):
    text = extract_text_from_pdf(pdf_path)
    if not text:
        print("No text found in PDF.")
        return
    print(f"Text length: {len(text)}")
    print(f"Text snippet: {text[:500]}")  # Print the first 500 characters

    summary = summarize_text(text)   
    print("Summary:")
    print(summary)

# Example usage
if __name__ == "__main__":
    pdf_path = "/users/shephalidubey/Documents/AQI-DETECTIVES.pdf"  # Replace with your PDF file path
    main(pdf_path)


Loaded OpenAI API Key: sk-Etun-u-I43DEKOuViX9F-y-kjpPSJNgx5myeLEYKE2T3BlbkFJrLV8Z0os3xZqsaAh9x7E9AxVi-5DC_gw1r8wGxMK4A
Text length: 11602
Text snippet: AQI DETECTIVES /g34B  
   /g34B  /g34B  /g34B  /g294
Air Quality: A Critical Global Concern
Collaborators: Shephali Dubey/Girish Hosalli/Christine Chung/Chearine 
Pringle/Xiwu Dai
Project Overview & Goals
Air quality is a vital aspect of our environment and human health. The 
quality of the air we breathe directly impacts our well-being and the 
sustainability of our planet. Our project aims to discover trends in air 
pollution in New York City and Long Island and do a deep dive into the air 
qu
Summary:
The text discusses a project focused on analyzing air quality trends in New York City and Long Island over the past 12 years using Machine Learning models to predict the Air Quality Index (AQI) for the next year. The project aims to help individuals, especially sensitive groups, make informed decisions about outdoor activities and assis

Explanation
extract_text_from_pdf: This function reads the PDF and extracts text from each page.
summarize_text: This function uses a pre-trained summarization model from Hugging Face to generate a summary. Adjust the max_length and min_length parameters to fit your summarization needs.
main: This function ties everything together. It extracts text from the PDF, summarizes it, and prints the summary.
Running the Code
Save the code to a file, e.g., pdf_summarizer.py.
Replace "example.pdf" with the path to your PDF file.
Run the script: python pdf_summarizer.py

For summarizing very large texts, especially when working with PDFs, consider the following models and pipelines that are known for handling long documents effectively:

1. Longformer
Model: allenai/longformer-large-4096
Description: Longformer is specifically designed to handle long documents by using a sparse attention mechanism. This allows it to process longer sequences than traditional transformers.facebook/bart-large-LED, which is a Longformer variant specifically designed for summarization tasks.
Simplified Pipeline: The summarization is handled more directly without needing to deal with tokenizers separately.
2. BigBird
Model: google/bigbird-roberta-base
Description: Similar to Longformer, BigBird can handle long sequences efficiently. It employs a combination of global and local attention, making it suitable for long texts.
3. LED (Longformer Encoder-Decoder)
Model: facebook/bart-large-LED
Description: This model is built on Longformer and is designed for sequence-to-sequence tasks, like summarization, while supporting longer inputs.OSError: facebook/bart-large-LED is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
4. GPT-3 or GPT-4 (if accessible)
Description: If you have access to OpenAI’s API for GPT models, they can generate summaries for large texts effectively. You might need to break your text into smaller parts, but they handle context well.

Additional Considerations
Chunking: If the text is extremely long, you may still need to chunk it, as even these models have maximum input sizes.
Preprocessing: Ensure the text extracted from the PDF is clean and properly formatted before summarization to improve the quality of the output.
Performance: Keep an eye on performance and response time, especially with larger models, as they may require more resources.