Skip to content

bsgarcia/LLM-extractor

Repository files navigation

PDF Extraction Tool

A tool for extracting structured information from PDF documents using Google's Gemini AI API. The tool processes multiple PDFs and answers predefined research questions about each document.

Quick Start

NB: Tested with python 3.12.0.

  1. Install dependencies:

    python -m pip install -r requirements.txt
  2. Set up configuration:

    cp config/config.template.yaml config/config.yaml

    Edit config/config.yaml and add your Gemini API key (get one at Google AI Studio)

  3. Add your PDFs: Place PDF files in the pdfs directory

  4. Run extraction:

    • Option A - Python script: python run_extraction.py
    • Option B - Jupyter notebook: Open run_extraction.ipynb

Configuration

Main Configuration File

The tool uses config/config.yaml for all settings. Create this file by copying config/config.template.yaml:

# Gemini AI Configuration
gemini:
  api_key: "your_api_key_here"  # Required: Get from https://aistudio.google.com/app/apikey
  model: "gemini-2.0-flash-lite"  # Model to use

# File Paths
paths:
  pdf_directory: "pdfs"                # Directory containing PDF files
  questions_file: "config/questions.yaml"    # File containing questions to ask
  output_file: "output/md/extraction_{timestamp}.md"     # Output markdown file with timestamp
  xlsx_output: "output/xlsx/extraction_{timestamp}.xlsx" # Excel output file with timestamp
  log_file: "logs/extraction_{timestamp}.log"            # Log file with timestamp

# Processing Options
options:
  log_level: "INFO"                   # Logging level
  confirm_before_processing: true     # Ask for confirmation before processing

Questions Configuration

Edit config/questions.yaml to customize the questions asked about each PDF:

questions:
  - "What is the main research question?"
  - "What methodology was used?"
  - "What were the key findings?"

additional_instructions: "Please provide concise answers with quotes and page references."

API Key Setup

You can provide your Gemini API key in two ways:

  1. In config file: Set api_key in config/config.yaml
  2. Environment variable: Set GEMINI_API_KEY
    # Windows PowerShell
    $env:GEMINI_API_KEY = "your_api_key_here"
    
    # Linux/Mac
    export GEMINI_API_KEY="your_api_key_here"

Running the Extraction

Method 1: Python Script

Use run_extraction.py for command-line execution:

python run_extraction.py

The script will:

  • Load configuration from config/config.yaml
  • Show a summary of settings
  • Ask for confirmation (if enabled)
  • Process all PDFs in the configured directory
  • Save results to the configured output file

Method 2: Jupyter Notebook

Use run_extraction.ipynb for interactive processing:

  1. Open the notebook: Launch Jupyter and open run_extraction.ipynb
  2. Run cells sequentially: Each cell handles a different step:
    • Cell 1-6: Setup and configuration
    • Cell 7: Basic processing with console output
    • Cell 8: Processing with progress bar (recommended)

The notebook offers two processing options:

  • Cell 7: Standard processing with detailed console output
  • Cell 8: Processing with a visual progress bar and quiet logging

Output

Results File

Processed results are saved to a markdown file with the following structure:

# PDF Analysis Results

Generated on: 2025-01-19 11:10:32
Model used: gemini-2.0-flash-lite
Files to process: 2

---

## Document Title 1

*Processed on: 2025-01-19 11:15:45***Question 1**

> Answer with quotes and page references
**Question 2**

> Another detailed answer

## Document Title 2

...

Log Files

Detailed logs are saved to the logs directory with timestamps:

  • Processing status for each PDF
  • Question-by-question progress
  • Error messages and timing information
  • Final summary statistics

Project Structure

├── config/
│   ├── config.yaml          # Main configuration (create from template)
│   ├── config.template.yaml # Template configuration file
│   └── questions.yaml       # Questions to ask about PDFs
├── modules/
│   └── llm_extractor.py     # Main extraction logic
├── pdfs/                    # Place your PDF files here
├── output/                  # Generated output files
│   ├── md/                  # Markdown files with complete structured results
│   └── xlsx/                # Excel files with short answers only
├── logs/                    # Processing logs (auto-created)
├── run_extraction.py        # Command-line script
├── run_extraction.ipynb     # Jupyter notebook version
└── requirements.txt         # Python dependencies

Output Formats

The tool generates two types of output files with timestamps in organized folders:

Markdown Output (output/md/)

  • Filename: extraction_YYYY-MM-DD_HH-MM-SS.md
  • Content: Complete structured results for each PDF including:
    • Q1, Q2, etc.: Numbered questions with full context
    • Short Answer: Brief 1-2 sentence responses
    • Long Answer: Detailed explanations with statistical data, formatted with proper bullet points and task names
    • Quote: Direct quotes from the paper with page/section references

Excel Output (output/xlsx/)

  • Filename: extraction_YYYY-MM-DD_HH-MM-SS.xlsx
  • Content: Tabular format with only short answers for quick analysis
    • One row per PDF
    • One column per question (automatically generated column names)
    • Ideal for quantitative analysis and comparison across studies

Example Output Structure

Markdown format:

**Q12:** For each task, specify the task name and mean and sd of the main performance measure at pre AND post training.

> **Short Answer:**
> Performance metrics for inhibition, working memory, cognitive flexibility, and fluid intelligence tasks were collected across three training groups and control group at pretest, posttest1, and posttest2.

> **Long Answer:**
> Performance metrics are as follows:
>
> *   **Response Inhibition** (Task: Stop-signal reaction time - SSRT):
>     *   **Response inhibition EG**: Pretest: M = 522.66, SD = 102.97; Posttest1: M = 478.63, SD = 94.53
>     *   **Control Group**: Pretest: M = 476.22, SD = 160.98; Posttest1: M = 484.32, SD = 139.45

Features

  • Batch processing: Handle multiple PDFs automatically
  • Organized output: Timestamped files in structured folders (md/ and xlsx/)
  • Dual format output: Complete markdown reports + concise Excel summaries
  • Numbered questions: Q1, Q2, etc. with proper formatting and statistical data presentation
  • Configurable questions: Customize research questions via YAML
  • Progress tracking: Visual progress bars in notebook mode
  • Detailed logging: Comprehensive logs with timing information
  • Error handling: Graceful handling of API errors and file issues
  • Resume capability: Can reprocess individual files by updating sections
  • Structured formatting: Blockquote formatting with bold headers and properly formatted statistical content

Dependencies

See requirements.txt for the complete list:

  • google-generativeai - Gemini AI API client
  • PyYAML - Configuration file parsing
  • tqdm - Progress bars (notebook only)

Troubleshooting

API Key Issues:

  • Ensure your API key is valid and has sufficient quota
  • Check both config/config.yaml and environment variables

File Not Found:

  • Verify PDF files are in the configured directory
  • Check that config/questions.yaml exists

Processing Errors:

  • Check the log files in logs for detailed error messages
  • Some PDFs may fail due to format issues or API timeouts

Recent Updates

  • Improved Output Organization: Files now organized in output/md/ and output/xlsx/ folders with timestamps
  • Enhanced Formatting: Questions numbered (Q1, Q2, etc.) with blockquote formatting and bold headers
  • Statistical Data Formatting: LLM automatically formats statistical content with proper spacing and bullet points
  • Dual Output Formats: Complete markdown reports + Excel summaries for different use cases

License

This project is licensed under the GNU License.

About

extract information from documents using gemini api

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published