PDF Extraction Tool

A tool for extracting structured information from PDF documents using Google's Gemini AI API. The tool processes multiple PDFs and answers predefined research questions about each document.

Quick Start

NB: Tested with python 3.12.0.

Install dependencies:

python -m pip install -r requirements.txt

Set up configuration:
```
cp config/config.template.yaml config/config.yaml
```
Edit config/config.yaml and add your Gemini API key (get one at Google AI Studio)
Add your PDFs: Place PDF files in the pdfs directory
Run extraction:
- Option A - Python script: python run_extraction.py
- Option B - Jupyter notebook: Open run_extraction.ipynb

Configuration

Main Configuration File

The tool uses config/config.yaml for all settings. Create this file by copying config/config.template.yaml:

# Gemini AI Configuration
gemini:
  api_key: "your_api_key_here"  # Required: Get from https://aistudio.google.com/app/apikey
  model: "gemini-2.0-flash-lite"  # Model to use

# File Paths
paths:
  pdf_directory: "pdfs"                # Directory containing PDF files
  questions_file: "config/questions.yaml"    # File containing questions to ask
  output_file: "output/md/extraction_{timestamp}.md"     # Output markdown file with timestamp
  xlsx_output: "output/xlsx/extraction_{timestamp}.xlsx" # Excel output file with timestamp
  log_file: "logs/extraction_{timestamp}.log"            # Log file with timestamp

# Processing Options
options:
  log_level: "INFO"                   # Logging level
  confirm_before_processing: true     # Ask for confirmation before processing

Questions Configuration

Edit config/questions.yaml to customize the questions asked about each PDF:

questions:
  - "What is the main research question?"
  - "What methodology was used?"
  - "What were the key findings?"

additional_instructions: "Please provide concise answers with quotes and page references."

API Key Setup

You can provide your Gemini API key in two ways:

In config file: Set api_key in config/config.yaml

Environment variable: Set GEMINI_API_KEY

# Windows PowerShell
$env:GEMINI_API_KEY = "your_api_key_here"

# Linux/Mac
export GEMINI_API_KEY="your_api_key_here"

Running the Extraction

Method 1: Python Script

Use run_extraction.py for command-line execution:

python run_extraction.py

The script will:

Load configuration from config/config.yaml
Show a summary of settings
Ask for confirmation (if enabled)
Process all PDFs in the configured directory
Save results to the configured output file

Method 2: Jupyter Notebook

Use run_extraction.ipynb for interactive processing:

Open the notebook: Launch Jupyter and open run_extraction.ipynb
Run cells sequentially: Each cell handles a different step:
- Cell 1-6: Setup and configuration
- Cell 7: Basic processing with console output
- Cell 8: Processing with progress bar (recommended)

The notebook offers two processing options:

Cell 7: Standard processing with detailed console output
Cell 8: Processing with a visual progress bar and quiet logging

Output

Results File

Processed results are saved to a markdown file with the following structure:

# PDF Analysis Results

Generated on: 2025-01-19 11:10:32
Model used: gemini-2.0-flash-lite
Files to process: 2

---

## Document Title 1

*Processed on: 2025-01-19 11:15:45*

• **Question 1**

> Answer with quotes and page references

• **Question 2**

> Another detailed answer

## Document Title 2

...

Log Files

Detailed logs are saved to the logs directory with timestamps:

Processing status for each PDF
Question-by-question progress
Error messages and timing information
Final summary statistics

Project Structure

├── config/
│   ├── config.yaml          # Main configuration (create from template)
│   ├── config.template.yaml # Template configuration file
│   └── questions.yaml       # Questions to ask about PDFs
├── modules/
│   └── llm_extractor.py     # Main extraction logic
├── pdfs/                    # Place your PDF files here
├── output/                  # Generated output files
│   ├── md/                  # Markdown files with complete structured results
│   └── xlsx/                # Excel files with short answers only
├── logs/                    # Processing logs (auto-created)
├── run_extraction.py        # Command-line script
├── run_extraction.ipynb     # Jupyter notebook version
└── requirements.txt         # Python dependencies

Output Formats

The tool generates two types of output files with timestamps in organized folders:

Markdown Output (`output/md/`)

Filename: extraction_YYYY-MM-DD_HH-MM-SS.md
Content: Complete structured results for each PDF including:
- Q1, Q2, etc.: Numbered questions with full context
- Short Answer: Brief 1-2 sentence responses
- Long Answer: Detailed explanations with statistical data, formatted with proper bullet points and task names
- Quote: Direct quotes from the paper with page/section references

Excel Output (`output/xlsx/`)

Filename: extraction_YYYY-MM-DD_HH-MM-SS.xlsx
Content: Tabular format with only short answers for quick analysis
- One row per PDF
- One column per question (automatically generated column names)
- Ideal for quantitative analysis and comparison across studies

Example Output Structure

Markdown format:

**Q12:** For each task, specify the task name and mean and sd of the main performance measure at pre AND post training.

> **Short Answer:**
> Performance metrics for inhibition, working memory, cognitive flexibility, and fluid intelligence tasks were collected across three training groups and control group at pretest, posttest1, and posttest2.

> **Long Answer:**
> Performance metrics are as follows:
>
> *   **Response Inhibition** (Task: Stop-signal reaction time - SSRT):
>     *   **Response inhibition EG**: Pretest: M = 522.66, SD = 102.97; Posttest1: M = 478.63, SD = 94.53
>     *   **Control Group**: Pretest: M = 476.22, SD = 160.98; Posttest1: M = 484.32, SD = 139.45

Features

Batch processing: Handle multiple PDFs automatically
Organized output: Timestamped files in structured folders (md/ and xlsx/)
Dual format output: Complete markdown reports + concise Excel summaries
Numbered questions: Q1, Q2, etc. with proper formatting and statistical data presentation
Configurable questions: Customize research questions via YAML
Progress tracking: Visual progress bars in notebook mode
Detailed logging: Comprehensive logs with timing information
Error handling: Graceful handling of API errors and file issues
Resume capability: Can reprocess individual files by updating sections
Structured formatting: Blockquote formatting with bold headers and properly formatted statistical content

Dependencies

See requirements.txt for the complete list:

google-generativeai - Gemini AI API client
PyYAML - Configuration file parsing
tqdm - Progress bars (notebook only)

Troubleshooting

API Key Issues:

Ensure your API key is valid and has sufficient quota
Check both config/config.yaml and environment variables

File Not Found:

Verify PDF files are in the configured directory
Check that config/questions.yaml exists

Processing Errors:

Check the log files in logs for detailed error messages
Some PDFs may fail due to format issues or API timeouts

Recent Updates

✅ Improved Output Organization: Files now organized in output/md/ and output/xlsx/ folders with timestamps
✅ Enhanced Formatting: Questions numbered (Q1, Q2, etc.) with blockquote formatting and bold headers
✅ Statistical Data Formatting: LLM automatically formats statistical content with proper spacing and bullet points
✅ Dual Output Formats: Complete markdown reports + Excel summaries for different use cases

License

This project is licensed under the GNU License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Extraction Tool

Quick Start

Configuration

Main Configuration File

Questions Configuration

API Key Setup

Running the Extraction

Method 1: Python Script

Method 2: Jupyter Notebook

Output

Results File

Log Files

Project Structure

Output Formats

Markdown Output (`output/md/`)

Excel Output (`output/xlsx/`)

Example Output Structure

Features

Dependencies

Troubleshooting

Recent Updates

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
logs		logs
modules		modules
output		output
pdfs		pdfs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_extraction.ipynb		run_extraction.ipynb
run_extraction.py		run_extraction.py

License

bsgarcia/LLM-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Extraction Tool

Quick Start

Configuration

Main Configuration File

Questions Configuration

API Key Setup

Running the Extraction

Method 1: Python Script

Method 2: Jupyter Notebook

Output

Results File

Log Files

Project Structure

Output Formats

Markdown Output (output/md/)

Excel Output (output/xlsx/)

Example Output Structure

Features

Dependencies

Troubleshooting

Recent Updates

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Markdown Output (`output/md/`)

Excel Output (`output/xlsx/`)

Packages