A tool for extracting structured information from PDF documents using Google's Gemini AI API. The tool processes multiple PDFs and answers predefined research questions about each document.
NB: Tested with python 3.12.0.
-
Install dependencies:
python -m pip install -r requirements.txt
-
Set up configuration:
cp config/config.template.yaml config/config.yaml
Edit
config/config.yaml
and add your Gemini API key (get one at Google AI Studio) -
Add your PDFs: Place PDF files in the
pdfs
directory -
Run extraction:
- Option A - Python script:
python run_extraction.py
- Option B - Jupyter notebook: Open
run_extraction.ipynb
- Option A - Python script:
The tool uses config/config.yaml
for all settings. Create this file by copying config/config.template.yaml
:
# Gemini AI Configuration
gemini:
api_key: "your_api_key_here" # Required: Get from https://aistudio.google.com/app/apikey
model: "gemini-2.0-flash-lite" # Model to use
# File Paths
paths:
pdf_directory: "pdfs" # Directory containing PDF files
questions_file: "config/questions.yaml" # File containing questions to ask
output_file: "output/md/extraction_{timestamp}.md" # Output markdown file with timestamp
xlsx_output: "output/xlsx/extraction_{timestamp}.xlsx" # Excel output file with timestamp
log_file: "logs/extraction_{timestamp}.log" # Log file with timestamp
# Processing Options
options:
log_level: "INFO" # Logging level
confirm_before_processing: true # Ask for confirmation before processing
Edit config/questions.yaml
to customize the questions asked about each PDF:
questions:
- "What is the main research question?"
- "What methodology was used?"
- "What were the key findings?"
additional_instructions: "Please provide concise answers with quotes and page references."
You can provide your Gemini API key in two ways:
- In config file: Set
api_key
inconfig/config.yaml
- Environment variable: Set
GEMINI_API_KEY
# Windows PowerShell $env:GEMINI_API_KEY = "your_api_key_here" # Linux/Mac export GEMINI_API_KEY="your_api_key_here"
Use run_extraction.py
for command-line execution:
python run_extraction.py
The script will:
- Load configuration from
config/config.yaml
- Show a summary of settings
- Ask for confirmation (if enabled)
- Process all PDFs in the configured directory
- Save results to the configured output file
Use run_extraction.ipynb
for interactive processing:
- Open the notebook: Launch Jupyter and open
run_extraction.ipynb
- Run cells sequentially: Each cell handles a different step:
- Cell 1-6: Setup and configuration
- Cell 7: Basic processing with console output
- Cell 8: Processing with progress bar (recommended)
The notebook offers two processing options:
- Cell 7: Standard processing with detailed console output
- Cell 8: Processing with a visual progress bar and quiet logging
Processed results are saved to a markdown file with the following structure:
# PDF Analysis Results
Generated on: 2025-01-19 11:10:32
Model used: gemini-2.0-flash-lite
Files to process: 2
---
## Document Title 1
*Processed on: 2025-01-19 11:15:45*
• **Question 1**
> Answer with quotes and page references
• **Question 2**
> Another detailed answer
## Document Title 2
...
Detailed logs are saved to the logs
directory with timestamps:
- Processing status for each PDF
- Question-by-question progress
- Error messages and timing information
- Final summary statistics
├── config/
│ ├── config.yaml # Main configuration (create from template)
│ ├── config.template.yaml # Template configuration file
│ └── questions.yaml # Questions to ask about PDFs
├── modules/
│ └── llm_extractor.py # Main extraction logic
├── pdfs/ # Place your PDF files here
├── output/ # Generated output files
│ ├── md/ # Markdown files with complete structured results
│ └── xlsx/ # Excel files with short answers only
├── logs/ # Processing logs (auto-created)
├── run_extraction.py # Command-line script
├── run_extraction.ipynb # Jupyter notebook version
└── requirements.txt # Python dependencies
The tool generates two types of output files with timestamps in organized folders:
- Filename:
extraction_YYYY-MM-DD_HH-MM-SS.md
- Content: Complete structured results for each PDF including:
- Q1, Q2, etc.: Numbered questions with full context
- Short Answer: Brief 1-2 sentence responses
- Long Answer: Detailed explanations with statistical data, formatted with proper bullet points and task names
- Quote: Direct quotes from the paper with page/section references
- Filename:
extraction_YYYY-MM-DD_HH-MM-SS.xlsx
- Content: Tabular format with only short answers for quick analysis
- One row per PDF
- One column per question (automatically generated column names)
- Ideal for quantitative analysis and comparison across studies
Markdown format:
**Q12:** For each task, specify the task name and mean and sd of the main performance measure at pre AND post training.
> **Short Answer:**
> Performance metrics for inhibition, working memory, cognitive flexibility, and fluid intelligence tasks were collected across three training groups and control group at pretest, posttest1, and posttest2.
> **Long Answer:**
> Performance metrics are as follows:
>
> * **Response Inhibition** (Task: Stop-signal reaction time - SSRT):
> * **Response inhibition EG**: Pretest: M = 522.66, SD = 102.97; Posttest1: M = 478.63, SD = 94.53
> * **Control Group**: Pretest: M = 476.22, SD = 160.98; Posttest1: M = 484.32, SD = 139.45
- Batch processing: Handle multiple PDFs automatically
- Organized output: Timestamped files in structured folders (md/ and xlsx/)
- Dual format output: Complete markdown reports + concise Excel summaries
- Numbered questions: Q1, Q2, etc. with proper formatting and statistical data presentation
- Configurable questions: Customize research questions via YAML
- Progress tracking: Visual progress bars in notebook mode
- Detailed logging: Comprehensive logs with timing information
- Error handling: Graceful handling of API errors and file issues
- Resume capability: Can reprocess individual files by updating sections
- Structured formatting: Blockquote formatting with bold headers and properly formatted statistical content
See requirements.txt
for the complete list:
google-generativeai
- Gemini AI API clientPyYAML
- Configuration file parsingtqdm
- Progress bars (notebook only)
API Key Issues:
- Ensure your API key is valid and has sufficient quota
- Check both
config/config.yaml
and environment variables
File Not Found:
- Verify PDF files are in the configured directory
- Check that
config/questions.yaml
exists
Processing Errors:
- Check the log files in
logs
for detailed error messages - Some PDFs may fail due to format issues or API timeouts
- ✅ Improved Output Organization: Files now organized in
output/md/
andoutput/xlsx/
folders with timestamps - ✅ Enhanced Formatting: Questions numbered (Q1, Q2, etc.) with blockquote formatting and bold headers
- ✅ Statistical Data Formatting: LLM automatically formats statistical content with proper spacing and bullet points
- ✅ Dual Output Formats: Complete markdown reports + Excel summaries for different use cases
This project is licensed under the GNU License.