PDF Data Extractor

This project extracts data from a PDF file using OpenAI's API and maps it to a predefined JSON schema.

Project Structure

pdf_extractor/
│
├── config/
│   └── config.py
│
├── data/
│   └── schema.json
│
├── src/
│   ├── __init__.py
│   ├── pdf_utils.py
│   ├── openai_utils.py
│   └── mapper.py
│
├── generate_sample_pdf.py
├── main.py
├── Pipfile
├── Pipfile.lock
├── ruff.toml
└── README.md

Setup

Prerequisites

Python 3.9 or higher: Ensure you have Python 3.9+ installed.
Pipenv: Ensure you have pipenv installed.

Installation

Clone the repository:

git clone https://github.com/yourusername/pdf_extractor.git
cd pdf_extractor

Install dependencies using pipenv:

pipenv install
pipenv install reportlab  # For generating the sample PDF
pipenv install --dev ruff  # For linting

Set up the OpenAI API key:
- Create a .env file in the root directory and add your OpenAI API key:
```
OPEN_API_KEY=your-openai-api-key
```

Generate Sample PDF

Generate a sample PDF with test data to use for extraction:

pipenv run python generate_sample_pdf.py

Running the Code

To extract data from the PDF and map it to the JSON schema:

pipenv run python main.py

Linting with Ruff

To check your code for linting errors with ruff, run:

pipenv run ruff check .

To automatically fix linting errors with ruff, run:

pipenv run ruff --fix .

How It Works

Configuration: The config/config.py file loads configuration settings and the OpenAI API key from environment variables.
PDF Generation: The generate_sample_pdf.py script generates a sample PDF with email addresses, dates, and phone numbers.
PDF Text Extraction: The src/pdf_utils.py file contains the extract_text_from_pdf function, which extracts text from the PDF.
Data Extraction Using OpenAI: The src/openai_utils.py file contains the extract_data_with_openai function, which uses OpenAI's API to extract data from the extracted text based on predefined prompts.
Mapping Data to JSON Schema: The src/mapper.py file contains the load_json_schema and map_to_json_schema functions, which load the JSON schema and map the extracted data to the schema.
Main Script: The main.py script orchestrates the entire process: it loads the JSON schema, extracts text from the PDF, uses OpenAI's API to extract data, maps the data to the JSON schema, and prints the mapped data as JSON.

Example Output

After running main.py, the output should be a JSON object containing the extracted email addresses, dates, and phone numbers from the sample PDF:

{
    "email": [
        "example1@example.com",
        "example2@example.com"
    ],
    "date": [
        "01/01/2023",
        "02/02/2023"
    ],
    "phone": [
        "(123) 456-7890",
        "(987) 654-3210"
    ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Data Extractor

Project Structure

Setup

Prerequisites

Installation

Generate Sample PDF

Running the Code

Linting with Ruff

How It Works

Example Output

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
data		data
src		src
tools		tools
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
main.py		main.py
ruff.toml		ruff.toml

Folders and files

Latest commit

History

Repository files navigation

PDF Data Extractor

Project Structure

Setup

Prerequisites

Installation

Generate Sample PDF

Running the Code

Linting with Ruff

How It Works

Example Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages