This project extracts data from a PDF file using OpenAI's API and maps it to a predefined JSON schema.
pdf_extractor/
│
├── config/
│ └── config.py
│
├── data/
│ └── schema.json
│
├── src/
│ ├── __init__.py
│ ├── pdf_utils.py
│ ├── openai_utils.py
│ └── mapper.py
│
├── generate_sample_pdf.py
├── main.py
├── Pipfile
├── Pipfile.lock
├── ruff.toml
└── README.md
- Python 3.9 or higher: Ensure you have Python 3.9+ installed.
- Pipenv: Ensure you have
pipenvinstalled.
-
Clone the repository:
git clone https://github.com/yourusername/pdf_extractor.git cd pdf_extractor -
Install dependencies using
pipenv:pipenv install pipenv install reportlab # For generating the sample PDF pipenv install --dev ruff # For linting
-
Set up the OpenAI API key:
-
Create a
.envfile in the root directory and add your OpenAI API key:OPEN_API_KEY=your-openai-api-key
-
Generate a sample PDF with test data to use for extraction:
pipenv run python generate_sample_pdf.pyTo extract data from the PDF and map it to the JSON schema:
pipenv run python main.pyTo check your code for linting errors with ruff, run:
pipenv run ruff check .To automatically fix linting errors with ruff, run:
pipenv run ruff --fix .-
Configuration: The
config/config.pyfile loads configuration settings and the OpenAI API key from environment variables. -
PDF Generation: The
generate_sample_pdf.pyscript generates a sample PDF with email addresses, dates, and phone numbers. -
PDF Text Extraction: The
src/pdf_utils.pyfile contains theextract_text_from_pdffunction, which extracts text from the PDF. -
Data Extraction Using OpenAI: The
src/openai_utils.pyfile contains theextract_data_with_openaifunction, which uses OpenAI's API to extract data from the extracted text based on predefined prompts. -
Mapping Data to JSON Schema: The
src/mapper.pyfile contains theload_json_schemaandmap_to_json_schemafunctions, which load the JSON schema and map the extracted data to the schema. -
Main Script: The
main.pyscript orchestrates the entire process: it loads the JSON schema, extracts text from the PDF, uses OpenAI's API to extract data, maps the data to the JSON schema, and prints the mapped data as JSON.
After running main.py, the output should be a JSON object containing the extracted email addresses, dates, and phone numbers from the sample PDF:
{
"email": [
"example1@example.com",
"example2@example.com"
],
"date": [
"01/01/2023",
"02/02/2023"
],
"phone": [
"(123) 456-7890",
"(987) 654-3210"
]
}