A full-stack application for extracting structured data from PDF invoices using OpenAI's language models.
DataExtractor-workspace/
├── backend/ # FastAPI Python backend
│ ├── backend/
│ │ ├── main.py # FastAPI application
│ │ ├── extractor.py # PDF and LLM extraction logic
│ │ └── model.py # Pydantic data models
│ ├── tests/ # Unit and integration tests
│ ├── requirements.txt # Python dependencies
│ └── README.md
│
└── frontend/ # React TypeScript frontend
├── src/
│ ├── components/ # React components
│ ├── styles/ # CSS stylesheets
│ └── App.tsx # Main application
├── package.json
├── index.html
└── README.md
cd backend
python -m venv myenv
source myenv/bin/activate # On Windows: myenv\Scripts\activate
pip install -r requirements.txt
python -m uvicorn backend.main:app --reloadThe API will be available at http://localhost:8000
cd frontend
npm install
npm run devThe frontend will be available at http://localhost:5173
Extract invoice data from a PDF file.
Request:
- Content-Type:
multipart/form-data - Field:
file(PDF file)
Response:
{
"vendor_name": "string",
"invoice_number": "string",
"invoice_date": "string",
"invoice_total": "string",
"vendor_address": "string",
"line_items": [
{
"description": "string",
"quantity": number,
"unit_price": number,
"line_total": number
}
]
}- Framework: FastAPI
- Language: Python 3.13
- LLM: OpenAI GPT-4o-mini
- PDF Processing: pdfplumber
- Data Validation: Pydantic
- Framework: React 18
- Language: TypeScript
- Build Tool: Vite
- Styling: CSS3
Create a .env file in the backend directory:
OPENAI_API_KEY=your_api_key_here
cd backend
pytest -vRun tests with coverage:
pytest --cov=backend tests/- Backend runs on port 8000
- Frontend runs on port 5173 (with proxy to backend at
/api) - CORS is enabled for local development
- All extracted data is validated against the InvoiceData schema
MIT