DataExtractor Workspace

A full-stack application for extracting structured data from PDF invoices using OpenAI's language models.

Project Structure

DataExtractor-workspace/
├── backend/              # FastAPI Python backend
│   ├── backend/
│   │   ├── main.py      # FastAPI application
│   │   ├── extractor.py # PDF and LLM extraction logic
│   │   └── model.py     # Pydantic data models
│   ├── tests/           # Unit and integration tests
│   ├── requirements.txt  # Python dependencies
│   └── README.md
│
└── frontend/            # React TypeScript frontend
    ├── src/
    │   ├── components/  # React components
    │   ├── styles/      # CSS stylesheets
    │   └── App.tsx      # Main application
    ├── package.json
    ├── index.html
    └── README.md

Getting Started

Backend Setup

cd backend
python -m venv myenv
source myenv/bin/activate  # On Windows: myenv\Scripts\activate
pip install -r requirements.txt
python -m uvicorn backend.main:app --reload

The API will be available at http://localhost:8000

Frontend Setup

cd frontend
npm install
npm run dev

The frontend will be available at http://localhost:5173

API Endpoints

POST /extract

Extract invoice data from a PDF file.

Request:

Content-Type: multipart/form-data
Field: file (PDF file)

Response:

{
  "vendor_name": "string",
  "invoice_number": "string",
  "invoice_date": "string",
  "invoice_total": "string",
  "vendor_address": "string",
  "line_items": [
    {
      "description": "string",
      "quantity": number,
      "unit_price": number,
      "line_total": number
    }
  ]
}

Technology Stack

Backend

Framework: FastAPI
Language: Python 3.13
LLM: OpenAI GPT-4o-mini
PDF Processing: pdfplumber
Data Validation: Pydantic

Frontend

Framework: React 18
Language: TypeScript
Build Tool: Vite
Styling: CSS3

Environment Variables

Backend

Create a .env file in the backend directory:

OPENAI_API_KEY=your_api_key_here

Testing

Backend Tests

cd backend
pytest -v

Run tests with coverage:

pytest --cov=backend tests/

Development Notes

Backend runs on port 8000
Frontend runs on port 5173 (with proxy to backend at /api)
CORS is enabled for local development
All extracted data is validated against the InvoiceData schema

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
backend		backend
frontend		frontend
invoice files		invoice files
.gitignore		.gitignore
FRONTEND_SETUP.md		FRONTEND_SETUP.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataExtractor Workspace

Project Structure

Getting Started

Backend Setup

Frontend Setup

API Endpoints

POST /extract

Technology Stack

Backend

Frontend

Environment Variables

Backend

Testing

Backend Tests

Development Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

aramjung/DataExtractor

Folders and files

Latest commit

History

Repository files navigation

DataExtractor Workspace

Project Structure

Getting Started

Backend Setup

Frontend Setup

API Endpoints

POST /extract

Technology Stack

Backend

Frontend

Environment Variables

Backend

Testing

Backend Tests

Development Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages