Skip to content

aramjung/DataExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataExtractor Workspace

A full-stack application for extracting structured data from PDF invoices using OpenAI's language models.

Project Structure

DataExtractor-workspace/
├── backend/              # FastAPI Python backend
│   ├── backend/
│   │   ├── main.py      # FastAPI application
│   │   ├── extractor.py # PDF and LLM extraction logic
│   │   └── model.py     # Pydantic data models
│   ├── tests/           # Unit and integration tests
│   ├── requirements.txt  # Python dependencies
│   └── README.md
│
└── frontend/            # React TypeScript frontend
    ├── src/
    │   ├── components/  # React components
    │   ├── styles/      # CSS stylesheets
    │   └── App.tsx      # Main application
    ├── package.json
    ├── index.html
    └── README.md

Getting Started

Backend Setup

cd backend
python -m venv myenv
source myenv/bin/activate  # On Windows: myenv\Scripts\activate
pip install -r requirements.txt
python -m uvicorn backend.main:app --reload

The API will be available at http://localhost:8000

Frontend Setup

cd frontend
npm install
npm run dev

The frontend will be available at http://localhost:5173

API Endpoints

POST /extract

Extract invoice data from a PDF file.

Request:

  • Content-Type: multipart/form-data
  • Field: file (PDF file)

Response:

{
  "vendor_name": "string",
  "invoice_number": "string",
  "invoice_date": "string",
  "invoice_total": "string",
  "vendor_address": "string",
  "line_items": [
    {
      "description": "string",
      "quantity": number,
      "unit_price": number,
      "line_total": number
    }
  ]
}

Technology Stack

Backend

  • Framework: FastAPI
  • Language: Python 3.13
  • LLM: OpenAI GPT-4o-mini
  • PDF Processing: pdfplumber
  • Data Validation: Pydantic

Frontend

  • Framework: React 18
  • Language: TypeScript
  • Build Tool: Vite
  • Styling: CSS3

Environment Variables

Backend

Create a .env file in the backend directory:

OPENAI_API_KEY=your_api_key_here

Testing

Backend Tests

cd backend
pytest -v

Run tests with coverage:

pytest --cov=backend tests/

Development Notes

  • Backend runs on port 8000
  • Frontend runs on port 5173 (with proxy to backend at /api)
  • CORS is enabled for local development
  • All extracted data is validated against the InvoiceData schema

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors