An advanced Streamlit-based app for extracting structured data (fields and tables) from PDF, scanned PDF, images, DOCX, or TXT.
Supports OCR (Tesseract) for scanned/image-based documents.
- ๐ Upload PDF, image, DOCX, or TXT
- ๐ Preview uploaded file (image or text)
- ๐ Extract structured fields (
key: value
) or table data - ๐ Automatic OCR for scanned documents
- ๐ Preview extracted data as single-row CSV
- ๐พ Download the extracted CSV
- โ๏ธ Inline editing before download
- Python 3.8+
- Tesseract OCR installed and added to PATH
- Ubuntu:
sudo apt-get install tesseract-ocr
- macOS:
brew install tesseract
- Windows: Installer here
- Ubuntu:
Install dependencies:
pip install -r requirements.txt