Skip to content

ajmal624/Data-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

26 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Data Extraction App

An advanced Streamlit-based app for extracting structured data (fields and tables) from PDF, scanned PDF, images, DOCX, or TXT.
Supports OCR (Tesseract) for scanned/image-based documents.


Features

  • ๐Ÿ“‚ Upload PDF, image, DOCX, or TXT
  • ๐Ÿ‘€ Preview uploaded file (image or text)
  • ๐Ÿ“ Extract structured fields (key: value) or table data
  • ๐Ÿ” Automatic OCR for scanned documents
  • ๐Ÿ“Š Preview extracted data as single-row CSV
  • ๐Ÿ’พ Download the extracted CSV
  • โœ๏ธ Inline editing before download

Requirements

System

  • Python 3.8+
  • Tesseract OCR installed and added to PATH
    • Ubuntu: sudo apt-get install tesseract-ocr
    • macOS: brew install tesseract
    • Windows: Installer here

Python

Install dependencies:

pip install -r requirements.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages