Data Extraction App

An advanced Streamlit-based app for extracting structured data (fields and tables) from PDF, scanned PDF, images, DOCX, or TXT.
Supports OCR (Tesseract) for scanned/image-based documents.

Features

📂 Upload PDF, image, DOCX, or TXT
👀 Preview uploaded file (image or text)
📝 Extract structured fields (key: value) or table data
🔍 Automatic OCR for scanned documents
📊 Preview extracted data as single-row CSV
💾 Download the extracted CSV
✏️ Inline editing before download

Requirements

System

Python 3.8+
Tesseract OCR installed and added to PATH
- Ubuntu: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
- Windows: Installer here

Python

Install dependencies:

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
README.md		README.md
data_extraction.py		data_extraction.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Extraction App

Features

Requirements

System

Python

About

Uh oh!

Releases

Packages

Languages

ajmal624/Data-extraction

Folders and files

Latest commit

History

Repository files navigation

Data Extraction App

Features

Requirements

System

Python

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages