A web application that extracts text from scanned PDF documents using OCR (Optical Character Recognition) technology. The application consists of a Python Flask backend and a frontend interface.
- PDF file upload functionality
- OCR text extraction from scanned PDFs
- Real-time text extraction processing
- Cross-Origin Resource Sharing (CORS) enabled
- Clean and simple API endpoint
Before running this application, make sure you have the following installed:
- Clone the repository:
git clone https://github.com/cozyCodr/python-ocr-extractor.git
cd python-ocr-extractor- Install backend dependencies
cd backend
pip install -r requirements.txt- Configure Tesseract and Poppler paths:
- Open backend/app.py
- Update the following paths according to your system
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' POPPLER_PATH = r"C:\poppler-24.08.0\Library\bin"
python-ocr-extractor/
├── backend/
│ ├── app.py # Flask application
│ └── requirements.txt # Python dependencies
├── frontend-app/ # Frontend application
└── .gitignore
POST /extract_text Extracts text from an uploaded PDF file.
Request:
Method: POST
Content-Type: multipart/form-data
Body: pdf_file (PDF file)
- Start the backend server:
cd backend
python app.py- The server will start running on
http://localhost:5000