Document Processing Pipeline

This project demonstrates how to build a document processing streaming pipeline. The base pipeline provided here allows for easy implementation of additional features using machine learning models as workers in the pipeline.

We have chosen invoice processing as the base pipeline and this release serves as the initial codebase to handle it. The base pipeline includes the following components:

Frontend: The frontend component provides an interface for users to easily set up the pipeline, configure settings, and view the processing results. It allows for step-by-step guidance and visualization of the pipeline.
API: Provides an interface for customers to upload documents with supported types (JPEG, PNG, TIF).
Image Processing: Preprocesses the invoice image to enhance quality, remove noise, and improve readability if necessary. Techniques such as image cropping, rotation, or resizing are applied to isolate and align the invoice content.
Field Invoice Detection: Utilizes techniques like object detection or contour analysis to identify and extract the invoice region within the processed image. This step aims to isolate the invoice from any surrounding elements or backgrounds.
OCR (Optical Character Recognition): Applies OCR techniques to recognize and extract text from the invoice image or the specific invoice region identified in the previous step. We have used an OCR model built using machine learning.
Information Extraction and Parsing: Performs calculations or data processing on the extracted information if necessary. For example, calculating the total amount, taxes, or applying business-specific rules.

Technologies Used

The following technologies are used in this project:

ReactJS Website for show dashboard, pipeline, and user cofiguration.
Golang with the gorillamux framework for building the API and handling public endpoints.
S3: Used for storage and handling of image and JSON files.
Postgrest: A database used for storing user information, document processing details, and other relevant data.
Kafka: A messaging service used for asynchronous communication between different components of the pipeline.
Redis: A caching service used to improve performance and efficiency.
Faust: A framework for streaming processing, which helps in building efficient and scalable pipeline components.

Getting Started

To get started with the document processing pipeline, follow these steps:

Install Docker and Docker Compose by following the official Docker documentation: Docker Documentation
Clone the repository with the sample code:

git clone https://github.com/dnguyenngoc/doc-extractor.git

Download the required machine learning model from model_link and place it in the appropriate directory.
Start the application using Docker Compose:

docker-compose up

Contact

For any inquiries or assistance, please contact our team at duynguyenngoc@hotmail.com or visit our website at www.example.com.

We appreciate your interest in our document processing pipeline!

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github		.github
api		api
client		client
consumer		consumer
database		database
document/images		document/images
scripts		scripts
.clusterID		.clusterID
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

api

api

client

client

consumer

consumer

database

database

document/images

document/images

scripts

scripts

.clusterID

.clusterID

.gitignore

.gitignore

README.md

README.md

docker-compose.yml

docker-compose.yml

Repository files navigation

Document Processing Pipeline

Technologies Used

Getting Started

Contact

About

Releases

Packages

Contributors 2

Languages

dnguyenngoc/doc-extractor

Folders and files

Latest commit

History

Repository files navigation

Document Processing Pipeline

Technologies Used

Getting Started

Contact

About

Resources

Stars

Watchers

Forks

Languages