Invoice Text Processing with LayoutLM and ONNX

This project aims to develop a Python application to extract key information from invoices using machine learning. The solution handles various invoice formats in English, Dutch, and French without hardcoded labels, understanding the context to accurately extract information.

Introduction

This project uses LayoutLM, a transformer-based model, and ONNX for optimization to process invoices. The solution involves training a model, optimizing it for deployment, and running it on a client desktop. The steps include data collection, preprocessing, model training, optimization, and deployment.

Requirements

Ensure you have Python 3.8 or later installed. The following libraries are required:

numpy
pandas
scikit-learn
torch
transformers
onnx
onnxruntime
pytorch-lightning
pdf2image
pytesseract

Usage

Data Collection Collect a diverse dataset of invoices in PDF format from the internet. Ensure the dataset includes invoices in English, Dutch, and French. Use OCR to convert PDFs to text.
Data Preprocessing Clean and preprocess the extracted text. Annotate the data to identify key information (e.g., sender, receiver, VAT number, amounts) without relying on hardcoded labels.
Model Training Use a pre-trained LayoutLM model and fine-tune it on your annotated dataset. Ensure the model understands context to extract information from various formats and languages. Evaluate the model's performance and adjust parameters as necessary.
Model Optimization Export the trained model to ONNX format. Use techniques like quantization to reduce model size and improve performance.

Documentation

For detailed documentation on the project approach, model architecture, training process, evaluation metrics, optimization techniques, deployment steps, and performance, refer to invoice_text_processing_documentation.docx.

Evaluation Criteria

Data Handling: Ability to collect, preprocess, and annotate data, including handling multiple languages. Model Training: Effectiveness in training and fine-tuning the model to handle diverse formats. Optimization: Success in optimizing the model for deployment. Deployment: Ability to set up the client environment and run the model. Documentation: Clarity and completeness of the documentation and code repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
Requirements.txt		Requirements.txt
Untitled.ipynb		Untitled.ipynb
invoice_data_extraction_documentation.docx		invoice_data_extraction_documentation.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Invoice Text Processing with LayoutLM and ONNX

Table of Contents

Introduction

Requirements

Usage

Documentation

Evaluation Criteria

About

Uh oh!

Releases

Packages

Uh oh!

Languages

avnishrai1/Invoice-Data-Extraction-Using-Machine-Learning1

Folders and files

Latest commit

History

Repository files navigation

Invoice Text Processing with LayoutLM and ONNX

Table of Contents

Introduction

Requirements

Usage

Documentation

Evaluation Criteria

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages