Although we stand on the brink of an AI-driven digital revolution, Germany still seems lost in the analogue era. For many companies, digitization is a costly endeavor because it demands significant human effort. This project simplifies the process by automatically integrating scanned receipts directly into your SAP system. This project implements a fully automated pipeline that takes a batch of pdf files and assigns them to positions in the SAP system.
While our first approaches to solve this problem involved mostly LLMs as we tried to
- Fintune the final layer of a vision embedding model
- Tried different small LLMs below 12GB and used different extract informations from the pdf or to make decision.
Still towards the end of saturday we suddenly accepted that one approach stole the show of all of these LLMs - Regex. Not only is it clearly faster than all LLM fueled approaches, it also gives a way higher accuracy than the small LLMs we considered using. As engeneers it's sometimes about being pragmatic, so we hand in this approach as it will also be the most convenient one to use as it can be run on any laptop and does not neccissates a GPU.
We also considered ColPali, a VLLM specialized in documents. Although its zeroshot capabilities predict if two consecutive pages of a pdf batch belong together were not that good, their RAG abilities are quite nice. Hence, we also implemented them together with string matching.
-
Evaluation
With only seven batch PDFs available initially, we needed to guard against overfitting. To address this we split the original batches into their ~70 individual PDFs and then randomly recombined them into new batches for more varied testing. We curated corpus of 400 German receipt PDFs (located in thedatafolder) to ensure robust evaluation and fine-tuning. -
Tuning the Regex Feature Weights We ran our evaluation loop (see evaluation.py and example_predict) to measure performance across various feature combinations and identify the top candidates. From this shortlist of predictors, we adjusted each feature’s weight based on precise performance criteria and human insight. This step not only improved accuracy but also highlights the transparency and simplicity of our regex-based approach.
1. Clone the repository
git clone https://github.com/anpoc/ScienceHack.git
cd ScienceHack2. Setup the Environment
First move into main folder and then install the packages specified in the requirements.txt (the one within the solution folder).
cd solution
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt3. Prepare your Data
Place your batch PDF(s) and the corresponding SAP JSON file into solution/data/. For example:
data/tests/
├── batch_6_2023_1.pdf
└── SAP_data.json`4. Run the pipeline
As mentioned in the presentation we have 3 different methods to generate the requested json file. Below we show how to run each of our methods:
From within the ScienceHack directory, execute:
python ./src/n_method/main.py ./data/tests/batch_6_2023_1.pdf ./data/BECONEX_challenge_materials_samples/SAP_data.jsonThe script runs in seconds and writes one JSON file per detected invoice to solution/output/. Each file includes:
page: the starting page numberconfidence: confidence score for the detectionMBLNRandMJAHR: SAP identifiers For an example output, seesolution/output/batch_6_2023_1.json.
