GitHub - anpoc/ScienceHack

Beconex Challenge

TUM Science Hackathon

✨ Inspiration

Although we stand on the brink of an AI-driven digital revolution, Germany still seems lost in the analogue era. For many companies, digitization is a costly endeavor because it demands significant human effort. This project simplifies the process by automatically integrating scanned receipts directly into your SAP system. This project implements a fully automated pipeline that takes a batch of pdf files and assigns them to positions in the SAP system.

📋 Project Overview

While our first approaches to solve this problem involved mostly LLMs as we tried to

Fintune the final layer of a vision embedding model
Tried different small LLMs below 12GB and used different extract informations from the pdf or to make decision.

Still towards the end of saturday we suddenly accepted that one approach stole the show of all of these LLMs - Regex. Not only is it clearly faster than all LLM fueled approaches, it also gives a way higher accuracy than the small LLMs we considered using. As engeneers it's sometimes about being pragmatic, so we hand in this approach as it will also be the most convenient one to use as it can be run on any laptop and does not neccissates a GPU.

We also considered ColPali, a VLLM specialized in documents. Although its zeroshot capabilities predict if two consecutive pages of a pdf batch belong together were not that good, their RAG abilities are quite nice. Hence, we also implemented them together with string matching.

🚧 Challenges

Evaluation
With only seven batch PDFs available initially, we needed to guard against overfitting. To address this we split the original batches into their ~70 individual PDFs and then randomly recombined them into new batches for more varied testing. We curated corpus of 400 German receipt PDFs (located in the data folder) to ensure robust evaluation and fine-tuning.
Tuning the Regex Feature Weights We ran our evaluation loop (see evaluation.py and example_predict) to measure performance across various feature combinations and identify the top candidates. From this shortlist of predictors, we adjusted each feature’s weight based on precise performance criteria and human insight. This step not only improved accuracy but also highlights the transparency and simplicity of our regex-based approach.

🤖 Start using our Solution

1. Clone the repository

git clone https://github.com/anpoc/ScienceHack.git
cd ScienceHack

2. Setup the Environment First move into main folder and then install the packages specified in the requirements.txt (the one within the solution folder).

cd solution
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Prepare your Data

Place your batch PDF(s) and the corresponding SAP JSON file into solution/data/. For example:

data/tests/
  ├── batch_6_2023_1.pdf
  └── SAP_data.json`

4. Run the pipeline

As mentioned in the presentation we have 3 different methods to generate the requested json file. Below we show how to run each of our methods:

From within the ScienceHack directory, execute:

python ./src/n_method/main.py ./data/tests/batch_6_2023_1.pdf ./data/BECONEX_challenge_materials_samples/SAP_data.json

The script runs in seconds and writes one JSON file per detected invoice to solution/output/. Each file includes:

page: the starting page number
confidence: confidence score for the detection
MBLNR and MJAHR: SAP identifiers For an example output, see solution/output/batch_6_2023_1.json.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
assets		assets
data		data
project_docs		project_docs
results		results
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
dataset.py		dataset.py
evaluation.py		evaluation.py
predict.py		predict.py
regex_splitting.py		regex_splitting.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beconex Challenge

✨ Inspiration

📋 Project Overview

🚧 Challenges

🤖 Start using our Solution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Beconex Challenge

✨ Inspiration

📋 Project Overview

🚧 Challenges

🤖 Start using our Solution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages