Skip to content

anpoc/ScienceHack

Repository files navigation

Beconex Challenge

TUM Science Hackathon

✨ Inspiration

Although we stand on the brink of an AI-driven digital revolution, Germany still seems lost in the analogue era. For many companies, digitization is a costly endeavor because it demands significant human effort. This project simplifies the process by automatically integrating scanned receipts directly into your SAP system. This project implements a fully automated pipeline that takes a batch of pdf files and assigns them to positions in the SAP system.

📋 Project Overview

While our first approaches to solve this problem involved mostly LLMs as we tried to

  • Fintune the final layer of a vision embedding model
  • Tried different small LLMs below 12GB and used different extract informations from the pdf or to make decision.

Still towards the end of saturday we suddenly accepted that one approach stole the show of all of these LLMs - Regex. Not only is it clearly faster than all LLM fueled approaches, it also gives a way higher accuracy than the small LLMs we considered using. As engeneers it's sometimes about being pragmatic, so we hand in this approach as it will also be the most convenient one to use as it can be run on any laptop and does not neccissates a GPU.

We also considered ColPali, a VLLM specialized in documents. Although its zeroshot capabilities predict if two consecutive pages of a pdf batch belong together were not that good, their RAG abilities are quite nice. Hence, we also implemented them together with string matching.

🚧 Challenges

  1. Evaluation
    With only seven batch PDFs available initially, we needed to guard against overfitting. To address this we split the original batches into their ~70 individual PDFs and then randomly recombined them into new batches for more varied testing. We curated corpus of 400 German receipt PDFs (located in the data folder) to ensure robust evaluation and fine-tuning.

  2. Tuning the Regex Feature Weights We ran our evaluation loop (see evaluation.py and example_predict) to measure performance across various feature combinations and identify the top candidates. From this shortlist of predictors, we adjusted each feature’s weight based on precise performance criteria and human insight. This step not only improved accuracy but also highlights the transparency and simplicity of our regex-based approach.

🤖 Start using our Solution

1. Clone the repository

git clone https://github.com/anpoc/ScienceHack.git
cd ScienceHack

2. Setup the Environment First move into main folder and then install the packages specified in the requirements.txt (the one within the solution folder).

cd solution
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

3. Prepare your Data

Place your batch PDF(s) and the corresponding SAP JSON file into solution/data/. For example:

data/tests/
  ├── batch_6_2023_1.pdf
  └── SAP_data.json`

4. Run the pipeline

As mentioned in the presentation we have 3 different methods to generate the requested json file. Below we show how to run each of our methods:

From within the ScienceHack directory, execute:

python ./src/n_method/main.py ./data/tests/batch_6_2023_1.pdf ./data/BECONEX_challenge_materials_samples/SAP_data.json

The script runs in seconds and writes one JSON file per detected invoice to solution/output/. Each file includes:

  • page: the starting page number
  • confidence: confidence score for the detection
  • MBLNR and MJAHR: SAP identifiers For an example output, see solution/output/batch_6_2023_1.json.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors