Skip to content

Conversation

@praveenmidde
Copy link
Contributor

@praveenmidde praveenmidde commented Mar 19, 2025

Working on Fintabnet datasetbuilder + Azure Provider.
//Creating a draft as requested, only for initial review.

Changes include

  • New implementation for S3 source
  • Save the intermediate files - prediction and doclingDocuments for pred+groundTruth for debugging
  • Implement the run for TEDS metric for (limited) Fintabnet dataset from cos; If the API predictions are available, will use them instead of calling the API
  • Needed to pin library versions of pydantic, url + s3 library
  • Update shard_id while writing the shard, so that they don't overwrite each other.

@praveenmidde praveenmidde self-assigned this Mar 19, 2025
@cau-git cau-git changed the base branch from main to cau/new-class-design March 19, 2025 13:06
Praveen Kumar Midde added 3 commits March 19, 2025 18:37
- Save the hyperscaler API jsons
- Save the docling document formats
- Temporary files are stored as below:
   - `intermediate_files` -- for parquet files
   - `microsoft` - Root folder that contains API output files
      - `docling_document` - docling document output of the MS outputs
   - `visualizations` - Output of table visualizations
- Prediction table output converted to doclingDocument (has a bug in the spans)
- Save the GT doclingDocument as well for debugging
- Some cleanup
- New implementation for S3 source
- Save the intermediate files - prediction and doclingDocuments for pred+groundTruth for debugging
- Implement the run for TEDS metric for (limited) Fintabnet dataset from cos; If the API predictions are available, will use them instead of calling the API
- Needed to pin library versions of pydantic, url + s3 library
@cau-git
Copy link
Contributor

cau-git commented Apr 7, 2025

This PR will be superseded by #47, which ports the key features from this branch to the final docling-eval API.

@praveenmidde
Copy link
Contributor Author

Closing the PR.. as it has been addressed in following PRs
#50
#65
#62 (misc minor fixes for Azure)

Note: FTN COS builder - if needed, will be implemented separately where required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants