Skip to content

Structured Data Extraction w/ Pre-defined Schema - Support for Excel/Spreadsheets #2216

@Kking112

Description

@Kking112

Requested feature

Docling recently added support for structured data extraction using a predefined schema (see here), but only currently supports pdfs & image files. Adding support for Excel/spreadsheet files would be extremely useful for myself and (I imagine) many others. I understand this feature is still in beta, but I am willing to try to implement Excel/spreadsheet file support myself and make a PR if the Docling team supports this.

Alternatives

There are several other libraries that do this, most of them direct competitors with Docling (Llamaindex, Unstructured, etc). However, most of those competitors have only limited open-source/free options; they typically require using their paid API for the most effective solutions.

Conclusion

I am a firm believer in open-source software, and I believe adding this feature to Docling would benefit the project tremendously and encourage many users to use Docling over other closed-source competitors. As I mentioned above, I am willing to try to implement this myself and make a PR if the Docling team supports this idea.

If anyone has suggestions on how to go about implementing this, or feedback, questions, etc., please let me know.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions