Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Document id generation algorithm #7450

Closed
vblagoje opened this issue Apr 2, 2024 · 1 comment
Closed

Update Document id generation algorithm #7450

vblagoje opened this issue Apr 2, 2024 · 1 comment
Assignees
Labels
2.x Related to Haystack v2.0 P1 High priority, add to the next sprint

Comments

@vblagoje
Copy link
Member

vblagoje commented Apr 2, 2024

During implementation of Azure converter issue one of the unit tests that worked on 1.x branch kept failing. The unit test istest_azure_converter_with_multicolumn_header_table and it is available on this PR

Upon further investigation @sjrl and I have traced this issue to stem from how we calculate Document id. A possible solution might be to "allow duplicate column names. Looking at the example in the PDF, it is very common to have tables in things like financial reports to have multi column headers. And I think the best way to represent that in a dataframe is to have duplicate column names. I think it would be better to update the call to the to_json() method to work with dataframes that have duplicate column names."

Not sure how deep consequences of this change would be but a PR resolving the Azure converter issue is blocked by this issue.

@vblagoje
Copy link
Member Author

vblagoje commented Apr 2, 2024

@vblagoje vblagoje added P1 High priority, add to the next sprint 2.x Related to Haystack v2.0 labels Apr 2, 2024
@masci masci self-assigned this Apr 7, 2024
@masci masci closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 P1 High priority, add to the next sprint
Projects
None yet
Development

No branches or pull requests

2 participants