fix: `ParsrConverter` fails on pages without text #3605

anakin87 · 2022-11-18T15:59:00Z

Related Issues

fixes ParsrConverter invalid if statement #3593

Proposed Changes:

Since #2932, the form feed character ("\f") is used to separate different pages.
In the ParsConverter, for each page, there is this check:

haystack/haystack/nodes/file_converter/parsr.py

Lines 189 to 190 in dc26e6d

    
           if text[-1] != "\f": 
        
               text += "\f"

that raises an error if len(text)==0. For example, we can have empty pages or pages where there is only a table.
I corrected this little bug.

How did you test it?

Manual test using this PDF.

Notes for the reviewer

I also have doubts about

haystack/haystack/nodes/file_converter/parsr.py

Line 210 in dc26e6d

    
           docs = tables + [Document(content=text.strip(), meta=meta, id_hash_keys=id_hash_keys)]

If the first page is blank, "\f" is stripped and page numbers may become incorrect.
@bogdankostic WDYT?

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

julian-risch

The change looks good to me and the PR is ready to be merged.
Sidenote: When I reviewed the PR, I also came across this line of code

haystack/haystack/nodes/file_converter/parsr.py

Line 155 in eb9d3fc

    
           while status_response.status_code == 200 and status_response.status_code != 201:

This could be simplified by having just

while status_response.status_code == 200

Maybe that's something for you @bogdankostic to check when you also have a look at

haystack/haystack/nodes/file_converter/parsr.py

Line 210 in dc26e6d

    
           docs = tables + [Document(content=text.strip(), meta=meta, id_hash_keys=id_hash_keys)]

and empty documents, wrong page numbers?

* Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * fix: discard metadata fields if not set in Weaviate (#3578) * fix weaviate bug in returning embeddings and setting empty meta fields * review comment * Update unstable version and openapi schema (#3584) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: Flatten `DocumentClassifier` output in `SQLDocumentStore`; remove `_sql_session_rollback` hack in tests (#3273) * first draft * fix * fix * move test to test_sql * test: add test to check id_hash_keys is not ignored (#3577) * refactor: Generate JSON schema when missing (#3533) * removed unused script * print info logs when generating openapi schema * create json schema only when needed * fix tests * Remove leftover Co-authored-by: ZanSara <sarazanzo94@gmail.com> * move milvus tests to their own module (#3596) * feat: store metadata using JSON in SQLDocumentStore (#3547) * add warnings * make the field cachable * review comment * Pin faiss-cpu as 1.7.3 seems to have problems (#3603) * Update Haystack imports (#3599) * Update Python version (#3602) * fix: `ParsrConverter` fails on pages without text (#3605) * try to fix bug * remove print * leftover * refactor: update Squad data (#3513) * refractor the to_squad data class * fix the validation label * refractor the to_squad data class * fix the validation label * add the test for the to_label object function * fix the tests for to_label_objects * move all the test related to squad data to one file * remove unused imports * revert tiny_augmented.json Co-authored-by: ZanSara <sarazanzo94@gmail.com> * Url fixes (#3592) * add 2 example scripts * fixing faq script * fixing some urls * removing example scripts * black reformatting * add labeler to the repo (#3609) * convert eval metrics to python float (#3612) * feat: add support for `BM25Retriever` in `InMemoryDocumentStore` (#3561) * very first draft * implement query and query_batch * add more bm25 parameters * add rank_bm25 dependency * fix mypy * remove tokenizer callable parameter * remove unused import * only json serializable attributes * try to fix: pylint too-many-public-methods / R0904 * bm25 attribute always present * convert errors into warnings to make the tutorial 1 work * add docstrings; tests * try to make tests run * better docstrings; revert not running tests * some suggestions from review * rename elasticsearch retriever as bm25 in tests; try to test memory_bm25 * exclude tests with filters * change elasticsearch to bm25 retriever in test_summarizer * add tests * try to improve tests * better type hint * adapt test_table_text_retriever_embedding * handle non-textual docs * query only textual documents * Incorporate Reviewer feedback * refactor: replace `torch.no_grad` with `torch.inference_mode` (where possible) (#3601) * try to replace torch.no_grad * revert erroneous change * revert other module breaking * revert training/base * Fix docstrings for DocumentStores * Fix docstrings for AnswerGenerator * Fix docstrings for Connector * Fix docstrings for DocumentClassifier * Fix docstrings for LabelGenerator * Fix docstrings for QueryClassifier * Fix docstrings for Ranker * Fix docstrings for Retriever and Summarizer * Fix docstrings for Translator * Fix docstrings for Pipelines * Fix docstrings for Primitives * Fix Python code block spacing * Add line break before code block * Fix code blocks * Incorporate Reviewer feedback Co-authored-by: Massimiliano Pippi <mpippi@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Stefano Fiorucci <44616784+anakin87@users.noreply.github.com> Co-authored-by: Julian Risch <julian.risch@deepset.ai> Co-authored-by: ZanSara <sarazanzo94@gmail.com> Co-authored-by: Espoir Murhabazi <espoir.mur@gmail.com> Co-authored-by: Tuana Celik <tuana.celik@deepset.ai> Co-authored-by: tstadel <60758086+tstadel@users.noreply.github.com>

anakin87 added 2 commits November 18, 2022 16:43

try to fix bug

770899c

remove print

0eb4fe7

anakin87 requested a review from a team as a code owner November 18, 2022 15:59

anakin87 requested review from julian-risch and removed request for a team November 18, 2022 15:59

leftover

eb9d3fc

julian-risch added the topic:file_converter label Nov 21, 2022

julian-risch approved these changes Nov 21, 2022

View reviewed changes

julian-risch changed the title ~~fix: ParsrConverter little bug~~ fix: ParsrConverter fails on pages without text Nov 21, 2022

julian-risch merged commit 5f62494 into deepset-ai:main Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `ParsrConverter` fails on pages without text #3605

fix: `ParsrConverter` fails on pages without text #3605

anakin87 commented Nov 18, 2022

julian-risch left a comment •

edited

fix: ParsrConverter fails on pages without text #3605

fix: ParsrConverter fails on pages without text #3605

Conversation

anakin87 commented Nov 18, 2022

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

julian-risch left a comment • edited

Choose a reason for hiding this comment

fix: `ParsrConverter` fails on pages without text #3605

fix: `ParsrConverter` fails on pages without text #3605

julian-risch left a comment •

edited