feat: Add headline extraction to `ParsrConverter`#3488

Merged

bogdankostic merged 6 commits intomainfrom

parsr_headlines

Oct 31, 2022

Contributor

bogdankostic commented Oct 27, 2022

Related Issues

fixes Make use of Parsr's heading detection #3057

Proposed Changes:

This PR adds the possibility to extract headlines out of PDF files using ParsrConverter. It follows the structure for headlines as defined in #3445.
During development, I noticed that Parsr's built-in headline extraction does not work well on all PDFs, so use this with caution.

How did you test it?

I added a unit test.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

bogdankostic marked this pull request as ready for review

October 28, 2022 09:36

bogdankostic requested a review from a team as a code owner

October 28, 2022 09:36

bogdankostic requested review from masci and removed request for a team

October 28, 2022 09:36

Contributor

masci commented Oct 28, 2022

Can you rebase on top of main?

bogdankostic requested a review from a team as a code owner

October 28, 2022 15:26

bogdankostic added 4 commits

October 28, 2022 17:35


          Add headline extraction to ParsrConverter

b447b3f


          Add sample PDF file

6408cf8


          Add test

3ce14f1


          Use extract_headlines if set in convert method

e07a8b3

bogdankostic force-pushed the parsr_headlines branch from b3fc25d to e07a8b3 Compare

October 28, 2022 15:37

Contributor Author

bogdankostic commented Oct 28, 2022

Can you rebase on top of main?

Done ✔️

bogdankostic removed the request for review from a team

October 31, 2022 08:19

masci suggested changes

View reviewed changes

Contributor

masci left a comment

Overall looks good

haystack/nodes/file_converter/parsr.py Outdated

-                          id_hash_keys = self.id_hash_keys
+                      valid_languages = valid_languages if valid_languages is not None else self.valid_languages
+                      id_hash_keys = id_hash_keys if id_hash_keys is not None else self.id_hash_keys
+                      extract_headlines = extract_headlines if extract_headlines is not None else self.extract_headlines

Contributor

masci Oct 31, 2022

nit: I find the other form more concise and readable, these one-liners go around the 100th column

haystack/nodes/file_converter/parsr.py Outdated

                                   f"been decoded in the correct text format."
                               )
+                      if self.extract_headlines:

Contributor

masci Oct 31, 2022

should this be if extract_headlines?

Contributor Author

bogdankostic Oct 31, 2022

Good catch

haystack/nodes/file_converter/parsr.py Outdated

                               )
+                      if self.extract_headlines:
+                          meta = meta if meta else {}

Contributor

masci Oct 31, 2022

I would check if meta is None here, you could also move this "initialization" of the param at the beginning of the method, along with the others

haystack/nodes/file_converter/parsr.py Outdated

+                      if extract_headlines:
+                          relevant_headlines = []
+                          cur_lowest_headline_level = 1000

Contributor

masci Oct 31, 2022

nit: if you use sys.maxsize instead of 1000, your intent is more obvious (for example, at first I was wondering if 1000 was somehow special in this context)

haystack/nodes/file_converter/parsr.py Outdated

+                                  headline_copy["start_idx"] = None
+                                  relevant_headlines.append(headline_copy)
+                                  cur_lowest_headline_level = headline_copy["level"]
+                          relevant_headlines = list(reversed(relevant_headlines))

Contributor

masci Oct 31, 2022

nit: the "alien face" operator would be faster here: relevant_headlines = relevant_headlines[::-1]. I expect the list to be small enough for this not not matter, leaving here just in case.

test/nodes/test_file_converter.py

+                  converter = ParsrConverter()
+                  docs = converter.convert(file_path=str((SAMPLES_PATH / "pdf" / "sample_pdf_4.pdf").absolute()))
+                  for doc, expectation in zip(docs, expected_headlines):

Contributor

masci Oct 31, 2022

should we assert the number of docs is exactly 2 before getting here? The second item in expected_headlines wouldn't be tested if some day docs contained only one item.

bogdankostic added 2 commits

October 31, 2022 15:44


          Integrate PR feedback

9aa84d6


          Merge remote-tracking branch 'origin/main' into parsr_headlines

c070fcf

bogdankostic requested a review from masci

October 31, 2022 17:07

masci approved these changes

View reviewed changes

Contributor

masci left a comment

🚀

bogdankostic merged commit 6022441 into main

bogdankostic deleted the parsr_headlines branch

October 31, 2022 18:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet