feat: Add headline extraction to ParsrConverter#3488
Conversation
|
Can you rebase on top of |
b3fc25d to
e07a8b3
Compare
Done ✔️ |
| id_hash_keys = self.id_hash_keys | ||
| valid_languages = valid_languages if valid_languages is not None else self.valid_languages | ||
| id_hash_keys = id_hash_keys if id_hash_keys is not None else self.id_hash_keys | ||
| extract_headlines = extract_headlines if extract_headlines is not None else self.extract_headlines |
There was a problem hiding this comment.
nit: I find the other form more concise and readable, these one-liners go around the 100th column
| f"been decoded in the correct text format." | ||
| ) | ||
|
|
||
| if self.extract_headlines: |
There was a problem hiding this comment.
should this be if extract_headlines?
| ) | ||
|
|
||
| if self.extract_headlines: | ||
| meta = meta if meta else {} |
There was a problem hiding this comment.
I would check if meta is None here, you could also move this "initialization" of the param at the beginning of the method, along with the others
|
|
||
| if extract_headlines: | ||
| relevant_headlines = [] | ||
| cur_lowest_headline_level = 1000 |
There was a problem hiding this comment.
nit: if you use sys.maxsize instead of 1000, your intent is more obvious (for example, at first I was wondering if 1000 was somehow special in this context)
| headline_copy["start_idx"] = None | ||
| relevant_headlines.append(headline_copy) | ||
| cur_lowest_headline_level = headline_copy["level"] | ||
| relevant_headlines = list(reversed(relevant_headlines)) |
There was a problem hiding this comment.
nit: the "alien face" operator would be faster here: relevant_headlines = relevant_headlines[::-1]. I expect the list to be small enough for this not not matter, leaving here just in case.
| converter = ParsrConverter() | ||
|
|
||
| docs = converter.convert(file_path=str((SAMPLES_PATH / "pdf" / "sample_pdf_4.pdf").absolute())) | ||
| for doc, expectation in zip(docs, expected_headlines): |
There was a problem hiding this comment.
should we assert the number of docs is exactly 2 before getting here? The second item in expected_headlines wouldn't be tested if some day docs contained only one item.
Related Issues
Proposed Changes:
This PR adds the possibility to extract headlines out of PDF files using
ParsrConverter. It follows the structure for headlines as defined in #3445.During development, I noticed that Parsr's built-in headline extraction does not work well on all PDFs, so use this with caution.
How did you test it?
I added a unit test.
Checklist