Skip to content

feat: Add headline extraction to ParsrConverter#3488

Merged
bogdankostic merged 6 commits intomainfrom
parsr_headlines
Oct 31, 2022
Merged

feat: Add headline extraction to ParsrConverter#3488
bogdankostic merged 6 commits intomainfrom
parsr_headlines

Conversation

@bogdankostic
Copy link
Contributor

Related Issues

Proposed Changes:

This PR adds the possibility to extract headlines out of PDF files using ParsrConverter. It follows the structure for headlines as defined in #3445.
During development, I noticed that Parsr's built-in headline extraction does not work well on all PDFs, so use this with caution.

How did you test it?

I added a unit test.

Checklist

@bogdankostic bogdankostic marked this pull request as ready for review October 28, 2022 09:36
@bogdankostic bogdankostic requested a review from a team as a code owner October 28, 2022 09:36
@bogdankostic bogdankostic requested review from masci and removed request for a team October 28, 2022 09:36
@masci
Copy link
Contributor

masci commented Oct 28, 2022

Can you rebase on top of main?

@bogdankostic bogdankostic requested a review from a team as a code owner October 28, 2022 15:26
@bogdankostic
Copy link
Contributor Author

Can you rebase on top of main?

Done ✔️

@bogdankostic bogdankostic removed the request for review from a team October 31, 2022 08:19
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good

id_hash_keys = self.id_hash_keys
valid_languages = valid_languages if valid_languages is not None else self.valid_languages
id_hash_keys = id_hash_keys if id_hash_keys is not None else self.id_hash_keys
extract_headlines = extract_headlines if extract_headlines is not None else self.extract_headlines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I find the other form more concise and readable, these one-liners go around the 100th column

f"been decoded in the correct text format."
)

if self.extract_headlines:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be if extract_headlines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

)

if self.extract_headlines:
meta = meta if meta else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check if meta is None here, you could also move this "initialization" of the param at the beginning of the method, along with the others


if extract_headlines:
relevant_headlines = []
cur_lowest_headline_level = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if you use sys.maxsize instead of 1000, your intent is more obvious (for example, at first I was wondering if 1000 was somehow special in this context)

headline_copy["start_idx"] = None
relevant_headlines.append(headline_copy)
cur_lowest_headline_level = headline_copy["level"]
relevant_headlines = list(reversed(relevant_headlines))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the "alien face" operator would be faster here: relevant_headlines = relevant_headlines[::-1]. I expect the list to be small enough for this not not matter, leaving here just in case.

converter = ParsrConverter()

docs = converter.convert(file_path=str((SAMPLES_PATH / "pdf" / "sample_pdf_4.pdf").absolute()))
for doc, expectation in zip(docs, expected_headlines):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert the number of docs is exactly 2 before getting here? The second item in expected_headlines wouldn't be tested if some day docs contained only one item.

@bogdankostic bogdankostic requested a review from masci October 31, 2022 17:07
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@bogdankostic bogdankostic merged commit 6022441 into main Oct 31, 2022
@bogdankostic bogdankostic deleted the parsr_headlines branch October 31, 2022 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make use of Parsr's heading detection

2 participants