How to read all the text in a page excluding page header/footer. #404

Julio-German-Gutierrez · 2022-01-03T11:53:56Z

Julio-German-Gutierrez
Jan 3, 2022

Hey I just discovered your library and it is great!
I am working on a small project to perform simple resume of pdf-text documents. So far I have been able to deal with page header/footer, using regular expressions. As you can imagine this is far from good as it has to be personalized with every new document. I am currently using the "text" property of a page myPage.text to access text content. The problem is that I am also getting header & footer.

Is there a way to get only text-content without header/footer?

Thanks a lot!

Answered by BobLd

Jan 3, 2022

Hi @Julio-German-Gutierrez, a first step would be to identify headers and footers (as you try to do with regex).
Have a look in the Wiki here: Decoration Text Block Classifier for another method to find headers and footers. This is far from perfect but this is a start...

You'll then have to remove them from your text (I strongly advise you have a look at the whole Document Layout Analysis wiki page for a better understanding)

View full answer

BobLd · 2022-01-03T11:59:40Z

BobLd
Jan 3, 2022
Maintainer

Hi @Julio-German-Gutierrez, a first step would be to identify headers and footers (as you try to do with regex).
Have a look in the Wiki here: Decoration Text Block Classifier for another method to find headers and footers. This is far from perfect but this is a start...

You'll then have to remove them from your text (I strongly advise you have a look at the whole Document Layout Analysis wiki page for a better understanding)

3 replies

Julio-German-Gutierrez Jan 3, 2022
Author

Thanks a lot!

BobLd Jan 3, 2022
Maintainer

Have a look there page segmenters

Julio-German-Gutierrez Jan 3, 2022
Author

It works great thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read all the text in a page excluding page header/footer. #404

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to read all the text in a page excluding page header/footer. #404

Julio-German-Gutierrez Jan 3, 2022

Replies: 1 comment · 3 replies

BobLd Jan 3, 2022 Maintainer

Julio-German-Gutierrez Jan 3, 2022 Author

BobLd Jan 3, 2022 Maintainer

Julio-German-Gutierrez Jan 3, 2022 Author

Julio-German-Gutierrez
Jan 3, 2022

Replies: 1 comment 3 replies

BobLd
Jan 3, 2022
Maintainer

Julio-German-Gutierrez Jan 3, 2022
Author

BobLd Jan 3, 2022
Maintainer

Julio-German-Gutierrez Jan 3, 2022
Author