Skip to content

How to read all the text in a page excluding page header/footer. #404

Discussion options

You must be logged in to vote

Hi @Julio-German-Gutierrez, a first step would be to identify headers and footers (as you try to do with regex).
Have a look in the Wiki here: Decoration Text Block Classifier for another method to find headers and footers. This is far from perfect but this is a start...

You'll then have to remove them from your text (I strongly advise you have a look at the whole Document Layout Analysis wiki page for a better understanding)

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@Julio-German-Gutierrez
Comment options

@BobLd
Comment options

@Julio-German-Gutierrez
Comment options

Answer selected by BobLd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants