How to read all the text in a page excluding page header/footer. #404
-
Hey I just discovered your library and it is great! Is there a way to get only text-content without header/footer? Thanks a lot! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @Julio-German-Gutierrez, a first step would be to identify headers and footers (as you try to do with regex). You'll then have to remove them from your text (I strongly advise you have a look at the whole Document Layout Analysis wiki page for a better understanding) |
Beta Was this translation helpful? Give feedback.
Hi @Julio-German-Gutierrez, a first step would be to identify headers and footers (as you try to do with regex).
Have a look in the Wiki here: Decoration Text Block Classifier for another method to find headers and footers. This is far from perfect but this is a start...
You'll then have to remove them from your text (I strongly advise you have a look at the whole Document Layout Analysis wiki page for a better understanding)