Skip to content

How to find the page number for the Table Of Contents (TOC) in a PDF file

Notifications You must be signed in to change notification settings

Zarharan/PDF-TOC-Page-No

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

How to find the page number for the Table Of Contents (TOC) in a PDF file

In this technical assignment, I developed a heuristic rule-based model that can find and return the page number for the PDF page containing the Table Of Contents (TOC) of annual reports (e.g. page 3 in "abcam_2013.pdf").

Although I presented manual features to find the TOC page number in the proposed model, we can use BerTopic and LDA to restructure the problem to a clustering problem. These PDFs are written in different languages, such as English and Chinese, which makes them difficult to extract text from and it is a challenging task.

Releases

No releases published

Packages

No packages published