How to find the page number for the Table Of Contents (TOC) in a PDF file

In this technical assignment, I developed a heuristic rule-based model that can find and return the page number for the PDF page containing the Table Of Contents (TOC) of annual reports (e.g. page 3 in "abcam_2013.pdf").

Although I presented manual features to find the TOC page number in the proposed model, we can use BerTopic and LDA to restructure the problem to a clustering problem. These PDFs are written in different languages, such as English and Chinese, which makes them difficult to extract text from and it is a challenging task.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pdfs		pdfs
0978-22 - Majid Zarharan.ipynb		0978-22 - Majid Zarharan.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to find the page number for the Table Of Contents (TOC) in a PDF file

About

Releases

Packages

Languages

Zarharan/PDF-TOC-Page-No

Folders and files

Latest commit

History

Repository files navigation

How to find the page number for the Table Of Contents (TOC) in a PDF file

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages