In this technical assignment, I developed a heuristic rule-based model that can find and return the page number for the PDF page containing the Table Of Contents (TOC) of annual reports (e.g. page 3 in "abcam_2013.pdf").
Although I presented manual features to find the TOC page number in the proposed model, we can use BerTopic and LDA to restructure the problem to a clustering problem. These PDFs are written in different languages, such as English and Chinese, which makes them difficult to extract text from and it is a challenging task.