Skip to content

alphydan/igcse-paper-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

igcse-paper-parser

Parses PDFs from past papers and organizes them by topic

Intro

This script reads the PDFs from past iGCSE physics exams and classifies them by topic.
So the 2007 exam may be broken into 10 different PDFs, each one going to the corresponding directory for Forces, Waves, Radioactivity, etc.

Strategy:

  1. Read a paper and find the page numbers for each problem (cf. paperparse.py )
  2. Import the keywords for each category (cf. igcse_categories.csv)
  3. Give each problem a score for each category (for example, it seems to belong to Waves (score = 3), but not to Forces (score=1)) (cf. problem_classifier.py)
  4. Name the paper according to the chosen category and save it in the relevant directory (cf. paper_classifier.py)

We use a few simplifying assumptions:

- A new question always starts on a new page
- A problem will be between 1 and 3 pages long
- a Blank page is at most 1 page long.
- No problem will ever be on page zero.

Requirements

You will need python and pyPDF which you can install with:

    pip install pyPDF

This script also uses the linux command line utility pdftotext which can be installed on mac and windows

About

Parses PDFs from past papers and organizes them by topic

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages