The AP Exam Corpus Project is a Python application for generating corpora for AP exams.
At this point, the application generates a corpus for the US AP History Free Response section in TEI-compliant format.
The input to the application is the path of an exam in PDF format. The default output is formatted XML to stdout.
- Improve parsing robustness in order to handle other AP exam formats.
- Adopt the Corpus Encoding Standard for the corpus.
- Expand to more AP subjects (arts, english, math and computer science) and paper formats (multiple choice, short response questions)
Clone the repository. Use the package manager pip to install required packages.
pip install -r requirements.txt
main.py -h
usage: main.py [-h] -f FILE [-o {xml,text,sections}] [-c CONFIG]
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE Path of PDF file
-o {xml,text,sections}, --output {xml,text,sections}
Output format
-c CONFIG, --config CONFIG
Path of yaml config file
The default output is a TEI corpus document. Structural elements are identified with a combination of the id
attribute and the span
element. An abbreviated example is below:
<TEI xmlns:ns="http://www.tei-c.org/ns/1.0">
<teiHeader>
...
</teiHeader>
<text>
<body>
<list>
<item xml:id="s0">
<l>
<s> <!-- sentence -->
<w pos="CD">2017</w> <!-- word -->
...
</list>
<spanGrp type="structure">
<span from="s0">non_pedagogical</span> <!-- from element links to id attribute in item above -->
...
</spanGrp>
</body>
</text>
</TEI>
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please update test cases as appropriate.