Skip to content

The AP Exam Corpus Project is a Python application that generates corpora for AP exams.

Notifications You must be signed in to change notification settings

george-zip/ap_exam_to_corpus

Repository files navigation

AP Exam Corpus Project

The AP Exam Corpus Project is a Python application for generating corpora for AP exams.

At this point, the application generates a corpus for the US AP History Free Response section in TEI-compliant format.

The input to the application is the path of an exam in PDF format. The default output is formatted XML to stdout.

Future Work

  1. Improve parsing robustness in order to handle other AP exam formats.
  2. Adopt the Corpus Encoding Standard for the corpus.
  3. Expand to more AP subjects (arts, english, math and computer science) and paper formats (multiple choice, short response questions)

Installation

Clone the repository. Use the package manager pip to install required packages.

pip install -r requirements.txt

Usage

main.py -h 
usage: main.py [-h] -f FILE [-o {xml,text,sections}] [-c CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Path of PDF file
  -o {xml,text,sections}, --output {xml,text,sections}
                        Output format
  -c CONFIG, --config CONFIG
                        Path of yaml config file

Output

The default output is a TEI corpus document. Structural elements are identified with a combination of the id attribute and the span element. An abbreviated example is below:

<TEI xmlns:ns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		...
	</teiHeader>
	<text>
		<body>
			<list>
				<item xml:id="s0">
					<l>
						<s> <!-- sentence -->
							<w pos="CD">2017</w> <!-- word -->
					...
			</list>
			<spanGrp type="structure">
				<span from="s0">non_pedagogical</span> <!-- from element links to id attribute in item above -->
				...
			</spanGrp>
		</body>
	</text>
</TEI>

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please update test cases as appropriate.

License

MIT

About

The AP Exam Corpus Project is a Python application that generates corpora for AP exams.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages