AP Exam Corpus Project

The AP Exam Corpus Project is a Python application for generating corpora for AP exams.

At this point, the application generates a corpus for the US AP History Free Response section in TEI-compliant format.

The input to the application is the path of an exam in PDF format. The default output is formatted XML to stdout.

Future Work

Improve parsing robustness in order to handle other AP exam formats.
Adopt the Corpus Encoding Standard for the corpus.
Expand to more AP subjects (arts, english, math and computer science) and paper formats (multiple choice, short response questions)

Installation

Clone the repository. Use the package manager pip to install required packages.

pip install -r requirements.txt

Usage

main.py -h 
usage: main.py [-h] -f FILE [-o {xml,text,sections}] [-c CONFIG]

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Path of PDF file
  -o {xml,text,sections}, --output {xml,text,sections}
                        Output format
  -c CONFIG, --config CONFIG
                        Path of yaml config file

Output

The default output is a TEI corpus document. Structural elements are identified with a combination of the id attribute and the span element. An abbreviated example is below:

<TEI xmlns:ns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		...
	</teiHeader>
	<text>
		<body>
			<list>
				<item xml:id="s0">
					<l>
						<s> <!-- sentence -->
							<w pos="CD">2017</w> <!-- word -->
					...
			</list>
			<spanGrp type="structure">
				<span from="s0">non_pedagogical</span> <!-- from element links to id attribute in item above -->
				...
			</spanGrp>
		</body>
	</text>
</TEI>

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please update test cases as appropriate.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
samples		samples
test		test
.gitignore		.gitignore
README.md		README.md
ap_parser.yaml		ap_parser.yaml
main.py		main.py
pdf_handling.py		pdf_handling.py
requirements.txt		requirements.txt
text_handling.py		text_handling.py
xml_handling.py		xml_handling.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AP Exam Corpus Project

Future Work

Installation

Usage

Output

Contributing

License

About

Releases

Packages

Languages

george-zip/ap_exam_to_corpus

Folders and files

Latest commit

History

Repository files navigation

AP Exam Corpus Project

Future Work

Installation

Usage

Output

Contributing

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages