subcategorization-frames-and-learner-English-data
Resources of subcategorization frames (SCFs) and learner English data.
ef1000:
This directory contains the gold standard annotations of the dependencies, SCFs, and learner errors for 1000 sentences sampled from EF-Cambridge Open Language Database (EFCAMDAT), a large-scale learner English corpus. For the details of the annotations, please refer to (Huang et al., 2018) and (Huang, 2018)
scf_identifier:
This directory contains a SCF identifier which can identify the SCFs of individual verbs occurrences in native or learner English text. For the linguistic and technical details of the identifier, please refer to (Huang, 2018). For instructions on how to use the identifier, please refer to /scf_identifier/README
.
scf_native:
This directory contains SCF resources and tools developed for native English. Please refer to /scf_native/README
for an explanation of the files. These resources and tools were produced from the PANACEA project (Quochi et al. 2014) and the project of Lexical Acquisition for the Biomedical Domain (Lippincott et al. 2013) led by Prof. Anna Korhonen.
References:
Huang Y. (2018). Automatic syntactic analysis of learner English. PhD thesis. Language Technology Lab, Faculty of Modern and Medieval Languages, University of Cambridge.
Huang Y., Murakami A., Alexopoulou T. & Korhonen A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28-54.
Quochi, V., Frontini, F., Bartolini, R., Hamon, O., Poch, M., Padró, M., Bel, N., Thurmair, G., Toral, A., and Kamram, A. (2014). Third evaluation report. Evaluation of PANACEA v3 and produced resources.
Lippincott T., Rimell L, Johnson H., Verspoor K. and Korhonen A. (2013). Acquisition and evaluation of verb subcategorization resources for biomedicine. Journal of Biomedical Informatics. Volume 46, Issue 2. Pages 228-237.