How to convert processed `Doc` to CoNLL-U format #5188

jyori112 · 2020-03-23T19:27:23Z

I would like to take a raw text file, apply dependency parsing, and output a CoNLL-U format text file (the subsequent program only accepts CoNNL-U format). Since I need to process large texts, I would like to use spaCy, but I am not sure how I do this. Especially, I could not figure out how to obtain FEATS (List of morphological features). (https://universaldependencies.org/format.html)

Your Environment

Operating System: Ubuntu 16.04 LTS
Python Version Used: Python 3.7.5
spaCy Version Used: spacy==2.2.3
Environment Information:

Info about spaCy

spaCy version: 2.2.3
Platform: Linux-4.4.0-21-generic-x86_64-with-debian-stretch-sid
Python version: 3.7.5

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-03-23T20:49:42Z

I'd currently recommend the textacy export.doc_to_conll() function for this. The support for morphological features is still under development for spacy v2, so there aren't morphological features available for all languages or all tokens consistently.

If a particular language has very fine-grained tags (some include the UD FEATS as part of the tag, like ADJ__Gender=Masc|Number=Sing), you might be able to use the tag and/or the tag map to create relatively accurate FEATS. You could extend the textacy exporter to handle this pretty easily, but only for some of the languages supported by spacy. (For the provided models it looks like it could work for Spanish, French, Italian, Norwegian, and Dutch. You'd want to double-check that the FEATS accuracy is high enough for your task.)

(There will be a statistical model for morphological features in spacy v3.)

lock · 2020-05-05T21:35:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / pipeline Feature: Processing pipeline and components feat / serialize Feature: Serialization, saving and loading labels Mar 23, 2020

adrianeboyd closed this as completed Apr 1, 2020

lock bot locked as resolved and limited conversation to collaborators May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to convert processed `Doc` to CoNLL-U format #5188

How to convert processed `Doc` to CoNLL-U format #5188

jyori112 commented Mar 23, 2020

adrianeboyd commented Mar 23, 2020

lock bot commented May 5, 2020

How to convert processed Doc to CoNLL-U format #5188

How to convert processed Doc to CoNLL-U format #5188

Comments

jyori112 commented Mar 23, 2020

Your Environment

Info about spaCy

adrianeboyd commented Mar 23, 2020

lock bot commented May 5, 2020

How to convert processed `Doc` to CoNLL-U format #5188

How to convert processed `Doc` to CoNLL-U format #5188