Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert processed Doc to CoNLL-U format #5188

Closed
jyori112 opened this issue Mar 23, 2020 · 2 comments
Closed

How to convert processed Doc to CoNLL-U format #5188

jyori112 opened this issue Mar 23, 2020 · 2 comments
Labels
feat / pipeline Feature: Processing pipeline and components feat / serialize Feature: Serialization, saving and loading

Comments

@jyori112
Copy link

I would like to take a raw text file, apply dependency parsing, and output a CoNLL-U format text file (the subsequent program only accepts CoNNL-U format). Since I need to process large texts, I would like to use spaCy, but I am not sure how I do this. Especially, I could not figure out how to obtain FEATS (List of morphological features). (https://universaldependencies.org/format.html)

Your Environment

  • Operating System: Ubuntu 16.04 LTS
  • Python Version Used: Python 3.7.5
  • spaCy Version Used: spacy==2.2.3
  • Environment Information:

Info about spaCy

  • spaCy version: 2.2.3
  • Platform: Linux-4.4.0-21-generic-x86_64-with-debian-stretch-sid
  • Python version: 3.7.5
@adrianeboyd
Copy link
Contributor

I'd currently recommend the textacy export.doc_to_conll() function for this. The support for morphological features is still under development for spacy v2, so there aren't morphological features available for all languages or all tokens consistently.

If a particular language has very fine-grained tags (some include the UD FEATS as part of the tag, like ADJ__Gender=Masc|Number=Sing), you might be able to use the tag and/or the tag map to create relatively accurate FEATS. You could extend the textacy exporter to handle this pretty easily, but only for some of the languages supported by spacy. (For the provided models it looks like it could work for Spanish, French, Italian, Norwegian, and Dutch. You'd want to double-check that the FEATS accuracy is high enough for your task.)

(There will be a statistical model for morphological features in spacy v3.)

@svlandeg svlandeg added feat / pipeline Feature: Processing pipeline and components feat / serialize Feature: Serialization, saving and loading labels Mar 23, 2020
@lock
Copy link

lock bot commented May 5, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / pipeline Feature: Processing pipeline and components feat / serialize Feature: Serialization, saving and loading
Projects
None yet
Development

No branches or pull requests

3 participants