New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1->v2 conversion script? #376

Closed
nschneid opened this Issue Dec 15, 2016 · 17 comments

Comments

Projects
None yet
9 participants
@nschneid
Contributor

nschneid commented Dec 15, 2016

I'm involved in an annotation project that has produced some v1 data for English (POS + basic dependencies). We'd obviously like to automate as much of the conversion to v2 as possible. Does anyone have a converter tool? I didn't see any listed on http://universaldependencies.org/tools.html.

Also, we'd appreciate guidance on which aspects of the conversion CANNOT be automated. My impression from the summary of changes is that the new treatment of ellipsis cannot be fully automated, but most of the other changes are deterministic.

@sebschu

This comment has been minimized.

Show comment
Hide comment
@sebschu

sebschu Dec 15, 2016

Member

I don't think we have anything like that so far but I'll definitely write some scripts to convert the English Treebank to UDv2 and I can share them with you once I have them ready (probably at some point in early/mid January).

And yes, most of them can be done automatically, but as you say, elliptical constructions cannot be automatically converted but these examples should be quite rare and you can find them by looking for sentences with a remnant relation.

The second thing that cannot be fully automated is the split of nmod into nmod and obl. Most of the time it will be the case that whenever a PP depends on a predicate, the relation will be obl but in copular constructions with a nominal head this is ambiguous and it depends on whether the PP modifies the nominal or the entire clause.

For example:

The talk is in the Greenberg room at noon.

obl(room, noon)

But:

This is the key to the apartment.

nmod(key, apartment)

I think everything else can be done automatically (at least in English).

Member

sebschu commented Dec 15, 2016

I don't think we have anything like that so far but I'll definitely write some scripts to convert the English Treebank to UDv2 and I can share them with you once I have them ready (probably at some point in early/mid January).

And yes, most of them can be done automatically, but as you say, elliptical constructions cannot be automatically converted but these examples should be quite rare and you can find them by looking for sentences with a remnant relation.

The second thing that cannot be fully automated is the split of nmod into nmod and obl. Most of the time it will be the case that whenever a PP depends on a predicate, the relation will be obl but in copular constructions with a nominal head this is ambiguous and it depends on whether the PP modifies the nominal or the entire clause.

For example:

The talk is in the Greenberg room at noon.

obl(room, noon)

But:

This is the key to the apartment.

nmod(key, apartment)

I think everything else can be done automatically (at least in English).

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Dec 15, 2016

Contributor

@sebschu Thanks for sharing your efforts. Everyone will have to do this, so to prevent a huge duplication of effort and give support to teams who are less familiar with the annotation scheme, it would be great if we could provide a script for the whole community that does automatic things directly and flags other cases for manual inspection. We have not promised such a script, because we didn't know if and when we could make it available. Do you think any of this is likely to happen soon?

Contributor

jnivre commented Dec 15, 2016

@sebschu Thanks for sharing your efforts. Everyone will have to do this, so to prevent a huge duplication of effort and give support to teams who are less familiar with the annotation scheme, it would be great if we could provide a script for the whole community that does automatic things directly and flags other cases for manual inspection. We have not promised such a script, because we didn't know if and when we could make it available. Do you think any of this is likely to happen soon?

@martinpopel

This comment has been minimized.

Show comment
Hide comment
@martinpopel

martinpopel Dec 15, 2016

Member

I am also working on the script (using Udapi). My plan is to mark unclear cases with a comment in MISC. I will let know once I have something.

Another issue is adding enhanced dependencies for Propagation of conjuncts. Dependencies of the first conjunct cannot be automatically distinguished whether they are shared or private. Based on my survey over few languages a reasonable heuristics is that if the dependent is before the first conjunct or after the last conjunct, it is shared. Note that in most Prague-style treebanks this distinction is already annotated (so e.g. for Czech, we will rather update the PDT-to-CoNLLU script).

Member

martinpopel commented Dec 15, 2016

I am also working on the script (using Udapi). My plan is to mark unclear cases with a comment in MISC. I will let know once I have something.

Another issue is adding enhanced dependencies for Propagation of conjuncts. Dependencies of the first conjunct cannot be automatically distinguished whether they are shared or private. Based on my survey over few languages a reasonable heuristics is that if the dependent is before the first conjunct or after the last conjunct, it is shared. Note that in most Prague-style treebanks this distinction is already annotated (so e.g. for Czech, we will rather update the PDT-to-CoNLLU script).

@sebschu

This comment has been minimized.

Show comment
Hide comment
@sebschu

sebschu Dec 15, 2016

Member

@jnivre I plan to get started with this next week and I'll try to have a first version of this before the holidays.

I agree that it would be useful to prevent duplicate work as much as possible. I am just a bit worried that what I have in mind might not produce correct results for all languages. For example, in English, neg will always be replaced with det or advmod but I'm not sure if this is true for all languages (e.g., is this even true for French?). If we provide the script to everyone, it should therefore come with a big disclaimer and individual treebank maintainers would have to check carefully if the output makes sense.

Also, a lot of the changes to the features won't apply to English, so I won't do anything about them.

But I'll try to make it as "universal" as possible and I'll try to make clear which parts are most likely language-specific and might have to be adapted for other languages and which parts should produce the correct results for all languages.

Member

sebschu commented Dec 15, 2016

@jnivre I plan to get started with this next week and I'll try to have a first version of this before the holidays.

I agree that it would be useful to prevent duplicate work as much as possible. I am just a bit worried that what I have in mind might not produce correct results for all languages. For example, in English, neg will always be replaced with det or advmod but I'm not sure if this is true for all languages (e.g., is this even true for French?). If we provide the script to everyone, it should therefore come with a big disclaimer and individual treebank maintainers would have to check carefully if the output makes sense.

Also, a lot of the changes to the features won't apply to English, so I won't do anything about them.

But I'll try to make it as "universal" as possible and I'll try to make clear which parts are most likely language-specific and might have to be adapted for other languages and which parts should produce the correct results for all languages.

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Dec 15, 2016

Contributor

@sebschu @martinpopel Thanks for doing this. I completely understand that you cannot take responsibility for producing something that will work for all languages, so disclaimers will be necessary. However, if you try to "think universally" whenever possible, I am sure it will be easier for people to adapt the script to new contexts.

Contributor

jnivre commented Dec 15, 2016

@sebschu @martinpopel Thanks for doing this. I completely understand that you cannot take responsibility for producing something that will work for all languages, so disclaimers will be necessary. However, if you try to "think universally" whenever possible, I am sure it will be easier for people to adapt the script to new contexts.

@fginter

This comment has been minimized.

Show comment
Hide comment
@fginter

fginter Dec 15, 2016

Member

I'll try to breathe some new life into a tool we have developed here in Turku for treebank conversions. Over the years it was used on numerous occasions, but also transformed itself into one huge hack. It reads a config file with rules that match arbitrary structures in the source and produce a single dependency in the target.

If I succeed in reviving the monster, I will post it here. :)

Member

fginter commented Dec 15, 2016

I'll try to breathe some new life into a tool we have developed here in Turku for treebank conversions. Over the years it was used on numerous occasions, but also transformed itself into one huge hack. It reads a config file with rules that match arbitrary structures in the source and produce a single dependency in the target.

If I succeed in reviving the monster, I will post it here. :)

@amir-zeldes

This comment has been minimized.

Show comment
Hide comment
@amir-zeldes

amir-zeldes Dec 15, 2016

Contributor

I was planning on using a DepEdit job to update the Coptic data, see: https://corpling.uis.georgetown.edu/depedit/

It's true not everything will be 100% automatable, and many things will be language specific, but it might be a good start for some people who don't want to delve into programming too much (it's just a 3-column configuration file specifying tokens to find, their subgraph relations, and what to do to them).

I'm happy to post a link to the job file here once I get around to doing this.

Contributor

amir-zeldes commented Dec 15, 2016

I was planning on using a DepEdit job to update the Coptic data, see: https://corpling.uis.georgetown.edu/depedit/

It's true not everything will be 100% automatable, and many things will be language specific, but it might be a good start for some people who don't want to delve into programming too much (it's just a 3-column configuration file specifying tokens to find, their subgraph relations, and what to do to them).

I'm happy to post a link to the job file here once I get around to doing this.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 16, 2016

Member

Piling on: I'm also interested in developing / contributing to the development of an automated conversion. I wouldn't mind working with @fginter 's monster, but would be happy to use any other reasonable framework.

Member

spyysalo commented Dec 16, 2016

Piling on: I'm also interested in developing / contributing to the development of an automated conversion. I wouldn't mind working with @fginter 's monster, but would be happy to use any other reasonable framework.

@fcbr

This comment has been minimized.

Show comment
Hide comment
@fcbr

fcbr Dec 16, 2016

Still in early stages, but we are building a Common Lisp library to manipulate CoNLL-U files (https://github.com/own-pt/cl-conllu) and using it to do the automated parts of our conversion (e.g.: https://github.com/own-pt/bosque-UD/blob/master/scripts/fix-issue-108-nao-VERB.lisp)

fcbr commented Dec 16, 2016

Still in early stages, but we are building a Common Lisp library to manipulate CoNLL-U files (https://github.com/own-pt/cl-conllu) and using it to do the automated parts of our conversion (e.g.: https://github.com/own-pt/bosque-UD/blob/master/scripts/fix-issue-108-nao-VERB.lisp)

@fginter

This comment has been minimized.

Show comment
Hide comment
@fginter

fginter Dec 16, 2016

Member

Seems we have lots of options here. :) My archaeological excavation site is here: https://github.com/TurkuNLP/dep2dep/ and an example config file here: https://github.com/TurkuNLP/dep2dep/blob/master/dtreebank/dep2dep/example_rules.lp2lp I guess this is what Turku will use for our v1->v2 conversion. The primary advantage of this tool is that it can handle non-trees on input, ie it can convert also the existing extended layer and make use of our PropBank annotation. The PropBank annotation will help e.g. in the obl vs nmod distinction.

Member

fginter commented Dec 16, 2016

Seems we have lots of options here. :) My archaeological excavation site is here: https://github.com/TurkuNLP/dep2dep/ and an example config file here: https://github.com/TurkuNLP/dep2dep/blob/master/dtreebank/dep2dep/example_rules.lp2lp I guess this is what Turku will use for our v1->v2 conversion. The primary advantage of this tool is that it can handle non-trees on input, ie it can convert also the existing extended layer and make use of our PropBank annotation. The PropBank annotation will help e.g. in the obl vs nmod distinction.

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Dec 20, 2016

Contributor

I started a table at http://universaldependencies.org/v1_to_v2.html for specifying the desired behavior of v1-to-v2 converters and (eventually) changes needed to the validation script. Please feel free to contribute. :)

Contributor

jnivre commented Dec 20, 2016

I started a table at http://universaldependencies.org/v1_to_v2.html for specifying the desired behavior of v1-to-v2 converters and (eventually) changes needed to the validation script. Please feel free to contribute. :)

@arademaker

This comment has been minimized.

Show comment
Hide comment
@arademaker

arademaker Dec 27, 2016

Contributor

It is good to know about so many tools under development: Udapi, depedit and dep2dep.

As @fcbr mentioned above, we are working on a Common Lisp library for working with CONLLU files. I hope to add a rule processing engine for tree transformations. So far, we are trying to define what would be the necessary expressivity for that rules language.

To define our rules language we are trying to formalize the modifications that our linguists suggest. For example UniversalDependencies/UD_Portuguese-Bosque#131 (comment)

Unfortunately, in the Portuguese corpus, we are not only dealing with the V1->V2 upgrade, but we are also still correcting wrong analyses.

In the definition of our language for rules, we are stealing ideas from SPARQL, corte e costura, the bioNLP query language, INESS etc.

Each group will probably prefer one particular programming language for the implementation and some specific architecture option, but maybe we can still share ideas about, for instance, a declarative language of rewriting rules.

Contributor

arademaker commented Dec 27, 2016

It is good to know about so many tools under development: Udapi, depedit and dep2dep.

As @fcbr mentioned above, we are working on a Common Lisp library for working with CONLLU files. I hope to add a rule processing engine for tree transformations. So far, we are trying to define what would be the necessary expressivity for that rules language.

To define our rules language we are trying to formalize the modifications that our linguists suggest. For example UniversalDependencies/UD_Portuguese-Bosque#131 (comment)

Unfortunately, in the Portuguese corpus, we are not only dealing with the V1->V2 upgrade, but we are also still correcting wrong analyses.

In the definition of our language for rules, we are stealing ideas from SPARQL, corte e costura, the bioNLP query language, INESS etc.

Each group will probably prefer one particular programming language for the implementation and some specific architecture option, but maybe we can still share ideas about, for instance, a declarative language of rewriting rules.

@fginter

This comment has been minimized.

Show comment
Hide comment
@fginter

fginter Jan 3, 2017

Member

In case this would be relevant to anyone, I got our dep2dep thing up and running, and even made it rehang punct and cc from head to the following conjunct. So I suppose it does something, and Turku will use that. I will try to keep the config file somewhat documented and modular (general UD vs Finnish specific parts). https://github.com/TurkuNLP/dep2dep/blob/master/example_rules.lp2lp

Member

fginter commented Jan 3, 2017

In case this would be relevant to anyone, I got our dep2dep thing up and running, and even made it rehang punct and cc from head to the following conjunct. So I suppose it does something, and Turku will use that. I will try to keep the config file somewhat documented and modular (general UD vs Finnish specific parts). https://github.com/TurkuNLP/dep2dep/blob/master/example_rules.lp2lp

@sebschu

This comment has been minimized.

Show comment
Hide comment
@sebschu

sebschu Jan 13, 2017

Member

Sorry that this took longer than expected but I just pushed a first version of my conversion script to

https://github.com/UniversalDependencies/tools/tree/master/v2-conversion

(I put it in the tools repo, so that other people can also make edits/contribute. I hope that's okay.)

If you intend to use the script, make sure to read the README on the limitations. Also, I haven't run it on any other language than English, so make sure that you do a thorough spot-checking if you run it on other treebanks.

Let me know if you have any questions or problems getting the script to run.

Member

sebschu commented Jan 13, 2017

Sorry that this took longer than expected but I just pushed a first version of my conversion script to

https://github.com/UniversalDependencies/tools/tree/master/v2-conversion

(I put it in the tools repo, so that other people can also make edits/contribute. I hope that's okay.)

If you intend to use the script, make sure to read the README on the limitations. Also, I haven't run it on any other language than English, so make sure that you do a thorough spot-checking if you run it on other treebanks.

Let me know if you have any questions or problems getting the script to run.

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Jan 13, 2017

Contributor

Thanks! This is really useful. Why don't you also send a message to the ud list. I think many people are waiting for something like this.

Contributor

jnivre commented Jan 13, 2017

Thanks! This is really useful. Why don't you also send a message to the ud list. I think many people are waiting for something like this.

@martinpopel

This comment has been minimized.

Show comment
Hide comment
@martinpopel

martinpopel Jan 17, 2017

Member

Another implementation (based on the @sebschu's one, thanks) is here:
https://github.com/udapi/udapi-python/tree/master/udapi/block/ud

I know it's too late, but it may be still useful for someone.
It is implemented using the Udapi framework (I hope the code is more readable, maintainable and powerful this way) and it supports also some edits of FEATS.
I plan to add enhanced dependencies and orphan/remnant.
Contributions and questions are welcome.

Member

martinpopel commented Jan 17, 2017

Another implementation (based on the @sebschu's one, thanks) is here:
https://github.com/udapi/udapi-python/tree/master/udapi/block/ud

I know it's too late, but it may be still useful for someone.
It is implemented using the Udapi framework (I hope the code is more readable, maintainable and powerful this way) and it supports also some edits of FEATS.
I plan to add enhanced dependencies and orphan/remnant.
Contributions and questions are welcome.

@martinpopel

This comment has been minimized.

Show comment
Hide comment
@martinpopel

martinpopel Feb 14, 2017

Member

I think we can close this issue. There are several converters in Lisp, Prolog and Python.
Udapi's ud.Convert1to2 has been used for converting several treebanks.
Udapi has also tools for adding SpaceAfter=No according to the raw text or according to heuristic rules. And ud.MarkBugs for syntax validation.

Member

martinpopel commented Feb 14, 2017

I think we can close this issue. There are several converters in Lisp, Prolog and Python.
Udapi's ud.Convert1to2 has been used for converting several treebanks.
Udapi has also tools for adding SpaceAfter=No according to the raw text or according to heuristic rules. And ud.MarkBugs for syntax validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment