Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

saving /home/johannes/GitHub/UD_Turkish-PUD/tr_pud-ud-test.conllu sen… #1

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

jheinecke
Copy link

three minor validation errors corrected

@@ -12166,7 +12166,7 @@
# sent_id = w01072079
# text = İmparator Caracalla 3. yüzyılda kısa bir süre geçerli olan yeni bir bölünme uygulamıştır.
# text_en = By the 3rd century the emperor Caracalla made a new division which lasted only a short time.
1 İmparator İmparator NOUN NN Number=Sing 13 nsubj _ _
1 İmparator İmparator NOUN NN Number=Sing 13 nsubj:outer _ _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this labeled as outer subject? Isn't the actual error elsewhere? Is the clause kısa bir süre geçerli olan yeni bir bölünme really subject? Isn't it rather an object clause?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting it, I think you're right, "Imperator" should be just nsubj and "bölümne" definitely not csubj (objd or rather ccomp?)

@@ -3857,7 +3857,7 @@
# text = Seçim bölgesi, seçmenlerin yüzde 62'si AB'den ayrılmayı destekleyen Kuzey Kesteven konsey bölgesindedir.
# text_en = The constituency is in the council area of North Kesteven, where 62% of voters backed leaving the EU.
1 Seçim seçim NOUN NN Number=Sing 2 nmod:poss _ _
2 bölgesi bölge NOUN NN Case=Nom|Number=Sing|Number[psor]=Sing|Person[psor]=3 13 nsubj _ SpaceAfter=No
2 bölgesi bölge NOUN NN Case=Nom|Number=Sing|Number[psor]=Sing|Person[psor]=3 13 nsubj:outer _ SpaceAfter=No
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this labeled as outer subject? Shouldn't the other subject, seçmenlerin yüzde 62'si, be rather attached to a nested clause?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I took the definition word by word: nsubj:outer specifies a nominal subject of a copular clause whose predicate is itself a clause, and concluded that Seçim bölgesi is the subject (but not nominal, in fact). Do you see it the other way round (62'si is nsubj:outer)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

62'si here is the subject of the destekleyen which is a participle.

On a general note: the annotations in this treebank is really in a bad shape. There has been some fixes from BOUN people a few years ago, but as far as I know they were not applied. I think we should re-activate that effort and apply their changes before diverging from the base a lot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I wanted only to correct the things detected by the automatic validator to be sure that the PUD treebanks won't be excluded from the future versions, because I think the PUds are really useful from a typological point of view. But I agree a new effort may be necessary to revisit things ....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It would be a pity if we drop PUD treebanks. I'd be happy to have a look at the Turkish one - at least to get it validate by next data freeze, but problems are beyond the issues of validation. I hope we can also improve the actual annotations. I will start contacting the BOUN people. Maybe we can factor their changes in before working on the validation issues.

4 gerçekleşen gerçekleş VERB VB Aspect=Perf|Mood=Ind|Number=Sing|Tense=Pres|VerbForm=Part 6 nmod:poss _ _
5 lerin _ X GW Case=Gen|Number=Plur|Person=3 4 goeswith _ _
4 gerçekleşen gerçekleş VERB VB Aspect=Perf|Case=Gen|Mood=Ind|Number=Plur|Tense=Pres|Typo=Yes|VerbForm=Part 6 nmod:poss _ _
5 lerin _ X GW _ 4 goeswith _ _
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is uncontroversial, but it is actually odd for a PUD treebank to have Typo=Yes and/or goeswith at all. This is not a naturally occurring text. It has been translated into Turkish specifically for the purpose of UD annotation, so there shouldn't be any errors in the translation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake during automatic segmentation during the annotation process. I do not think the original text would have this word split into two. Correct solution would be merging these tokens. I'd annotate as:

gerçekleşenlerin gerçekleş VERB    _ Case=Gen|Number=Plur|Tense=Pres|VerbForm=Part   5       acl     _       _

But it seems in PUD participles and verbal nouns are all split. (Also, nomd(:poss) is also the default for clausal modifiers that behave like nouns as the one above.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt't realise that PUD translations are "error free". In these case I agree that we should retokenise it into gerçekleşenlerin. What deprel do you prefer, acl or rather nmod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt't realise that PUD translations are "error free". In these case I agree that we should retokenise it into gerçekleşenlerin. What deprel do you prefer, acl or rather nmod?

From the global UD perspective I would say that if it is correctly tagged as VERB, then it heads a clause, meaning that it can be acl but not nmod or nmod:poss. But I haven't checked what's usually done in the Turkish treebanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the tricky cases, and inconsistently annotated in different treebanks. These clauses behave like nouns (as in here, participating in a typical genitive - possessive construction). However, UD does not have dependency type for 'a nominal clause modifier of a noun'. Some treebanks use acl, and others use nmod(:poss). I am more inclined for acl (the type of modification can be inferred from the morphological features), but I do understand the others wanting to treat these clauses like nouns.

coltekin added a commit that referenced this pull request Apr 6, 2023
        This is corrected version of PR #1, fixing only two validation
        errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants