Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clitics with discourse function #709

Closed
ftyers opened this issue May 15, 2020 · 13 comments
Closed

Clitics with discourse function #709

ftyers opened this issue May 15, 2020 · 13 comments

Comments

@ftyers
Copy link
Contributor

ftyers commented May 15, 2020

I'm working with a spoken corpus of Chukchi where there are clitics with a discourse function, "particles", as Dunn describes:
Captura de 2020-05-15 23-10-17

Example 018:
Captura de 2020-05-15 23-10-30

I have been annotating them with the discourse relation and attaching them to the word they are cliticised too. However, the guidelines for discourse state that they should be attached "to the head of the most relevant nearby clause".

This is a bit inconvenient when there are a lot of these in a single clause... pretty much anything can have them attached, as Dunn notes. For example in the following tree,

Captura de 2020-05-15 23-17-57

# sent_id = Chapaev:17
# text = Нэмыӄэйъымэ ӄынвэтъымэ тъэче гииӈитыкъымэ ӄынвэтэ амноӈэты нэмыӄэй киноратыӈӈоаʼт.
# text[phon] = neməqejʔəme qənwetʔəme tʔese ɣiiŋitəkʔəme qənwete amnoŋetə neməqej kinoratəŋŋoʔat
# text[rus] = Как-то через несколько лет тоже наконец стали возить кино по бригадам.
# text[eng] = Sometimes a few years later, they also finally began to carry films around the camps.
# labels = complete-dep anno-karina
1-3	Нэмыӄэйъымэ	_	_	_	_	_	_	_	Gloss=тоже-=EMPH-=PTCL
1	Нэмыӄэй	_	ADV	_	_	15	advmod	15:advmod	Gloss=тоже
2	ъым	ъм	PART	_	_	1	discourse	1:discourse	Gloss=EMPH
3	э	а	PART	_	_	1	discourse	1:discourse	Gloss=PTCL
4-6	ӄынвэтъымэ	_	_	_	_	_	_	_	Gloss=наконец-=EMPH-=PTCL
4	ӄынвэт	_	ADV	_	_	15	advmod	15:advmod	Gloss=наконец
5	ъым	ъм	PART	_	_	4	discourse	4:discourse	Gloss=EMPH
6	э	а	PART	_	_	4	discourse	4:discourse	Gloss=PTCL
7	тъэче	_	ADV	_	_	8	advmod	8:advmod	Gloss=сколько-MPL
8-10	гииӈитыкъымэ	_	_	_	_	_	_	_	Gloss=год-TIME-LOC-=EMPH-=PTCL
8	гииӈитык	_	NOUN	_	Case=Loc	15	obl	15:obl	Gloss=год-TIME-LOC
9	ъым	ъм	PART	_	_	8	discourse	8:discourse	Gloss=EMPH
10	э	а	PART	_	_	8	discourse	8:discourse	Gloss=PTCL
11-12	ӄынвэтэ	_	_	_	_	_	_	_	Gloss=наконец-=PTCL
11	ӄынвэт	_	ADV	_	_	15	advmod	15:advmod	Gloss=наконец
12	э	а	PART	_	_	11	discourse	11:discourse	Gloss=PTCL
13	амноӈэты	эмнуӈ	NOUN	_	Case=Dat	15	obl	15:obl	Gloss=тундра-DAT
14	нэмыӄэй	нэмыӄэй	ADV	_	_	15	advmod	15:advmod	Gloss=тоже
15	киноратыӈӈоаът	рэтык	VERB	_	Incorporated[obj]=Yes	0	root	0:root	2/3.S/A-фильм-приносить-INCH-TH-PL
15.1	кино	кино	NOUN	_	Incorporated=Yes	_	_	15:obj	Gloss=фильм
16	.	.	PUNCT	_	_	15	punct	15:punct	_

Applying the rule to this sentence, we would end up with a tree like,
Captura de 2020-05-15 23-18-30

It isn't the case in this example, but one could imagine a lot of non-projectivities could be introduced because of this.

The validator does not complain about most cases, but it does complain about cc having dependents,

[Line 862 Sent Boots:8 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Ынкъам:cc --> 2:э:discourse)

So, I'd like to ask if anyone has any suggestions? Does this happen in other languages, what are the potential solutions? I've seen that Czech has advmod:emph, would that do the job?

@sylvainkahane
Copy link
Contributor

We have such particles in Naija, a creole of English spoken in Nigeria. When they attach on words, such as sha, we attached them with advmod:emph: examples sorted by POS of the previous word

@dan-zeman By the way, we had a problem with the validator when they attach on a CCONJ, because advmod:emph was refused.

When they are more discursive, such as o, we used discourse.

@ftyers
Copy link
Contributor Author

ftyers commented May 16, 2020

@dan-zeman By the way, we had a problem with the validator when they attach on a CCONJ, because advmod:emph was refused.

Yes, this is the same problem I'm having, with cc, cop, case and aux. Perhaps the validator could be relaxed to allow for discourse or advmod to be attached.

At least for case, it might even make sense in English in sentences like:

  1. "it was in exactly that house", vs.
  2. "it was exactly in that house"

@amir-zeldes
Copy link
Contributor

Out of curiosity, do they modify the words they are attached to semantically, or are they really clause level evidentials, emphatics, etc. that just happen to piggy-back on adjacent phonological units? Are they selective in any way with regard to the host word?

@Stormur
Copy link
Contributor

Stormur commented May 21, 2020

If these particles are to be attached to the single words, it seems that advmod:emph does the trick! Using the official validation script, I do not get errors when a PART depends on a node with this deprel. It looks like PARTs are allowed to be much more flexible with respect to relations.

By the way, what is the other particle э which always accompanies the ъым in your sample?

@ftyers
Copy link
Contributor Author

ftyers commented May 21, 2020

@Stormur so if I have a CCONJ with cc label that has a PART dependent, e.g. advmod:emph(CCONJ, PART), the validator doesn't complain?

PTCL =(ŋ)e Самая частотная частица

The glossing guidelines say that this is "the most frequent particle" but do not give any other information and Dunn's grammar and the sketch by Muravyova doesn't mention it. I've asked the original annotators this question as well as @amir-zeldes', let's see what they say.

@Stormur
Copy link
Contributor

Stormur commented May 21, 2020

@Stormur so if I have a CCONJ with cc label that has a PART dependent, e.g. advmod:emph(CCONJ, PART), the validator doesn't complain?

I tried it and apparently it didn't. In Latin too we are going to use some particles with advmod:emph, by the way, and I noticed that they are also allowed to bear features like PronType=Emp or similar. I hope it is not just a quirk of the version/system I tested, but anyhow, it makes sense.

@ftyers
Copy link
Contributor Author

ftyers commented May 21, 2020

I tried it, and it doesn't seem to work, I get the same errors,

Fri May 22 00:21:13 CEST 2020
tools/check_files.pl UD_Chukchi-HSE
Unknown language Chukchi.
*** FAILED ***
./validate.sh --lang ckt --max-err 0 UD_Chukchi-HSE/ckt_hse-ud-test.conllu
[Line 901 Sent Being_a_child:30 Node 5]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (5:итык:cop --> 1:Ӄунэчеӈ:advmod)
[Line 908 Sent Being_a_child:30 Node 5]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (5:итык:cop --> 6:э:advmod)
[Line 1187 Sent Boots:8 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Ынкъам:cc --> 2:э:advmod)
[Line 4837 Sent Having_bear_ears:33 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Ынкъам:cc --> 2:э:advmod)
[Line 5575 Sent Ice_Age:6 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Ынкъам:cc --> 2:э:advmod)
[Line 5682 Sent Ice_Age:12 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Ынкъам:cc --> 2:э:advmod)
[Line 5745 Sent Ice_Age:15 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Къам:cc --> 2:а:advmod)
[Line 5802 Sent Incident:1 Node 4]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (4:нытваӄэн:cop --> 5:э:advmod)
[Line 10177 Sent Shaman:3 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Къам:cc --> 2:а:advmod)
[Line 10372 Sent Shaman:15 Node 6]: [L3 Syntax leaf-aux-cop] 'aux' not expected to have children (6:нъэԓгъи:aux --> 4:этаны:advmod)
[Line 11665 Sent The_race_to_death:2 Node 1]: [L3 Syntax leaf-cc] 'cc' not expected to have children (1:Ынкъам:cc --> 2:э:advmod)
[Line 11676 Sent The_race_to_death:2 Node 10]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (10:нытваӄэн:cop --> 11:ъым:advmod)
[Line 11677 Sent The_race_to_death:2 Node 10]: [L3 Syntax leaf-aux-cop] 'cop' not expected to have children (10:нытваӄэн:cop --> 12:э:advmod)
[Line 11961 Sent Transhumance:10 Node 2]: [L3 Syntax leaf-cc] 'cc' not expected to have children (2:ӈэвэӄ:cc --> 3:э:advmod)
Syntax errors: 14
*** FAILED *** with 14 errors

Here is an example that breaks,

# sent_id = Boots:8
# text = Ынкъамэ триԓгытэквъэыʼм травэръэпыгъа трасапокгыпгъа.
# text[phon] = ənkʔame triɬɣətekwʔeʔəm trawerʔepəɣʔa trasapokɣəpɣʔa
# text[rus] = DMP: — Ещё помоюсь, оденусь, надену сапоги.
# text[eng] = DMP: — And I will wash myself, get dressed, put on my boots.
# labels = complete-dep anno-fran
1-2     Ынкъамэ _       _       _       _       _       _       _       Gloss=и-=PTCL
1       Ынкъам  _       CCONJ   _       _       3       cc      3:cc    Gloss=и
2       э       _       PART    _       _       1       advmod:emph     1:advmod        Gloss=PTCL
3-4     триԓгытэквъэыʼм _       _       _       _       _       _       _       Gloss=1SG.S/A-FUT-мыться-TH-=EMPH
3       триԓгытэквъэ    _       VERB    _       _       0       root    0:root  Gloss=1SG.S/A-FUT-мыться-TH
4       ыʼм     _       PART    _       _       3       advmod:emph     3:advmod        Gloss=EMPH
5       травэръэпыгъа   _       VERB    _       _       3       parataxis       3:parataxis     Gloss=1SG.S/A-FUT-одежда.из.шкур-надевать-TH
5.1     авэръ   _       NOUN    _       _       _       _       5:obj   Gloss=одежда.из.шкур
6       трасапокгыпгъа  _       VERB    _       _       3       parataxis       3:parataxis     Gloss=1SG.S/A-FUT-сапог-надевать-TH|SpaceAfter=No
6.1     сапок   _       NOUN    _       _       _       _       6:obj   Gloss=сапог
7       .       _       PUNCT   _       _       3       punct   3:punct _

@Stormur
Copy link
Contributor

Stormur commented May 22, 2020

Sorry, my bad. Yes, the validator treats any dependency of a cc or aux as an error.

What does work is that PART is allowed to have deprel advmod:emph, as this does not trigger the advmod but not ADV error (and neither does a PronType).

But still, if the same particle appears over and over in the same sentence, isn't it really just a discourse tied to the root? And are we really going to have non-projectivities?

@ftyers
Copy link
Contributor Author

ftyers commented May 22, 2020

@Stormur I'm waiting for a reply from the corpus developers, I'll let you know!

I had another thought too, in UD_Finnish these clitics are represented as features, e.g. search for Clitic= here, gyzp5-839 for example. Although I must admit I don't particularly like that solution either.

@Stormur
Copy link
Contributor

Stormur commented May 22, 2020

Yes, I was aware of the Finnish solution, but it does not convince me. I think that if the clitic deserves to be annotated like that, it is not just a variant of that word and has its own syntactic role. So I would stick just to Clitic=Yes and find the right head and dependency for the clitic.

Just some thoughts about advmod:emph and ccs: if we think that ccs represent connectors, the clitic might be seen to effectively modify the co-ordinated word. A sensible solution might be then to "transfer" the relation advmod:emph to the head of the cc, and the same would go for auxiliaries. In a similar way, I can think of an emphasizer attached to a SCONJ which effectively is modifying the whole preoposition.

@ftyers
Copy link
Contributor Author

ftyers commented May 22, 2020

@amir-zeldes @Stormur here is the answer I got back:

They serve both as emphasizers and placeholders. It is too much to attribute to them coordinating conjunction dep, as it is an intepretation through the translations. The informants themselves could not mean conjuction. Then the use of these clitics is very idiolectic. There are ones who use it after each word and others who say it only twice in five minutes. As it is indicated by Dunn, they join to words of any class.

So, advmod:emph is an appropriate type. And the relation should be as much local as it is possible.

@dan-zeman dan-zeman added this to the v2.7 milestone May 23, 2020
@dan-zeman
Copy link
Member

I support the view that these particles should not be discourse and that advmod:emph seems appropriate from the perspective of how we used it in Czech. I'm not convinced that their parent node must necessarily be the word they are attached to phonologically/orthographically (if it were so, then I wouldn't see any point in making them separate nodes). So going from "must-be-leaf" nodes to the nearest eligible ancestor sounds like a reasonably UD-ish solution to me.

@Stormur : I think there are not many (if any) restrictions the validator places on morphological features at present. But I hope we will be able to add tests of that sort in the future :-)

@ftyers
Copy link
Contributor Author

ftyers commented May 24, 2020

@dan-zeman that sounds reasonable. Now the file passes validation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants