-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document policy for foreign expressions and code-switching #1001
Comments
Thank you @nschneid for this simplified version. It looks like it will get much more complex in the long run. In narrative number three, the vocative "Monsieur" is likely to be recognize as a title, and therefore it would be tagged as NOUN. In a similar way, "c'est la vie" would be recognized as a kind of formulative, i.e., we would get an INTERJ-type phrase head tied together with flat deps, so it might seem that the relation would be discourse. |
No. It is not just about morphology. Code switching has implications for syntax, too. |
The But the feature would not apply when the whole corpus is declared as code-switching, i.e., none of the (typically two) languages is considered domestic. (And assuming that the word in question belongs to one of the code-switching languages and not to a third one.) |
I would add that the borrowed analysis is preferable (if not the only possible) when the borrowed word has acquired morphology of the host language, different from the morphology of the source language. For example,
Here, exit is borrowed from English (pure Czech would be k výjezdu 36) but it has a form that does not exist in English and it should receive the Czech features Similarly, domesticated spelling is a signal of borrowing. For example, in Czech you can encounter
where ánunk comes from German and its original spelling is Ahnung "idea". A gray area arises when the original language uses a different writing system. The word or phrase will probably appear transcribed in the host text but this does not necessarily make it a borrowing. On the other hand, it does not follow the original spelling, which makes it difficult to use the code-switching analysis and please the validator. For example,
Here, the Russian phrase is transcribed from Все будет в порядке. It is certainly not a borrowing. But if we want the code-switching analysis, we must acknowledge that búdět is Finally, I would say that modification of the foreign word by a non-foreign word is also a sign of borrowing:
|
Is this done in the metadata somewhere? Can a language code be provided at the level of a document or sentence, or does |
Yes. Such treebanks are assigned to an artificial "language" which in fact represents two languages (which may or may not exist in UD separately), it has a private-area ISO code and its "family" is "Code switching". At present we have 5 such languages in the system:
For example, Turkish-German uses the Besides the five code switching languages above, some other treebanks may contain a significant amount of code switching even though they are assigned to one language. Sometimes it means that code switching has become part of the language because its speakers live under heavy influence of a majority language. I believe this is the case of Komi Zyrian IKDP (code switching with Russian). In this case the metadata will not directly reveal it (except that you can search for the
As far as the validator is concerned, it must be on each token individually. It is much easier to process (not just for the validator but for any tool that is interested in |
OK thanks. Here is a second draft:
|
It seems like it would make sense to have a feature indicating an alternate orthography/script. This would be useful for a text containing transliterations or phonetic transcriptions, and for languages where multiple orthographies are used (e.g. Arabic + Arabizi). Perhaps spelling variation as well. Have any treebanks been using such a feature? |
We used to have additional values of But it is a wide area and variation can occur at different levels. The above example pertains to one phrase and it is most likely to occur as a citation within another language. Sometimes you have the whole corpus in a specific spelling (for example, Serbian uses Cyrillic or Latin script, the single Serbian treebank in UD uses Latin; Sanskrit has been written in several different scripts depending on time and location, in UD we have one treebank in Devanagari and another in Latin-based transcription). And sometimes there are competing orthography standards within one language, and they may be mixed in one treebank. I guess you have this problem with British/American/whatever English, but I suspect it occurs to a much higher level in minority languages such as Nahuatl or Low Saxon. |
Implemented at https://universaldependencies.org/foreign.html. Do the trees look OK? @dan-zeman, feel free to add features to the Czech borrowing. |
Looks good to me.
Joakim
Skickat från Outlook för iOS<https://aka.ms/o0ukef>
…________________________________
Från: Nathan Schneider ***@***.***>
Skickat: Tuesday, December 12, 2023 11:50:01 PM
Till: UniversalDependencies/docs ***@***.***>
Kopia: Subscribed ***@***.***>
Ämne: Re: [UniversalDependencies/docs] Document policy for foreign expressions and code-switching (Issue #1001)
Implemented at https://universaldependencies.org/foreign.html. Do the trees look OK? @dan-zeman<https://github.com/dan-zeman>, feel free to add features to the Czech borrowing.
—
Reply to this email directly, view it on GitHub<#1001 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVVBFO35NNTXRVMSJA3YJDNRTAVCNFSM6AAAAABAEPY7Y6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJSHEZTKOBWGE>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert.
CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe.
När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/
E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy
|
Looks good to me too. And I have completed the annotation of the Czech example. |
I have some comments about the revised definition.
This passage is not clear to me and I fear it is misleading for (new) annotators. In general, the annotation endeavour of UD never simulates a speaker: the addition of linguistic feature is actually very much top-down and the typological tags often do not correspond with the "intuition" or the "traditional grammars" of even linguistically aware speakers. So, this is the first reason I would like to see this passage rephrased. Let's not link the annotation to some "intuition-based", "inherent knowledge" framework. The second reason is that the morphosyntactic knowledge of the speaker is rather irrelevant from the point of view of annotation when dealing with cross-lingual content: I think we just have to distinguish between intact and adapted (see below) material. Then if the material is intact, it simply needs the features of its original language. This is useful to retrieve iteresting information such as how foreign words are used in a given language, and to detect trends. For example:
More generally, in the documentation I would actually like to see stressed that as far as possible this cross-lingual annotation has to be the favoured one, at least in the long term. It simply is the most informative and meaningful one and well, what goes more towards "universality"? Criteria can be defined with a good level of precision. And if a foreign word is not adapted, its belonging to that other language's system is always active, even if quiescent (e.g. cultured Italian speakers occasionally using Länder as a plural of ger. Land, where the singular form would actually be the prescriptive one).
This passage does not make sense to me and I would suggest to remove it. For once, it promotes the presupposed exceptionality of proper names while there is hardly any evidence for it: names of any kind have been kept intact or adapted between any two languages in the world, and so it is desirable to annotate this fact for couples such as fr. Hortense / it. Ortensia. It simply is a factual distinction. Then, isolation is not a criterion in annotation of cross-lingual content, and might actually represent the prototypical case.
Good to finally know the difference between Symmetrically as in the previous point, I would like to see stressed that this analysis only makes sense for clear cases like the mentioned Czech k exitu, while if the material is left intact as in coup d'état (even keeping the original orthography!) the cross-lingual analysis should be favoured. Another case that comes to my mind is eng. capish/capeesh/capiche, from Italian capisci /kaˈpiːʃi/ '(do) you understand'. This is totally adapted, as shown even in the orthography, and has become something else from the paradigm-belongng Italian form.
I do not get how this line helps, so I would suggest to simply remove it. It reads tautological at least, in the sense that if a word has been borrowed, then of course it can be modified. Conversely, modification can readily happen for non-adapted words, too.
I think this can also be misleading. Previously in the documentation page, it is said that "wide latitude" is given to treebanks on how to treat foreign words, then this comes, but I do not see how phrasal idioms etc. can be different from isolated material (as it seems implied now by the first two points). Again, I would like to see stressed that this 3rd option is simply (at least in the long run) an ad interim solution in absence of a meaningful cross-lingual analysis, and this is independent from the length of the foreign passage. |
I would tend to agree with that, maybe this is not a good formulation.
I think proper names ARE exceptional, in that if someone's last name is Takahashi, then their name is not suddenly transformed into "Highbridge" in English (the literal meaning) - it stays the same. In that respect, Hortense is not the same as Ortensia (despite etymology), and I think if my name were Hortense I would say my name in Italian is also Hortense, not Ortensia. I think this is also how Italy would issue me a visa or passport if that was my birth name.
Not necessarily. The phrase "coup d'état" happens to be nominal in both French and English, but if I say something "has a certain je ne sais quoi", many English speakers would use that in speech as an unanalyzable nominal. Technically the multilingual analysis would regard this as a clause and might be tempted to give it such a deprel, but the modifier "certain" is a good indication that this is not the status of this borrowed item in English. I think the best annotation there is to treat "certain" as |
To my knowledge, UD doesn't take a position on exactly whose linguistic knowledge is being modeled with trees—the speaker's? hearer's? some average over a speech community? There may be specific treebanks that do seek to model the knowledge of specific individuals (learners, for instance). But I can rephrase the "simulates a speaker" part to clarify that this is just an analogy, not a theoretical claim. It sounds like you're advocating for treebanks to adopt the code-switching analysis. That may not be practical for all treebanks, though: it may be hard to find annotators familiar with the quoted languages, let alone prepared to apply the annotation guidelines for those languages (which may involve language-specific subtypes etc.). We don't want to encourage low-quality annotation of foreign language material by those who lack the qualifications, polluting the collected UD data in that language. I think the neutral position—that it's up to treebanks to decide—is the right one. Regarding morphological adaptation: Are you arguing that there should be features indicating a loan word was plural in the source language but singular in the target language? If so that may motivate Regarding phrasal idioms: This is simply to to suggest that e.g. "C'est la vie" has no internal syntax as an idiom of English. I'm not sure it would make sense to pretend it consists of several VERBs, for example. |
Yes, the fashion now is to leave the name as it is, so keep it as a possible foreign word: Takahasi is a Japanese ( The exceptionality might be at (social) levels of iconicity, saliency, extravagance... but not morphosyntax.
It is not different than the exact equivalent in, say, Italian: ha un certo non so che: non so che 'I do not know what' is a PART-VERB-PRON phrase, it has a predicate, but this does not prevent it to be used as an argument itself, and the meaningful analysis is to make it depend as This is way different from other phenomena like the Hungarian muszáj 'must', which comes from ger. muss sein 'it has to be', but has been completely morphosyntactically incorporated into the language. |
Updated the page. There were inconsistent signals regarding titles. Personally I wouldn't mind saying that "Le festin de Babette" is borrowed as a PROPN. But I guess others in the discussion think of titles as more compositional than typical names. |
If, as a speaker of English, I call my Italian friend Marco instead of translating his name to Mark, am I speaking Italian? In some narrow sense, yes. But it doesn't entail that I have any morphosyntactic knowledge of Italian—how names may or may not inflect for case and so on. As a practical matter, a speaker or annotator may not know the name's language of origin or even how to draw a sharp line between "English" and "non-English" names. Also true of place names: do we want to say that "Massachusetts" has a language code for the Massachusett language? I don't think this is remotely practical to implement at scale, so it is simpler to treat such names as borrowings, but if treebank developers have the resources to conduct etymological inquiries, they are welcome to add |
Well, just the fact that it strives towards a typological approach removes UD's point of view from that of a speaker or a hearer, in general from a spontaneous use of a natural language. Indirect proof of this are all the discussions taking place in these issues...
Yes I do, but at the same time I formulated it as in the long term: I think that we should by all means favour this kind of annotation, of course when it can be done in a sensible way, presenting it as the one to aim at. Then, if, for many practical and good reasons this is not (easily) possible, we still contemplate the "agnostic", "
No. If it is a foreign non-adapted word, only the original language's morphology matters (but see below). To assign a I was wondering, though, about possible cases like I like pizzes: here we would observe at the same time an Italian inflection (pl. pizze vs sg. pizza) and an English one (the pl. -s). To handle such cases, I would propose to stick to the original language annotation, all the wile adding a layered feature, e.g.
OK, possibly, but I see no reason to not encourage a similar annotation (which in my opinion is the more meaningful one, though admittedly more difficult to achieve). Because how isolated these idioms might be, they do possess their own (foreign) internal syntax. |
You would be using an Italian name, and this might be interesting indeed to annotate. Because nothing prevents anyone (and we in fact do observe this happening daily) to call your friend Mark, or maybe latinately Marcus, and to code-switch with regard to his name, be it for style, joke, conviviality... Again, this is interesting to annotate, if it can be done. No claims about speaking any one language or being aware of its workings.
We are focusing on person and place names here, but they, morphosyntactically, really are not different from any other In this specific case I would agree on |
We have some documentation of the Foreign feature, a mention of foreign words in the X tag, and foreign expressions as an example of flat. But I can't find an overarching discussion of how to deal with foreign expressions.
Would the morphology overview be a good place for this?
Here is a crack at some text, including clarifications that were decided by the core group:
The text was updated successfully, but these errors were encountered: