Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Foreign names analyzed as compound, should be flat #81

Open
nschneid opened this issue Oct 20, 2019 · 5 comments
Open

Foreign names analyzed as compound, should be flat #81

nschneid opened this issue Oct 20, 2019 · 5 comments

Comments

@nschneid
Copy link
Contributor

  • Sao Paulo
  • Rio de Janeiro
  • Porte de Vanves
  • al - Qaeda
  • El Paso
  • El Paradero
  • La Croce
  • La Hacienda

etc.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 7, 2022

or we could introduce flat:foreign

@amir-zeldes
Copy link
Contributor

I'd prefer foreignness to be in the Foreign feat, since otherwise you only see this for multiword foreign expressions (for example if someone says:

¡Ole!

In an English corpus, I'd like it to be Foreign=Yes, but it wouldn't have a flat relation of any kind. So I think these should be flat and foreign, but one is a deprel and the other is a foreign language identification (ideally coupled with what language it is, which we now have in GUM as well)

@nschneid
Copy link
Contributor Author

nschneid commented Dec 7, 2022

Oops I spoke too soon. These are names, so should be flat:name I think (if we were to use the subtype in EWT, which we don't yet). The foreign part is relevant insofar as it is probably why heuristics used to preprocess EWT missed them.

flat:foreign is already in use in EWT for borrowed expressions like "c'est la vie".

Both flat:name and flat:foreign are universally "recommended". But the guidelines need clarification: I opened UniversalDependencies/docs#914.

I see the subtypes as a way of explaining why a flat structure is needed. A Foreign=Yes feature sounds good independently to capture foreignness as a lexical status. I don't know if it should also include names that incorporate foreign language function words/syntax, like "El Paso". So many place names are borrowed that it may not be a good idea to treat them all as foreign.

@amir-zeldes
Copy link
Contributor

Sure, nothing stops people from using subtypes. I'm just not tempted to add these to any corpus I maintain and would probably steer new developers away from them if it were up to me, because we already have PROPN for names and Foreign for foreign, so this just adds a layer where we could have conflicting analyses (and it missed one word instances, as mentioned, so it's not really useful for retrieval).

I think names are really an entity level property, and foreignness is a text-span property, but I'm happy enough with PROPN and Foreign as practical operationalizations, especially given that most UD corpora don't have NER or codeswitching/Lang annotations. Documenting the reason why something is flat seems beyond the scope of what deprels should be responsible for to me - stating that something is foreign or a name seems interesting, by contrast, but is additional information to the syntax tree itself.

@nschneid
Copy link
Contributor Author

nschneid commented Jul 6, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants