Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weak Arabic Name handling #133

Open
us88 opened this issue Feb 6, 2022 · 1 comment
Open

Weak Arabic Name handling #133

us88 opened this issue Feb 6, 2022 · 1 comment

Comments

@us88
Copy link

us88 commented Feb 6, 2022

The library does not handle Arabic names well, even the most common patterns. I'm no expert on the topic, but I'm Arabic and know the common patterns.

Compound Names
My first name is "Mohamad Ali", but the library identifies "Ali" as my middle name. Arabic full names of the form "Mohamad X Surname" are almost always meant to have "Mohamad X" as a first name (with exceptions such as when X is "El" or "Al", in which case the surname is compound with the first word being "El" or "Al"). Other exceptions are "Bin" (the library handles these correctly). Examples: Mohamad Khalil, Mohamad Amin, Mohamad Ali, Mohamad El Amin, Mohamad Bin Salman, etc...

Well-known Surname Suffixes
Some names like "Mohamad Zeineddine" can be written as "Mohamad Zein El Dine". Here the first name is Mohamad and the surname is "Zein El Dine" which is equivalent to "Zeineddine". "El Dine"/"eddine" is an extremely common suffix to have in Arabic surnames (e.g. Zeineddine, Alameddine, Charafeddine, Safieddine, Saifeddine, etc...). Other suffixes like "-allah"/"-ullah"/"-ollah" are extremely common as well (e.g., Nasrallah). This is to say that "El Dine" and "Allah" are almost always the 2nd part of a surname (at least one more word is needed on the left to complete the surname)

Middle names hardly exist
An Arabic-looking name is a good hint that there is no middle name. Arabic cultures adopt chaining of names instead of middle names (first name, followed by father's name, followed by father's father's name, etc..., and then the surname).

Edit: Honestly, the Wikipedia page discusses this really well https://en.wikipedia.org/wiki/Arabic_name

@derek73
Copy link
Owner

derek73 commented Feb 6, 2022

Thanks for all the info. I studied Arabic for many years so I'm familiar with some of this, but I hadn't gotten all that detail.

For "Mohamad Zein El Dine", you could get the parser to handle this correctly by adding "al" and "el" to conjunctions. Conjunctions join to words before and after them. This would break the correct parsing of the name "Al", as in "Al Gore", but if you are dealing with mostly Arab names then you're not going to have many Al's anyway.

Interesting that you consider "Mohamad Ali" as your first name. Potentially delving into the difference between Arab and Muslim, but some Google-ing indicates there are people with the name "Mohamad Ali" that consider "Mohamad" their first name and "Ali" their last name, so we probably wouldn't want to make assumptions about that name by default.

In the parser's terminology, prefixes are words that join to the word after them but not before. That sounds kinda like what you want to do with "Mohamad" and variations, but currently this check is only performed on last names so that things like "Al" can join to last names but still parse "Al Gore" correctly. To make "Mohamad" join to the first name, we could add a separate set of words that can join to first names. This would at least give someone like you the option of sticking "Mohamad" in there if that's what you needed.

The parser does not currently do any checking of substrings, it just matches the entire word parts (split on spaces). So, if we wanted to add that, it would be a new thing. There are other languages that also have specific suffixes on specific name parts, eg from #85 in Russian/Ukrainian they apparently have suffixes that help indicate middle names.

middle (patronymic) always ends with suffix ovna/evna/ovich/evich/ich/ichna/inichna.

That's probably what we'd need to parse those name suffixes that are added to a word, "-eddine" and "-allah". (not to be confused with the parser's suffixes set, which is named the same but a totally different thing.)

re: middle names, it seems like it would be pretty easy to implement some switch to turn off middle names. The parser does a much worse job with Chinese names than Arabic, but I believe they also don't have middle names and instead sometimes have longer family names, so it could probably be useful for other languages as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants