Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird email and username in Chinese locale package #1105

Closed
shtse8 opened this issue Jun 24, 2022 · 18 comments · Fixed by #1554
Closed

Weird email and username in Chinese locale package #1105

shtse8 opened this issue Jun 24, 2022 · 18 comments · Fixed by #1554
Assignees
Labels
c: locale Permutes locale definitions has workaround Workaround provided or linked m: internet Something is referring to the internet module p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug

Comments

@shtse8
Copy link

shtse8 commented Jun 24, 2022

Describe the bug

email and username should not using Chinese even in Chinese locale package.
there is no one using Chinese as an email and username even in Chinese.

Reproduction

code

// import { faker } from '@faker-js/faker';
import { faker } from '@faker-js/faker/locale/zh_CN'

export const USERS: User[] = []

export function createRandomUser(): User {
  return {
    userId: faker.datatype.uuid(),
    username: faker.internet.userName(),
    email: faker.internet.email(),
    avatar: faker.image.avatar(),
    password: faker.internet.password(),
    birthdate: faker.date.birthdate(),
    registeredAt: faker.date.past(),
  }
}

Array.from({ length: 1 }).forEach(() => {
  USERS.push(createRandomUser())
})

console.log(USERS)

output

[
  {
    userId: '88d30bb6-c783-4e56-8ffc-6778ec6e1c0a',
    username: '钰轩.侯68',
    email: '明杰_彭@gmail.com',
    avatar: 'https://cloudflare-ipfs.com/ipfs/Qmd3W5DuhgHirLHGVixi6V76LhCkZUz6pnFt5AJBiyvHye/avatar/765.jpg',
    password: 'UdVxsDkMWFajEId',
    birthdate: 1964-10-12T19:43:31.378Z,
    registeredAt: 2022-04-27T11:56:33.741Z
  }
]

Additional Info

No response

@shtse8 shtse8 added the s: pending triage Pending Triage label Jun 24, 2022
@shtse8 shtse8 changed the title Weird email in chinese locale package Weird email and username in Chinese locale package Jun 24, 2022
@Shinigami92
Copy link
Member

@ST-DDT
Copy link
Member

ST-DDT commented Jun 24, 2022

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

@xDivisionByZerox xDivisionByZerox added s: awaiting more info Additional information are requested s: needs decision Needs team/maintainer decision c: locale Permutes locale definitions and removed s: pending triage Pending Triage labels Jun 24, 2022
@shtse8
Copy link
Author

shtse8 commented Jun 25, 2022

https://en.wikipedia.org/wiki/International_email#Email_addresses 🤔

it is not the case. as there is possible to support Chinese in domain, username and email in theory and in standard. but it's not in practical. Chinese is very difficult to input comparing other languages.

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

because there is not possible to use Chinese in email and username most of the time on any site, which won't allow to input due to difficult to handle in tech way, parsing Chinese is relatively difficult. also, it's much easier to enter in English which can be directly from keyboard - one char by one char.

in Chinese world, there are many ways to transcribe our Chinese name to English. In Hong Kong, we are using our English name or Cantonese phonic name on our id card. For example, surname is Chan, surname is Cheung, first name isHang. so if someone called 張恒`, his might use "Cheung Hang" as his English name. useing "cheunghang" as username, and using "cheunghang@gmail.com" as email.

Many of us have read English name taken by ourselves like Peter, Simon. so if 張恒 takes a English as Peter. He might take Peter Cheung as his English display name. as use it on username and email.

In Mainland China, Taiwan and other Mandarin speaking places like Sigapore, Malysia, they are using Pinyin (Mandarin phonic), for example, surname (Traditional Chinese) or (Simplified Chinese) is Chen, surname "張" or is Zhang, first name isHeng`. so if someone called "張恒", his might use "Zheng Heng" as his English name. useing "zhangheng" as username, and using "zhangheng@qq.com" as email.

Let's take a look on DouYin (抖音) (Chinese version TikTok)
https://www.douyin.com/user/MS4wLjABAAAAvOpuhpSOPCAvoa6Slgg54m1DtiTBR4ac003SlM86yoxlmMF3AnnF2c8LzHEocAMj

image

`抖音号` is username on the platform. this user picked `Sariel_740399`. I guess `Sariel` is his English name and `740399` is something meaning to her, like birthday?

https://www.douyin.com/user/MS4wLjABAAAApDszKVp0whQtJRUaaDmKnrshCmZ5gwZwcXXnvYsAUFE
image
this user picked wobushixumengjie. while her Chinese name is 洁梦徐, last name should put on the front in Chinese. So her real Chinese name should be 徐梦洁, she just reverse enter her name. Pinyin of 徐梦洁 is Xu Meng Jie which is part of her username. wobushi is the Pinyin of 我不是 (meaning I am not) which is Wo Bu Shi.

Hope it can help to be more fake on faker

@Shinigami92
Copy link
Member

Just my opinion and idea:

I feel like this breaks out of scope for faker itself. It uses a simple algorithm right now where a first name and last name are just inserted for the email.
Faker is not a converter library that specifically converts chinese to english names.

So my proposal (and we can freely discuss about that) would be:

Create/Use a package, to covert chinese names to english counterparts and pass them into the email function of faker.

@ST-DDT
Copy link
Member

ST-DDT commented Jun 25, 2022

IMO we could probably add a locale like en_CN that contains some Chinese sounding (first?/)lastnames, so it possible to generate Peter Cheung as "English" version of the Chinese name, which will then be used to generate the email.

However, this would be up to the user to explicitly select as locale, because technically it not Chinese anymore and phonetically converting the text probably takes more than 50 lines of code. And some users might explicitly want chinese usernames and email addresses, because they have to verify, that it works with those as well. (In Germany, it is possible to use Umlaute äöüß in E-Mail Addresses. Yes, it is rare, but some people prefer it over the "asci" converted variants (ae, oe, ue, sz).)

export function createRandomUser(): User {
  return {
    userId: fakerZH.datatype.uuid(),
    username: fakerEN_CN.internet.userName(),
    email: fakerEN_CN.internet.email(),
    avatar: fakerZH.image.avatar(),
    password: fakerZH.internet.password(),
    birthdate: fakerZH.date.birthdate(),
    registeredAt: fakerZH.date.past(),
  }
}

If we add some kind of internal workaround, to delegate to the English Faker ourselves, then we won't be able to split faker into individual locale modules anymore.

@shtse8 What do you think about the en_CN locale approach?

@import-brain
Copy link
Member

import-brain commented Jun 25, 2022

@shtse8 Is there a "trivial" way to translate/transcribe Chinese words to English/Latin letters?

Are usernames also in Latin letters? I think I have seen mostly Chinese usernames (display names) in Chinese forums (In the few I have ever visited).

There is a romanization system for Chinese characters called "pinyin" as @shtse8 said, but I'm not sure if there's an easy way to transliterate characters into it. I'll look into it.

Edit: Problem is, some Chinese characters have multiple ways to pronounce them based on context :/

@Shinigami92
Copy link
Member

and just one google search away, typing in pinyin npm, the first result is:
https://www.npmjs.com/package/pinyin

and there are even alternativ packages

so I think this is currently the best workaround for now


according to this answer on stackoverflow: https://stackoverflow.com/a/760151/6897682
we might want to think about an option to allow/disallow non-english letters and switch strategy based on that
I wont like to have a special case just for chinese in our code base

@import-brain import-brain added the has workaround Workaround provided or linked label Jun 26, 2022
@ST-DDT
Copy link
Member

ST-DDT commented Jun 27, 2022

Today another "affected" method and locale showed up: internet.domainWord()
https://discord.com/channels/929487054990110771/929544565348777984/990970477138833428

We might have to add an option onlyAscii or similar to some of the internet methods.

@schw4rzlicht
Copy link
Contributor

Especially with internet.domainWord() (or internet.domain() for that matter) it's kind of annoying b/c it leads to our CI failing over and over again (as we validate domain inputs) and always b/c of the word jalapeño which is randomly appearing.

From what I understand, not all TLDs are even accepting internationalized domain names (wiki), so I think it is out of scope for faker to determine which are and keep track of that. Imo, domain words should just not include non-ASCII chars to keep it simple.

@xDivisionByZerox xDivisionByZerox added the m: internet Something is referring to the internet module label Jul 29, 2022
@matthewmayer
Copy link
Contributor

Perhaps locales which aren't in ASCII script should optionally be able to provide an alternative set of ASCII first names and last names to be used in contexts that require ascii like email addresses? For example zh_CN, ar, el

@ST-DDT ST-DDT mentioned this issue Nov 7, 2022
10 tasks
@matthewmayer
Copy link
Contributor

matthewmayer commented Nov 12, 2022

Sample output for

    Object.keys(faker.locales).forEach(locale=>{faker.setLocale(locale); console.log(`${locale}: ${faker.internet.email()}`)})
af_ZA: Harvey_Ferreira60@gmail.com
ar: .@yahoo.com
az: Kellie_Hansen@yahoo.com
cz: Krytof9@atlas.cz
de: Lisann_Tsamonikian@yahoo.com
de_AT: Lenja2@gmail.com
de_CH: Marlies29@hotmail.com
el: .@gmail.com
en: Isobel40@yahoo.com
en_AU: Eliza_Edwards@yahoo.com
en_AU_ocker: Oliver46@gmail.com
en_BORK: Vita.Buckridge78@yahoo.com
en_CA: Fausto18@gmail.com
en_GB: Adrienne.Konopelski@yahoo.com
en_GH: person.female_first_name.Kusi@hotmail.com
en_IE: Oswaldo.Dietrich@hotmail.com
en_IN: Baalaaditya15@yahoo.co.in
en_NG: Titi.Christian94@yahoo.com
en_US: Erik83@hotmail.com
en_ZA: Amelia_Connelly33@yahoo.com
es: Esteban93@gmail.com
es_MX: Mayte.Ruiz@nearbpo.com
fa: 72@yahoo.com
fi: Oskari.Hmlinen@hotmail.com
fr: Flavie_Nguyen@hotmail.fr
fr_BE: Freda7@advalvas.be
fr_CA: Daija_Osinski@yahoo.ca
fr_CH: Arion13@hotmail.com
ge: _@posta.ge
he: 14@gmail.com
hr: David.Zdelar48@gmail.com
hu: Dina63@outlook.com
hy: .@gmail.com
id_ID: Paul_OKeefe@yahoo.co.id
it: Igor24@libero.it
ja: 太一.中村73@gmail.com
ko: 71@yahoo.co.kr
lv: Grover_Kshlerin@apollo.lv
mk: 44@hotmail.com
nb_NO: Herman_Strand@yahoo.com
ne: Raju26@gmail.com
nl: Nick.Janssen33@gmail.com
nl_BE: Amy44@gmail.com
pl: Gerald.Urbanowicz44@yahoo.com
pt_BR: Lvia98@yahoo.com
pt_PT: Edgar60@mail.pt
ro: Trenton46@hotmail.com
ru: Seamus.Carter@yahoo.com
sk: Bethany.Parisian@zoznam.sk
sv: Monica.Axelsson@gmail.com
tr: Brbars1@yahoo.com
uk: Hellen_Price34@ukr.net
ur: .@gmail.com
vi: VinhDiu.Mai@yahoo.com
zh_CN: 鑫鹏_宋@gmail.com
zh_TW: 樂駒76@hotmail.com
zu_ZA: Maphikelela.Mabhida@hotmail.com

I note there are two groups of locales with slightly different problems
zh_CN, zh_TW and ja contain unstripped non-ASCII characters

ar, el, fa, ge, he, hy, ko, mk, ur are stripped down and generally only contain _.01234567890, often giving an invalid address like .@gmail.com

@matthewmayer
Copy link
Contributor

matthewmayer commented Nov 12, 2022

The difference seems to come down to the fact that faker.helpers.slugify has some exceptions for Japanese and Chinese characters

https://github.com/faker-js/faker/blame/next/src/modules/helpers/index.ts#L37

slugify(string: string = ''): string {
    return string
      .replace(/ /g, '-')
      .replace(/[^\一-龠\ぁ-ゔ\ァ-ヴー\w\.\-]+/g, '');
  }

Note the Chinese and Japanese characters here are not stripped but Cyrillic, Arabic, Korean are:

faker.helpers.slugify("ABCD123 靖琪 結衣 용환.예 Саве.Панговски زینہ81") //'ABCD123-靖琪-結衣-.-.-81'

@matthewmayer
Copy link
Contributor

... and that was originally introduced here:
0d3809d

It seems to have caused more problems than it solved, so perhaps that could be reverted, and a more general solution found for all the non-ascii-ish locales.

@ST-DDT
Copy link
Member

ST-DDT commented Nov 12, 2022

I dont think that @example.com is any more useful than <InsertChineseCharactersHere>@example.com.

@matthewmayer
Copy link
Contributor

as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers, e.g.

mj1234415@example.com

... at least it would be a valid email address.

@matthewmayer
Copy link
Contributor

i created #1554 as a tentative solution for this. Not sure would be the best long term solution but it at least means that all locales return valid, ascii, email addresses.

@kz-d
Copy link
Contributor

kz-d commented Nov 20, 2022

email and username should not using Chinese even in Chinese locale package.
there is no one using Chinese as an email and username even in Chinese.

At least, as for email addresses, the same goes for the Japan.
(If you enter a Japanese email address, it will be rejected by validation, even on most systems used in Japan)

as a simple solution, in non-ascii locales you could just make a purely random localPart for email addresses like two letters, followed by 5-8 numbers

I think this fix will help!

@matthewmayer
Copy link
Contributor

Thanks @kz-d good to get a Japanese opinion too :) I guess the #1554 PR will help with #1437 also

@ST-DDT ST-DDT linked a pull request Nov 20, 2022 that will close this issue
@ST-DDT ST-DDT added p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug and removed s: awaiting more info Additional information are requested s: needs decision Needs team/maintainer decision labels Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: locale Permutes locale definitions has workaround Workaround provided or linked m: internet Something is referring to the internet module p: 1-normal Nothing urgent s: accepted Accepted feature / Confirmed bug
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants