-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance dataset with Spanish nicknames #71
Enhance dataset with Spanish nicknames #71
Conversation
Sources: [1] http://deron.meranda.us/data/nicknames.txt (Only pairs with likelihood >= 0.6) [2] https://github.com/onyxrev/common_nickname_csv
Added a lot of names and nicknames from other public DB
removed submit
merged abigail variants
fixed line 2 issue
Adding `seb` to match `sebastian`.
Update names.csv
Update names.csv
Updated the list of names, that is it.
Add randi for miranda. Add randy for randall, randolf, bertrand, and andrew. Supporting evidence: https://en.wikipedia.org/wiki/Randi https://en.wikipedia.org/wiki/Randy
Add randi/randy nicknames
We were getting failures in CI because we were auto-upgrading to a newer version. Really it would be sweet if we used PDM or some ofther package manager that supported lockfiles. But this should for our very minimal deps.
See previous commit
This version doesn't appear to exist.
This is a great addition @Ahosseinzadeh723 ! I've left a comments to resolve before we can merge it in. |
names.csv
Outdated
@@ -1049,7 +1049,7 @@ tasha,tash,tashie | |||
ted,teddy | |||
temperance,tempy | |||
terence,terry | |||
teresa,terry,tess,tessa,tessie | |||
teresa,terry,tess,tessa,tessie,tere,theresa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong but I don't think "Theresa" would be a nickname or diminutive name for "Teresa".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh but Theresa would be the English version of Teresa?
names.csv
Outdated
bernardo,berna,bernard | ||
candelario,candel | ||
carlos,charles | ||
cristian,cris,christian |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Christian and Cristan falls under the same case as Theresa and Teresa.
names.csv
Outdated
candelario,candel | ||
carlos,charles | ||
cristian,cris,christian | ||
cristobal,cris,christopher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Christopher a nickname for Cristobal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Cristóbal" is the Spanish equivalent of "Christopher." Spanish-speaking people who move to the U.S. often adopt a more English-friendly name, and this is one of those cases.
names.csv
Outdated
esteban,steven | ||
federico,quiquo,kiko,federick | ||
fernando,nando,ferdinand | ||
francisco,paco,pancho,francis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a dog named "Paco". I didn't realize that was a nickname for Francisco. I'll have to start calling him that. :)
names.csv
Outdated
ignacio,nacho,ignatitus | ||
jaime,james | ||
jesus,chuy,chucho | ||
jorge,koque,george |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed a few of these where the Spanish name has an equivalent English name. I'm on the fence if we want to include these or not. I think it makes sense if a common convention would be for a Jorge to go by George in a more English setting.
names.csv
Outdated
@@ -1144,3 +1144,96 @@ zack,zach,zak | |||
zebedee,zeb | |||
zedediah,dyer,zed,diah | |||
zephaniah,zeph | |||
adam,adan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need these additions included in the list alphabetically.
@Ahosseinzadeh723 Thanks for the contribution! I can see why this would be useful. @carltonnorthern I am hesitant to merge this in without thinking about it a bit more. Semantically, this is changing the meaning of this package lot. Before, it was just nicknames within english. Now we are blurring the line with spanish translations (even if these names are very common in english-speaking countries). I can see the immense usefulness of including this data, but I also worry that this will break other existing users, who might not want these translations included. What if we took this as an opportunity to restructure the representation: Currently we use a "wide" format of "canoncial,nickname1, nickname2,...,nicknameN". What if we switched to a normalized "long" format of "canonical,nickname"? eg "charles,charlie,carlos" would be
This would have the benefit of
This format change also opens the door for us to add per-edge attributes. For example, we could label these cross-language nicknames by adding a 3rd column:
and then we could support filtering by these languages. And we could add more attributes in the future, for example This would be breaking for the raw consumers of the CSV, but we could keep the python API stable. I think this would be worth it for the long-term life of this project. What do you think @carltonnorthern @Ahosseinzadeh723 ? |
@NickCrews I think this is an excellent idea. I had similar concerns but couldn't think of a way to handle these issues. FYI, sorry for the slow response. I'm in the middle of orchestrating a move cross country. |
71f2f08
to
c5dae80
Compare
c5dae80
to
0d70221
Compare
I accidentally nuked the git history while getting auto-fix working in CI. It's back now, but that seemed to have closed this PR. Once/if I make this structural change, then we should re-open this PR with that new format |
Changes include: