Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance dataset with Spanish nicknames #71

Closed
wants to merge 192 commits into from

Conversation

Ahosseinzadeh723
Copy link

Changes include:

  • Addition of Spanish nicknames for male and female names.
  • Addition of English version for Spanish names

carlton.northern and others added 30 commits August 21, 2010 20:38
Added a lot of names and nicknames from other public DB
merged abigail variants
Adding `seb` to match `sebastian`.
added a comma between john and johnny line 549
@carltonnorthern
Copy link
Owner

This is a great addition @Ahosseinzadeh723 ! I've left a comments to resolve before we can merge it in.

names.csv Outdated
@@ -1049,7 +1049,7 @@ tasha,tash,tashie
ted,teddy
temperance,tempy
terence,terry
teresa,terry,tess,tessa,tessie
teresa,terry,tess,tessa,tessie,tere,theresa
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong but I don't think "Theresa" would be a nickname or diminutive name for "Teresa".

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh but Theresa would be the English version of Teresa?

names.csv Outdated
bernardo,berna,bernard
candelario,candel
carlos,charles
cristian,cris,christian
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Christian and Cristan falls under the same case as Theresa and Teresa.

names.csv Outdated
candelario,candel
carlos,charles
cristian,cris,christian
cristobal,cris,christopher
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Christopher a nickname for Cristobal?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cristóbal" is the Spanish equivalent of "Christopher." Spanish-speaking people who move to the U.S. often adopt a more English-friendly name, and this is one of those cases.

names.csv Outdated
esteban,steven
federico,quiquo,kiko,federick
fernando,nando,ferdinand
francisco,paco,pancho,francis
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a dog named "Paco". I didn't realize that was a nickname for Francisco. I'll have to start calling him that. :)

names.csv Outdated
ignacio,nacho,ignatitus
jaime,james
jesus,chuy,chucho
jorge,koque,george
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed a few of these where the Spanish name has an equivalent English name. I'm on the fence if we want to include these or not. I think it makes sense if a common convention would be for a Jorge to go by George in a more English setting.

names.csv Outdated
@@ -1144,3 +1144,96 @@ zack,zach,zak
zebedee,zeb
zedediah,dyer,zed,diah
zephaniah,zeph
adam,adan
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need these additions included in the list alphabetically.

@NickCrews
Copy link
Collaborator

NickCrews commented Jun 18, 2024

@Ahosseinzadeh723 Thanks for the contribution! I can see why this would be useful.

@carltonnorthern I am hesitant to merge this in without thinking about it a bit more. Semantically, this is changing the meaning of this package lot. Before, it was just nicknames within english. Now we are blurring the line with spanish translations (even if these names are very common in english-speaking countries). I can see the immense usefulness of including this data, but I also worry that this will break other existing users, who might not want these translations included.

What if we took this as an opportunity to restructure the representation:

Currently we use a "wide" format of "canoncial,nickname1, nickname2,...,nicknameN". What if we switched to a normalized "long" format of "canonical,nickname"? eg "charles,charlie,carlos" would be

charles,charlie
charles,carlos

This would have the benefit of

  1. cleaner diffs when editing. Entire lines would get added, deleted
  2. now it would be trivial to keep the whole file sorted, which is nice
  3. Make it easier for consuming libraries to ingest the data, eg we could add a pairs(self) -> Iterable[NamedTuple] method to NickNamer. Then we could pass the data right into a pandas dataframe, or a normalized SQL table. I already do this in a downstream library, and but I have to access the private attributes of the NickNamer to do it and its sorta gross.

This format change also opens the door for us to add per-edge attributes. For example, we could label these cross-language nicknames by adding a 3rd column:

canonical,nickname,languages
charles,charlie,EN:EN
charles,carlos,EN:SP

and then we could support filtering by these languages. And we could add more attributes in the future, for example

This would be breaking for the raw consumers of the CSV, but we could keep the python API stable. I think this would be worth it for the long-term life of this project. What do you think @carltonnorthern @Ahosseinzadeh723 ?

@carltonnorthern
Copy link
Owner

@NickCrews I think this is an excellent idea. I had similar concerns but couldn't think of a way to handle these issues.

FYI, sorry for the slow response. I'm in the middle of orchestrating a move cross country.

@NickCrews
Copy link
Collaborator

I accidentally nuked the git history while getting auto-fix working in CI. It's back now, but that seemed to have closed this PR. Once/if I make this structural change, then we should re-open this PR with that new format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.