Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-person skin tones #204

Closed
cvzi opened this issue Feb 19, 2022 · 7 comments · Fixed by #259
Closed

Multi-person skin tones #204

cvzi opened this issue Feb 19, 2022 · 7 comments · Fixed by #259

Comments

@cvzi
Copy link
Contributor

cvzi commented Feb 19, 2022

Multi-Person Skin Tones on unicode.org

Edit: here's a tool to create these: https://codepen.io/cvzi/full/RwQNJBK

These are currently not RGI by unicode (Recommended for General Interchange), which means they should not be generated with emojize().
However they work in some phones and browsers. For example a family of 4 persons with 4 different skin tones: 👨🏽‍👩🏿‍👧🏻‍👦🏾
This emoji consists of:

  • 👨🏽 :man_medium_skin_tone:
  • \u200d ZWJ
  • 👩🏿 :woman_dark_skin_tone:
  • \u200d ZWJ
  • 👧🏻 :girl_light_skin_tone:
  • \u200d ZWJ
  • 👦🏾 :boy_medium-dark_skin_tone:

demojize() currently converts that emoji to:
'👨:medium_skin_tone:\u200d👩:dark_skin_tone:\u200d:girl_light_skin_tone:\u200d:boy_medium-dark_skin_tone:'

Possible solutions:

Convert the man and woman as well to (minimal solution):

  • :man::medium_skin_tone:\u200d:woman::dark_skin_tone:\u200d:girl_light_skin_tone:\u200d:boy_medium-dark_skin_tone:

or combine the skin tones into man and woman as well:

  • :man_medium_skin_tone:\u200d:woman_dark_skin_tone:\u200d:girl_light_skin_tone:\u200d:boy_medium-dark_skin_tone:

remove the skin tones

  • :family_man_woman_girl_boy:'

Or with the skin tones:

  • :family_man_woman_girl_boy_medium_dark_light_skin_medium-dark_tone:'

Edit:

Probably the easiest one is this:
:man_medium_skin_tone:\u200d:woman_dark_skin_tone:\u200d:girl_light_skin_tone:\u200d:boy_medium-dark_skin_tone:

Have to decide if we want to remove the \u200d or not.
If we keep the \u200d, emojize() can revert the string correctly i.e. emojize(demojize(str)) == str.
I don't know what's the effect of having them though, :\u200d: might be displayed strangely.

@cvzi cvzi mentioned this issue Feb 19, 2022
11 tasks
@cvzi
Copy link
Contributor Author

cvzi commented Dec 21, 2022

Could you open this again, it's actually still a bug.

Not sure why I referenced it in v2.0.0, that was wrong

@TahirJalilov TahirJalilov reopened this Dec 21, 2022
@cvzi
Copy link
Contributor Author

cvzi commented Mar 23, 2023

@TahirJalilov
So I have this solved and working so far, but we need to make a decision how to handle these \u200d in the emoji.

The problem:

For example a family emoji can look like this 👨‍👩🏿‍👧🏻‍👦🏾

{man}\u200d{woman_dark_skin_tone}\u200d{girl_light_skin_tone}\u200d{boy_medium-dark_skin_tone}

The \u200d (ZWJ zero width joiner) joins the emoji to be displayed as one picture
(if supported, if not supported it just shows the individual emoji and \u200d is invisible. The family emojis are supported on Chrome/Windows and Firefox/Windows).

That means when we do demojize() on such an emoji, we could convert it to:

:man:\u200d:woman_dark_skin_tone:\u200d:girl_light_skin_tone:\u200d:boy_medium-dark_skin_tone:

Or remove the \u200d

:man::woman_dark_skin_tone::girl_light_skin_tone::boy_medium-dark_skin_tone:

If we keep the\u200d, then emojize() can recreate the joined family emoji from the output of demojize(). If we remove them, emojize() will recreate the individual emoji instead of the joined emoji.

Also emoji.replace_emoji() can keep the emoji or replace them:

emoji.replace_emoji(family_emoji, 'X') == 'XXXX'
#OR
emoji.replace_emoji(family_emoji, 'X') == 'X\u200dX\u200dX\u200dX'

Note that X\u200dX\u200dX\u200dX is displayed as 'XXXX'. So visually it is not a problem, but people may not expect the invisible \u200d to be there after replacing all emoji.

My current solution could support both behaviors. So I would suggest to have switch/parameter to control it:

emoji.demojize(family_emoji, keep_zwj=True)
emoji.replace_emoji(family_emoji, keep_zwj=False)

# Or a global switch
import emoji
emoji.config.demojize.keep_zwj = True
emoji.config.replace_emoji.keep_zwj = False

I think the global config is better, because it is not something that you want to change more than once.
Also I would suggest these values as default:

  • demojize() should keep the \u200d by default to be able to emojize() the result again.
  • replace_emoji() should remove them by default.

@lsmith77
Copy link

Let me pre-face that what I am doing is replacing emoji with alternatives ones.

f.e. replacing a women with person or man with women. I am also replacing skin tones, offering either a version without a skin tone or offering alternative skin tones.

Generally I think it makes sense to be able to do something like demojize(emojize(demojize(some_emoji))) so I would either keep \u200d or return a list in demojize() which could then also be handled by emojize().

For my use case a list could be convenient since I would want to iterate over all of the "sub-emoji" to create permutations of modified emojis.

@cvzi
Copy link
Contributor Author

cvzi commented May 12, 2023

Thanks for you input!

I agree a list of the "sub-emoji" might be nice. Possibly a list could be available in the callback of replace_emoji() or returned by emoji_list()

@lsmith77
Copy link

in my case I am processing emoji one at a time using https://github.com/explosion/spacymoji

Note they currently creating separate tokens for \u200d delimited emoji.

@cvzi
Copy link
Contributor Author

cvzi commented May 21, 2023

I will add a new function to the module (probably call it emoji.analyze(string)) that can "tokenize" the string into a list of emoji and non-emoji chars.
If the found emoji is a ZWJ-emoji, then offer some way to detect/split the "sub-emoji".

My progress so far:
https://github.com/cvzi/emoji/tree/main

@cvzi
Copy link
Contributor Author

cvzi commented Jun 1, 2023

FYI these changes will remove support for Python 2.7 and probably 3.5.

Ref #243

cvzi added a commit to cvzi/emoji that referenced this issue Jun 6, 2023
The logic from demojize() is moved to two separate function tokenize and filter_tokens in a new file emoji/tokenizer.py
Also the logic for the search tree is moved to that file.

A new public function analyze() is available, that supports the multi-person skintones

The handling of the multi-person skintones can be controlled by the new `emoji.config` class, which is a static class that works as a module-wide configuration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants