Turkish Language #231

AliNajafi1998 · 2022-09-22T13:00:55Z

I am trying to demojize the emojis for the Turkish language and based on your doc I added the Turkish language and I am using it; however, some emojis are missing.
For example,
u'\U0001F3F4\U0000200D\U00002620\U0000FE0F' -> 🏴‍☠️
there is no Turkish equivalent text for that.
But there is tts for u'\U0001F3F4\U0000200D\U00002620' -> 🏴‍☠

I am actually working with Twitter data so I need to demojize the tweets.
I want to know what I can do regarding this problem. Need to mention that, I also scraped the emojis from Emojiterra website to merge the DB, and it resolved some issues but still some of them are missing.

All I want is demojizing the Twitter-supported emojis.

Best!

The text was updated successfully, but these errors were encountered:

cvzi · 2022-09-22T21:02:06Z

I will check it out. For these cases like 🏴‍☠️ - the same emoji with different codepoints - we can probably use the translation of the other emoji easily.

Edit: Regarding the emoji that only differ in the suffix \U0000FE0F:
https://github.com/cvzi/emoji/tree/missing_translations
cvzi@797f1c2

In general the Turkish Unicode translation data may just be incomplete (I have not checked). I noticed that the Unicode translations lag behind the emoji releases for many languages.

Currently the untranslated emoji are skipped:

emoji/emoji/core.py

Lines 200 to 201 in b27cf78

    
           # The emoji exists, but it is not translated, so we keep the emoji 
        
           replace_str = code_points

But that could be changed. Or replace_emoji() could be used instead of demojize() and then you could fallback to English name (or to a custom translation) if there is no translation in the EMOJI_DATA.

cvzi · 2022-09-24T21:40:34Z

@AliNajafi1998 Could you upload the scraped data from Emojiterra (or the merged DB)? I would like to see what is still missing.
I saw that Emojiterra for example has all the flags/countries in Turkish but Unicode repository doesn't have any in Turkish. I wonder what else Emojiterra has.

cvzi · 2022-09-24T22:29:57Z

A lot of the emoji that are not translated are the ones with components, that is skin color and hair color. For example: :man_climbing_dark_skin_tone: is not translated, but :man_climbing: is translated.
With replace_emoji() such an emoji could fallback to the translation of :man_climbing::

import emoji

# make a list of all the components like 'dark_skin_tone' or 'red_hair'
all_components = []
for emj in emoji.EMOJI_DATA:
    if emoji.EMOJI_DATA[emj]["status"] == emoji.STATUS["component"]:
        all_components.append(emoji.EMOJI_DATA[emj]["en"][1:-1])
all_components = sorted(all_components, key=len, reverse=True)

def repl_fct(emj, emj_data):
    if "tr" in emj_data:
        return emj_data["tr"]

    # remove the components from the name
    # e.g. :person_medium-light_skin_tone_red_hair: - > :person:
    name = emj_data["en"][1:-1]
    for component in all_components:
        name = name.replace(component, "")
    name = f":{name.strip('_')}:"

    if name != emj_data["en"][1:-1]:
        # Check if the name without components has a translation
        for emj in emoji.EMOJI_DATA:
            if emoji.EMOJI_DATA[emj]["en"] == name and "tr" in emoji.EMOJI_DATA[emj]:
                return emoji.EMOJI_DATA[emj]["tr"]

    # Return English name as last resort
    return emj_data["en"]


text = """
Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:
"""

print(text)

text = emoji.emojize(text)

print(text)

decoded = emoji.replace_emoji(text, repl_fct)

print(decoded)

Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:


Dark skin climber 🧗🏿‍♂️
Ginger person 🧑🏼‍🦰


Dark skin climber :dağcı_erkek:
Ginger person :yetişkin:

AliNajafi1998 · 2022-09-26T07:21:24Z

@AliNajafi1998 Could you upload the scraped data from Emojiterra (or the merged DB)? I would like to see what is still missing. I saw that Emojiterra for example has all the flags/countries in Turkish but Unicode repository doesn't have any in Turkish. I wonder what else Emojiterra has.

@cvzi
This is the code for scraping.
For merging, I just checked whether the emoji was in EMOJI_DATA and has "TR" or not.

import requests as req
from bs4 import BeautifulSoup

soup = BeautifulSoup(req.get("https://emojiterra.com/copypaste/tr/").text)
emojis = {}

data = soup.find_all('li')
data = [i for i in data if 'href' not in str(i)]

for i in data:
    code = i['data-clipboard-text'] 
    emojis[code] = i['title'].strip()

You can download the scraped data from here: emojitera.json

AliNajafi1998 · 2022-09-26T07:32:41Z

A lot of the emoji that are not translated are the ones with components, that is skin color and hair color. For example: :man_climbing_dark_skin_tone: is not translated, but :man_climbing: is translated. With replace_emoji() such an emoji could fallback to the translation of :man_climbing::

import emoji

# make a list of all the components like 'dark_skin_tone' or 'red_hair'
all_components = []
for emj in emoji.EMOJI_DATA:
    if emoji.EMOJI_DATA[emj]["status"] == emoji.STATUS["component"]:
        all_components.append(emoji.EMOJI_DATA[emj]["en"][1:-1])
all_components = sorted(all_components, key=len, reverse=True)

def repl_fct(emj, emj_data):
    if "tr" in emj_data:
        return emj_data["tr"]

    # remove the components from the name
    # e.g. :person_medium-light_skin_tone_red_hair: - > :person:
    name = emj_data["en"][1:-1]
    for component in all_components:
        name = name.replace(component, "")
    name = f":{name.strip('_')}:"

    if name != emj_data["en"][1:-1]:
        # Check if the name without components has a translation
        for emj in emoji.EMOJI_DATA:
            if emoji.EMOJI_DATA[emj]["en"] == name and "tr" in emoji.EMOJI_DATA[emj]:
                return emoji.EMOJI_DATA[emj]["tr"]

    # Return English name as last resort
    return emj_data["en"]


text = """
Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:
"""

print(text)

text = emoji.emojize(text)

print(text)

decoded = emoji.replace_emoji(text, repl_fct)

print(decoded)

Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:


Dark skin climber 🧗🏿‍♂️
Ginger person 🧑🏼‍🦰


Dark skin climber :dağcı_erkek:
Ginger person :yetişkin:

Good Idea, but for Ginger Person the meaning gets changed.
Thanks Again!

This was referenced Sep 28, 2022

Missing emojis. #234

Closed

Unicode 15.0 #237

Merged

TahirJalilov closed this as completed Dec 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turkish Language #231

Turkish Language #231

AliNajafi1998 commented Sep 22, 2022 •

edited

cvzi commented Sep 22, 2022 •

edited

cvzi commented Sep 24, 2022 •

edited

cvzi commented Sep 24, 2022

AliNajafi1998 commented Sep 26, 2022 •

edited

AliNajafi1998 commented Sep 26, 2022

Turkish Language #231

Turkish Language #231

Comments

AliNajafi1998 commented Sep 22, 2022 • edited

cvzi commented Sep 22, 2022 • edited

cvzi commented Sep 24, 2022 • edited

cvzi commented Sep 24, 2022

AliNajafi1998 commented Sep 26, 2022 • edited

AliNajafi1998 commented Sep 26, 2022

AliNajafi1998 commented Sep 22, 2022 •

edited

cvzi commented Sep 22, 2022 •

edited

cvzi commented Sep 24, 2022 •

edited

AliNajafi1998 commented Sep 26, 2022 •

edited