Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turkish Language #231

Closed
AliNajafi1998 opened this issue Sep 22, 2022 · 5 comments
Closed

Turkish Language #231

AliNajafi1998 opened this issue Sep 22, 2022 · 5 comments

Comments

@AliNajafi1998
Copy link
Contributor

AliNajafi1998 commented Sep 22, 2022

Hi @cvzi,

I am trying to demojize the emojis for the Turkish language and based on your doc I added the Turkish language and I am using it; however, some emojis are missing.
For example,
u'\U0001F3F4\U0000200D\U00002620\U0000FE0F' -> 🏴‍☠️
there is no Turkish equivalent text for that.
But there is tts for u'\U0001F3F4\U0000200D\U00002620' -> 🏴‍☠

I am actually working with Twitter data so I need to demojize the tweets.
I want to know what I can do regarding this problem. Need to mention that, I also scraped the emojis from Emojiterra website to merge the DB, and it resolved some issues but still some of them are missing.

All I want is demojizing the Twitter-supported emojis.

Best!

@cvzi
Copy link
Contributor

cvzi commented Sep 22, 2022

I will check it out. For these cases like 🏴‍☠️ - the same emoji with different codepoints - we can probably use the translation of the other emoji easily.

Edit: Regarding the emoji that only differ in the suffix \U0000FE0F:
https://github.com/cvzi/emoji/tree/missing_translations
cvzi@797f1c2


In general the Turkish Unicode translation data may just be incomplete (I have not checked). I noticed that the Unicode translations lag behind the emoji releases for many languages.


Currently the untranslated emoji are skipped:

emoji/emoji/core.py

Lines 200 to 201 in b27cf78

# The emoji exists, but it is not translated, so we keep the emoji
replace_str = code_points

But that could be changed. Or replace_emoji() could be used instead of demojize() and then you could fallback to English name (or to a custom translation) if there is no translation in the EMOJI_DATA.

@cvzi
Copy link
Contributor

cvzi commented Sep 24, 2022

@AliNajafi1998 Could you upload the scraped data from Emojiterra (or the merged DB)? I would like to see what is still missing.
I saw that Emojiterra for example has all the flags/countries in Turkish but Unicode repository doesn't have any in Turkish. I wonder what else Emojiterra has.

@cvzi
Copy link
Contributor

cvzi commented Sep 24, 2022

A lot of the emoji that are not translated are the ones with components, that is skin color and hair color. For example: :man_climbing_dark_skin_tone: is not translated, but :man_climbing: is translated.
With replace_emoji() such an emoji could fallback to the translation of :man_climbing::

import emoji

# make a list of all the components like 'dark_skin_tone' or 'red_hair'
all_components = []
for emj in emoji.EMOJI_DATA:
    if emoji.EMOJI_DATA[emj]["status"] == emoji.STATUS["component"]:
        all_components.append(emoji.EMOJI_DATA[emj]["en"][1:-1])
all_components = sorted(all_components, key=len, reverse=True)

def repl_fct(emj, emj_data):
    if "tr" in emj_data:
        return emj_data["tr"]

    # remove the components from the name
    # e.g. :person_medium-light_skin_tone_red_hair: - > :person:
    name = emj_data["en"][1:-1]
    for component in all_components:
        name = name.replace(component, "")
    name = f":{name.strip('_')}:"

    if name != emj_data["en"][1:-1]:
        # Check if the name without components has a translation
        for emj in emoji.EMOJI_DATA:
            if emoji.EMOJI_DATA[emj]["en"] == name and "tr" in emoji.EMOJI_DATA[emj]:
                return emoji.EMOJI_DATA[emj]["tr"]

    # Return English name as last resort
    return emj_data["en"]


text = """
Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:
"""

print(text)

text = emoji.emojize(text)

print(text)

decoded = emoji.replace_emoji(text, repl_fct)

print(decoded)
Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:


Dark skin climber 🧗🏿‍♂️
Ginger person 🧑🏼‍🦰


Dark skin climber :dağcı_erkek:
Ginger person :yetişkin:

@AliNajafi1998
Copy link
Contributor Author

AliNajafi1998 commented Sep 26, 2022

@AliNajafi1998 Could you upload the scraped data from Emojiterra (or the merged DB)? I would like to see what is still missing. I saw that Emojiterra for example has all the flags/countries in Turkish but Unicode repository doesn't have any in Turkish. I wonder what else Emojiterra has.

@cvzi
This is the code for scraping.
For merging, I just checked whether the emoji was in EMOJI_DATA and has "TR" or not.

import requests as req
from bs4 import BeautifulSoup

soup = BeautifulSoup(req.get("https://emojiterra.com/copypaste/tr/").text)
emojis = {}

data = soup.find_all('li')
data = [i for i in data if 'href' not in str(i)]

for i in data:
    code = i['data-clipboard-text'] 
    emojis[code] = i['title'].strip()

You can download the scraped data from here: emojitera.json

@AliNajafi1998
Copy link
Contributor Author

A lot of the emoji that are not translated are the ones with components, that is skin color and hair color. For example: :man_climbing_dark_skin_tone: is not translated, but :man_climbing: is translated. With replace_emoji() such an emoji could fallback to the translation of :man_climbing::

import emoji

# make a list of all the components like 'dark_skin_tone' or 'red_hair'
all_components = []
for emj in emoji.EMOJI_DATA:
    if emoji.EMOJI_DATA[emj]["status"] == emoji.STATUS["component"]:
        all_components.append(emoji.EMOJI_DATA[emj]["en"][1:-1])
all_components = sorted(all_components, key=len, reverse=True)

def repl_fct(emj, emj_data):
    if "tr" in emj_data:
        return emj_data["tr"]

    # remove the components from the name
    # e.g. :person_medium-light_skin_tone_red_hair: - > :person:
    name = emj_data["en"][1:-1]
    for component in all_components:
        name = name.replace(component, "")
    name = f":{name.strip('_')}:"

    if name != emj_data["en"][1:-1]:
        # Check if the name without components has a translation
        for emj in emoji.EMOJI_DATA:
            if emoji.EMOJI_DATA[emj]["en"] == name and "tr" in emoji.EMOJI_DATA[emj]:
                return emoji.EMOJI_DATA[emj]["tr"]

    # Return English name as last resort
    return emj_data["en"]


text = """
Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:
"""

print(text)

text = emoji.emojize(text)

print(text)

decoded = emoji.replace_emoji(text, repl_fct)

print(decoded)
Dark skin climber :man_climbing_dark_skin_tone:
Ginger person :person_medium-light_skin_tone_red_hair:


Dark skin climber 🧗🏿‍♂️
Ginger person 🧑🏼‍🦰


Dark skin climber :dağcı_erkek:
Ginger person :yetişkin:

Good Idea, but for Ginger Person the meaning gets changed.
Thanks Again!

This was referenced Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants