## Parsing Emoji


Parse emoji into six categories:
1. Person
2. Place
3. Thing
4. Time
5. Activity
6. Mood


Libraries:
- emoji  https://pypi.org/project/emoji/
- Hug http://www.hug.rest/

#### Task 1:
Given the set of emojis in emoji package.
1. Parse each into one of the above six categories
2. Show this work using python3 in this notebook
3. Please document

#### Task 2:
Use the short python code snippet in app.py.
Add code to:
1. take in a string
2. find if there are any emojis
3. return a dict mapping any found emoji to their category type

#### Final Notes:
- This should be done using Python3
- Consider this a deadline project. I just want a useable solution. It won't be perfect.
- Please explain your process.
- Try to limit yourself to 4 hours.


### First Step: Exploring the emoji data

In [24]:
import emoji
import numpy as np

In [2]:
uni = set(emoji.EMOJI_UNICODE.keys())
uni_ali = set(emoji.EMOJI_ALIAS_UNICODE.keys())

In [3]:
len(uni_ali - uni)

803

Since `EMOJI_ALIAS_UNICODE` includes all of `EMOJI_UNICODE` emojis plus 800 more, the alias dictionary will only be utilized

In [4]:
arr_keys = list(emoji.EMOJI_ALIAS_UNICODE.keys())
max_len = max([len(k) for k in arr_keys])

In [5]:
count = 0
for i in emoji.EMOJI_ALIAS_UNICODE:
    count +=1
    if count%50 == 0:
        w_s = ' '*(max_len - len(i) + 5)
        print(i,w_s , emoji.EMOJI_ALIAS_UNICODE[i])

:Cambodia:                                                      🇰🇭
:France:                                                        🇫🇷
:Japanese_open_for_business_button:                             🈺
:Mrs._Claus:                                                    🤶
:Russia:                                                        🇷🇺
:TOP_arrow:                                                     🔝
:airplane:                                                      ✈
:backhand_index_pointing_left_medium_skin_tone:                 👈🏽
:bellhop_bell:                                                  🛎
:bowling:                                                       🎳
:candle:                                                        🕯
:clapping_hands_medium_skin_tone:                               👏🏽
:cow:                                                           🐮
:detective:                                                     🕵
:eight-spoked_asterisk:                                         ✳
:fall

## Brainstorming

My first thought: find patterns in the description text of emojis to determine a label.  For instance, if emoji's description includes the word "face", it is highly likely it would be categorized as a *mood* more than any other category. Therefore, I would loop over every emoji's description, do a regex search on certain patterns, then go to through if conditions to determine its category.

Why I instantly rejected this: Categorizing emojis based on fixed heuristics that I create is wrong because emojis can mean different things depending on the person and circumstance.

My second thought: what if I use a pretrained model to work on top of?  [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) comes into mind.  This technique is a popular way to learn word embeddings. Word2Vec converts words into vectors in a space that captures context, syntactic and semantic relations with other words. For instance, vectorizing a set of synonyms with Word2Vec's pretrained model will yield vectors with high cosine similarities (i.e. these vectors roughly point to the same region in this vector space).   


How exactly can I leverage this:
* Embed the words *place, person, mood, activity, thing, time* to yield six vectors.  These vectors are located somewhere in a 300 dimension space, each with a certain magnitude and direction representing rich information about that word's form, meaning, and context. 
* Vectorize each emoji with the [Emoji2Vec](https://github.com/uclmr/emoji2vec) pretrained model 
* Given each *emoji vector*, calculate the cosine similarity between it and each of the six categories. This will output a sequence of six scalars.  Choose the category with the highest similarity score.

Why this approach is appealing to me:  emoji categorization is not the result of my own perception of how an emoji ought to be labelled.  Instead, it is the result of capturing its representation in a rich vector space that has been trained on millions of sentences.



In [7]:
from gensim.models import KeyedVectors
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)

In [8]:
categories = ['person', 'place', 'thing', 'time', 'activity', 'mood']

In [10]:
import gensim.models as gsm
e2v = gsm.KeyedVectors.load_word2vec_format('emoji2vec/pre-trained/emoji2vec.bin', binary=True)


In [11]:
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
import random
from copy import deepcopy

In [13]:
ems = list(emoji.EMOJI_ALIAS_UNICODE.values())

In [14]:
ems_vec = []
ems_not_working = []
for idx, em in enumerate(ems):
    try:
        v = e2v[em].reshape(1, -1)
        ems_vec.append((idx,v))
    except:
        ems_not_working.append(em)

In [15]:
len(ems_not_working)

1260

Noticing a problem with the e2v model, as a third of emojis weren't found in e2v's vocabulary.

In [16]:
len(ems_vec)

2243

In [17]:
from pdb import set_trace

def get_max_similarity(v, cats):
    """
    Given emoji vector v and category embeddings cats, function returns the category 
    with the highest cosine similarity score.
    
    Input: v -- type: <numpy.ndarray> with shape (300,) with float64 entries; 
           cats -- type: <dict>. keys are type <str>, values are type <numpy.ndarray> with shape(300,)
                                 and float64 entries.
    Output: max_cat -- type: <str>
    """
    results = []
#     set_trace()
    for cat, vec in cats.items():
        results.append((cat, cosine_similarity(v, vec)))
    max_dis = -1
    max_cat = ''
    for cat, dis in results:
#         set_trace()
        if dis > max_dis:
            max_dis = dis
            max_cat = cat
    return max_cat
    
    

In [18]:
person= model.word_vec('person').reshape(1,-1)
place = model.word_vec('place').reshape(1,-1)
thing = model.word_vec('thing').reshape(1,-1)
activity = model.word_vec('activity').reshape(1,-1)
time = model.word_vec('time').reshape(1,-1)
mood = model.word_vec('mood').reshape(1,-1)
categories = {'person':person, 'place':place, 'thing':thing, 'activity':activity, 'time':time, 'mood':mood}

    
    

In [19]:
categories = {'person':person, 'place':place, 'thing':thing, 'activity':activity, 'time':time, 'mood':mood}
results = []

for idx, vector in ems_vec:
    best_cat = get_max_similarity(vector, categories)
    results.append((idx, best_cat))

In [20]:
new_dict = {}

for idx, cat in results:
    new_dict[ems[idx]]=cat

In [21]:
count =0
for key in new_dict:
    count+=1
    if count%15==0:
        print(key, new_dict[key])

🇦🇲 time
🇧🇲 place
🇰🇭 time
🇨🇵 time
🇩🇬 person
🇫🇴 thing
🇬🇱 place
🇮🇳 person
🎎 thing
🇰🇪 time
🇱🇺 time
🇫🇲 time
🇳🇱 thing
👌🏾 thing
🇵🇳 time
🎅🏿 thing
🇸🇧 thing
🇨🇭 time
🇹🇦 time
🆚 mood
⚗ place
🎨 mood
👶🏻 person
👈🏼 mood
🏸 activity
🏖 activity
⚫ time
🔵 time
👦🏼 mood
🚅 time
🍬 mood
🧀 person
🗜 activity
📫 place
😖 thing
🔄 time
🤞 thing
🍡 thing
😵 mood
💧 thing
🔌 person
😷 person
👨‍👨‍👧‍👧 person
⛴ mood
⛳ time
🙏🏻 person
🐸 person
👧🏾 person
💚 thing
💂🏼 mood
💘 person
🍯 thing
⌛ thing
🔠 thing
3️⃣ time
😙 mood
🗨 time
🍭 thing
👨🏻 person
📣 person
📲 person
🚠 time
💅🏼 thing
🚯 thing
🐙 person
👵🏾 person
📖 thing
🐂 thing
🍑 thing
🙇 thing
🙅🏿 thing
💇🏼 thing
🚵🏾 person
🙋🏽 person
🏄🏿 person
🛀🏼 mood
👳🏿 person
🛐 place
🚰 thing
🖨 person
✊ time
🙌🏾 person
💞 time
🌹 thing
🙈 thing
🚿 thing
🔸 place
⚽ activity
🗓 activity
⏱ time
🚟 time
🎾 time
👍🏾 thing
🚊 person
👬 person
📼 thing
🚾 place
♿ person
😉 thing
👢 thing
🇦🇽 place
🚆 time


Approach \#1: Using pre-trained word2vec and emoji2vec models to compare distances between emojis and each categories; the category that renders the highest cosine similarity would be chosen as the category for that emoji.

Observable Problems

* Mixed labelling with flags
* Mixed labelling with people
* Mixed labelling with animals
* Rejects 1/3 of emojis


emoji2vec doesn't seem to provide a solid baseline. Its embedding method for emojies is not very generalizable.

Therefore, attempt Approach #2:  Emoji vectors will be created by vectorizing their text descriptions.  Each emoji description has rich data about its meaning: for instance :clapping_hands: contains words that define exactly what that emoji is. I can exploit word2vec here to generate a sequence of word embeddings for each emoji.  Then, I take the average of all word embeddings to get one vector -- this vector will be the emoji vector.  

In [25]:
smilies = emoji.EMOJI_ALIAS_UNICODE

In [26]:
len(smilies)

3503

In [27]:
words = list(smilies.keys())

In [28]:
words[1500:2000:100]

[':mermaid_medium-light_skin_tone:',
 ':oil_drum:',
 ':person_bowing_light_skin_tone:',
 ':person_playing_handball_medium_skin_tone:',
 ':pool_8_ball:']

I need to clean the words in order to feed it into word2vec.  Below represents how I cleaned the text. 

In [29]:
clean_words = deepcopy(words)

In [30]:
len(words)

3503

In [31]:
len(clean_words)

3503

In [32]:
clean_words = [i.strip(':') for i in clean_words]

In [33]:
clean_words = [i.rstrip('0123456789.+\(\)') for i in clean_words]

In [34]:
for idx, word in enumerate(clean_words):
    if 'skin' in word:
        if 'medium' in word:
            crp = word.find('medium')
            word= word[:crp]
        elif 'dark' in word:
            crp = word.find('dark')
            word= word[:crp]
        elif 'light' in word:
            crp = word.find('light')
            word= word[:crp]
    if '-'in word:
        word = word.replace('-', ' ')
    if 'facepalming' in word:
        word = word.replace('facepalming', 'face palming')
    split_words = word.split('_')
    words = ' '.join(split_words)
    split_words = words.split(' ')
    while '' in split_words:
        split_words.remove('')
    clean_words[idx] = split_words
    

In [35]:
len(clean_words)

3503

In [36]:
clean_words[::100]

[['1st', 'place', 'medal'],
 ['French', 'Guiana'],
 ['Mrs.', 'Claus'],
 ['Taiwan'],
 ['backhand', 'index', 'pointing', 'right'],
 ['boxing', 'glove'],
 ['classical', 'building'],
 ['detective'],
 ['family'],
 ['genie'],
 ['hourglass', 'done'],
 ['lobster'],
 ['man', 'elf'],
 ['man', 'in', 'suit', 'levitating'],
 ['man', 'running'],
 ['mermaid'],
 ['oil', 'drum'],
 ['person', 'bowing'],
 ['person', 'playing', 'handball'],
 ['pool', '8', 'ball'],
 ['rice', 'cracker'],
 ['smirking', 'face'],
 ['thumbs', 'down'],
 ['waving', 'hand'],
 ['woman', 'detective'],
 ['woman', 'judge'],
 ['woman', 'shrugging'],
 ['airplane', 'arriving'],
 ['clock'],
 ['flag', 'for', 'Antigua', '&', 'Barbuda'],
 ['flag', 'for', 'Indonesia'],
 ['flag', 'for', 'St.', 'Barthélemy'],
 [],
 ['ok', 'hand'],
 ['spiral', 'note', 'pad'],
 ['worried']]

In [40]:
words_to_vec = deepcopy(clean_words)

In [41]:
failed_emojies = []
for idx, seq in enumerate(words_to_vec):
    vectors = []
    for word in seq:
        try:
            vec = model.word_vec(word)
            vectors.append(vec)
        except:
            failed_emojies.append((idx, seq))
#             print(f'FAILED: {seq}')
            words_to_vec[idx] = []
        words_to_vec[idx]=vectors

FAILED: ['AB', 'button', '(blood', 'type']
FAILED: ['A', 'button', '(blood', 'type']
FAILED: ['B', 'button', '(blood', 'type']
FAILED: ['Cocos', '(Keeling)', 'Islands']
FAILED: ['Côte', 'd’Ivoire']
FAILED: ['Isle', 'of', 'Man']
FAILED: ['Japanese', 'free', 'of', 'charge', 'button']
FAILED: ['Japanese', 'not', 'free', 'of', 'charge', 'button']
FAILED: ['Myanmar', '(Burma']
FAILED: ['ON!', 'arrow']
FAILED: ['O', 'button', '(blood', 'type']
FAILED: ['Statue', 'of', 'Liberty']
FAILED: ['UP!', 'button']
FAILED: ['bow', 'and', 'arrow']
FAILED: ['cat', 'face', 'with', 'tears', 'of', 'joy']
FAILED: ['chequered', 'flag']
FAILED: ['cloud', 'with', 'lightning', 'and', 'rain']
FAILED: ['couch', 'and', 'lamp']
FAILED: ['cut', 'of', 'meat']
FAILED: ['diamond', 'with', 'a', 'dot']
FAILED: ['doughnut']
FAILED: ['ear', 'of', 'corn']
FAILED: ['eight', 'o’clock']
FAILED: ['eleven', 'o’clock']
FAILED: ['face', 'blowing', 'a', 'kiss']
FAILED: ['face', 'with', 'tears', 'of', 'joy']
FAILED: ['five', 'o’clock

In [42]:
len(failed_emojies)

165

In [44]:
failed_emojies[::10]

[(3, ['AB', 'button', '(blood', 'type']),
 (230, ['O', 'button', '(blood', 'type']),
 (721, ['doughnut']),
 (909, ['glass', 'of', 'milk']),
 (1041, ['keycap']),
 (1495, ['men’s', 'room']),
 (1879, ['pile', 'of', 'poo']),
 (2016, ['roll', 'of', 'paper']),
 (2060, ['shinto', 'shrine']),
 (2147, ['star', 'of', 'David']),
 (2675, ['woman’s', 'sandal']),
 (2839, ['dove', 'of', 'peace']),
 (2890, ['facepunch']),
 (3206, ['capital', 'abcd']),
 (3227, ['keycap', 'digit', 'three']),
 (3329, ['pisces']),
 (3442, ['thumbsup'])]

Since emoji descriptions contain more than one word, each of our emoji samples will contain a collection of word vectors.  Since ultimately I want one vector to compare, I will take their average, coded below.

In [45]:
for idx, vecs in enumerate(words_to_vec):
    total = np.zeros((300))
    for v in vecs:
        total += v
    avg_vec = total/len(vecs)
    words_to_vec[idx] = avg_vec
print(len(words_to_vec))
# print(words_to_vec[:3])


3503


  """


In [46]:
for i in words_to_vec:
    print(len(i))
    break

300


In [47]:
len(words_to_vec)

3503

In [48]:
categories = {'person':person, 'place':place, 'thing':thing, 'activity':activity, 'time':time, 'mood':mood}


In [49]:
results = []
for idx, avg_vec in enumerate(words_to_vec):
    avg_vec = avg_vec.reshape(1,-1)
#     print(idx, avg_vec[:5])
    try:
        best_cat= get_max_similarity(avg_vec, categories)
        results.append((idx, best_cat))
    except:
        results.append((idx, 'FAIL'))
    

In [50]:
results[:10]

[(0, 'place'),
 (1, 'place'),
 (2, 'place'),
 (3, 'thing'),
 (4, 'place'),
 (5, 'thing'),
 (6, 'place'),
 (7, 'activity'),
 (8, 'activity'),
 (9, 'time')]

In [51]:
words = list(smilies.keys())

In [52]:
words[:10]

[':1st_place_medal:',
 ':2nd_place_medal:',
 ':3rd_place_medal:',
 ':AB_button_(blood_type):',
 ':ATM_sign:',
 ':A_button_(blood_type):',
 ':Afghanistan:',
 ':Albania:',
 ':Algeria:',
 ':American_Samoa:']

In [53]:
import pandas as pd

In [54]:
print(len(words))
print(len(clean_words))

3503
3503


In [55]:
df = pd.DataFrame(data=words)
df_2 = pd.DataFrame(data=[clean_words])
df_2 = df_2.T

In [56]:
df_2 = df_2.rename(columns={0:'seq_clean_text'})

In [57]:
df = df.rename(columns={0:'EMOJI_ALIAS_UNICODE'})

In [58]:
df['emoji'] = pd.Series(list(smilies.values()), index=df.index)


In [59]:
df['VEC_TO_CAT'] = pd.Series([i[1] for i in results], index=df.index)


In [60]:
df['SEQ_TO_VEC'] =pd.Series(words_to_vec, index=df.index)

In [61]:
df = pd.concat([df, df_2], axis=1)

In [63]:
df_2 = df.copy()

In [64]:
df_2 = df_2.drop(columns=['SEQ_TO_VEC'])

In [65]:
df_2.head()

Unnamed: 0,EMOJI_ALIAS_UNICODE,emoji,VEC_TO_CAT,seq_clean_text
0,:1st_place_medal:,🥇,place,"[1st, place, medal]"
1,:2nd_place_medal:,🥈,place,"[2nd, place, medal]"
2,:3rd_place_medal:,🥉,place,"[3rd, place, medal]"
3,:AB_button_(blood_type):,🆎,thing,"[AB, button, (blood, type]"
4,:ATM_sign:,🏧,place,"[ATM, sign]"


In [66]:
df[df.VEC_TO_CAT == 'FAIL'].count()

EMOJI_ALIAS_UNICODE    54
emoji                  54
VEC_TO_CAT             54
SEQ_TO_VEC             54
seq_clean_text         54
dtype: int64

I decided to repair the "FAIL" categories manually.

In [67]:
df.loc[df.EMOJI_ALIAS_UNICODE.str.contains('merperson'), 'VEC_TO_CAT'] = 'person'

df.loc[df.EMOJI_ALIAS_UNICODE.str.contains('selfie'), 'VEC_TO_CAT'] = 'person'

df.loc[df.EMOJI_ALIAS_UNICODE.str.contains('keycap'), 'VEC_TO_CAT'] = 'thing'

df.loc[df.EMOJI_ALIAS_UNICODE.str.contains('thumbs'), 'VEC_TO_CAT'] = 'person'

df.loc[df.EMOJI_ALIAS_UNICODE.str.contains('heart'), 'VEC_TO_CAT'] = 'mood'

df.loc[df.EMOJI_ALIAS_UNICODE.str.contains('facepunch'), 'VEC_TO_CAT'] = 'activity'

df.loc[df.EMOJI_ALIAS_UNICODE.str.contains('1'), 'VEC_TO_CAT'] = 'activity'

df.loc[df.VEC_TO_CAT=='FAIL', 'VEC_TO_CAT']='thing'

df.loc[df.VEC_TO_CAT=='FAIL'].count()

EMOJI_ALIAS_UNICODE    0
emoji                  0
VEC_TO_CAT             0
SEQ_TO_VEC             0
seq_clean_text         0
dtype: int64

In [94]:
#save for later use
# df.to_csv('emoji_dict.csv')

In [71]:
df.head()

Unnamed: 0.1,Unnamed: 0,EMOJI_ALIAS_UNICODE,emoji,VEC_TO_CAT,seq_clean_text
0,0,:1st_place_medal:,🥇,activity,"['1st', 'place', 'medal']"
1,1,:2nd_place_medal:,🥈,place,"['2nd', 'place', 'medal']"
2,2,:3rd_place_medal:,🥉,place,"['3rd', 'place', 'medal']"
3,3,:AB_button_(blood_type):,🆎,thing,"['AB', 'button', '(blood', 'type']"
4,4,:ATM_sign:,🏧,place,"['ATM', 'sign']"


Some Observations:

* 54 samples with FAIL labels: part of their description isn't found in word2vec's vocabulary.
* mixed labelling with animals
* mixed labelling with flags

To get a clear sense on the quality of my model, I perform a series of six tests.  

I created 6 files, each file named after a category that contains 100 emojies that would likely be labelled as that category.  I define some threshold that represents a baseline.  I choose 60%, meaning 6 of 10 emojies would be labelled correctly.  I choose a low threshold due to the point I mentioned above, that emojies are really subjective.  One may interpret the flexing emoji as a person while another would interpret it as a mood or activity.    

In [94]:
from pdb import set_trace
failed_emojis =[]
files = ['person.txt', 'thing.txt', 'place.txt', 'time.txt', 'activity.txt', 'mood.txt']
df = pd.read_csv('emoji_dict.csv')


def extract_emojis(string):
    cats = [ file[:-4] for file in files ]
    result = dict()
    try:
        descr = emoji.UNICODE_EMOJI_ALIAS[string]
        cat = df[df.EMOJI_ALIAS_UNICODE==descr].VEC_TO_CAT.item()
        result[cat] = result.get(cat, []) + [string]
    except:
        failed_emojis.append((string, idx))

    return result


test_results = []

for file in files:
    with open(file, 'r') as fp:
        distribution= dict()
        for line in fp:
            res = extract_emojis(line.strip('\n'))
            for key in res:
                distribution[key] = distribution.get(key, 0) + 1
        test_results.append((file, distribution))


NEEDED_THRESHOLD = .6

for file, res in test_results:
    total = sum(res.values())
    cat = file[:-4]
    perc = res[cat]/total
    print(cat, '\n', NEEDED_THRESHOLD <= perc, '\n', 'accuracy: ', perc*100)



person 
 True 
 accuracy:  97.82608695652173
thing 
 False 
 accuracy:  28.57142857142857
place 
 False 
 accuracy:  19.54022988505747
time 
 False 
 accuracy:  47.5
activity 
 False 
 accuracy:  7.792207792207792
mood 
 False 
 accuracy:  45.348837209302324


In [93]:
from pprint import pprint as pp
pp([pp((a,b)) for a, b in test_results])

('person.txt', {'person': 45, 'place': 1})
('thing.txt',
 {'activity': 7, 'mood': 10, 'person': 37, 'place': 7, 'thing': 28, 'time': 9})
('place.txt',
 {'activity': 6,
  'mood': 18,
  'person': 20,
  'place': 17,
  'thing': 14,
  'time': 12})
('time.txt', {'activity': 11, 'mood': 4, 'person': 12, 'thing': 15, 'time': 38})
('activity.txt',
 {'activity': 6, 'mood': 12, 'person': 19, 'place': 13, 'thing': 19, 'time': 8})
('mood.txt', {'mood': 39, 'person': 9, 'thing': 33, 'time': 5})
[None, None, None, None, None, None]


The above statistics show that my model is really good at detecting an emoji representing a person, but is very poor at detecting an emoji that represents an activity or a place. It makes sense that person would perform so highly because a description like ":woman_raising_hand:" or ":man_teacher:" would assuredly be close to 'person' in the vector space. On the other hand, the word "France" has closest semantic meaning to 'mood'.


If I had more time, I would have tuned my model by using EmojiNet's keywords as additional words to apply when embedding the emojies.  
