Please check out the original [tensorflow implementation](https://github.com/uclmr/emoji2vec) of Emoji2Vec by its authors as well as the [paper](https://arxiv.org/pdf/1609.08359.pdf) for more details. <br>
This notebook is intended to provide an intuition behind Emoji2Vec and provide an idea of its features rather than serve as a in-depth analysis of the framework.

In [1]:
# External dependencies
import os
import numpy as np
import pickle as pk
import gensim.models as gsm
import torch
from tabulate import tabulate

# Internal dependencies
from model import Emoji2Vec, ModelParams
from phrase2vec import Phrase2Vec
from utils import build_kb, get_examples_from_kb, generate_embeddings, get_metrics, generate_predictions

### What is Emoji2Vec?
Emoji2Vec representations are a form of distributed representations for Unicode emojis, trained directly from natural language descriptions of emojis. <br>
This notebook presents a high level walkthrough the process of training Emoji2Vec representations and demonstrates some experimental results. <br><br>
First, lets define the logical steps in training Emoji2Vec representations.
1. For each emoji in a data set a number of natural language descriptions is collected.
2. Each description is encoded to a fixed form vector in a high dimensional space. 
    - Although it can be done by an arbitrary encoding method, this implementation follows the approach presented in the [paper](https://arxiv.org/pdf/1609.08359.pdf) using 300-dimensional [Google News word2vec embeddings](https://code.google.com/archive/p/word2vec/) together with a simple phrase encoder `phrase2vec.Phrase2Vec`.
3. A neural network model is trained to classify emojis from their descriptions.
    - Inside the model each unique emoji has its own vector of parameters that is updated when this emoji is being classified. Through continuous incrementation in training such vectors of parameters become emoji representations.
4. Neural network's parameters are extracted and used as distributed emoji representations that are embedded in the same space as the underlying word embeddings.

### How to train Emoji2Vec representations?
First, we need a couple of things to train Emoji2Vec representations. Most importantly we need something to encode natural language descriptions of emojis to a fixed form high dimensional vectors. For this we are going to use [Google News wor2vec embeddings](https://code.google.com/archive/p/word2vec/), so in order to continue with this notebook make sure you've downloaded those embeddings and placed a `.bin.gz` file in `data/word2vec`. <br><br>
Of course we also need some emojis and their natural language descriptions. Fortunately authors of the paper collected a data set that is part of this repository. <br>
Our training data set is in `data/training`. Let's add this directory to our parameters.

In [2]:
params = {"data_folder": "data/training"}

To train Emoji2Vec representations we need to put emojis in some sort of organised structure, so that after training we know which vector of parameters in the model is associated with which emoji. Of course the same needs to be done for each phrase (natural language description of emoji) in the data set so that we know which descrption describes which emoji.

In [3]:
# Build knowledge base
print('reading training data from: ' + params["data_folder"])
kb, ind2phr, ind2emoji = build_kb(params["data_folder"])


reading training data from: data/training


We'll need this emoji mapping later, so for now let's save it to a file.

In [4]:
params.update({"mapping_file": "emoji_mapping_file.pkl"})

In [5]:
# Save the mapping from index to emoji
pk.dump(ind2emoji, open(params["mapping_file"], 'wb'))

Time to encode descriptions of emojis to fixed form vectors. As mentioned before we're using 300 dimensional Google News embeddings located in `data/word2vec`. As we will need them later we're going to save the generated embeddings of descriptions to a file `phrase_embeddings.pkl`. <br>Let's add those paths to our parameters.

In [6]:
params.update({"word2vec_embeddings_file": "data/word2vec/GoogleNews-vectors-negative300.bin.gz",
               "phrase_embeddings_file": "phrase_embeddings.pkl"})

To reitarete - phrase embeddings are embeddigns of each emoji description that will serve as input for the neural network, whereas Google News embeddings are what we use to generate phrase embeddings. <br>

In [7]:
# Generate embeddings for each phrase in the training set
embeddings_array = generate_embeddings(ind2phr=ind2phr, kb=kb, embeddings_file=params["phrase_embeddings_file"],
                                       word2vec_file=params["word2vec_embeddings_file"])

loading embeddings...


So we have the training input (phrase embeddings), now let's load our training and development data sets.

In [8]:
train_set = get_examples_from_kb(kb=kb, example_type='train')
dev_set = get_examples_from_kb(kb=kb, example_type='dev')

And define training parameters.

In [9]:
model_params = ModelParams(in_dim=300, 
                           out_dim=300, 
                           max_epochs=60, 
                           pos_ex=4, 
                           neg_ratio=1, 
                           learning_rate=0.001,
                           dropout=0.0, 
                           class_threshold=0.5)

Define the model, it needs to know the training parameters, the number of emojis (size of parameter matrix depends on it) and the matrix of phrase embeddings (embeddings of emoji descriptions).

In [10]:
model = Emoji2Vec(model_params=model_params, num_emojis=kb.dim_size(0), embeddings_array=embeddings_array)

Now that the model has been defined, let's train it! 

In [11]:
model.train(kb=kb, epochs=model_params.max_epochs, learning_rate=model_params.learning_rate)

  'precision', 'predicted', average, warn_for)


Epoch: 1 
 Training loss: 0.68 
 Training acc: 0.57 
 Training f1: 0.57 
Epoch: 2 
 Training loss: 0.58 
 Training acc: 0.73 
 Training f1: 0.76 
Epoch: 3 
 Training loss: 0.51 
 Training acc: 0.81 
 Training f1: 0.83 
Epoch: 4 
 Training loss: 0.45 
 Training acc: 0.84 
 Training f1: 0.86 
Epoch: 5 
 Training loss: 0.41 
 Training acc: 0.87 
 Training f1: 0.89 
Epoch: 6 
 Training loss: 0.38 
 Training acc: 0.89 
 Training f1: 0.9 
Epoch: 7 
 Training loss: 0.34 
 Training acc: 0.91 
 Training f1: 0.92 
Epoch: 8 
 Training loss: 0.32 
 Training acc: 0.92 
 Training f1: 0.93 
Epoch: 9 
 Training loss: 0.3 
 Training acc: 0.93 
 Training f1: 0.94 
Epoch: 10 
 Training loss: 0.27 
 Training acc: 0.94 
 Training f1: 0.94 
Epoch: 11 
 Training loss: 0.26 
 Training acc: 0.94 
 Training f1: 0.94 
Epoch: 12 
 Training loss: 0.24 
 Training acc: 0.95 
 Training f1: 0.95 
Epoch: 13 
 Training loss: 0.23 
 Training acc: 0.95 
 Training f1: 0.95 
Epoch: 14 
 Training loss: 0.21 
 Training acc: 0

Let's save the trained PyTorch model to a file so that we can load it later if we need it.

In [12]:
model_folder = 'example_emoji2vec'
if not os.path.isdir(model_folder):
    os.makedirs(model_folder)
    
torch.save(model.nn, model_folder + '/model.pt')

Now that the model is trained and saved we need to extract emoji representations from the neural network's parameters. <br>
This can be done by a `model.Emoji2Vec` method called `create_gensim_files`, which will save the distributed representations of emojis in a format compatible with `gensim.models` allowing us to add Emoji2Vec representations to a gensim model and use them as any other word embeddings.


In [13]:
e2v = model.create_gensim_files(model_folder=model_folder, ind2emoj=ind2emoji, out_dim=model_params.out_dim)

#### Done.
That's it, you've generated your own Emoji2Vec representations which now sit in `example_emoji2vec/` as `emoji2vec.txt` and `emoji2vec.bin`. Good job. <br>
Now is the time for the fun part, for example we can create a gensim model with our newly generated emoji embeddings.

In [14]:
e2v = gsm.KeyedVectors.load_word2vec_format("example_emoji2vec/emoji2vec.bin", binary=True)

Let's take a look at some of the emojis in the data set.

In [15]:
vocabulary = e2v.vocab.keys()

In [16]:
# Sample 10 random emojis from the data set.
example_emojis = np.random.choice(list(vocabulary), 10)
print(example_emojis)

['🇨🇺' '🙌🏿' '🌮' '☣' '👱🏿' '🇨🇫' '🇲🇵' '🕓' '🔪' '🐢']


Each of the above emojis has its own vector representation.

In [17]:
e2v.wv[example_emojis]

  """Entry point for launching an IPython kernel.


array([[-0.31287813, -0.30965185,  0.27617866, ..., -0.11076863,
         0.27228284,  0.20612997],
       [ 0.1279807 , -0.05454838, -0.07805232, ...,  0.02052569,
        -0.00635082, -0.24651076],
       [-0.18673006, -0.3734869 , -0.13771021, ...,  0.13268538,
        -0.01627036, -0.09945607],
       ...,
       [-0.084155  , -0.26755714,  0.41112074, ..., -0.06871715,
        -0.16944052, -0.19901495],
       [-0.5802002 , -0.11823418,  0.39910623, ...,  0.04358187,
        -0.22510768,  0.2459275 ],
       [ 0.1919789 ,  0.32465485, -0.7188034 , ...,  0.45272884,
        -0.01710549, -0.09367993]], dtype=float32)

Thanks to the fact that the learnt emoji representations are compatible with `gensim.models`, we can do a lot of cool things like finding similar emojis.

In [18]:
e2v.most_similar('🚔')

  if np.issubdtype(vec.dtype, np.int):


[('🚓', 0.8722398281097412),
 ('🚑', 0.6165059804916382),
 ('🚘', 0.5822903513908386),
 ('👮🏿', 0.577760636806488),
 ('👮', 0.5697044730186462),
 ('🚗', 0.5328611135482788),
 ('🚃', 0.5255826711654663),
 ('🚙', 0.5193851590156555),
 ('🚕', 0.5161677002906799),
 ('🚐', 0.5124407410621643)]

In [None]:
e2v.most_similar('🏊')

[('🦋', 0.6746906638145447),
 ('🏊🏿', 0.5993638038635254),
 ('🛀', 0.5933505892753601),
 ('🛁', 0.583660364151001),
 ('⛲', 0.5823800563812256),
 ('🚰', 0.5734569430351257),
 ('🏊🏾', 0.5688455700874329),
 ('🏊🏻', 0.556538462638855),
 ('🌊', 0.5471222400665283),
 ('💦', 0.5453152656555176)]

In [None]:
e2v.most_similar('🚄')

[('🚅', 0.8191389441490173),
 ('🚆', 0.7354292869567871),
 ('🚋', 0.7118366956710815),
 ('🚉', 0.6971631646156311),
 ('🚂', 0.6815627217292786),
 ('🚃', 0.6649359464645386),
 ('🚟', 0.6503240466117859),
 ('🚝', 0.5870567560195923),
 ('🚞', 0.5861678123474121),
 ('🚈', 0.5822986364364624)]

In [None]:
e2v.most_similar('🌩')

[('🌧', 0.6393843293190002),
 ('☁️', 0.6163010001182556),
 ('🌦', 0.5562928318977356),
 ('🌨', 0.5485323667526245),
 ('⛈', 0.5457589626312256),
 ('⛅', 0.5421726703643799),
 ('☔', 0.5362968444824219),
 ('🌤', 0.526214063167572),
 ('🌪', 0.5242847204208374),
 ('🌥', 0.5231043100357056)]

However the real fun starts when we combine the power of emoji embeddings with word embeddings. <br>
Let's create a model that combines the two, this will allow us to measure similarity between emojis and phrases. 

In [None]:
phraseVecModel = Phrase2Vec.from_word2vec_paths(300,
                                                "data/word2vec/GoogleNews-vectors-negative300.bin.gz",
                                                "example_model/emoji2vec.bin")

In [None]:
# mapping from id to emoji
mapping = pk.load(open(params["mapping_file"], 'rb'))
# mapping from emoji to id
inverse_mapping = {v: k for k, v in mapping.items()}

In [None]:
def find_analogous_emoji(emoji_1,
                         emoji_2,
                         emoji_3,
                         top_n,
                         mapping=mapping,
                         inverse_mapping=inverse_mapping,
                         e2v=e2v,
                         model=model):
    similarities = []
    vector = e2v[emoji_1] - e2v[emoji_2] + e2v[emoji_3]
    vector = vector / np.linalg.norm(vector)
    
    for idx in range(len(mapping)):
        emoij_idx_similarity = model.nn.forward(torch.Tensor(vector.reshape(1, -1)), idx).detach().numpy()
        similarities.append(emoij_idx_similarity)
    
    similarities = np.array(similarities)
    n_most_similar_idxs = similarities.argsort(axis=0)[-top_n:][::-1].reshape(-1)
    n_most_similar_emojis = [mapping[emoji_idx] for emoji_idx in n_most_similar_idxs]
    
    str_expression = ' '.join([emoji_1, "-", emoji_2, "+", emoji_3,])
    top_score = similarities[n_most_similar_idxs[0]]
    
    return str_expression, n_most_similar_emojis
        

In [None]:
table = [["Expression", "Closest emojis"],
         [*find_analogous_emoji("🤴", "🚹", "🚺", 3)],
         [*find_analogous_emoji("👑", "🚹", "🚺", 3)],
         [*find_analogous_emoji("👦", "🚹", "🚺", 3)],
         [*find_analogous_emoji("💵", "🇺🇸", "🇬🇧", 3)],
         [*find_analogous_emoji("💵", "🇺🇸", "🇪🇺", 3)],
         [*find_analogous_emoji("💷", "🇬🇧", "🇪🇺", 3)]]

In [None]:
print(tabulate(table, headers="firstrow", tablefmt="rst"))

Although far from perfect, similarly to the [published](https://arxiv.org/pdf/1609.08359.pdf) results an emoji considered to be "right" is usually within the top 3 examples selected by the model.

Combining Emoji2Vec representations with word2vec embeddings they were trained on also allows to inspect the relationships between word embeddings and emoji embeddings. For example with the `phrase2emoji` function below we can see emojis most similar to an arbitrary phrase.

In [None]:
def phrase2emoji(phrase, 
                 top_n,
                 phraseVecModel=phraseVecModel,
                 mapping=mapping,
                 model=model):
    
    similarities = []
    phrase_vec = phraseVecModel[phrase]
    for idx in range(len(mapping)):
        emoij_idx_similarity = model.nn.forward(torch.Tensor(phrase_vec.reshape(1, -1)), idx).detach().numpy()
        similarities.append(emoij_idx_similarity)
        
    similarities = np.array(similarities)
    n_most_similar_idxs = similarities.argsort(axis=0)[-top_n:][::-1].reshape(-1)
    n_most_similar_emojis = [mapping[emoji_idx] for emoji_idx in n_most_similar_idxs]
    
    return phrase, n_most_similar_emojis
    

In [None]:
table = [["Phrase", "Closest emojis"],
         [*phrase2emoji("funny", 3)],
         [*phrase2emoji("scary", 3)],
         [*phrase2emoji("okay", 3)],
         [*phrase2emoji("crazy", 3)],
         [*phrase2emoji("wild", 3)],
         [*phrase2emoji("afraid", 3)]]

In [None]:
print(tabulate(table, headers="firstrow", tablefmt="rst"))

There seems to exist a strong relationship between simple adjectives and emojis. That's great, but is this the whole point of Emoji2Vec? <br>
Of course not. In fact Emoji2Vec authors evaluated their method's performance on a downstream task of sentiment classification on a dataset  by  Kralj   Novak  et  al.  (2015),  which  consists of over 67k English tweets labelled manually for positive, neutral, or negative sentiment. Using Emoji2Vec alongside word embeddings, as opposed to using just word embeddings, yielded an improvement in classification accuracy across all studied datasets. In the same task Emoji2Vec also outperformed an alternative method for emoji representation. <br>
For more details check out the [paper](https://arxiv.org/pdf/1609.08359.pdf). You can also inspect the results of the sentiment classification task in [this notebook](https://github.com/uclmr/emoji2vec/blob/master/TwitterClassification.ipynb).