# Predicting Emojis from Twitter Data

#### Nicholas Farn, Ekaterina Koplenko, Maithili Bhide

Emoji have become more and more prominent in today’s social media. Since their initial appearance in Japan in the 1990s, it has been found that emoji are used by over ninety-two percent of the online population in 2015 [1]. Due to the indicated trend, numerous NLP applications can benefit from the emoji interpretation capability.

In this project, we aim to implement and to train the following models: a bidirectional LSTM, a CNN, and a bag of words. Our objective is to predict one of 5, 10, and 20 most frequently used emoticons for a given sentence.

## 1. Data Set

We acquired a pre-processed dataset containing 584,600 tweets, posted between October 2015 and May 2016 in the US [2]. The dataset consists of three sets containing tweets from the top 5, 10, and 20 most common emojis. Each set is split into training, validation, and test sets with the training sets containing 2-5 hundred thousand tweets and the validation and test sets containing a couple ten thousand.

Preprocessing consisted of replacing user mentions with the symbol "@user", as well as replacing words that occur less than 5 times with the symbol "< unk >". Punctuation such as commas and quotation marks are separated from words with a space and are treated as words themselves. A brief sample 

In [1]:
import pandas as pd
pd.read_csv('./data/5_test', delimiter='\t', names=['Tweet', 'Emoji']).head(10)

Unnamed: 0,Tweet,Emoji
0,funny how you change when certain people are a...,eoji1f602
1,@user hahahah yeeahhh right,eoji1f602
2,@user i like the last one,eoji1f602
3,good lawrd your blankets smell amazing,eoji1f60d
4,@user lol my nigga you mad,eoji1f602
5,i love having beautiful friends ! happy holida...,eoji2764
6,"guy at gas station : "" did you know , ford bac...",eoji1f602
7,@user : @user bih that's gone be my step chil...,eoji1f602
8,welcome to the nba i told yall its another jus...,eoji1f525
9,oh my girls were so beautiful tonight ! #lizzi...,eoji1f602


## 2. Baseline

For the baseline we have implemented a bag of words classifier as it has been known to be successful for classification tasks like sentiment analysis and topic modeling. Our task is to remove the emoji from the sequence of tokens and use it as a label both for training and testing. Each message/tweet is represented as a vector. The most informative tokens (including punctuation marks) are selected using term frequency-inverse document frequency (TF-IDF). L2 regularized logistic regression is used to make the predictions.

<tr>
    <td> <img src="images/Fig_5.png" alt="Drawing" style="width: 500px;"/> </td>
    <td> <img src="images/Fig_10.png" alt="Drawing" style="width: 500px;"/> </td>
</tr>

Based on the results for the 5 most frequent, 10 most frequent and 20 most frequent emoji in our dataset we can see that the baseline model has a high preference towards predicting the first<tear_of_joy> and second<red_heart> most frequent emoji. From the above confusion matrix for 5 most frequent emojis we can see that the third<face_with_heart_eyes> most frequent emoji gets misclassified as the first most frequent one more than half the times. Similar misclassification with preference towards predicting the first or second most frequent emoji can be seen in the top 10 and top 20 most frequent emoji confusion matrix as well. 

The baseline classifier does considerably well on our dataset. The performance metrics for this model were seen to be consistent with the results published in the paper 'Are Emojis Predictable?'[2].
For our future models, we have tried to reduce the prediction bias towards the most frequent class and improve the classification accuracy of emojis occurring less frequently. This has been done by modeling a CNN and LSTM as explaind further. 

## 3. Convolutional Neural Network

Another model noted to do well is a convolutional neural network[2][3]. The model consisted of passing 64 filters of width 3, 4, and 5 over a sequence of word embeddings (of dimension 50) which a max pool is applied to produce a fixed size output. The output then fed directly into a fully connected softmax used to predict the emoji class. During training the fully connected layer is subjected to dropout. Embeddings were initialized using pre-trained GloVe embeddings from twitter data. Words without matching GloVe embeddings were initialized from a uniform distribution from -1 to 1.

<img src='images/cnn_model.png' />

In addition to the basic CNN, increased fully connected layers and a highway network was introduced between the convolutional layer output and fully connected layer. Deep highway networks are noted to have improved training time over deep neural networks as well as produce similar outputs between semantically similar words and phrases with vastly different input[3]. A highway layer is defined by eq. 1 where $\circ$ is an element wise multiplication.

$$y = \text{relu}(W_H x + b_H) \circ \sigma(W_T x + b_T) + (1 - \sigma(W_T x + b_T)) \circ x$$

Since the output of a highway network is of the same dimension as its inputs, $W_H$ and $W_T$ are therefore square matrices.

Dropout is applied between every layer from the convolutional output up to, but not including, the softmax layer in order to regularize the model. Weights are initialized using the Glorot uniform distribution. Biases are initialized to 0 except for $b_T$ which is initialized from a uniform distribution from -4 to -2. This is so highway networks tend to produce similar output as its inputs at first. All models were trained using the Adam optimizer.

## 4. Long-Short Term Memory

We explore both uni- and bi-directional LSTM models to solve the sequential emoji classification problem with GloVe word embeddings. LSTM neural networks are being actively researched as they show promising results and can provide state-of-the-art performance.

Two models were constructed and analyzed. First model consists of a unidirectional LSTM hidden layer while an extra bidirectional LSTM layer added prior to the LSTM layer in the second model. Both neural networks accept tweets that have been tokenized, enumerated, and padded or truncated to gain a uniform length of 35. All of the unknown words are given and <unk> token and placed in the zeroth position in the vocabulary. The first hidden layer creates GloVe word embeddings to represent all the words in the training set. The embedding weights are being learned along with the model. The output dimension of the resulted embedded vector is set to 100. In the final stage, the output is passed through the softmax function to determine the most probable emoji.

Adam optimization and dropout techniques were integrated to improve performance of both models. During training, we noticed that for some experiments testing accuracy increases while validation accuracy drops. This signified overfitting issues within the neural network. To compensate for overfitting,  two intermediate dropout layers were introduced. Additionally, we used a popular stochastic optimization algorithm in NLP called Adam. This method computes different learning rates for various parameters unlike traditional SGD where learning rate remains unchanged.

## 5. Results

We tested our models using a weighted F1 score as an indicator of performance.

<center><b>F1 Scores by Model per Top N Emojis</b></center>

|  | baseline | CNN | Resampled CNN | Highway CNN | LSTM | Bi-LSTM |
|--|----------|----------|----------|----------|----------|----------|
|5 | 0.592061 | 0.549705 | <b>0.595257</b> | 0.564256 | 0.564256 | 0.593087 |
|10| 0.441736 | <b>0.447219</b> | 0.390082 | 0.423835 | 0.423835 | 0.443731 |
|20| <b>0.347743</b> | 0.208166 | 0.292820 | 0.284491 | 0.284491 | 0.342432 |

Unfortunately, as one might notice none of our models performed siginificantly better than our baseline model.

### 5.1 Baseline

<tr>
    <td> <img src="images/Fig_5.png" alt="Drawing" style="width: 300px;"/> </td>
    <td> <img src="images/Fig_10.png" alt="Drawing" style="width: 300px;"/> </td>
    <td> <img src="images/Fig_20.png" alt="Drawing" style="width: 300px;"/> </td>
</tr>

### 5.2 CNN

<table style="width:100%">
    <tr>
        <th><center>CNN</center></th>
        <th><center>Resampled</center></th>
        <th><center>Highway</center></th>
    </tr>
    <tr>
        <td><img src='images/5_confusion.png' /></td>
        <td><img src='images/5_resample.png' /></td>
        <td><img src='images/5_highway.png' /></td>
    </tr>
</table>

<b>Figure 1.</b> Confusion matrix of top 5 emojis for various CNNs. The most common emoji is denoted as the class 0 while the least common is denoted as the class 4.

Here we can compare the results of the various CNNs. As one would expect, the resampled CNN does a better job at predicting the less common classes, however this is to the detriment of now confusing the first and third most common emojis together. However both of these perform better than a CNN with a single highway layer since it fails to distinguish the third emoji from the first nearly at all.

### 5.3 LSTM

Testing showed that BILSTM followed by LSTM performs better than LSTM network for both 5 and 10 emoji sets. However, during further testing on 20 emoji both models show almost identical results. During development, we aimed to implement a similar model examined in the research article Are Emojis Predictable[2]? Our BILSTM model outperforms the one described in the paper; therefore, we consider our model a success. 

## 6. Conclusion

## References

[1]: http://emogi.com/documents/Emoji_Report_2015.pdf "Emoji Report 2015", emoji.com, 2015

[2]: https://arxiv.org/pdf/1702.07285.pdf F. Barbieri, M. Ballesteros, H. Saggion, "Are Emojis Predictable?", 2016

[3]: https://web.stanford.edu/class/cs224n/reports/2762064.pdf L. Zhao, C. Zeng, "Using Neural Networks to Predict Emoji Usage from Twitter Data"

[4]: https://arxiv.org/abs/1408.5882 Y. Kim, "Convolutional Neural Networks for Sentence Classification", 2014

[5]: https://arxiv.org/abs/1505.00387 R. Srivastava, K. Greff, J. Schmidhuber, "Highway Networks", 2015 

## Code

https://github.com/neonrights/emoji_predictor