# Emojify!

To build an Emojifier, we are going to use word vector representations.

Have you ever wanted to make your text messages more expressive? Your emojifier app will help you do that. So rather than writing "Congratulations on the promotion! Lets get coffee and talk. Love you!" the emojifier can automatically turn this into "Congratulations on the promotion! 👍 Lets get coffee and talk. ☕️ Love you! ❤️" Using this model, inputs a sentence (such as "Let's go see the baseball game tonight!") and finds the most appropriate emoji to be used with this sentence (⚾️). In many emoji interfaces, you need to remember that ❤️ is the "heart" symbol rather than the "love" symbol. But using word vectors, you'll see that even if your training set explicitly relates only a few words to a particular emoji, your algorithm will be able to generalize and associate words in the test set to the same emoji even if those words don't even appear in the training set. This allows you to build an accurate classifier mapping from sentences to emojis, even using a small training set.

In [0]:
#@title Install packages and download files

# Install packages
!pip install torch torchvision -q
!pip install emoji -q
# Download 50 dimentional GLoVe embedding
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
!rm *.*.*00d.txt
!rm *.zip

In [1]:
%matplotlib inline

import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
import numpy as np
import csv
import emoji
import matplotlib.pyplot as plt

torch.__version__

'1.1.0'

In [0]:
# To read glove vector embedding
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map


In [0]:
emoji_dictionary = {"0": "\u2764\uFE0F",    # :heart: prints a black instead of red heart depending on the font
                    "1": ":baseball:",
                    "2": ":smile:",
                    "3": ":disappointed:",
                    "4": ":fork_and_knife:"}

def label_to_emoji(label):
    """
    Converts a label (int or string) into the corresponding emoji code (string) ready to be printed
    """
    return emoji.emojize(emoji_dictionary[str(label)], use_aliases=True)

In [0]:
def read_csv(filename):
    phrase = []
    emoji = []

    with open(filename) as csvDataFile:
        csvReader = csv.reader(csvDataFile)

        for row in csvReader:
            phrase.append(row[0].strip())
            emoji.append(row[1].strip())

    X = np.asarray(phrase)
    Y = np.asarray(emoji, dtype=int)

    return X, Y

Load the dataset using the code below. We split the dataset between training (127 examples) and testing (56 examples).

In [0]:
X_train, Y_train = read_csv('train_emoji.csv')
X_test, Y_test = read_csv('tesss.csv')

In [10]:
print('Training set size:', len(X_train))
print('Test set size:', len(X_test))

Training set size: 132
Test set size: 56


In [11]:
# First 10 rows
for i in range(10):
    print(X_train[i], label_to_emoji(Y_train[i]))

never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴
I love you mum ❤️
Stop saying bullshit 😞
congratulations on your acceptance 😄
The assignment is too long 😞
I want to go play ⚾


In [12]:
maxLen = len(max(X_train, key=len).split())
maxLen

10

In [0]:
# read the GLoVe embedding
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

* `word_to_index`: dictionary mapping from words to their indices in the vocabulary (400,001 words, with the valid indices ranging from 0 to 400,000)

* `index_to_word`: dictionary mapping from indices to their corresponding words in the vocabulary

* `word_to_vec_map`: dictionary mapping words to their GloVe vector representation.

In [14]:
# as a example
word = "india"
index = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(index) + "th word in the vocabulary is", index_to_word[index])

the index of india in the vocabulary is 189129
the 289846th word in the vocabulary is potatos


Convert all your training sentences into lists of indices, and then zero-pad all these lists so that their length is the length of the longest sentence.

This function below to convert X (array of sentences as strings) into an array of indices corresponding to words in the sentences.

In [0]:
def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()`. 
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    # Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)
    X_indices = np.zeros((m, max_len))
    
    for i in range(m):                               # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words = X[i].lower().split()
        
        # Initialize j to -1
        j = 0
        
        # Loop over the words of sentence_words
        # For Left zero padding
        for w in reversed(sentence_words): 
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            X_indices[i, j-1] = word_to_index[w]
            # decrement j to j - 1
            j = j - 1
            
        # For Right zero padding
        # for w in sentence_words: 
        #    # Set the (i,j)th entry of X_indices to the index of the correct word.
        #    X_indices[i, j] = word_to_index[w]
        #    # decrement j to j + 1
        #    j = j + 1
    
    return X_indices

In [16]:
# example
X1 = np.array(["funny lol", "lets play baseball", "food is ready for you"])
X1_indices = sentences_to_indices(X1,word_to_index, max_len=5)
print("X1 =", X1)
print("\nX1_indices =", X1_indices)

X1 = ['funny lol' 'lets play baseball' 'food is ready for you']

X1_indices = [[     0.      0.      0. 155345. 225122.]
 [     0.      0. 220930. 286375.  69714.]
 [151204. 192973. 302254. 151349. 394475.]]


In [17]:
# train and test shape
X_train.shape, X_test.shape

((132,), (56,))

In [0]:
X_tr = sentences_to_indices(X_train, word_to_index, maxLen)
X_te = sentences_to_indices(X_test, word_to_index, maxLen)

In [0]:
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(X_tr), torch.from_numpy(Y_train))
test_data = TensorDataset(torch.from_numpy(X_te), torch.from_numpy(Y_test))

# dataloaders
batch_size = 4

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size)

In [20]:
# input shape in the network
x, y = next(iter(train_loader))
x.shape # shape = (batch_size, maxLen)

torch.Size([4, 10])

In [21]:
# just checking
for i in range(x.size(0)):
    print(' '.join([index_to_word[int(xx)] for xx in x[i] if xx != 0]), label_to_emoji(int(y[i])))

he likes baseball ⚾
i cooked meat 🍴
she is so cute ❤️
where is the stadium ⚾


In [22]:
# First checking if GPU is available

train_on_gpu = torch.cuda.is_available()

print('Training on GPU.') if train_on_gpu else print('No GPU available, training on CPU.')

Training on GPU.


In [0]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]    
    
    weight = torch.FloatTensor(emb_matrix)
    embedding_layer = nn.Embedding.from_pretrained(weight.cuda())
    
    return embedding_layer

In [0]:
class Emojify(nn.Module):
    """
    The RNN model that will be used to find correct emoji.
    """

    def __init__(self, output_size, hidden_dim, n_layers, word_to_vec_map, word_to_index, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(Emojify, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # define all layers

        self.embedding = pretrained_embedding_layer(word_to_vec_map, word_to_index)
        embedding_dim = self.embedding.embedding_dim # embedding dimension
    
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=n_layers, 
                            dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        
        # linear and logsoftmax layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.logsoftmax = nn.LogSoftmax(dim=1)
        

    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        x = x.type(torch.cuda.LongTensor)

        # embeddings and lstm_out
        
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden) # lstm_out.shape = (4, 10, 128)
        
        # dropout and fully-connected layer
        out = self.dropout(lstm_out[:, -1, :])
        out = self.fc(out)
        
        # logsoftmax function
        logsoftmax_out = self.logsoftmax(out)
        
        # reshape to be batch_size first
        logsoftmax_out = logsoftmax_out.view(batch_size, -1)
        
        # return last logsoftmax output and hidden state
        return logsoftmax_out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = torch.tensor((),) 
        hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        
        return hidden

In [37]:
output_size = 5
hidden_dim = 128
n_layers = 2

net = Emojify(output_size, hidden_dim, n_layers, word_to_vec_map, word_to_index, drop_prob=0.5)

print(net)

Emojify(
  (embedding): Embedding(400001, 50)
  (lstm): LSTM(50, 128, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=128, out_features=5, bias=True)
  (logsoftmax): LogSoftmax()
)


In [0]:
lr=0.004

criterion = nn.NLLLoss() # or use nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=lr, momentum=0.95)

In [39]:
epochs = 100

print_every = 5
clip=5 # gradient clipping

# move model to GPU, if available
if train_on_gpu:
    net.cuda()

net.train()

# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        labels = labels.type(torch.cuda.LongTensor)
        loss = criterion(output, labels)
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()
    
    if e%print_every == print_every-1:
        print("Epoch: {}/{}...".format(e+1, epochs),
              "Loss: {:.6f}...".format(loss.item()))

Epoch: 5/100... Loss: 1.566547...
Epoch: 10/100... Loss: 1.686284...
Epoch: 15/100... Loss: 1.186079...
Epoch: 20/100... Loss: 0.678496...
Epoch: 25/100... Loss: 1.100852...
Epoch: 30/100... Loss: 0.401501...
Epoch: 35/100... Loss: 0.145676...
Epoch: 40/100... Loss: 0.060932...
Epoch: 45/100... Loss: 0.026466...
Epoch: 50/100... Loss: 0.145202...
Epoch: 55/100... Loss: 0.712100...
Epoch: 60/100... Loss: 0.582608...
Epoch: 65/100... Loss: 0.260361...
Epoch: 70/100... Loss: 0.036270...
Epoch: 75/100... Loss: 0.009214...
Epoch: 80/100... Loss: 0.003649...
Epoch: 85/100... Loss: 0.001581...
Epoch: 90/100... Loss: 0.261765...
Epoch: 95/100... Loss: 0.011587...
Epoch: 100/100... Loss: 0.009255...


In [0]:
def evaluate_and_prediction(net, loader=test_loader, batch_size=batch_size):
    net.eval()
    accuracy = 0
    loss = 0
    classes = []
    val_h = net.init_hidden(batch_size)
    for inputs, labels in loader:

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        val_h = tuple([each.data for each in val_h])

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()
        labels = labels.type(torch.cuda.LongTensor)

        log_ps, val_h = net(inputs, val_h)
        ps = torch.exp(log_ps)
        top_p, top_class = ps.topk(1, dim=1)
        equals = top_class == labels.view(*top_class.shape)
        
        accuracy += torch.mean(equals.type(torch.cuda.FloatTensor))
        loss += criterion(log_ps, labels).item()

        for i in top_class.squeeze():
            classes.append(int(i))

    net.train()
    loss, accuracy = float(loss/len(loader)), float(accuracy/len(loader))
    
    return loss, accuracy, classes

In [40]:
_, train_accuracy, _ = evaluate_and_prediction(net, train_loader, batch_size)

print('Train Accuracy: {:.3f}'.format(train_accuracy))

Train Accuracy: 1.000


In [41]:
test_loss, test_accuracy, predicted = evaluate_and_prediction(net, test_loader, batch_size)

print('Test Loss: {:.3f}'.format(test_loss),
     '\nTest Accuracy: {:.3f}'.format(test_accuracy))

Test Loss: 0.619 
Test Accuracy: 0.821


In [42]:
# check
for i in range(10):
    print(X_test[i], '\nAcctualy:', label_to_emoji(Y_test[i]), \
          'Predicted:', label_to_emoji(predicted[i]), end='\n\n')

I want to eat 
Acctualy: 🍴 Predicted: 🍴

he did not answer 
Acctualy: 😞 Predicted: 😞

he got a very nice raise 
Acctualy: 😄 Predicted: ❤️

she got me a nice present 
Acctualy: 😄 Predicted: ❤️

ha ha ha it was so funny 
Acctualy: 😄 Predicted: 😄

he is a good friend 
Acctualy: 😄 Predicted: ❤️

I am upset 
Acctualy: 😞 Predicted: 😞

We had such a lovely dinner tonight 
Acctualy: 😄 Predicted: 😄

where is the food 
Acctualy: 🍴 Predicted: 🍴

Stop making this joke ha ha ha 
Acctualy: 😄 Predicted: 😄

