In [2]:
%pip install emoji

Collecting emoji
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.1


emoji is a library that helps handle emojis in text — useful for decoding, encoding, or filtering emoji content.

import numpy as np: Imports NumPy, a numerical computing library used for arrays, matrices, and mathematical functions.

import pandas as pd: Imports pandas, which is used for working with tabular data like CSVs or DataFrames.

import emoji: Brings in the emoji package you just installed.

Why this is needed:

numpy helps in handling numerical operations during model processing.

pandas is likely used for loading and exploring the emoji dataset.

emoji helps manage emoji characters — either for cleaning, identifying, or encoding them.



In [3]:
import numpy as np
import pandas as pd
import emoji

Sequential: A linear stack of layers from Keras; used to build the neural network step by step.

Dense: A fully connected layer, where every input neuron is connected to every output neuron.

LSTM: Long Short-Term Memory, a special kind of RNN for sequence prediction problems (great for text).

SimpleRNN: A basic Recurrent Neural Network layer (not as powerful as LSTM, but simpler).

Embedding: Turns integer-encoded words into dense vectors of fixed size — useful for inputting text into models.

Then from tensorflow.keras.preprocessing:

Tokenizer: Converts text into a sequence of integers.

pad_sequences: Ensures all input sequences are of the same length by padding them (important for batching).

to_categorical: Converts class labels (like 0,1,2) into one-hot vectors — useful for classification tasks.

Why this is needed:
This cell sets up all the tools needed for:

Preprocessing text,

Building a neural network that understands sequences,

Predicting emoji classes from text.



In [6]:
from keras.models import Sequential
from keras.layers import Dense,LSTM,SimpleRNN,Embedding

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [8]:
data = pd.read_csv("emoji_data.csv",header = None)

In [9]:
data.head()

Unnamed: 0,0,1
0,French macaroon is so tasty,4
1,work is horrible,3
2,I am upset,3
3,throw the ball,1
4,Good joke,2


In [10]:
emoji_dict = {
    0:"red_heart:",
    1:":baseball:",
    2:":grinning_face_with_big_eyes:",
    3:":disappointed_face:",
    4:":fork_and_knife:"
}


def label_to_emoji(label):
  return emoji.emojize(emoji_dict[label])

emoji_dict: A dictionary mapping numeric class labels (0 to 4) to corresponding emoji names in colon format.

These names come from the Unicode CLDR short name for emojis.

Example: :red_heart: gets converted to ❤️.

label_to_emoji(label):

A helper function that takes a label like 0 and returns the actual emoji using emoji.emojize().

emoji.emojize(':red_heart:') will return ❤️.

Why this is needed:

Models work with numeric labels (e.g., 0, 1, 2), but we want to show the actual emoji in results.

This function makes it easy to translate predicted labels back to human-readable emojis.



In [11]:
X = data[0].values
Y=data[1].values

In [12]:
X

array(['French macaroon is so tasty', 'work is horrible', 'I am upset',
       'throw the ball', 'Good joke',
       'what is your favorite baseball game', 'I cooked meat',
       'stop messing around', 'I want chinese food',
       'Let us go play baseball', 'you are failing this exercise',
       'yesterday we lost again', 'Good job', 'ha ha ha it was so funny',
       'I will have a cheese cake', 'Why are you feeling bad',
       'I want to joke', 'I never said yes for this',
       'the party is cancelled', 'where is the ball', 'I am frustrated',
       'ha ha ha lol', 'she said yes', 'he got a raise',
       'family is all I have', 'he can pitch really well',
       'I love to the stars and back', 'do you like pizza ',
       'You totally deserve this prize', 'I miss you so much',
       'I like your jacket ', 'she got me a present',
       'will you be my valentine', 'you failed the midterm',
       'Who is down for a restaurant', 'valentine day is near',
       'Great so awesome

In [13]:
Y

array(['4', '3', '3 ', '1 ', '2', '1', '4', '3', '4', '1', '3', '3 ', '2',
       '2', '4', '3', '2', '3 ', '3 ', '1', '3 ', '2', '2', '2', '0', '1',
       '0', '4 ', '2', '0v2', '2', '0', '0', '3 ', '4', '0', '2', '1',
       '3', '1', '0', '4', '0 ', '3', '0 ', '4', '2', '3 ', '4', '2 ',
       '2', '3', '0', '2', '2', '3 ', '2', '3', '2', '2', '3 ', '3', '0 ',
       '2', '3', '0', '2', '0', '0 ', '2', '3', '2', '4', '1', '3', '3',
       '0', '0', '3', '2', '0', '3', '0', '2', '2', '4', '2', '2', '0',
       '0', '2', '3', '0', '4', '2', '1', '2', '3', '3', '2', '3', '0',
       '3', '0', '2', '0', '2', '3', '4', '3', '1', '3', '4', '3', '2',
       '3', '3', '3', '1', '4', '4', '2', '2', '1', '1', '2', '3', '2',
       '3', '4', '2', '3', '0', '2', '0', '0', '4', '3', '4', '2', '3',
       '2', '3', '4', '2', '1', '2', '4', '3', '1', '3', '2', '3', '2',
       '2', '3', '3', '2', '4', '0', '0', '0', '3', '0', '0', '1', '1',
       '2', '2', '2', '0', '3', '2', '3', '3', '1', '2',

data[0]: Refers to the first column of the dataset — this column contains the input sentences (text).

data[1]: Refers to the second column — this contains the emoji labels (as numbers 0–4).

.values: Converts the pandas Series into a NumPy array for easier processing.

Embedding

In [14]:
!wget --no-check-certificate http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2025-06-21 12:45:13--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-06-21 12:45:13--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-06-21 12:45:14--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
  Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zi

In [15]:

file = open('glove.6B.100d.txt', 'r', encoding = 'utf8')
content = file.readlines()
file.close()
# content

GloVe (Global Vectors for Word Representation) provides pre-trained word vectors that can be used to convert text into numerical form. These are much richer than simple one-hot or integer encoding.

In [16]:
embeddings = {}


for line in content:
  line = line.split()
  embeddings[line[0]] = np.array(line[1:], dtype = float)

You are loading the GloVe embeddings into memory, so that in the next steps, you can:

Convert them into a dictionary,

Look up embeddings for words in your training data,

Use them in your model's embedding layer.

In [20]:
def  get_maxlen(data):
  maxlen = 0
  for sent in data:
    maxlen = max(maxlen,len(sent))
  return maxlen

# Tokenize the input data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
Xtokens = tokenizer.texts_to_sequences(X)
word2index = tokenizer.word_index


maxlen = get_maxlen(Xtokens)
print(maxlen)

10


You're preparing your text data for deep learning by:

Converting text into sequences of integers.

Figuring out how much padding is needed.

Getting the vocabulary-to-index mapping for embedding lookup.

In [22]:
Xtrain = pad_sequences(Xtokens,maxlen = maxlen, padding = 'post', truncating = 'post')

In [24]:
# Find the index of the row with '0v2' in the original data
invalid_index = data[data[1] == '0v2'].index[0]

# Remove the row from both X and Y
X = np.delete(X, invalid_index)
Y = np.delete(Y, invalid_index)

# Convert Y to integer type
Y = Y.astype(int)

Y_train = to_categorical(Y)

This block:

Makes your input sequences uniform.

Fixes a bad label.

Converts labels into one-hot format so that they’re ready for training in a classification model.

## Model

In [27]:
embed_size = 100
embedding_matrix = np.zeros((len(word2index)+1, embed_size))



for word, i in word2index.items():
  embed_vector = embeddings[word]
  embedding_matrix[i] = embed_vector

This matrix will later be passed into a keras.layers.Embedding(...) layer with weights=[embedding_matrix], so your model starts with rich semantic understanding from pre-trained GloVe vectors — instead of learning everything from scratch.

In [28]:
embedding_matrix

array([[ 0.      ,  0.      ,  0.      , ...,  0.      ,  0.      ,
         0.      ],
       [-0.046539,  0.61966 ,  0.56647 , ..., -0.37616 , -0.032502,
         0.8062  ],
       [-0.49886 ,  0.76602 ,  0.89751 , ..., -0.41179 ,  0.40539 ,
         0.78504 ],
       ...,
       [-0.46263 ,  0.069864,  0.69095 , ..., -0.29174 ,  0.32041 ,
         0.21202 ],
       [ 0.073242,  0.11134 ,  0.62281 , ...,  0.53417 , -0.1646  ,
        -0.27516 ],
       [ 0.29019 ,  0.80497 ,  0.31187 , ..., -0.33603 ,  0.45998 ,
        -0.11278 ]])

In [31]:
model = Sequential ([
    Embedding(input_dim = len(word2index)+1,
                              output_dim = embed_size,
                              input_length = maxlen,
                              weights = [embedding_matrix],
                              trainable = False
    ),

    LSTM(units = 16, return_sequences = True),
    LSTM(units = 4),
    Dense(5, activation = 'softmax')
])

model.compile(optimizer ='adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

Converts sentences into word vectors using pre-trained GloVe embeddings.

Passes them through 2 LSTM layers to capture sequential context.

Outputs probabilities across 5 emoji categories using a softmax classifier.

In [34]:
# Re-tokenize and re-pad X after removing the invalid row
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
Xtokens = tokenizer.texts_to_sequences(X)
word2index = tokenizer.word_index

maxlen = get_maxlen(Xtokens)
Xtrain = pad_sequences(Xtokens,maxlen = maxlen, padding = 'post', truncating = 'post')

print(f"Shape of Xtrain: {Xtrain.shape}")
print(f"Shape of Y_train: {Y_train.shape}")

Shape of Xtrain: (182, 10)
Shape of Y_train: (182, 5)


resetting everything related to tokenization and sequence formatting after deleting a problematic row — ensuring the model won't crash and that the data is consistent.

In [35]:
model.fit(Xtrain,Y_train,epochs = 100)

Epoch 1/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 20ms/step - accuracy: 0.2943 - loss: 1.5861
Epoch 2/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.3285 - loss: 1.5514
Epoch 3/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.2595 - loss: 1.5548
Epoch 4/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.3101 - loss: 1.5136
Epoch 5/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.2962 - loss: 1.5068
Epoch 6/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - accuracy: 0.3500 - loss: 1.4837
Epoch 7/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.3358 - loss: 1.4727
Epoch 8/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.3477 - loss: 1.4528
Epoch 9/100
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

<keras.src.callbacks.history.History at 0x7abf2706e010>

In [42]:
test = ['Im good','i feel cold ','lets catch up for coffee']


test_seq = tokenizer.texts_to_sequences(test)
Xtest = pad_sequences(test_seq, maxlen = maxlen, padding = 'post', truncating = 'post')


y_pred = model.predict(Xtest)
y_pred = np.argmax(y_pred,axis =1)


for i in range(len(test)):
  print(test[i],label_to_emoji(y_pred[i]))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 154ms/step
Im good 😃
i feel cold  😞
lets catch up for coffee 🍴


This block shows a full inference pipeline:

Clean and encode new input.

Predict emoji classes using the trained model.

Convert numeric class predictions to actual emojis.

Display results.