# Emoji Prediction Using Bi-directional LSTM with GloVe Embeddings

This project aims to predict appropriate emojis based on the content of text messages or social media posts. Leveraging the power of natural language processing (NLP), we train a machine learning model using a dataset of text paired with emojis. The model uses Bi-directional Long Short-Term Memory (LSTM) networks with pre-trained GloVe word embeddings to capture the contextual meaning of the text and predict the most relevant emoji. By utilizing these advanced deep learning techniques, the model can understand subtle nuances in language and suggest emojis that match the sentiment or context of the input text, making it an exciting application for enhancing social media interactions, messaging apps, and more.

The project focuses on:

Text preprocessing and feature extraction
Implementing Bi-directional LSTM with GloVe embeddings
Training a model to predict emoji labels
Evaluation using accuracy and other metrics
The final model can be used in various applications where text-based emoji prediction can enhance user experience.

# Import Libraries

In [138]:
import pandas as pd
import numpy as np
import re
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.initializers import Constant

# Load and Explore Dataset

In [130]:
df1 = pd.read_csv(r'C:\Users\user\Desktop\CognoRise Infotech ML Intern\data\emoji_prediction_training_data.csv')
df1

Unnamed: 0.1,Unnamed: 0,TEXT,Label
0,0,Vacation wasted ! #vacation2017 #photobomb #ti...,0
1,1,"Oh Wynwood, you’re so funny! : @user #Wynwood ...",1
2,2,Been friends since 7th grade. Look at us now w...,2
3,3,This is what it looks like when someone loves ...,3
4,4,RT @user this white family was invited to a Bl...,3
...,...,...,...
69995,69995,"Yes, I call Galina ""my Bubie"" Go follow my bea...",3
69996,69996,"I SEA you, Seattle @ Ballard Seafood Festival\n",16
69997,69997,If one of my daughters is wearing this and ask...,2
69998,69998,Guess who whoop people on THEIR homecoming?! #...,3


In [132]:
df2 = pd.read_csv(r'C:\Users\user\Desktop\CognoRise Infotech ML Intern\data\emoji_prediction_mapping_data.csv')
df2.head()

Unnamed: 0.1,Unnamed: 0,emoticons,number
0,0,😜,0
1,1,📸,1
2,2,😍,2
3,3,😂,3
4,4,😉,4


# Merge Dataset

In [133]:
df = pd.merge(df1, df2, left_on='Label', right_on='number')
df.head()

Unnamed: 0,Unnamed: 0_x,TEXT,Label,Unnamed: 0_y,emoticons,number
0,0,Vacation wasted ! #vacation2017 #photobomb #ti...,0,0,😜,0
1,1,"Oh Wynwood, you’re so funny! : @user #Wynwood ...",1,1,📸,1
2,2,Been friends since 7th grade. Look at us now w...,2,2,😍,2
3,3,This is what it looks like when someone loves ...,3,3,😂,3
4,4,RT @user this white family was invited to a Bl...,3,3,😂,3


In [136]:
df = df.drop(['Unnamed: 0_x', 'Unnamed: 0_y', 'number'], axis=1)
df.head()

Unnamed: 0,TEXT,Label,emoticons
0,Vacation wasted ! #vacation2017 #photobomb #ti...,0,😜
1,"Oh Wynwood, you’re so funny! : @user #Wynwood ...",1,📸
2,Been friends since 7th grade. Look at us now w...,2,😍
3,This is what it looks like when someone loves ...,3,😂
4,RT @user this white family was invited to a Bl...,3,😂


In [137]:
print("Dataset size:", len(df))

Dataset size: 70000


# Text cleaning

In [139]:
def clean_text(text):
    text = text.lower()  # lowercase
    text = re.sub(r'http\S+|www.\S+', '', text)  # remove URLs
    text = re.sub(r'@\w+|#\w+', '', text)  # remove mentions and hashtags
    text = re.sub(r'[^\w\s]', '', text)  # remove punctuation
    return text

In [140]:
df['cleaned_text'] = df['TEXT'].apply(clean_text)

# Preparing text for model

In [141]:
texts = df['cleaned_text'].values
labels = df['Label'].values

# Tokenizing

In [142]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index

In [143]:
maxlen = 100  
X_train = pad_sequences(sequences, maxlen=maxlen)

# Load GloVe embeddings

In [147]:
embedding_dim = 300
embedding_index = {}
with open(r'C:\Users\user\Downloads\glove.42B.300d.txt', encoding='utf-8') as f:  # replace with actual path to GloVe file
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

# Prepare embedding matrix

In [148]:
num_words = len(word_index) + 1
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# Model setup

In [150]:
model = Sequential([
    Embedding(num_words, embedding_dim, embeddings_initializer=Constant(embedding_matrix),
              trainable=False),  
    Bidirectional(LSTM(128, return_sequences=True)),
    Dropout(0.2),
    Bidirectional(LSTM(128)),
    Dropout(0.2),
    Dense(128, activation='relu'),
    Dropout(0.2),
    Dense(len(np.unique(labels)), activation='softmax')
])

In [151]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train model

In [152]:
model.fit(X_train, labels, epochs=5, batch_size=64, verbose=1)

Epoch 1/5
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1696s[0m 2s/step - accuracy: 0.2589 - loss: 2.5704
Epoch 2/5
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m908s[0m 830ms/step - accuracy: 0.3198 - loss: 2.3330
Epoch 3/5
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m904s[0m 826ms/step - accuracy: 0.3376 - loss: 2.2497
Epoch 4/5
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m833s[0m 761ms/step - accuracy: 0.3567 - loss: 2.1727
Epoch 5/5
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m840s[0m 767ms/step - accuracy: 0.3789 - loss: 2.0760


<keras.src.callbacks.history.History at 0x194dc5c9ac0>

In [154]:
# create a dictionary to map labels (number) to emojis
emoji_labels = dict(zip(df['Label'], df['emoticons']))

# Predictions

In [155]:
test_df = pd.read_csv(r'C:\Users\user\Desktop\CognoRise Infotech ML Intern\data\emoji_prediction_test_data.csv')  # Replace with your actual file path
test_texts = test_df['TEXT'].values  # Assuming the text column is named 'TEXT'

In [156]:
# tokenize the test texts
test_sequences = tokenizer.texts_to_sequences(test_texts)

In [157]:
# pad the sequences to ensure uniform input length
maxlen = 100  
X_test = pad_sequences(test_sequences, maxlen=maxlen)

In [158]:
# make predictions using the trained model
predictions = model.predict(X_test)

[1m812/812[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m178s[0m 217ms/step


In [159]:
# convert predictions to emoji labels (argmax to get the class index)
predicted_emojis = [np.argmax(pred) for pred in predictions]

In [160]:
# map predicted emoji indices to emoji labels using the `emoji_labels` dictionary
predicted_emoji_labels = [emoji_labels[pred] for pred in predicted_emojis]

In [161]:
# output the predictions
for text, emoji in zip(test_texts, predicted_emoji_labels):
    print(f"Text: {text}\nPredicted Emoji: {emoji}\n")

Text: Thought this was cool...#Repost (get_repost)・・・Colorview. by shay_images…

Predicted Emoji: 📸

Text: Happy 4th! Corte madera parade. #everytownusa #merica @ Perry's on…

Predicted Emoji: 🇺🇸

Text: Luv. Or at least something close to it. @ Union Hill, Richmond, Virginia

Predicted Emoji: ❤

Text: There's a slice of pie under that whipped cream. #HouseofPies @ House of Pies

Predicted Emoji: 😍

Text: #thankyou for your thank you We adore you both + plan on moreeeee! Hosting your #wedding was…

Predicted Emoji: ❤

Text: the SPECIAL4U Lyric video will be posted on my youtube channel today at 6PM EST ! #Z…

Predicted Emoji: 🔥

Text: Momma Tanya's In town ! Awesome dinner @user with friends! @ Perch

Predicted Emoji: ❤

Text: Thing 1 and Thing 2 @ Huron, Ohio

Predicted Emoji: ❤

Text: Bday girl and some random @ Sheraton New York Times Square

Predicted Emoji: ❤

Text: Always fun with my forever wedding date Congrats @user &amp; @user

Predicted Emoji: ❤

Text: La La Land @ Griffith P