# Twitter Sentiment Analysis Prediction

In this notebook we will classify the sentiment of twitter messages. There are three classes: Positive, Negative and Neutral. The class Irrelevant of the dataset is regarded as Neutral. The data can be found here: <a href="https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis">https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis</a>

In [25]:
import tensorflow as tf
from tensorflow import keras

import numpy as np
import pandas as pd
import re

## Load Training and Test Data

In [2]:
df = pd.read_csv("twitter_training.csv")
df.head()

Unnamed: 0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...


In [3]:
df_test = pd.read_csv("twitter_validation.csv")
df_test.head()

Unnamed: 0,3364,Facebook,Irrelevant,"I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣"
0,352,Amazon,Neutral,BBC News - Amazon boss Jeff Bezos rejects clai...
1,8312,Microsoft,Negative,@Microsoft Why do I pay for WORD when it funct...
2,4371,CS-GO,Negative,"CSGO matchmaking is so full of closet hacking,..."
3,4433,Google,Neutral,Now the President is slapping Americans in the...
4,6273,FIFA,Negative,Hi @EAHelp I’ve had Madeleine McCann in my cel...


## Preprocessing of Text Data

In [4]:
# training data
X = df.iloc[:,3]
X

0        I am coming to the borders and I will kill you...
1        im getting on borderlands and i will kill you ...
2        im coming on borderlands and i will murder you...
3        im getting on borderlands 2 and i will murder ...
4        im getting into borderlands and i can murder y...
                               ...                        
74676    Just realized that the Windows partition of my...
74677    Just realized that my Mac window partition is ...
74678    Just realized the windows partition of my Mac ...
74679    Just realized between the windows partition of...
74680    Just like the windows partition of my Mac is l...
Name: im getting on borderlands and i will murder you all ,, Length: 74681, dtype: object

In [5]:
# test data
X_test = df_test.iloc[:,3]
X_test

0      BBC News - Amazon boss Jeff Bezos rejects clai...
1      @Microsoft Why do I pay for WORD when it funct...
2      CSGO matchmaking is so full of closet hacking,...
3      Now the President is slapping Americans in the...
4      Hi @EAHelp I’ve had Madeleine McCann in my cel...
                             ...                        
994    ⭐️ Toronto is the arts and culture capital of ...
995    tHIS IS ACTUALLY A GOOD MOVE TOT BRING MORE VI...
996    Today sucked so it’s time to drink wine n play...
997    Bought a fraction of Microsoft today. Small wins.
998    Johnson & Johnson to stop selling talc baby po...
Name: I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣, Length: 999, dtype: object

In [6]:
# clean text data
def clean_text(text):
    # make all characters lowercase
    text = text.lower()
    
    # eliminate hashtags 
    hashtags = '#\S+'
    text = re.sub(hashtags, '', text)

    # eliminate mentions
    mentions = '@\S+'
    text = re.sub(mentions, '', text)

    # eliminate urls
    url = 'https?://[A-z0-9_%/\-\.]+[A-z0-9_\.\-\?&=%]+'
    text = re.sub(url, '', text)

    # eliminate puntuations
    puntuations =  r'[^\w\s]'
    text = re.sub(puntuations, '', text)
    
    return text

In [7]:
# eliminate float in the text data
X = X.astype(str)
X_test = X_test.astype(str)

# apply clean text data on X
X = X.map(clean_text)
X_test = X_test.map(clean_text)

In [8]:
X

0        i am coming to the borders and i will kill you...
1        im getting on borderlands and i will kill you all
2        im coming on borderlands and i will murder you...
3        im getting on borderlands 2 and i will murder ...
4        im getting into borderlands and i can murder y...
                               ...                        
74676    just realized that the windows partition of my...
74677    just realized that my mac window partition is ...
74678    just realized the windows partition of my mac ...
74679    just realized between the windows partition of...
74680    just like the windows partition of my mac is l...
Name: im getting on borderlands and i will murder you all ,, Length: 74681, dtype: object

In [9]:
X_test

0      bbc news  amazon boss jeff bezos rejects claim...
1       why do i pay for word when it functions so po...
2      csgo matchmaking is so full of closet hacking ...
3      now the president is slapping americans in the...
4      hi  ive had madeleine mccann in my cellar for ...
                             ...                        
994     toronto is the arts and culture capital of ca...
995    this is actually a good move tot bring more vi...
996    today sucked so its time to drink wine n play ...
997      bought a fraction of microsoft today small wins
998    johnson  johnson to stop selling talc baby pow...
Name: I mentioned on Facebook that I was struggling for motivation to go for a run the other day, which has been translated by Tom’s great auntie as ‘Hayley can’t get out of bed’ and told to his grandma, who now thinks I’m a lazy, terrible person 🤣, Length: 999, dtype: object

## Map Labels to Integer Values


In [10]:
# y train
y = df.iloc[:,2]
y

0        Positive
1        Positive
2        Positive
3        Positive
4        Positive
           ...   
74676    Positive
74677    Positive
74678    Positive
74679    Positive
74680    Positive
Name: Positive, Length: 74681, dtype: object

In [11]:
# y test
y_test = df_test.iloc[:,2]
y_test

0         Neutral
1        Negative
2        Negative
3         Neutral
4        Negative
          ...    
994    Irrelevant
995    Irrelevant
996      Positive
997      Positive
998       Neutral
Name: Irrelevant, Length: 999, dtype: object

In [12]:
print(set(y.values))
print(set(y_test.values))

{'Negative', 'Irrelevant', 'Positive', 'Neutral'}
{'Irrelevant', 'Positive', 'Negative', 'Neutral'}


In [13]:
map_pos_neg_neutr = {'Positive': 0, 'Neutral': 1, 'Irrelevant': 1, 'Negative': 2}

y = y.map(map_pos_neg_neutr)
y_test = y_test.map(map_pos_neg_neutr)

In [14]:
print("y values:", set(y.values))
print("y_test: ", set(y_test.values))

y values: {0, 1, 2}
y_test:  {0, 1, 2}


## Split the Training Data into Training and Validation Set 

In [15]:
# split the data into training and validation set
from sklearn.model_selection import train_test_split

# split into training and test set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42) 

## Convert Labels to One Hot Encoded Values

In [16]:
# convert label vector to one-hot encoded vectors
from keras.utils import to_categorical

num_classes = 3
y_train_onehot = to_categorical(y_train, num_classes=num_classes)
y_val_onehot = to_categorical(y_val, num_classes=num_classes)
y_test_onehot = to_categorical(y_test, num_classes=num_classes)

In [17]:
y_train_onehot

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.]], dtype=float32)

In [18]:
y_val_onehot

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       ...,
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]], dtype=float32)

In [19]:
y_test_onehot

array([[0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.]], dtype=float32)

## Convert Text Data to Sequences

In [20]:
print("X_train: ", X_train.shape)
print("X_val: ", X_val.shape)
print("X_test: ", X_test.shape)
print("")
print("y_train one-hot: ", y_train_onehot.shape)
print("y_val one-hot: ", y_val_onehot.shape)
print("y_test one-hot: ", y_test_onehot.shape)

X_train:  (59744,)
X_val:  (14937,)
X_test:  (999,)

y_train one-hot:  (59744, 3)
y_val one-hot:  (14937, 3)
y_test one-hot:  (999, 3)


In [21]:
# transform the text data to numerical data with Tokenizer
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# create tokenizer
tokenizer = Tokenizer()

# fit tokenizer on training data
tokenizer.fit_on_texts(X_train)

# convert text data to sequences of integer IDs,
sequences_train = tokenizer.texts_to_sequences(X_train)

# calculate lengths of the sequences
seq_len = [len(seq) for seq in sequences_train]

# calculate maxlen for the padded sequences using the 95 percentile of the length of the sequences
maxlen = int(np.percentile(seq_len, 95))

X_train_padded = pad_sequences(sequences_train, maxlen=maxlen, padding='post')

# transform X_test
sequences_test = tokenizer.texts_to_sequences(X_test)
X_test_padded = pad_sequences(sequences_test, maxlen=maxlen, padding='post')

# transform X_val
sequences_val = tokenizer.texts_to_sequences(X_val)
X_val_padded = pad_sequences(sequences_val, maxlen=maxlen, padding='post')

print("max len: ", maxlen)
print(sequences_train[:5])
print(X_train_padded[:5])


max len:  46
[[14, 2470, 19318, 28, 18, 599, 27, 19, 8, 303, 32, 6, 255, 4089, 5, 7871, 6, 245, 3704, 4, 766, 3704], [2, 540, 199, 15, 8, 147, 791, 6, 121, 2, 21, 27, 650, 11, 5286, 19, 27, 596, 3965, 576, 6, 5005], [449, 11985, 66, 25205, 66, 449, 4431, 216, 1545, 5893, 4, 4579, 11985, 6768, 9751, 27, 170, 216, 1830, 4580], [71, 123, 93, 11, 127, 7, 39, 2, 233, 475, 1, 8542, 11986, 3598, 13, 754, 165, 25, 19319], [27, 1, 221, 61, 651, 1, 408, 15, 93, 24, 514, 63, 252, 462, 64, 696, 584, 1, 917, 452, 131, 6, 86, 44, 23, 21, 297, 3, 300, 486, 4, 44, 23, 21, 297, 3, 300, 264, 222, 131, 6, 86, 44, 23, 21, 297, 3, 300, 264]]
[[   14  2470 19318    28    18   599    27    19     8   303    32     6
    255  4089     5  7871     6   245  3704     4   766  3704     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0]
 [    2   540   199    15     8   147   791     6   121     2    21    27
    650    1

In [22]:
print("X_train_padded: ", X_train_padded.shape)
print("X_val_padded: ", X_val_padded.shape)
print("X_test_padded: ", X_test_padded.shape)
print("")
print("y_train_onehot: ", y_train_onehot.shape)
print("y_val_onehot: ", y_val_onehot.shape)
print("y_test_onehot: ", y_test_onehot.shape)

X_train_padded:  (59744, 46)
X_val_padded:  (14937, 46)
X_test_padded:  (999, 46)

y_train_onehot:  (59744, 3)
y_val_onehot:  (14937, 3)
y_test_onehot:  (999, 3)


##  Model Training

In [23]:
# import model
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

# define the embedding layer
vocab_size = 40000
embedding_dim = 100
max_length = maxlen
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))

# flatten the embeddings
model.add(Flatten())

# ddd additional layers for classification
model.add(Dense(128, activation='relu'))
model.add(Dense(3, activation='softmax'))

# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 46, 100)           4000000   
                                                                 
 flatten (Flatten)           (None, 4600)              0         
                                                                 
 dense (Dense)               (None, 128)               588928    
                                                                 
 dense_1 (Dense)             (None, 3)                 387       
                                                                 
Total params: 4,589,315
Trainable params: 4,589,315
Non-trainable params: 0
_________________________________________________________________


In [24]:
# Train the model
batch_size = 64
epochs = 10

model.fit(X_train_padded, y_train_onehot, validation_data=(X_val_padded, y_val_onehot), epochs=epochs, batch_size=batch_size)

# Evaluate the model
loss, accuracy = model.evaluate(X_test_padded, y_test_onehot, batch_size=batch_size)
print("Test loss: ", loss)
print("Test accuary: ", accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss:  0.11652030050754547
Test accuary:  0.9799799919128418
