#Twitter Sentiment Analysis with Simple RNN
Welcome to twitter sentiment analysis using simple RNN! Now, we are going to apply the concepts we have learnt.
Download the dataset from: https://www.kaggle.com/datasets/daniel09817/twitter-sentiment-analysis/data
Can you think of why sentiment analysis may be important to a company or government?

# Install the necessary libraries.

In [33]:
!pip install tensorflow



## Import the necessary libraries.

In [1]:
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer

In [2]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

In [4]:
!pwd

/content


## Extract the data

In [9]:
#unzip the data
!unzip "/content/archive (4).zip" -d /content/data


Archive:  /content/archive (4).zip
  inflating: /content/data/twitter sentiment analysis.csv  


In [10]:
!ls /content/data

'twitter sentiment analysis.csv'


In [11]:
df = pd.read_csv(r"/content/data/twitter sentiment analysis.csv")


## Data Exploration and Preprocessing

In [12]:
df.head(10)

Unnamed: 0,Text,Label
0,rwanda is set to host the headquarters of unit...,positive
1,it sucks for me since im focused on the nature...,negative
2,shawntarloff itsmieu you can also relate this ...,neutral
3,social security constant political crises dist...,negative
4,filmthepolicela a broken rib can puncture a lu...,negative
5,jacobringenwald akeithwatts countdankulatv i a...,negative
6,nzhksu telebusiness my question was rhetorical...,negative
7,wimbledon nick kyrgios admits spitting towards...,positive
8,is booktwt a thing if so thats her and she spe...,positive
9,roipaee joe formulagame redbullracing silverst...,negative


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691248 entries, 0 to 691247
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Text    691244 non-null  object
 1   Label   691248 non-null  object
dtypes: object(2)
memory usage: 10.5+ MB


### Clearly, there are two columns- Text and Label.


*   Text contains the actual content of the tweet
*   Label contains three different labels- positive, negative and neutral.

Can you now identify what type of classification problem this is?



In [14]:
#view the columns in the dataframe
df.columns

Index(['Text', 'Label'], dtype='object')

In [15]:
#check for nulls.
df.isnull().sum()

Unnamed: 0,0
Text,4
Label,0


In [16]:
#remove the nulls.
df.dropna(inplace = True)

In [17]:
#check for nulls again.
df.isnull().sum()

Unnamed: 0,0
Text,0
Label,0


In [18]:
#dimensions of dataframe.
df.shape

(691244, 2)

In [19]:
#what are the possible labels?
df['Label'].unique()

array(['positive', 'negative', 'neutral'], dtype=object)

## Train-test-split
Next, we are going to split the data into text and label. We will use the train-test-split method from sklearn to split the data into a training set and a testing set.

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

#we need to convert positive, negative and neutral to numeric values.
le = LabelEncoder()

X = df['Text'].astype('str')
y = le.fit_transform(df['Label'])

print(le.classes_)
print(y[0])
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=42)

['negative' 'neutral' 'positive']
2


# Tokenisation and Padding - two very important topics.

Computers dont understand raw text ("love", "hate", "airline"). They work with numbers. Tokenisation helps map words to integers. Each word gets a unique index based on frequency.

* The Tokenizer creates a dictionary, mapping each word to an index. For example, "I love this movie" -> [2,5,7,9]

* num_words limits the vocabulary size (example: keep only the 1000 most frequent words)

* oov_token means if a new word appears (which wasn't present during training), it gets replaced with a special <OOV> index.

### fit_on_texts(texts):
*	Learns the word index dictionary based on the input texts.
*	Counts word frequencies, assigns integer indices to words.
###	texts_to_sequences(texts):
*	Converts each text into a list of integer indices based on the tokenizer’s word index.


In [21]:
#define the vocabulary size
vocab_size = 10000
embed_size = 100 #each word represented as a vector
hidden_size = 128 #number of neurons
max_len = 30 #if this length isnt met, we pad.
output_size = 3 #since we have 3 labels.

In [22]:


tokenizer = Tokenizer(num_words = vocab_size, oov_token ="<OOV>")
tokenizer.fit_on_texts(X_train)


In [23]:
train_seq = tokenizer.texts_to_sequences(X_train)
test_seq = tokenizer.texts_to_sequences(X_test)
train_seq[0]

[149,
 4,
 172,
 308,
 18,
 1,
 166,
 6,
 144,
 79,
 3,
 1432,
 13,
 6,
 20,
 1911,
 4,
 1,
 45,
 8878,
 1553,
 1,
 1,
 1,
 1]

## Padding
* padding is done to ensure all the texts have the same length.
* We pad with 0s.
* Pre-padding ensures 0s added to the front of text whereas post-padding ensures 0s added after the text.

In [24]:
X_train_pad = pad_sequences(train_seq, maxlen = max_len, padding='post',truncating='post')
X_test_pad = pad_sequences(test_seq, maxlen= max_len, padding = 'post',truncating='post')

In [25]:
X_train_pad[0]

array([ 149,    4,  172,  308,   18,    1,  166,    6,  144,   79,    3,
       1432,   13,    6,   20, 1911,    4,    1,   45, 8878, 1553,    1,
          1,    1,    1,    0,    0,    0,    0,    0], dtype=int32)

# Build RNN Model

The model consists of an Embedding layer, then the RNN layers and finally a dense layer.

* Embedding layer- Each of the sentences which were indexed previously are going to be converted to a vector. Why? So we capture semantic meaning. Example: cat and kitten are closer in the vector space than cat and dog.

* SimpleRNN- the structure of which we studied earlier.

* Dense layer with the softmax activation function. Acts as output layer.

In [27]:
model = Sequential([
    Embedding(input_dim = vocab_size, output_dim = embed_size, input_length= max_len
              ,mask_zero= True), #if we set mask_zero to true, zeros are not treated as actual words as theyre part of padding.
    SimpleRNN(hidden_size,
             ),
    Dense(output_size, activation='softmax')
])

# Train the Model

In [28]:
model.compile(
    optimizer= 'adam',
    loss = 'sparse_categorical_crossentropy',
    metrics=['accuracy']
)

In [29]:
model.fit(X_train_pad, y_train, epochs =5, batch_size = 64, validation_data= (X_test_pad,y_test))

Epoch 1/5
[1m8641/8641[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m50s[0m 5ms/step - accuracy: 0.8754 - loss: 0.3150 - val_accuracy: 0.8575 - val_loss: 0.3493
Epoch 2/5
[1m8641/8641[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 5ms/step - accuracy: 0.8949 - loss: 0.2968 - val_accuracy: 0.9015 - val_loss: 0.2479
Epoch 3/5
[1m8641/8641[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m43s[0m 5ms/step - accuracy: 0.9052 - loss: 0.2412 - val_accuracy: 0.9028 - val_loss: 0.3072
Epoch 4/5
[1m8641/8641[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 5ms/step - accuracy: 0.9009 - loss: 0.2650 - val_accuracy: 0.8915 - val_loss: 0.3000
Epoch 5/5
[1m8641/8641[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m48s[0m 5ms/step - accuracy: 0.8976 - loss: 0.3048 - val_accuracy: 0.9048 - val_loss: 0.2673


<keras.src.callbacks.history.History at 0x79e59e39fad0>

In [30]:
model.summary()

# Test the Model

In [33]:
import numpy as np

In [34]:
def predict_sentiment(text):
    seq = tokenizer.texts_to_sequences([text]) #tokenize
    padded = pad_sequences(seq, maxlen=max_len, padding='post') #pad with 0s
    pred = model.predict(padded, verbose=0) #Predict
    sentiment = np.argmax(pred, axis=1)[0]
    print("Prediction",pred)
    print("Sentiment",sentiment)
    label = le.inverse_transform([sentiment])

    return label


In [37]:
predict_sentiment("How can I improve myself?")


Prediction [[0.00537229 0.00698105 0.9876467 ]]
Sentiment 2


array(['positive'], dtype=object)

In [36]:
predict_sentiment("You have to start learning somewhere")

Prediction [[0.01635886 0.9670601  0.01658105]]
Sentiment 1


array(['neutral'], dtype=object)

In [38]:
predict_sentiment("I am bad at Maths")

Prediction [[0.9892859  0.00748694 0.00322721]]
Sentiment 0


array(['negative'], dtype=object)

In [42]:
print(le.classes_)

['negative' 'neutral' 'positive']


## Note:
Don't forget to disconnect runtime after usage! Or else your limit may run out (GPU)