# **NLP (Sentiment Analysis)**

**References**


*   https://keras.io/api/layers/
*   https://www.kaggle.com/code/ngyptr/lstm-sentiment-analysis-keras
*   https://www.kaggle.com/datasets/crowdflower/first-gop-debate-twitter-sentiment
*   https://www.kaggle.com/code/ruchi798/sentiment-analysis-the-simpsons


### **Importing Libraries**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import re

### **Downloading our Data**

In [None]:
! gdown --id 1VxNqohZW753ZA_b0lXh_bSZRvbkRhnQL

Downloading...
From: https://drive.google.com/uc?id=1VxNqohZW753ZA_b0lXh_bSZRvbkRhnQL
To: /content/Sentiment.csv
100% 3.97M/3.97M [00:00<00:00, 55.9MB/s]


In [None]:
data = pd.read_csv('/content/Sentiment.csv')
print(data.columns)

Index(['id', 'candidate', 'candidate_confidence', 'relevant_yn',
       'relevant_yn_confidence', 'sentiment', 'sentiment_confidence',
       'subject_matter', 'subject_matter_confidence', 'candidate_gold', 'name',
       'relevant_yn_gold', 'retweet_count', 'sentiment_gold',
       'subject_matter_gold', 'text', 'tweet_coord', 'tweet_created',
       'tweet_id', 'tweet_location', 'user_timezone'],
      dtype='object')


### **Cleaning up the Data**

In [None]:
# keeping only the neccessary columns
data = data[['text','sentiment']]
print(data.head)

# drops any neutral sentiments
data = data[data.sentiment != "Neutral"]

# making all words lowercase
data['text'] = data['text'].apply(lambda x: x.lower())

# removing special characters, only get a-zA-z0-9
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

# see number of positive sentiments
print(data[ data['sentiment'] == 'Positive'].size)

# see number of negative sentiments
print(data[ data['sentiment'] == 'Negative'].size)

# since we're working with twitter data, replace rt with spaces as rt is redundant and not a word
for idx,row in data.iterrows():
    row[0] = row[0].replace('rt',' ')

<bound method NDFrame.head of                                                     text sentiment
0      RT @NancyLeeGrahn: How did everyone feel about...   Neutral
1      RT @ScottWalker: Didn't catch the full #GOPdeb...  Positive
2      RT @TJMShow: No mention of Tamir Rice and the ...   Neutral
3      RT @RobGeorge: That Carly Fiorina is trending ...  Positive
4      RT @DanScavino: #GOPDebate w/ @realDonaldTrump...  Positive
...                                                  ...       ...
13866  RT @cappy_yarbrough: Love to see men who will ...  Negative
13867  RT @georgehenryw: Who thought Huckabee exceede...  Positive
13868  RT @Lrihendry: #TedCruz As President, I will a...  Positive
13869  RT @JRehling: #GOPDebate Donald Trump says tha...  Negative
13870  RT @Lrihendry: #TedCruz headed into the Presid...  Positive

[13871 rows x 2 columns]>


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].apply(lambda x: x.lower())


4472
16986


### **Preprocessing (Tokenizing)**

In [None]:
# max number of words to keep, this is our model's "vocabulary"
# get the top 2000 most frequent words
max_features = 2000

# tokenizer object is instantiated, we split the words using ' ' (space) as our delimiter
tokenizer = Tokenizer(num_words=max_features, split=' ')

# update tokeninzer's vocabulary using text from the data
tokenizer.fit_on_texts(data['text'].values)

X = tokenizer.texts_to_sequences(data['text'].values)

# standardize the lengths of our sequences
X = pad_sequences(X)

print("This is our tokenizer with all our text tokenized.")
print(X)

This is our tokenizer with all our text tokenized.
[[   0    0    0 ... 1310 1394  735]
 [   0    0    0 ...  232  715   18]
 [   0    0    0 ...  205  367  680]
 ...
 [   0    0    0 ...   72   66    4]
 [   0    0    0 ... 1009 1406   74]
 [   0    0    0 ...  195    4  714]]


### **(For Fun) Accessing a word from a Token**

In [None]:
# create mapping from index to word
reverse_word_index = {value: key for key, value in tokenizer.word_index.items()}

# look up the word corresponding to index 4
word_for_index_4 = reverse_word_index.get(9, "Index not found")

print("Word for index 4:", word_for_index_4)

Word for index 4: and


### **Creating the LSTM Neural Network**

**LSTM (Long Short-Term Memory) - Special kind of RNN**

RNN's are especially good for sentiment analysis for these reasons:

*   Remembering - Helps us remember large pieces of text as we need context to determine if a sentiment is positive or negative
*   Forgetting - Forget unimportant details that may have no relation to the text



In [None]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D

In [None]:
# output space of the lstm is 196
embed_dim = 128
lstm_out = 196

# accessing Keras.Sequential, which indicates that our model is being built layer by layer
model = Sequential()

**Embedding Layer**

In [None]:
# max_features = 2000
# transform each word index into a vector of size embed_dim (128)
# input_length = number of sequences in our tokenizer `X`
model.add(Embedding(max_features, embed_dim,input_length = X.shape[1]))

**Dropout Layer**

In [None]:
# drop 40% of the elements to reduce over fitting
model.add(SpatialDropout1D(0.4))

**LSTM Layer**

In [None]:
# 20% of input layer will be dropped, 20% is the dropout rate for recurrent layers
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))

**Dense (Final) Layer**

In [None]:
# 2 neurons in this layer (2 possible outputs)
model.add(Dense(2,activation='softmax'))


**Compiling our Model**

In [None]:
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 28, 128)           256000    
                                                                 
 spatial_dropout1d (Spatial  (None, 28, 128)           0         
 Dropout1D)                                                      
                                                                 
 lstm (LSTM)                 (None, 196)               254800    
                                                                 
 dense (Dense)               (None, 2)                 394       
                                                                 
Total params: 511194 (1.95 MB)
Trainable params: 511194 (1.95 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


### **Training our Model and Testing**

**Importing Sci-kit Learn**

In [None]:
from sklearn.model_selection import train_test_split

**Preparing Training/Testing Data**

In [None]:
# `get dummies` converts categorical variables ('Positive', 'Negative') into dummy/indicator variables (0,1)
Y = pd.get_dummies(data['sentiment']).values

# print(data['sentiment'].values)
# print("X : ", X)
# print("Y : ", Y)

# remember, X is our tokenized data

# test_size = 33% of data is reserved for test set
# random_state = kinda like a minecraft seed
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)

# checking size of training/testing sets
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

X :  [[   0    0    0 ... 1310 1394  735]
 [   0    0    0 ...  232  715   18]
 [   0    0    0 ...  205  367  680]
 ...
 [   0    0    0 ...   72   66    4]
 [   0    0    0 ... 1009 1406   74]
 [   0    0    0 ...  195    4  714]]
Y :  [[False  True]
 [False  True]
 [False  True]
 ...
 [False  True]
 [ True False]
 [False  True]]
(7188, 28) (7188, 2)
(3541, 28) (3541, 2)


**Training Our Model**

In [None]:
batch_size = 32

# epochs = 7, iterate over entire dataset 7 times
# batch_size = 32 samples are processed at a time
# verbose = basically shows you the format of the information per iteration, keep it at 2 tho
model.fit(X_train, Y_train, epochs = 7, batch_size=batch_size, verbose = 2)

Epoch 1/7
225/225 - 46s - loss: 0.4384 - accuracy: 0.8146 - 46s/epoch - 203ms/step
Epoch 2/7
225/225 - 41s - loss: 0.3189 - accuracy: 0.8689 - 41s/epoch - 182ms/step
Epoch 3/7
225/225 - 43s - loss: 0.2776 - accuracy: 0.8841 - 43s/epoch - 190ms/step
Epoch 4/7
225/225 - 42s - loss: 0.2519 - accuracy: 0.8975 - 42s/epoch - 186ms/step
Epoch 5/7
225/225 - 42s - loss: 0.2187 - accuracy: 0.9126 - 42s/epoch - 187ms/step
Epoch 6/7
225/225 - 41s - loss: 0.2020 - accuracy: 0.9204 - 41s/epoch - 181ms/step
Epoch 7/7
225/225 - 49s - loss: 0.1750 - accuracy: 0.9279 - 49s/epoch - 218ms/step


<keras.src.callbacks.History at 0x7b3ff7e1fe80>

**Performance Evaluation and Validation Setup**

In [None]:
# size of our validation set
validation_size = 1500

# get last 1500 samples from X,Y test data
X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]

# exclude validation samples, so first 8500
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]

# evaluate model against the testing data
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)

# score (loss value)
print("score: %.2f" % (score))

# accuracy
print("acc: %.2f" % (acc))

64/64 - 2s - loss: 0.4144 - accuracy: 0.8427 - 2s/epoch - 29ms/step
score: 0.41
acc: 0.84


**Checking Accuracy of Detecting Positive/Negatives**

In [None]:
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0

# iterates through the validation set and sees how accurate a positive/negative prediction is
for x in range(len(X_validate)):

    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]

    if np.argmax(result) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1

    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1



print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")

1/1 - 0s - 374ms/epoch - 374ms/step
1/1 - 0s - 48ms/epoch - 48ms/step
1/1 - 0s - 37ms/epoch - 37ms/step
1/1 - 0s - 39ms/epoch - 39ms/step
1/1 - 0s - 38ms/epoch - 38ms/step
1/1 - 0s - 42ms/epoch - 42ms/step
1/1 - 0s - 38ms/epoch - 38ms/step
1/1 - 0s - 42ms/epoch - 42ms/step
1/1 - 0s - 38ms/epoch - 38ms/step
1/1 - 0s - 41ms/epoch - 41ms/step
1/1 - 0s - 37ms/epoch - 37ms/step
1/1 - 0s - 49ms/epoch - 49ms/step
1/1 - 0s - 40ms/epoch - 40ms/step
1/1 - 0s - 37ms/epoch - 37ms/step
1/1 - 0s - 37ms/epoch - 37ms/step
1/1 - 0s - 36ms/epoch - 36ms/step
1/1 - 0s - 33ms/epoch - 33ms/step
1/1 - 0s - 35ms/epoch - 35ms/step
1/1 - 0s - 34ms/epoch - 34ms/step
1/1 - 0s - 30ms/epoch - 30ms/step
1/1 - 0s - 33ms/epoch - 33ms/step
1/1 - 0s - 31ms/epoch - 31ms/step
1/1 - 0s - 36ms/epoch - 36ms/step
1/1 - 0s - 30ms/epoch - 30ms/step
1/1 - 0s - 30ms/epoch - 30ms/step
1/1 - 0s - 34ms/epoch - 34ms/step
1/1 - 0s - 31ms/epoch - 31ms/step
1/1 - 0s - 33ms/epoch - 33ms/step
1/1 - 0s - 36ms/epoch - 36ms/step
1/1 - 0s - 3

**Fun Stuff**

In [None]:
twt = ['']
print(twt)

# vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
print(twt)

# create mapping from index to word
reverse_word_index = {value: key for key, value in tokenizer.word_index.items()}

# look up the word corresponding to index 4
word = reverse_word_index.get(344, "Index not found")

print("Word:", word)

# padding the tweet to have exactly the same shape as `embedding_1` layer
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
# print(twt)


sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
if(np.argmax(sentiment) == 0):
    print("The tweet is negative.")
elif (np.argmax(sentiment) == 1):
    print("The tweet is positive.")

['i hate squidward']
[[10, 378]]
Word: bad
1/1 - 0s - 39ms/epoch - 39ms/step
The tweet is negative.
