# TWEETS SENTIMENT CLASSIFICATION USING LSTM

Sentiment analysis refers to the idea of predicting the sentiment ( happy, sad, neutral) from a particular text. In this project, I will be performing sentiment analysis on a large real-world twitter dataset by applying techniques of NLP to make a binary classification (Positive and Negative). 

In [3]:
# IMPORTING NECESSARY LIBRARIES

import pandas as pd
import numpy as np
import re
import nltk
import textblob
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from textblob import Word
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

## Loading Dataset and Data Exploration for Target Variable

In [4]:
#Reading from csv file 

data = pd.read_csv("data.csv")
data.head()

Unnamed: 0,text,sentiment
0,RT @ScottWalker: Didn't catch the full #GOPdeb...,Positive
1,RT @RobGeorge: That Carly Fiorina is trending ...,Positive
2,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Positive
3,"RT @GregAbbott_TX: @TedCruz: ""On my first day ...",Positive
4,RT @warriorwoman91: I liked her and was happy ...,Negative


In [5]:
#Checking target values we have

data['sentiment'].unique()

array(['Positive', 'Negative'], dtype=object)

In [6]:
# label encoding for the sentiment column. 
# Positive takes the value 1, negative takes the value 0

data['sentiment'] = data['sentiment'].replace('Positive',1)
data['sentiment'] = data['sentiment'].replace('Negative',0)

In [7]:
# Check out the number counts of our unique classes

data['sentiment'].value_counts()

0    8493
1    2236
Name: sentiment, dtype: int64

# Text Cleaning and Preprocessing

There is a lot of noise in the raw text data scrapped from the tweets. The critical part of text cleaning for sentiment analysis is to remove stop words.

There are punctuations, symbols that will not contribute to our model much. There are also stop words present which need to be removed. Stop words refer to the connecting words like ‘the,’ ‘and’ ‘was,’ which do not provide any specific meaning, which will not help our analysis. 

In [8]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

In [9]:
# Defining a function to make necessary cleaning in the texts 

def clean(tweet):
    tweet = tweet.lower() # Lowering all cases before continue
    tweet = re.sub("@[A-Za-z0-9]+","",tweet) # Removing mentions @
    tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", tweet) # Removing http links 
    tweet = re.sub("[^A-Za-z]"," ", tweet) # Removing non-alphanumeric characters 
    tweet = " ".join(tweet.split())
    tweet = tweet.replace(":", "")
    tweet = tweet.replace("rt", "")
    tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text
# removing stop words
    temp = tweet.split()
    temp = [w for w in temp if not w in stop_words]
    tweet = " ".join(word for word in temp)
    return tweet

In [10]:
 # Cleaning tweets calling clean() function
    
data['text'] = data['text'].map(lambda x: clean(x))

In [11]:
data.head(10)

Unnamed: 0,text,sentiment
0,catch full gopdebate last night scott best lin...,1
1,carly fiorina trending hours debate men comple...,1
2,gopdebate w delivered highest ratings history ...,1
3,tx first day rescind every illegal executive a...,1
4,liked happy heard going moderator anymore gopd...,0
5,deer headlights ben carson may brain surgeon p...,0
6,last night debate proved gopdebate batsask tbats,0
7,fairness billclinton owns phrase gopdebate,0
8,woke tweet gopdebate best line night via,1
9,reading family comments great gopdebate,0


# TOKENIZATION
Tokenization refers to splitting the given sentence into a list of tokens, indexed or vectorized.

In [12]:
data['tokenized_tweets'] = data.apply(lambda row : nltk.word_tokenize(str(row['text'])),axis = 1)

In [13]:
data.head()

Unnamed: 0,text,sentiment,tokenized_tweets
0,catch full gopdebate last night scott best lin...,1,"[catch, full, gopdebate, last, night, scott, b..."
1,carly fiorina trending hours debate men comple...,1,"[carly, fiorina, trending, hours, debate, men,..."
2,gopdebate w delivered highest ratings history ...,1,"[gopdebate, w, delivered, highest, ratings, hi..."
3,tx first day rescind every illegal executive a...,1,"[tx, first, day, rescind, every, illegal, exec..."
4,liked happy heard going moderator anymore gopd...,0,"[liked, happy, heard, going, moderator, anymor..."


# Converting Tokenized Tweets to Vectors

Keras has a pre-processing module for text, which offers us the tf.keras. pre-processing.text.Tokenizer() class. 
If we pass a list of texts to fit_on_texts() function, we will update the internal vocabulary accordingly.

In [14]:
tokenizer = Tokenizer() 
tokenizer.fit_on_texts(data.tokenized_tweets.values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 9967 unique tokens.


I will be applying a sequence model to this data. For this, I need to pass inputs of the same size. To achieve this, I will use the `pad_sequences()` function. This will return us sequences of a constant size, which can be passed as a parameter. I have set the sequence length as 30 in this case.

In [25]:
# Defining vocabulary size
MAX_NB_WORDS = len(word_index) + 1

# Max number of words in each tweets.
MAX_SEQUENCE_LENGTH = 30

# Defining Embedding Dimention. This is fixed.
EMBEDDING_DIM = 100

X = tokenizer.texts_to_sequences(data.tokenized_tweets)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
Y = data.sentiment
print('Shape of data tensor:', X.shape)
print('Shape of data tensor:', Y.shape)

Shape of data tensor: (10729, 30)
Shape of data tensor: (10729,)


# Splitting Training and Testing Sets

Before training my model, I need to divide my data into training and test parts.

In [17]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 42)

print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(7510, 30) (7510,)
(3219, 30) (3219,)


# MODELLING

My architecture consists of three main parts. I start with the embedding layer defined previously, and it inputs the sequences and gives word embeddings. These embeddings are then passed on to the convolution layer, which will convert them into small feature vectors. Next, I have Dense (fully connected layers) for classification purposes. I use a sigmoid activation function before the final output.


In [18]:
from keras import models, layers
from keras.layers import Activation, Dense

In [19]:
# Building Model
embedding_vector_length = 32 
model = models.Sequential() 
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH)) 
model.add(LSTM(units=embedding_vector_length, dropout=0.2, recurrent_dropout=0.2)) 
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 30, 100)           996800    
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                17024     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 1,013,857
Trainable params: 1,013,857
Non-trainable params: 0
_________________________________________________________________


In [20]:
# Fitting the data into our model

model.fit(X_train, Y_train, validation_data=(X_test, Y_test), epochs=15, batch_size=128) 

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 7510 samples, validate on 3219 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.callbacks.History at 0x2118a1f05c8>

In [21]:
# Evaluation Model Accuracy

scores = model.evaluate(X_test, Y_test, batch_size=128)

print("Accuracy: %.2f%%" % (scores[1]*100))
print("loss: {}".format((scores[0])))

Accuracy: 84.25%
loss: 0.5381908492411135


In [27]:
# Saving Model

import pickle

pickle_file = open('sentiment_analysis_of_tweets.pkl', 'wb')     
pickle.dump(model, pickle_file)
pickle_file.close()

# PREDICTION FOR NEW TWEETS

In [28]:
test_sample_1 = "You are good"
test_sample_2 = "You are bad"
test_samples = [test_sample_1, test_sample_2]

test_sample_tokens = tokenizer.texts_to_sequences(d for d in test_samples)

# Padding the testing sequences
test_samples_tokens_pad = pad_sequences(test_sample_tokens, maxlen = 30, padding='post')

scores = model.predict(x = test_samples_tokens_pad)

def predict_tweet_sentiment(score):
    print("Score: ", score)
    return "Positive" if score > 0.5 else "Negative"

model_predictions = [predict_tweet_sentiment(score) for score in scores]

print(model_predictions)

Score:  [0.83211845]
Score:  [0.01057871]
['Positive', 'Negative']


# CONCLUSIONS, INSIGHTS AND RECOMMENDATIONS

1) Class balance is an important criterion when we are working on classification problems. It is essential to ensure that the classes are not very skewed, and the class imbalance will lead to biased results. 
Our dataset is quite imbalanced. The number of negative tweets are 4 times higher than the positive ones. Despite the fact that it's a biased model, our accuracy level seems high. But for various examples our model can fail predicting positive tweets.  

For future improvements, sampling techniques can be applied to solve the imbalance problem.

In [23]:
test_sample_1 = "You are wonderful"
test_sample_2 = "You are bad"
test_samples = [test_sample_1, test_sample_2]

test_sample_tokens = tokenizer.texts_to_sequences(d for d in test_samples)

# Padding the testing sequences
test_samples_tokens_pad = pad_sequences(test_sample_tokens, maxlen = 30, padding='post')

model.predict(x = test_samples_tokens_pad)

array([[0.0129921 ],
       [0.01057871]], dtype=float32)

2) Stemmization can be applied to improve predictions.