### Sentiment Analysis using Deep Learning - LSTM/BiLSTM

In this task we will develop a system to detect irony in text. We will use the data from the SemEval-2018 task on irony detection. You should use the file `SemEval2018-T3-train-taskA.txt` from Blackboard it consists of examples as follows:

```csv
Tweet index     Label   Tweet text
1       1       Sweet United Nations video. Just in time for Christmas. #imagine #NoReligion  http://t.co/fej2v3OUBR
2       1       @mrdahl87 We are rumored to have talked to Erv's agent... and the Angels asked about Ed Escobar... that's hardly nothing    ;)
3       1       Hey there! Nice to see you Minnesota/ND Winter Weather 
4       0       3 episodes left I'm dying over here
```


In [281]:
'''

This section has been written to read the tweeter file into colab use after wards

'''
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "" with length 9000 bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving SemEval2018-T3-train-taskA.txt to SemEval2018-T3-train-taskA (1).txt
User uploaded file "" with length 9000 bytes


In [0]:
'''
In This section wer are reading the text file containing the tweet details

'''
import pandas as pd
tweets = pd.read_csv('SemEval2018-T3-train-taskA.txt',sep='\t')

In [428]:
tweets.head()

Unnamed: 0,Tweet index,Label,Tweet text
0,1,1,Sweet United Nations video. Just in time for C...
1,2,1,@mrdahl87 We are rumored to have talked to Erv...
2,3,1,Hey there! Nice to see you Minnesota/ND Winter...
3,4,0,3 episodes left I'm dying over here
4,5,1,I can't breathe! was chosen as the most notabl...


In [429]:
'''
Importing NLTK library to use necessary packages

'''
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [430]:
'''
As part of the question here we are reading the dataset and calculating the size of vocabulary and number of positive 
and negative examples

count_words() : This funtion takes the dataframe as as input, make the words in lower case and calculates the total
number of words in the whole data set.

count_labels() : This function takes the dataframe as input and calculates the number of positive and negative tweets 
from the Label column

'''


def count_words(dframe):
    texts = dframe['Tweet text'].str.lower()
    all_txt = ' '.join(texts)
    return len(set(nltk.word_tokenize(all_txt)))

def count_labels(dframe):
    positive_count=0
    negative_count=0
    for item in dframe['Label']:
        if item == 1:
            positive_count +=1
        else:
            negative_count +=1
    return (positive_count,negative_count)

size_of_dataset = count_words(tweets)
pos_w_count,neg_count = count_labels(tweets)

print('size of the dataset :',size_of_dataset)
print('Number of piositive tweets :',pos_w_count)
print('Number of piositive tweets :',neg_count)


size of the dataset : 13460
Number of piositive tweets : 1901
Number of piositive tweets : 1916


In [446]:
'''
In This part we are preprocessing the tweets in order to remove unnecessary item which will not 
contribute much to our model.

'''

import re
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

def text_preprocess_1(text):
    text = text.lower()
    text = re.sub('@[^\s]+','',text)
    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', text, flags=re.MULTILINE)
    text = re.sub(r'\W',' ',text)
    text = re.sub(r'\s+',' ',text)
    text = " ".join([i for i in text.split() if i not in STOPWORDS ])
    return text
tweets['Tweet text'] = tweets['Tweet text'].apply(lambda x: text_preprocess_1(x))
size_of_dataset = count_words(tweets)
size_of_dataset


11697

In [447]:
'''
In this part we have divided our data set in to two parts using Scikit Learn library train_test_split. He re we have seggregated
the 80% of our data into traning set as we need more data to train our model and rest 20% to test our model.

So in train_test_split we have passed the parameter test_size=0.20 which will do the job.
Along with that we have converted our word data into numerical data by using CountVectorizer. This will basically
converts the string into bag of words according the frequency of the word.

For this Assignment we have also checd with Tf-Idf vecotorizer but CountVectorizer is performing a little better for our model.

'''

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
y = tweets['Label'].values
X_train,X_test,y_train,y_test = train_test_split(tweets['Tweet text'],y,test_size=0.20,random_state=42)
vectorizer = CountVectorizer()
vecorized_x_train = vectorizer.fit_transform(X_train).toarray()
vecorized_x_test = vectorizer.transform(X_test).toarray()
vecorized_x_test.shape

(764, 10035)

In [0]:
'''
As asked in the question here we have implemented a function which will take predicted label and Actual label as input
and give us the Accuracy,Precision,Recall and F1 score for us.

score() : This is the function that calculates Accuracy,Precision,Recall and F1 score

predict() : This function basically calculates the probability for our log-linear model and according to the 
probability it gives us the prediction.

'''

from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
def predict(X, y,weights_vector):
    pred_prob = sigmoid(dot_pro(X,weights_vector))
    y_pred = np.where(pred_prob>.5,1,0)
    return y_pred
def score(y_pred,y):
    score = sum(y_pred == y) / len(y)
    prcision = precision_score(y_pred,y)
    recall = recall_score(y_pred,y)
    fscore = f1_score(y_pred,y)
    print('accuracy is:',score )
    print('prcision is:',prcision )
    print('recall is:',recall )
    print('f1_score is:',fscore )

In [547]:
'''
Here we have implemented our Log-Linear model to classify the tweets being ironic or not.
We have have taken the Sigmoid function to calculate the probability and used Cross-Entropy as our loss function.

'''
import numpy as np
epochs = 200
learning_rate = 0.001
weights_vector = np.random.random(vecorized_x_train.shape[1])
def dot_pro(x,weights):
    return np.dot(x,weights)
def sigmoid(x):
    return 1.0/(1+np.exp(-x))
def cost_function(y_pred,Y):
    return -Y*np.log(y_pred) - (1-Y)*np.log(1-y_pred)
for ep in range(epochs+1):
    cost = 0
    for i in range(len(vecorized_x_train)):
        X = vecorized_x_train[i]
        y  = y_train[i]
        y_pred = sigmoid(np.dot(X,weights_vector))
        cost = cost_function(y_pred,y)
        weights_vector = weights_vector - learning_rate*(y_pred - y)* X
    if ep%100==0:
        print ("Epoch {} has finished. Cost is {}".format(ep,cost))

  # Remove the CWD from sys.path while we load stuff.


Epoch 0 has finished. Cost is 2.71961310375693
Epoch 100 has finished. Cost is 0.7753001096970733
Epoch 200 has finished. Cost is 0.5468308949584724


In [548]:
'''
After we trained our Log-Linear model using the weights we are predicting the values of our unseen
Tweets, which we have kept aside for our testing.

After we get the predicted labels for test data we are using score() function to get the Accuracy,precision,Recall
and f1 score.

'''

prediction = predict(vecorized_x_test,y_test,weights_vector)
score(prediction,y_test)

accuracy is: 0.5301047120418848
prcision is: 0.5169082125603864
recall is: 0.5737265415549598
f1_score is: 0.5438373570520966


## Implementation of Deeplearning methods using LSTM

In [457]:
'''
In this section we have implemented an Acceptor using Keras for classifying the tweets as Ironic or non ironic.
We have followed the below steps t build our RNN.
1. We have created a word dictionary of sequence using frequent 5000 words in the tweets ignoring some special chars.
2.Then we have padded each sentences to a length of 33 which is giving us the best results. This we can change accoding to our 
model performance.
3.Then we have transformed our Label to one hot vector
4.Again we have done the split for training and Test data
5.Lastly we have build our simple RNN model.
'''
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from keras.layers import Dropout
from keras.preprocessing.text import Tokenizer

MAX_NB_WORDS = 5000
MAX_SEQUENCE_LENGTH = 33
EMBEDDING_DIM = 100
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(tweets['Tweet text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 11699 unique tokens.


In [459]:

X = tokenizer.texts_to_sequences(tweets['Tweet text'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH,padding='post')
print('Shape of data tensor:', X.shape)
X

Shape of data tensor: (3817, 33)


array([[ 488,  829, 1611, ...,    0,    0,    0],
       [3769, 3770, 1612, ...,    0,    0,    0],
       [ 275,   71,   12, ...,    0,    0,    0],
       ...,
       [  35, 1997, 3596, ...,    0,    0,    0],
       [1399, 3767, 1241, ...,    0,    0,    0],
       [  63,  713, 1186, ...,    0,    0,    0]], dtype=int32)

In [460]:

Y = pd.get_dummies(tweets['Label']).values
Y

array([[0, 1],
       [0, 1],
       [0, 1],
       ...,
       [1, 0],
       [1, 0],
       [1, 0]], dtype=uint8)

In [506]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(2671, 33) (2671, 2)
(1146, 33) (1146, 2)


In [544]:
'''
#-----------RNN MODEL-----------#
'''
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential_86"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_85 (Embedding)     (None, 33, 100)           500000    
_________________________________________________________________
lstm_99 (LSTM)               (None, 100)               80400     
_________________________________________________________________
dropout_20 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_86 (Dense)             (None, 2)                 202       
Total params: 580,602
Trainable params: 580,602
Non-trainable params: 0
_________________________________________________________________
None


In [539]:
'''
After building the RNN model we are fitting the model with the help of Training data.
We are fitting the model in the batch of 64 with 20 epochs.
'''
epochs = 20
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
                    validation_split=0.1,verbose=1)

Train on 2403 samples, validate on 268 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [540]:
'''
After the model being trained, we pass our Test data to predict the Label for the same. Keras has Predict_classes method
to calculate the same.

After we get the prediction we are passing the same to our pre-built score() function to calculate the Accuracy,precision,Recall
and F1 score.

'''

y_pred_test =  model.predict_classes(X_test, batch_size=batch_size, verbose=0)
score(y_pred_test,np.argmax(Y_test,axis=1))

accuracy is: 0.512216404886562
prcision is: 0.17017828200972449
recall is: 0.6907894736842105
f1_score is: 0.27308192457737324


## Implementation of BiDirectional LSTM with Stacked LSTM


#### Improvements

In this section we have tried to implement the concept of Birectional LSTM with stacked LSTM.

#### Model Design:
The new Model is designed with help of the concept(As being studied in the lectures) of Bidirectional LSTM and stacked LSTM.
The basic idea of the BiRNN is to read the sequence of words both from the beggining and from the end,as both the sequence may be useful for any prediction.

In BiRNN the model maintains two states,**S<sub>i</sub><sup>f</sup>(Forward State)** when we feed the sequence from the beginning  and **S<sub>i</sub><sup>b</sup>(Backward state)**, when we feed he sequence from the end. the output at a particular position is 
the concatenation of the two output vectors.

Then  I have used the concept of Multilayer RNN(Stacked RNN) where we use the output of the one LSTM layer as the input of the next LSTM layer.These type of architecture is called Deep RNNs. 
Though there has been no solid theoritcal explanations for the better performance of the Deep RNNs, but practically it has been tried and tested which yielded better performances.

In our case also when we used Deep RNNs, BiRNN+Multilayer LSTM it has given us better performance than our initial model.

#### Process Run:
As seen in the above our previous RNN model does not perform very well with some low Precision and F1. Then we used the same input to test our new model.

We have used the same preprocessing steps as our previous models as to compare the results on a same ground. So we have taken most frquent 5000 words and converted each tweet into sequence.
Then we have done the padding. After that we have split the tweets into Train(70%) and Test(30%) data using Scikit Learn.

Then we have done Label into one hot ecoding to our Labels and used **categorical_crossentropy** as our loss function.Instead of dropout we have used Spatial Dropout which generally drops the entire entire 1D feature maps instead of individual elements.IN the end we have used Softmax activation function to get the probability of the particular input.

Then we have fed the input to our Bidirectional LSTM. The output of the BiRNN then again fed to another LSTM layer. In our experiment we have found that our model has yielded better results than previous.

#### Evaluation :

After we built the model we passed our Testing data to for desired prediction.
We have fit the model with Training data with batch size of and then passed Test data to get the  prediction values. Then the prediction values we fed the same to our pre built score() function to get all the required evaluation metrices.

We have clearly see the our model gave us a 60% overall accuracy with 60% overall prcision,Recall and F1 score.

We have completed our model for Tweet classification the manner. PFB the Model design and resul outputs.

In [541]:
from keras.layers import Dense,Dropout,Embedding,LSTM,Flatten,GRU,SpatialDropout1D,Bidirectional
model4 = Sequential()

model4.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
model4.add(SpatialDropout1D(0.25))
model4.add(Bidirectional(LSTM(128,return_sequences=True)))
model4.add(LSTM(128))
#model4.add(Dropout(0.2))
model4.add(Dense(2, activation='softmax'))
model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model4.summary()

Model: "sequential_85"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_84 (Embedding)     (None, 33, 100)           500000    
_________________________________________________________________
spatial_dropout1d_37 (Spatia (None, 33, 100)           0         
_________________________________________________________________
bidirectional_34 (Bidirectio (None, 33, 256)           234496    
_________________________________________________________________
lstm_98 (LSTM)               (None, 128)               197120    
_________________________________________________________________
dense_85 (Dense)             (None, 2)                 258       
Total params: 931,874
Trainable params: 931,874
Non-trainable params: 0
_________________________________________________________________


In [0]:
model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [543]:
history4=model4.fit(X_train, Y_train, validation_split=0.1,epochs=20, batch_size=64, verbose=1)

Train on 2403 samples, validate on 268 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [545]:
y_pred_test =  model4.predict_classes(X_test, batch_size=batch_size, verbose=0)
score(y_pred_test,np.argmax(Y_test,axis=1))

accuracy is: 0.6047120418848168
prcision is: 0.5980551053484603
recall is: 0.6428571428571429
f1_score is: 0.619647355163728


#### References:

To complete the above task we have taken ideas from following references

[1] Neural Network Methods for Natural Language Processing by Yoav Goldberg <Chapter 14>

[2] https://www.kaggle.com/nafisur/keras-models-lstm-cnn-gru-bidirectional-glove

[3] https://keras.rstudio.com/reference/layer_spatial_dropout_1d.html

[4] https://github.com/susanli2016/NLP-with-Python/blob/master/Multi-Class%20Text%20Classification%20LSTM%20Consumer%20complaints.ipynb