# Recurrent neural networks for Airline Sentiment Analysis of Twitter Data

This work takes an approach by applying recurrent neural network with Long Short Term Memory (LSTM) for some of positive and negative airline tweets. My approach is tested on U.S airline tweets.You can download the dataset from here (https://www.kaggle.com/crowdflower/twitter-airline-sentiment) 

# Dependencies

In [1]:
import os
import re
import csv
import collections
import numpy as np
import tensorflow as tf
import pandas as pd
from nltk.corpus import stopwords
from tensorflow.contrib import learn

# Processing the Data

In [2]:
path=r'C:\Users\Abdullahfadel\Desktop\Last semster\Machine Learning+deep Learning\assginments\data\Tweets.csv'
Tweet = pd.read_csv(path)

#Here a snapshot of the dataset
Tweet

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,570300767074181121,negative,1.0000,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)
6,570300616901320704,positive,0.6745,,0.0000,Virgin America,,cjmcginnis,,0,"@VirginAmerica yes, nearly every time I fly VX...",,2015-02-24 11:13:57 -0800,San Francisco CA,Pacific Time (US & Canada)
7,570300248553349120,neutral,0.6340,,,Virgin America,,pilot,,0,@VirginAmerica Really missed a prime opportuni...,,2015-02-24 11:12:29 -0800,Los Angeles,Pacific Time (US & Canada)
8,570299953286942721,positive,0.6559,,,Virgin America,,dhepburn,,0,"@virginamerica Well, I didn't…but NOW I DO! :-D",,2015-02-24 11:11:19 -0800,San Diego,Pacific Time (US & Canada)
9,570295459631263746,positive,1.0000,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada)


In [3]:
#function for processing the data regading stop words and for keeping only the text without any special characters ..etc
def convert_tweets(tweet):
    letters = re.sub("[^a-zA-Z@]", " ",tweet) 
    words = letters.lower().split()                             
    stops_ = set(stopwords.words("english"))  
    significative_words = [w for w in words if not w in stops_ and not re.match("^[@]", w) and not re.match("flight",w)] 
    return( " ".join( significative_words ))

In [22]:
#Pre-process the tweet and store in a separate column
Tweet['clean_tweet']=Tweet['text'].apply(lambda x: convert_tweets(x))
#Convert sentiment to binary
Tweet['sentiment'] = Tweet['airline_sentiment'].apply(lambda x: 0 if x == 'negative' else 1)

#Join all the words in review to build a corpus
all_text = ' '.join(Tweet['clean_tweet'])
words = all_text.split()

In [23]:
# Convert words to integers
from collections import Counter
counts_words = Counter(words)
vocabularies = sorted(counts_words, key=counts_words.get, reverse=True)
vocabularies_to_integer= {word: mm for mm, word in enumerate(vocabularies, 1)}

tweet_ints = []
for each in Tweet['clean_tweet']:
    tweet_ints.append([vocabularies_to_integer[word] for word in each.split()])

In [24]:
#Create a list of labels 
labels = np.array([0 if tweet == 'negative' else 1 for tweet in Tweet['airline_sentiment'][:]]) 

#here we are finding  the number of tweets with zero length after we did the  data pre-processing
tweet_length = Counter([len(x) for x in tweet_ints])

In [10]:
seq_len = max(tweet_length)
seq_len
features = np.zeros((len(tweet_ints), seq_len), dtype=int)
for i, row in enumerate(tweet_ints):
    features[i, -len(row):] = np.array(row)[:seq_len]
features

array([[    0,     0,     0, ...,     0,     0,   123],
       [    0,     0,     0, ...,  2351,   105,  8591],
       [    0,     0,     0, ...,    73,    67,   100],
       ..., 
       [    0,     0,     0, ...,   327,   146, 10873],
       [    0,     0,     0, ...,  1280,    47,  2273],
       [    0,     0,     0, ...,   475,    62,    92]])

Spliting dataset into 90 % train set and test set 10 % Which means 11684 Tweets for train set and the test set 1461 Tweets

In [11]:
split_fraction = 0.8
split_idx = int(len(features)*0.8)
train_x, val_x = features[:split_idx], features[split_idx:]
train_y, val_y = labels[:split_idx], labels[split_idx:]

test_idx = int(len(val_x)*0.5)
test_x =  val_x[test_idx:]
test_y = val_y[test_idx:]

print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape), 
      "\nTest set: \t\t{}".format(test_x.shape))

print("Labels Train set: \t\t{}".format(train_y.shape), 
      "\nLabels Test set: \t\t{}".format(test_y.shape))

			Feature Shapes:
Train set: 		(11684, 22) 
Test set: 		(1461, 22)
Labels Train set: 		(11684,) 
Labels Test set: 		(1461,)


# hyperparameter

On the twitter dataset, my one-layer LSTM performed best with word vector and hidden dimensions of 128. However, I tested dimensions ranging from 200 to 300 ; overall, the LSTM give a good performance with high accuracy on the training set. The optimal performance was obtained after 10-20 training epochs. The leaning rate which is 0.001 performed very well.

In [12]:
lstm_size = 128
lstm_layers = 1
batch_size = 100
learning_rate = 0.001
epochs = 20

# Model

Tensorflow provides, many functions which make the work very easy in deep learing , I used a class LSTMCell which do the Lstm functionality.For reducing overfitting in neural networks DropoutWrapper  class is used for that,  which it is an operator adding dropout to inputs and outputs of the given cell. Moreover, for stacking up multiple LSTM layers, I used MultiRNNCell class.

In [15]:
#here we creating the  input placeholders
num_words = len(vocabularies_to_integer)+1
# Create the graph object
graph = tf.Graph()
# Add nodes to the graph
with graph.as_default():
    inputs_ = tf.placeholder(tf.int32, [None, None], name='inputs')
    labels_ = tf.placeholder(tf.int32, [None, None], name='labels')
    keep_probably = tf.placeholder(tf.float32, name='keep_probably')
    
# Size of the embedding vectors (number of units in the embedding layer)
    
    embeding_size = 300 


    embedding = tf.Variable(tf.random_uniform((num_words, embeding_size), -1, 1))
    embedding_lookup = tf.nn.embedding_lookup(embedding, inputs_)
#ehere we are building  the LSTM cells
    # basic LSTM cell
    lstm = tf.contrib.rnn.LSTMCell(lstm_size)
    #For reducing overfitting in neural networks
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_probably)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop] * lstm_layers)
    
    # Getting an initial state of all zeros
    initial_state = cell.zero_state(batch_size, tf.float32)
#RNN Forward pass
    outputs, final_state = tf.nn.dynamic_rnn(cell, embedding_lookup,
                                             initial_state=initial_state)
    
    predictions = tf.contrib.layers.fully_connected(outputs[:, -1], 1, activation_fn=tf.sigmoid)
    cost = tf.losses.mean_squared_error(labels_, predictions)
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    
    correct_predictions= tf.equal(tf.cast(tf.round(predictions), tf.int32), labels_)
    accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))

# Batching

In [16]:
#here we creating Batching 
def batches(x, y, batch_size=100):
    num_batches = len(x)//batch_size
    x, y = x[:num_batches*batch_size], y[:num_batches*batch_size]
    for ii in range(0, len(x), batch_size):
        yield x[ii:ii+batch_size], y[ii:ii+batch_size]

# save your work

In [17]:
checkpoint_path = 'ckpt'

#Create a checkpoint directory 
if tf.gfile.Exists(checkpoint_path):
    tf.gfile.DeleteRecursively(checkpoint_path)
tf.gfile.MakeDirs(checkpoint_path)

# Training

Once the graph is done now we are ready to train our data according to our model and hyperparameter

In [18]:
with graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    iteration = 1
    for i in range(epochs):
        state = sess.run(initial_state)
        
        for j, (x, y) in enumerate(batches(train_x, train_y, batch_size), 1):
            feed = {inputs_: x,
                    labels_: y[:, None],
                    keep_probably: 0.5,
                    initial_state: state}
            loss, state, _ = sess.run([cost, final_state, optimizer], feed_dict=feed)
            
            if iteration%5==0:
                print("at epoch: {}/{}".format(i, epochs),
                      "number of Iteration: {}".format(iteration),
                      "the Train loss is : {:.3f}".format(loss))

            if iteration%25==0:
                value_accuracy = []
                val_state = sess.run(cell.zero_state(batch_size, tf.float32))
                for x, y in batches(test_x, test_y, batch_size):
                    feed = {inputs_: x,
                            labels_: y[:, None],
                            keep_probably: 1,
                            initial_state: val_state}
                    batch_accuracy, val_state = sess.run([accuracy, final_state], feed_dict=feed)
                    value_accuracy.append(batch_accuracy)
                print("accuracy : {:.3f}".format(np.mean(value_accuracy)))
            iteration +=1
    saver.save(sess, checkpoint_path + '/model')

at epoch: 0/20 number of Iteration: 5 the Train loss is : 0.178
at epoch: 0/20 number of Iteration: 10 the Train loss is : 0.266
at epoch: 0/20 number of Iteration: 15 the Train loss is : 0.189
at epoch: 0/20 number of Iteration: 20 the Train loss is : 0.157
at epoch: 0/20 number of Iteration: 25 the Train loss is : 0.181
accuracy : 0.790
at epoch: 0/20 number of Iteration: 30 the Train loss is : 0.172
at epoch: 0/20 number of Iteration: 35 the Train loss is : 0.186
at epoch: 0/20 number of Iteration: 40 the Train loss is : 0.182
at epoch: 0/20 number of Iteration: 45 the Train loss is : 0.229
at epoch: 0/20 number of Iteration: 50 the Train loss is : 0.217
accuracy : 0.773
at epoch: 0/20 number of Iteration: 55 the Train loss is : 0.230
at epoch: 0/20 number of Iteration: 60 the Train loss is : 0.184
at epoch: 0/20 number of Iteration: 65 the Train loss is : 0.212
at epoch: 0/20 number of Iteration: 70 the Train loss is : 0.152
at epoch: 0/20 number of Iteration: 75 the Train loss is 

at epoch: 5/20 number of Iteration: 600 the Train loss is : 0.038
accuracy : 0.815
at epoch: 5/20 number of Iteration: 605 the Train loss is : 0.040
at epoch: 5/20 number of Iteration: 610 the Train loss is : 0.043
at epoch: 5/20 number of Iteration: 615 the Train loss is : 0.039
at epoch: 5/20 number of Iteration: 620 the Train loss is : 0.044
at epoch: 5/20 number of Iteration: 625 the Train loss is : 0.053
accuracy : 0.843
at epoch: 5/20 number of Iteration: 630 the Train loss is : 0.065
at epoch: 5/20 number of Iteration: 635 the Train loss is : 0.038
at epoch: 5/20 number of Iteration: 640 the Train loss is : 0.037
at epoch: 5/20 number of Iteration: 645 the Train loss is : 0.071
at epoch: 5/20 number of Iteration: 650 the Train loss is : 0.042
accuracy : 0.796
at epoch: 5/20 number of Iteration: 655 the Train loss is : 0.024
at epoch: 5/20 number of Iteration: 660 the Train loss is : 0.027
at epoch: 5/20 number of Iteration: 665 the Train loss is : 0.058
at epoch: 5/20 number of 

at epoch: 10/20 number of Iteration: 1190 the Train loss is : 0.015
at epoch: 10/20 number of Iteration: 1195 the Train loss is : 0.013
at epoch: 10/20 number of Iteration: 1200 the Train loss is : 0.027
accuracy : 0.809
at epoch: 10/20 number of Iteration: 1205 the Train loss is : 0.019
at epoch: 10/20 number of Iteration: 1210 the Train loss is : 0.042
at epoch: 10/20 number of Iteration: 1215 the Train loss is : 0.003
at epoch: 10/20 number of Iteration: 1220 the Train loss is : 0.015
at epoch: 10/20 number of Iteration: 1225 the Train loss is : 0.022
accuracy : 0.822
at epoch: 10/20 number of Iteration: 1230 the Train loss is : 0.012
at epoch: 10/20 number of Iteration: 1235 the Train loss is : 0.004
at epoch: 10/20 number of Iteration: 1240 the Train loss is : 0.003
at epoch: 10/20 number of Iteration: 1245 the Train loss is : 0.010
at epoch: 10/20 number of Iteration: 1250 the Train loss is : 0.021
accuracy : 0.774
at epoch: 10/20 number of Iteration: 1255 the Train loss is : 0.0

at epoch: 15/20 number of Iteration: 1765 the Train loss is : 0.001
at epoch: 15/20 number of Iteration: 1770 the Train loss is : 0.003
at epoch: 15/20 number of Iteration: 1775 the Train loss is : 0.011
accuracy : 0.835
at epoch: 15/20 number of Iteration: 1780 the Train loss is : 0.021
at epoch: 15/20 number of Iteration: 1785 the Train loss is : 0.016
at epoch: 15/20 number of Iteration: 1790 the Train loss is : 0.021
at epoch: 15/20 number of Iteration: 1795 the Train loss is : 0.001
at epoch: 15/20 number of Iteration: 1800 the Train loss is : 0.010
accuracy : 0.824
at epoch: 15/20 number of Iteration: 1805 the Train loss is : 0.014
at epoch: 15/20 number of Iteration: 1810 the Train loss is : 0.010
at epoch: 15/20 number of Iteration: 1815 the Train loss is : 0.005
at epoch: 15/20 number of Iteration: 1820 the Train loss is : 0.001
at epoch: 15/20 number of Iteration: 1825 the Train loss is : 0.001
accuracy : 0.799
at epoch: 15/20 number of Iteration: 1830 the Train loss is : 0.0

# Tweets in test for predicting 

In [21]:
test_accuracy = []
test_predict = []
with tf.Session(graph=graph) as sess:
    saver.restore(sess, tf.train.latest_checkpoint(checkpoint_path + '/model'))
    test_state = sess.run(cell.zero_state(batch_size, tf.float32))
    for ee, (x, y) in enumerate(batches(test_x, test_y, batch_size), 1):
        feed = {inputs_: x,
                labels_: y[:, None],
                keep_prob: 1,
                initial_state: test_state}
        batch_accuracy, test_state= sess.run([accuracy, final_state], feed_dict=feed)
        test_accuracy.append(batch_accuracy)
        prediction = tf.cast(tf.round(predictions),tf.int32)
        prediction = sess.run(prediction,feed_dict=feed)
        test_predict .append(prediction)
    print("Test accuracy is : {:.3f}".format(np.mean(test_accuracy)))

Test accuracy is 0.846
