# Airline Tweets Sentiment Analysis

This analysis classifies airline tweets as having a positive or negative sentiment, based on the words that are used within the Tweets. A deep learning model will be trained on the pre-processed textual Tweet data in order to predict sentiment.

In [1]:
import pandas as pd

# import CSV file from GitHub
data_url = 'https://raw.githubusercontent.com/msda665/MSDA665/main/Tweets.csv'

# save data as dataframe
dat = pd.read_csv(data_url)

# display data
display(dat)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)


In [2]:
# verify column names
print(dat.columns)

# verify the number of unique sentiments for Airline Tweets
print('\n', dat['airline_sentiment'].unique())

Index(['tweet_id', 'airline_sentiment', 'airline_sentiment_confidence',
       'negativereason', 'negativereason_confidence', 'airline',
       'airline_sentiment_gold', 'name', 'negativereason_gold',
       'retweet_count', 'text', 'tweet_coord', 'tweet_created',
       'tweet_location', 'user_timezone'],
      dtype='object')

 ['neutral' 'positive' 'negative']


## Initial processing

The data set contains 3 sentiments: neutral, positive, and negative. Since the objective is to classify positive or negative sentiments, any Tweets with a 'neutral' sentiment will be removed from the data set.

In [3]:
is_neutral = dat['airline_sentiment'] != 'neutral'
print(is_neutral)

0        False
1         True
2        False
3         True
4         True
         ...  
14635     True
14636     True
14637    False
14638     True
14639    False
Name: airline_sentiment, Length: 14640, dtype: bool


In [4]:
# create new dataframe with only 'positive' and 'negative' sentiments
dat2 = dat[is_neutral]

# verify the unique sentiments in the new dataframe
dat2['airline_sentiment'].unique()

array(['positive', 'negative'], dtype=object)

## Further processing

Since the input variable is 'text', and the target variable is 'airline_sentiment', only these 2 columns will be retained in the data set.

Additionally, the sentiment column will be transformed, such that positive sentiments are represented by a 1, and negative sentiments are represented by a 0.

In [5]:
# create new dataframe with only the sentiment and text columns
df = dat2.loc[:, ['airline_sentiment', 'text']]

# verify the columns
df.columns

Index(['airline_sentiment', 'text'], dtype='object')

In [6]:
# store only the sentiment column
senti = df['airline_sentiment']

# return positive sentiments
print(senti == 'positive')

1         True
3        False
4        False
5        False
6         True
         ...  
14633    False
14634    False
14635     True
14636    False
14638    False
Name: airline_sentiment, Length: 11541, dtype: bool


In [8]:
# replace all positive sentiments with 1 and all negative sentiments with 0
senti[senti == 'positive'] = 1
senti[senti == 'negative'] = 0

# assign numeric sentiments back to df
df['airline_sentiment'] = senti

# verify the data
display(df)

Unnamed: 0,airline_sentiment,text
1,1,@VirginAmerica plus you've added commercials t...
3,0,@VirginAmerica it's really aggressive to blast...
4,0,@VirginAmerica and it's a really big bad thing...
5,0,@VirginAmerica seriously would pay $30 a fligh...
6,1,"@VirginAmerica yes, nearly every time I fly VX..."
...,...,...
14633,0,@AmericanAir my flight was Cancelled Flightled...
14634,0,@AmericanAir right on cue with the delays👌
14635,1,@AmericanAir thank you we got on a different f...
14636,0,@AmericanAir leaving over 20 minutes Late Flig...


## Partition data

The data set will be partitioned into a training and test set, where 2/3 of the observations are used for training, and the remaining 1/3 of observations are used for testing the trained model.

In [9]:
# store sentiments as a list
labels = list(df['airline_sentiment'])

# store the number of labels
m = len(labels)
print(m)

11541


In [10]:
# store Tweet text as a list
tweets = list(df['text'])

# partition data into training and test sets
# 2/3 of observations in training set
# 1/3 of observations in test set
train_labels = labels[:7698]
test_labels = labels[7698:]
train_tweets = tweets[:7698]
test_tweets = tweets[7698:]

# verify number of observations
print('Training data set length: ', len(train_labels))
print('Test data set length: ', len(test_labels))

Training data set length:  7698
Test data set length:  3843


In [11]:
import numpy as np

# convert labels into np arrays for Neural Network processing in Tensorflow
train_labels_final = np.array(train_labels)
test_labels_final = np.array(test_labels)

## Deep learning model

The Tensorflow package will be used to train a neural network model. Since neural networks require numeric input, each Tweet will be converted from a string to a vector by building a word index. Each word in each vector is assigned an integer, which represents the location of that word in the word index. Each vector will contain 120 numbers. Vectors which are longer than 120 will be truncated by removing the ending numbers past 120, and any vectors which contain less than 120 numbers will be padded by placing zeros at the beginning of the vector (before the word index values).

In [12]:
# declare deep learning hyperparameters

# vocabulary of 10,000 words
vocab_size = 10000

# Tweet/vector lengths of 120 words
max_length = 120

# truncate words at the end of the Tweet if the Tweet has more than 120 words
trunc_type='post'

# specify the Out of Vocab token value
oov_tok = '<OOV>'

In [13]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# initialize tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)

# fit tokenizer to the training Tweets
tokenizer.fit_on_texts(train_tweets)

# create word index from fitted tokenizer
word_index = tokenizer.word_index

# convert training Tweet strings into vectors
# tokens are represented as integers, based on token's index value in word_index
sequences = tokenizer.texts_to_sequences(train_tweets)

# pad each vector with zeros (0), so that all vector lengths = 120
# zeros are pre-padded by default
train_padded = pad_sequences(
    sequences,
    maxlen=max_length,
    truncating=trunc_type
)

# convert test Tweet strings into vectors
test_sequences = tokenizer.texts_to_sequences(test_tweets)

# pad the test vectors
test_padded = pad_sequences(
    test_sequences,
    maxlen=max_length,
    truncating=trunc_type
)

In [16]:
# create word index with the index as key, word as value
# needed for converting vectors back into strings
reverse_word_index = dict(
    [(value, key) for (key, value) in word_index.items()]
)

# verify the original and reversed word_indexes
print('Word index: ', list(word_index.items())[:4])
print('Reverse word index: ', list(reverse_word_index.items())[:4])

Word index:  [('<OOV>', 1), ('to', 2), ('the', 3), ('united', 4)]
Reverse word index:  [(1, '<OOV>'), (2, 'to'), (3, 'the'), (4, 'united')]


In [18]:
# create function which converts vectors into strings
def decode_review(seq):

  # replace padded zeros with a question mark
  return ' '.join([reverse_word_index.get(i, '?') for i in seq])

# print a sample Tweet and its padded version
print('Padded vector: ', train_padded[1234])
print('Decoded Tweet: ', decode_review(train_padded[1234]))

Padded vector:  [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    4
   16    3  743   70  107 1968  313   77  120 2814  284   16  118 3630
  910 1518    8 1796    9   80   32  138]
Decoded Tweet:  ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? united in the future when delay causes 15 hour wait slept night in airport ensuring seating choice for replacement flight would be good


## Neural network configuration

The neural network is built using the Tensorflow package. The following layers are used within the neural network:
1. Embedding layer, so that vectors with similar words are numerically closer
2. Flattening layer, to transform each vector into 1 dimension
3. Hidden layer, with 6 nodes and ReLu used as the activation function (output is any positive number)
4. Output layer, with 1 node to represent the probability that the Tweet has a positive sentiment. The Sigmoid activation function is chosen so that the probability ranges from 0 to 1

In [20]:
import tensorflow as tf

# define embedding dimension
# so vectors with similar words are numerically closer
embedding_dimension = 16

# initialize neural network
neural_net = tf.keras.Sequential([
      
      # embedding layer as first layer
      tf.keras.layers.Embedding(
          vocab_size,
          embedding_dimension,
          input_length=max_length
      ),

      # flatten layer to flatten each vector into 1 dimension
      tf.keras.layers.Flatten(),

      # hidden layer with 6 nodes and ReLu as activation function
      # output is any positive real number
      tf.keras.layers.Dense(6, activation='relu'),

      # output layer with 1 node to represent the probability of the Tweet being positive sentiment
      # Sigmoid activation function so probability ranges from 0 to 1
      tf.keras.layers.Dense(1, activation='sigmoid')
])

# declare optimizer, loss function, and validation metric
neural_net.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# summarize the neural network
neural_net.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1920)              0         
_________________________________________________________________
dense (Dense)                (None, 6)                 11526     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


## Model training

The neural network is trained using 10 epoch iterations. The accuracy of the model classification is chosen as the validation metric for model performance.

In [21]:
# train the neural network using 10 epoch iterations
# epoch = 1 cycle of forward-feeding and backpropagation
num_epochs = 10

# train the model with padded training Tweets
neural_net.fit(
    train_padded,
    train_labels_final,
    epochs=num_epochs,

    # validate the model using the test Tweets
    validation_data=(test_padded, test_labels_final)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f6cd405ed10>

## Model evaluation

After 10 epochs of training, the model classified approximately 99% of the training observations correctly, whereas the model classified 90% of the testing observations correctly.

The next step is to predict the sentiment of new Tweets using the trained model.

In [26]:
# test neural network using new Tweets
my_tweets = [
             '@AmericanAirlines I greatly enjoyed the flight. The beverage selection was excellent.', 
             '@AmericanAirlines This flight was a waste of money. It arrived late at the destination.',
]

# tokenize, convert to vector, then pad the Tweets
my_sequences = tokenizer.texts_to_sequences(my_tweets)

my_padded = pad_sequences(
    my_sequences,
    maxlen=max_length,
    truncating=trunc_type
)

# predict the sentiment of the new Tweets using the trained model
predictions = neural_net.predict(my_padded)

print('Probability of positive sentiment for first Tweet: ', predictions[0] * 100, '%')
print('Probability of positive sentiment for second Tweet: ', predictions[1] * 100, '%')

Probability of positive sentiment for first Tweet:  [99.90274] %
Probability of positive sentiment for second Tweet:  [0.00546293] %
