# Sentiment Analysis

This notebook performs sentiment analysis on a dataset of tweets with different Neural Nets using Tensorflow / Keras.

In [1]:
#imports
import pandas as pd
from collections import defaultdict
import collections
from sklearn.model_selection import KFold

In [2]:
#tensorflow imports
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow.keras.layers as layers

# Data

Load the data in with pandas. The important columns are:
- `text` - the tweet body
- `airline_sentiment` - the label. Can be `positive`, `negative`, or `neutral`.
- `airline_sentiment_confidence` - value 0-1 giving the confidence of the label.

In [3]:
def load_full_data(filename="Tweets.csv"):
    df = pd.read_csv(filename)
    df = df[['text', 'airline_sentiment', 'airline_sentiment_confidence']]
    return df

def clean_text(input):
    words = input.split()
    words = [w for w in words if not w.startswith('@')] #removes any mentions
    return ' '.join(words)

def transform_sentiment(s):
    smap = {'neutral':0, 'positive':1, 'negative':2}
    if s in smap:
        return smap[s]
    print("Unknown sentiment " + s)
    return 0


In [4]:
df = load_full_data()
df.text = df.text.apply(clean_text)
df.airline_sentiment = df.airline_sentiment.apply(transform_sentiment)

Let's check the distribution across the different labels.

In [5]:
df.groupby(df['airline_sentiment']).count()

Unnamed: 0_level_0,text,airline_sentiment_confidence
airline_sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3099,3099
1,2363,2363
2,9178,9178


`airline_sentiment==2` (i.e. negative) is represented much more than `0` (neutral) or `1` (positive).

Let's just make them all equal (grab the min, or `2,363` for each).

In [6]:
num_per_label = min(df.groupby(['airline_sentiment']).count().text)

neu = df[df['airline_sentiment']==0].sort_values(by=['airline_sentiment_confidence'], ascending=False).head(num_per_label)
pos = df[df['airline_sentiment']==1].sort_values(by=['airline_sentiment_confidence'], ascending=False).head(num_per_label)
neg = df[df['airline_sentiment']==2].sort_values(by=['airline_sentiment_confidence'], ascending=False).head(num_per_label)

df = pd.concat([pos, neu, neg])

# Word Embedding

Our input is a string of words, but to use as input, we need to create a word embedding of some sort.

Here we'll use Keras' built-in Tokenizer to map each word to an int, and then an Embedding layer to create the vectors.

First, let's see how big we should make our Tokenizer dictionary.

In [7]:
words = defaultdict(int)
for t in df.text:
    for w in t.split():
        words[w]+=1
print(f"Num words: {len(words)}")
print(f"Num words that appear more than once: {len([w for w in words if words[w]>1])}")

Num words: 17646
Num words that appear more than once: 5874


There are only `5,874` words that appear more than once. Let's use a bit more than that as our dictionary size.

In [8]:
dict_size = 6000

Now we can create our Tokenizer.

In [9]:
tk = Tokenizer(num_words=dict_size)
tk.fit_on_texts(df.text)

You can see below that many of the most common words are not very informative (e.g. 'to', 'the', 'a'). These are called "stop words," and using a stop word dictionary (e.g. from nltk) you can remove these. I didn't here, but you definitely can.

In [10]:
collections.Counter(tk.word_counts).most_common(10)

[('to', 3858),
 ('the', 2691),
 ('i', 2380),
 ('you', 2029),
 ('a', 1934),
 ('for', 1907),
 ('on', 1696),
 ('flight', 1681),
 ('and', 1548),
 ('my', 1390)]

# Input Preparation

Now let's prepare our inputs for going into the model.

For text, we need to use our Tokenizer to turn the words into sequences, then pad the sequences to the same length.

In [11]:
X = tk.texts_to_sequences(df.text)
X = pad_sequences(X)
INPUT_LEN = len(X[0])

For the labels, we need to convert from an int (0-2) to a one-hot encoding (`[1,0,0]` or `[0,1,0]` or `[0,0,1]`).

In [12]:
Y = tf.keras.utils.to_categorical(df.airline_sentiment)

# Model

Here we actually define the model.

The input shape to the Embedding layer is [None, 33], where the None depends on the batch size (i.e. how many inputs you give at once). The type of each of those input is an int between `0` and `tk.num_words-1`. The Embedding layer translates each int into a size 64 vector.

Then an LSTM layer with 64 units is fed each word vector.

The final output of the LSTM layer is input into a Densely connected layer of output size 32, which goes to another Dense layer of output size 3.

The 3 outputs are run through a softmax layer to get the probability of each answer (neutral vs positive vs negative).

Since this is a multi-class classification problem, I use categorical cross-entropy. Adam optimizer is pretty standard, and categorical accuracy is a way to test how often predictions match the one-hot labels.

In [18]:
def gen_model():
    model = Sequential()

    model.add(layers.Embedding(input_dim=tk.num_words, output_dim=64, input_length=INPUT_LEN))
    model.add(layers.LSTM(64))
    model.add(layers.Dense(32))
    model.add(layers.Dense(3))
    model.add(layers.Softmax())

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["categorical_accuracy"])

    return model

In [19]:
gen_model().summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 33, 64)            384000    
_________________________________________________________________
lstm_6 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense_12 (Dense)             (None, 32)                2080      
_________________________________________________________________
dense_13 (Dense)             (None, 3)                 99        
_________________________________________________________________
softmax_6 (Softmax)          (None, 3)                 0         
Total params: 419,203
Trainable params: 419,203
Non-trainable params: 0
_________________________________________________________________


In [21]:
all_results = []
models = []
kf = KFold(n_splits=5, shuffle=True) #make sure to shuffle, since I ordered based on label earlier
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    Y_train, Y_test = Y[train_index], Y[test_index]
    model = gen_model()
    model.fit(x=X_train, y=Y_train, batch_size=128, validation_split=0.1, epochs=25, callbacks=tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1))
    print("Test:")
    results = model.evaluate(x=X_test, y=Y_test, batch_size=128)
    print("------------------------------")
    all_results.append(results)
    models.append(model)
print("Average categorical accuracy: %.4f" % (sum([r[1] for r in all_results])/len(all_results)))

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Test:
------------------------------
Epoch 1/25
Epoch 2/25
Epoch 3/25
Test:
------------------------------
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Test:
------------------------------
Epoch 1/25
Epoch 2/25
Epoch 3/25
Test:
------------------------------
Epoch 1/25
Epoch 2/25
Test:
------------------------------
Average categorical accuracy: 0.7677


# Poking Around

Let's take a random model from above and see some examples that it passed and failed on.

In [22]:
model = models[0]
print("Test loss: %.4f - Test Accuracy: %.4f" % (all_results[0][0], all_results[0][1]))

Test loss: 0.6691 - Test Accuracy: 0.7757


In [23]:
printed = [0]*9
def itos(i):
    arr = ['neutral', 'positive', 'negative']
    return arr[i]

for i in range(0, len(X), 128):
    probs = model.predict(x=X[i:i+128])
    for j in range(len(probs)):
        pred = probs[j].argmax()
        answer = Y[i+j].argmax()
        index = pred*3 + answer
        if printed[index]==0:
            printed[index]=1
            if pred!=answer:
                print(f"Thought it was {itos(pred)}, but it was {itos(answer)}.")
            else:
                print(f"Got it right! {itos(answer)}")
            print(tk.sequences_to_texts([X[i+j],])) 
            print()

Got it right! positive
['thank you so much for stepping up your game and making my day after night of elevator music much appreciated']

Thought it was neutral, but it was positive.
["“ jetblue our fleet's on fleek http t co 3kvkd8yrxa” lol wow"]

Thought it was negative, but it was positive.
['thanks i actually made it my connection flight was delayed guess all delays are not a bad thing http t co xggcntco8m']

Got it right! neutral
['what said']

Thought it was positive, but it was neutral.
["i usually do but i didn't make the flight booking problems this time teach me yea i have that going for me at least haha"]

Thought it was negative, but it was neutral.
['why why how many people even know what that means lol']

Thought it was neutral, but it was negative.
["it's really aggressive to entertainment in your faces amp they have little"]

Got it right! negative
['this is the worst customer service i have ever had rebooked to tues but seat available on mon wtf contact me']

Thought it

Some of these are quite short, and so very difficult. But you would think that we could correctly classify `u guys suck` as a negative... Probably not enough data to learn very good word embeddings. Using a pre-trained word embedding would probably be a lot more powerful.