# Twitter Airline Sentiment Analysis, Exploratory Data Analysis and Hugging Face Transformers

Dataset Description: It is a record of tweets about airlines in US. Along with other information, it contains ID of Tweet, sentiment of tweer ( neutral, negative and positive), reason for negative tweet, name of airline and text of tweet. Here it is posed as a binary classififcation problem by converting neutral and positve into one category.

In [5]:
#Call libraries
import numpy as np
import pandas as pd
#Import module imdb & other keras modules
import tensorflow as tf
from sklearn.model_selection import train_test_split
#API to manipulate sequences of words
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.utils import plot_model
#We will have three types of layers.
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Flatten

#Misc
import matplotlib.pyplot as plt
import time
import io


In [6]:
#Display multiple commands output from a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [35]:
#Define some constants:
max_vocabulary = 20000        # words
max_len_tweet = 500          # words

In [36]:
#reading the dataframe
data=pd.read_csv('E:/Work & Study/MBA/T5/FA/twitter data sentiment analysis/Tweets.csv')
data.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [37]:
#Shape:
data.shape

(14640, 15)

### Exploration

In [38]:
# split the dataset
X = data['text']
y = data['airline_sentiment']

X.shape

(14640,)

In [39]:
from collections import Counter

Counter(y)

Counter({'neutral': 3099, 'positive': 2363, 'negative': 9178})

### Data pre-processing

The first step when building a neural network model is getting the data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

Here are the processing steps, we'll want to take:

We'll want to get rid of periods and extraneous punctuation.
We'll want to remove web address, twitter id, and digit.
First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [40]:
punctuation = '!"#$%&\'()*+,-./:;<=>?[\\]^_`{|}~'

# get rid of punctuation
all_reviews = 'separator'.join(X)
all_reviews = all_reviews.lower()
all_text = ''.join([c for c in all_reviews if c not in punctuation])

# split by new lines and spaces
reviews_split = all_text.split('separator')
all_text = ' '.join(reviews_split)

# create a list of words
words = all_text.split()

Then, we remove web address, twitter id, and digit.

In [41]:
# get rid of web address, twitter id, and digit
new_reviews = []
for review in reviews_split:
    review = review.split()
    new_text = []
    for word in review:
        if (word[0] != '@') & ('http' not in word) & (~word.isdigit()):
            new_text.append(word)
    new_reviews.append(new_text)

### Encoding the Tweets
The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then, we can convert each of our reviews into integers so they can be passed into the network.

In [42]:
#Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

#use the dict to tokenize each review in reviews_split
#store the tokenized reviews in reviews_ints
reviews_ints = []
for review in new_reviews:
    reviews_ints.append([vocab_to_int[word] for word in review])

Let's print out the number of unique words in the vocabulary and the contents of the first, tokenized review.

In [43]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  17213
Tokenized review: 
 [[57, 217]]


In [44]:
X[8224]
reviews_ints[8224]

'@JetBlue yes thankfully! Catering just got here and now they are loading, but very frustrated. I was supposed to be there by 10-10:30'

[167,
 2530,
 1165,
 41,
 92,
 141,
 10,
 39,
 54,
 35,
 2580,
 31,
 151,
 486,
 3,
 23,
 390,
 1,
 32,
 71,
 102]

### Encoding the labels
As mentioned before, our goal is to identify whether a tweet is negative or non-negative (positive or neutral). Our labels are "positive", "negative", or "neutral. To use these labels in our network, we need to convert them to 0 and 1.

In [45]:
# 1=positive, 1=neutral, 0=negative label conversion
encoded_labels = []
for label in y:
    if label == 'neutral':
        encoded_labels.append(1)
    elif label == 'negative':
        encoded_labels.append(0)
    else:
        encoded_labels.append(1)

encoded_labels = np.asarray(encoded_labels)

In [46]:
encoded_labels

array([1, 1, 1, ..., 1, 0, 1])

In [47]:
#Check max and min length of reviews
maxLen = 0         # Start with a low number
minLen = 200       # Start with a high number
for i in range(len(reviews_ints)):
    if len(reviews_ints[i]) > maxLen:
        maxLen = len(reviews_ints[i])
    if len(reviews_ints[i]) < minLen :
        minLen = len(reviews_ints[i])

maxLen

32

### Process data
We want to pad all sequences to max_len_review size. Reviews more in size will be truncated and less in size will be padded with zeros

In [48]:
#Pad X sequences
#And also make each inner list as one row:

feature = sequence.pad_sequences(
                                 reviews_ints,   # An array of lists where each inner
                                            # list is a sequence, Or,
                                            # A list of lists with each
                                            #  list being a sequence
                                 maxlen = 30,   # This is default
                                 padding = 'pre'   # option: 'post'
                                 )

In [49]:
#Look at first twenty rows
#and first twenty columns:

feature[:20,:10]

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0, 430],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,  29,  14, 557,   4],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,  44

### Training, validation, and test
With our data in nice shape, we'll split it into training, validation, and test sets.

In [50]:
X_train, X_test, y_train, y_test = train_test_split(
    feature, encoded_labels, test_size=0.2, random_state=0)

In [51]:
X_train[0:5]
print("\n\n------------\n\n")
y_train[:4]  

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,  167, 2530,
        1165,   41,   92,  141,   10,   39,   54,   35, 2580,   31,  151,
         486,    3,   23,  390,    1,   32,   71,  102],
       [   0,    0,    0,    0,    0,    0,    0,    0,  167,    3,   22,
          62,  644, 5074,   58,   20,   76, 8032,   89,  122,  569,    9,
          94,  101,   10,   11,  150,   47,   25,  350],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,  455,   10,    3,   22,  181,  146, 2272,   10,   50,
         231,   15,  547,  305,  593,   25,   28,  142],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,   69,   43,
           3,   72,   24,   22,   11,  142,   54,  568, 8594,  255,  644,
          72,   27,    2, 8595,   56,   96,  240, 1508],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,  174,  261, 



------------




array([1, 0, 0, 0])

### Design model

In [52]:
#Delete any earlier model 
if 'model' in locals():
  del model

#Start with a blank template:
model = Sequential() 

#Add an embedding layer:
model.add(Embedding(
                    max_vocabulary,            # Decides number of input neurons
                    32,                        # Decides number of neurons in hidden layer
                    input_length= 30) # (optional) Decides how many groups of OHEs
                                                  # are input at a time (or in sequence).
                                                  # It also decides how many times
                                                  #  RNN should loop around
                                                  #    If omitted, decided autoamtically
                                                  #     during 'model.fit()' by considering
                                                  #       x_train.shape[1]
                  
          )

In [53]:
# It is instructive to see number of parameters
#  in the summary. This tells us about the Embedding
#   layer as being two layered network with no of neurons
#    as max_vocabulary and output (hidden) layer with 32 neurons
#     Note: Hidden layer has no activation function
#            and no bias parameter:

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 30, 32)            640000    
                                                                 
Total params: 640,000
Trainable params: 640,000
Non-trainable params: 0
_________________________________________________________________


In [54]:
# Ideally we should be adding not one RNN but as many RNNs as
#     there are timesteps ie sequence length or 'max_len_review'.
#     But we add just one and perform internal looping. Note that
#     internal weights and hence LSTM parameters remain same from one
#     'timestep' to another 'timestep'. You can verify this by
#     changing the value of max_len_review and seein that number
#     of parameters in the model summary after adding the following
#     do not change.

model.add(
           SimpleRNN
                    (
                      32,                      # Neurons at the output
                      return_sequences = False # Make it True
                                               # And add layer #4.4
                    )
          )   # Output


In [55]:

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 30, 32)            640000    
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 32)                2080      
                                                                 
Total params: 642,080
Trainable params: 642,080
Non-trainable params: 0
_________________________________________________________________


In [56]:
model.add(Flatten())

In [57]:
#Add classification layer:
model.add(Dense(1, activation = 'sigmoid'))
model.summary()


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 30, 32)            640000    
                                                                 
 simple_rnn_1 (SimpleRNN)    (None, 32)                2080      
                                                                 
 flatten_1 (Flatten)         (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 33        
                                                                 
Total params: 642,113
Trainable params: 642,113
Non-trainable params: 0
_________________________________________________________________


In [58]:
#Plot model
tf.keras.utils.plot_model(
                          model,
                          show_shapes=True,
                          show_layer_names=True
                          )

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


In [59]:
#Compile model
model.compile(
               loss = 'binary_crossentropy',
               optimizer = 'rmsprop',
               metrics = ['acc']
              )

In [60]:
#Tensorboard callback
#       We will use TensorBoard to visualize metrics 
#       including loss, accuracy etc. 
#       Create a tf.keras.callbacks.TensorBoard

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [61]:
epochs = 5
start = time.time()
history = model.fit(X_train,
                    y_train,
                    batch_size = 32,             # Number of samples per gradient update
                    validation_split = 0.2,      # Fraction of training data to be used as validation data
                    epochs = epochs,
                    shuffle = True,              # Shuffle training data before each epoch
                    callbacks=[tensorboard_callback],
                    verbose =1
                    )
end = time.time()
(end-start)/60

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


0.22717999219894408

In [62]:
#Get x_test padded
X_test = sequence.pad_sequences(
                                 X_test,   # A list of lists where each inner
                                            # list is a sequence, Or,
                                            # An array of lists with each
                                            #  list being a sequence
                                 maxlen = 30,
                                 padding = 'pre'
                                 )

In [63]:
#Predict now
out = model.predict(X_test)
out[out > 0.5]  = 1
out[out <= 0.5] = 0
out



array([[0.],
       [1.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]], dtype=float32)

In [64]:
model.evaluate(X_test,y_test)
model.metrics_names



[0.5768351554870605, 0.7749316692352295]

['loss', 'acc']

## Hugging Face Transformers

In [65]:
#!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-win_amd64.whl (3.3 MB)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [71]:
# Call libraries:
# Hugging Face related:
from transformers import pipeline
from datasets import Dataset

### Classification
Create an object to perform sentiment analysis

In [72]:
#     Instantiate 'pipeline' for sentiment-anaysis
#     Once instantiated, 'classifier' object
#     can be used for sentiment analysis:

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceCla

In [73]:
#reading the dataframe
data1=pd.read_csv('E:/Work & Study/MBA/T5/FA/twitter data sentiment analysis/Tweets.csv')
data1.head()
#Transform pandas dataframe to hugging face dataset:

dataset = Dataset.from_pandas(data1)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [76]:
X1 = data1['text']
y1 = data1['airline_sentiment']

In [88]:
punctuation = '!"#$%&\'()*+,-./:;<=>?[\\]^_`{|}~'

# get rid of punctuation
all_reviews1 = 'separator'.join(X1)
all_reviews1 = all_reviews1.lower()
all_text1 = ''.join([c for c in all_reviews if c not in punctuation])

# split by new lines and spaces
reviews_split1 = all_text1.split('separator')
all_text1 = ' '.join(reviews_split1)

# create a list of words
words1 = all_text1.split()

# get rid of web address, twitter id, and digit
new_reviews1 = []
for review in reviews_split1:
    review = review.split()
    new_text1 = []
    for word in review:
        if (word[0] != '@') & ('http' not in word) & (~word.isdigit()):
            new_text1.append(word)
    new_reviews1.append(new_text1)
final_review=[]   
for review in new_reviews1:
    joined=""
    for word in review:
        joined=joined+" "+word
    final_review.append(joined)


In [104]:
#Look at first few rows:

final_review[:3]

[' what said',
 ' plus youve added commercials to the experience tacky',
 ' i didnt today must mean i need to take another trip']

In [90]:
#Take a sample of dataset
#     select(range(1000)) will select top 1000 rows.
#     Hence shuffle is a must to take a sample:

sample = dataset.shuffle(seed=42).select(range(1000))
sample.shape  # (1000, 7)

(1000, 15)

In [91]:
#Classify five of the reviews:

classifier(final_review[:5])

[{'label': 'POSITIVE', 'score': 0.9899877905845642},
 {'label': 'NEGATIVE', 'score': 0.9438981413841248},
 {'label': 'NEGATIVE', 'score': 0.9987396597862244},
 {'label': 'NEGATIVE', 'score': 0.9974498152732849},
 {'label': 'NEGATIVE', 'score': 0.9996324777603149}]

## Question Answering

In [93]:
#Instantiate question-answer object:

question_answerer = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-cased-distilled-squad were not used when initializing TFDistilBertForQuestionAnswering: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased-distilled-squad and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [105]:
#Let the object study context and answer question:

question_answerer(
    question="flights leaving dallas to which country",
    context= final_review[44]
)

{'score': 0.7554116249084473, 'start': 32, 'end': 39, 'answer': 'seattle'}