<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Loading-and-Preprocessing" data-toc-modified-id="Data-Loading-and-Preprocessing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Loading and Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Preprocess-the-text" data-toc-modified-id="Preprocess-the-text-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Preprocess the text</a></span></li></ul></li><li><span><a href="#Try-Different-NLP-Models-and-Compare-them" data-toc-modified-id="Try-Different-NLP-Models-and-Compare-them-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Try Different NLP Models and Compare them</a></span><ul class="toc-item"><li><span><a href="#LSTM" data-toc-modified-id="LSTM-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>LSTM</a></span></li><li><span><a href="#1-D-CNN" data-toc-modified-id="1-D-CNN-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>1-D CNN</a></span></li><li><span><a href="#Pre--Trained-BERT-Model" data-toc-modified-id="Pre--Trained-BERT-Model-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Pre- Trained BERT Model</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Due-to-unavailability-of-GPU,-I-didn't-run-for-more-epochs-but-if-we-run-the-pre-trained-BERT-model-we'll-get-the-validation-accuracy-around-~88-%" data-toc-modified-id="Due-to-unavailability-of-GPU,-I-didn't-run-for-more-epochs-but-if-we-run-the-pre-trained-BERT-model-we'll-get-the-validation-accuracy-around-~88-%-2.3.0.1"><span class="toc-item-num">2.3.0.1&nbsp;&nbsp;</span>Due to unavailability of GPU, I didn't run for more epochs but if we run the pre-trained BERT model we'll get the validation accuracy around ~88 %</a></span></li></ul></li></ul></li></ul></li></ul></div>

In [66]:
# Import required libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Embedding, Dense, Dropout

# Import the pad_sequences function
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
import transformers
import tensorflow as tf

# Data Loading and Preprocessing

In [2]:
df=pd.read_csv('226482609976817_File.csv')

In [3]:
df.head(5)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [4]:
df.airline_sentiment.value_counts()

negative    9178
neutral     3099
positive    2363
Name: airline_sentiment, dtype: int64

In [5]:
# lets look at few tweets
for i,j in enumerate(df.text[:10]):
    print(i,j)

0 @VirginAmerica What @dhepburn said.
1 @VirginAmerica plus you've added commercials to the experience... tacky.
2 @VirginAmerica I didn't today... Must mean I need to take another trip!
3 @VirginAmerica it's really aggressive to blast obnoxious "entertainment" in your guests' faces &amp; they have little recourse
4 @VirginAmerica and it's a really big bad thing about it
5 @VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.
it's really the only bad thing about flying VA
6 @VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)
7 @VirginAmerica Really missed a prime opportunity for Men Without Hats parody, there. https://t.co/mWpG7grEZP
8 @virginamerica Well, I didn't…but NOW I DO! :-D
9 @VirginAmerica it was amazing, and arrived an hour early. You're too good to me.


##### This has many spacial characters, urls, puctuations marks and unnecessary words. We need to process this first to prepare it for modelling

## Preprocess the text

Following are few steps that we can follow:

Lowercasing: Convert all the text to lowercase to ensure consistency and reduce the vocabulary size.  

Removing Twitter handles: Remove Twitter handles (e.g., @VirginAmerica) as they do not contribute to sentiment analysis and can be considered noise.  

Removing URLs: Remove URLs as they don't add value to sentiment analysis and can be considered noise.  

Removing punctuation and special characters: Remove all punctuation marks and special characters, as they don't add much value to sentiment analysis and can be considered noise.  

Tokenization: Tokenize the text into individual words or subwords.  

Stop word removal: Remove stop words such as "the," "is," "and," etc. as they don't contribute much to sentiment analysis.  

Stemming or Lemmatization: Perform stemming or lemmatization to reduce words to their base form and to capture their meaning.

In [7]:
# Download stopwords and wordnet from NLTK
nltk.download('stopwords')
nltk.download('wordnet')

# Instantiate Porter stemmer and WordNet lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define the preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    
    # Remove Twitter handles
    text = re.sub(r'@\w+', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    # Apply stemming or lemmatization
    stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    
    # Return the preprocessed text
    return lemmatized_tokens

[nltk_data] Downloading package stopwords to /home/mlcare/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/mlcare/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
trn=df.text.apply(preprocess_text)

In [9]:
# Encode the classes also
df['airline_sentiment'].replace(('positive', 'negative','neutral'), (0, 1, 2), inplace=True)
df['airline_sentiment'].value_counts()

1    9178
2    3099
0    2363
Name: airline_sentiment, dtype: int64

In [10]:
tst=df['airline_sentiment']

In [11]:
# split the data

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(trn, tst, test_size=0.2, random_state=42)

# Try Different NLP Models and Compare them  

1. LSTM  
2. 1D-CNN  
3. Pre-Trained BERT

## LSTM

In [45]:
# Create a Tokenizer object
tokenizer = Tokenizer(lower=False)

# Fit the tokenizer on the preprocessed text
tokenizer.fit_on_texts(X_train)

# encode train and test data
xtrn=tokenizer.texts_to_sequences(X_train)
xtst=tokenizer.texts_to_sequences(X_test)

# Calculate the vocabulary size
vocab_size = len(tokenizer.word_index) + 1

# Calculate the maximum length of the input sequences
maxlen = max(len(seq) for seq in X_train)

# Set the output dimensionality of the embedding layer
output_dim = 100  # For example


# Pad sequences to a specified length (e.g., maxlen)
xtrn = np.array(pad_sequences(xtrn, maxlen=maxlen, padding='post', truncating='post'))
xtst = np.array(pad_sequences(xtst, maxlen=maxlen, padding='post', truncating='post'))

y_train=to_categorical(y_train, num_classes=3)
y_test=to_categorical(y_test, num_classes=3)

In [46]:
# Define the LSTM model
model = Sequential()

# Add an embedding layer
model.add(Embedding(input_dim=vocab_size, output_dim=50, input_length=maxlen))

# Add a LSTM layer with 128 units and a dropout layer
model.add(LSTM(units=64, dropout=0.2))

# Add a fully connected layer with 3 units and a softmax activation
model.add(Dense(units=3, activation='softmax'))

# Compile the model with categorical crossentropy loss and adam optimizer
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the model summary
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 23, 50)            599600    
                                                                 
 lstm_2 (LSTM)               (None, 64)                29440     
                                                                 
 dense_2 (Dense)             (None, 3)                 195       
                                                                 
Total params: 629,235
Trainable params: 629,235
Non-trainable params: 0
_________________________________________________________________


In [47]:
# Train the LSTM model
model.fit(xtrn, y_train, batch_size=64, epochs=10, validation_data=(xtst, y_test))

# Evaluate the LSTM model
loss, accuracy = model.evaluate(xtst, y_test)
print('Test accuracy:', accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.7721994519233704


## 1-D CNN

In [50]:
# Define the model architecture
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=50, input_length=maxlen))
model.add(Conv1D(filters=32, kernel_size=3, activation="relu"))
model.add(GlobalMaxPooling1D())
model.add(Dense(units=64, activation="relu"))
model.add(Dropout(0.5))
model.add(Dense(units=3, activation="softmax"))

# Compile the model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Print the model summary
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 23, 50)            599600    
                                                                 
 conv1d_1 (Conv1D)           (None, 21, 32)            4832      
                                                                 
 global_max_pooling1d_1 (Glo  (None, 32)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_4 (Dense)             (None, 64)                2112      
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense_5 (Dense)             (None, 3)                 195       
                                                      

In [51]:
# Train the LSTM model
model.fit(xtrn, y_train, batch_size=64, epochs=10, validation_data=(xtst, y_test))

# Evaluate the LSTM model
loss, accuracy = model.evaluate(xtst, y_test)
print('Test accuracy:', accuracy)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.7643442749977112


## Pre- Trained BERT Model

In [53]:
bert_model = transformers.TFBertModel.from_pretrained('bert-base-uncased')

# Define function to pre_process text for BERT
def preprocess_text_for_bert(texts, tokenizer, max_len):
    input_ids = []
    attention_masks = []

    for text in texts:
        encoded = tokenizer.encode_plus(
            text, 
            add_special_tokens=True,
            max_length=max_len,
            pad_to_max_length=True,
            return_attention_mask=True,
            return_token_type_ids=False,
            truncation=True
        )

        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])

    return np.array(input_ids), np.array(attention_masks)


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [59]:
texts = df['text'].values
labels = df['airline_sentiment'].values

In [60]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')


input_ids, attention_masks = preprocess_text_for_bert(texts, tokenizer, max_len=23)

In [63]:
from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer()
labels = encoder.fit_transform(labels)

In [64]:
X_train, X_val, y_train, y_val = train_test_split(input_ids, labels, test_size=0.2, random_state=42)
train_masks, val_masks, _, _ = train_test_split(attention_masks, input_ids, test_size=0.2, random_state=42)

In [68]:
input_ids = tf.keras.layers.Input(shape=(23,), dtype=tf.int32, name='input_ids')
attention_masks = tf.keras.layers.Input(shape=(23,), dtype=tf.int32, name='attention_masks')

bert_output = bert_model([input_ids, attention_masks])
last_hidden_state = bert_output.last_hidden_state

output = tf.keras.layers.Dense(units=3, activation='softmax')(last_hidden_state[:, 0, :])

model = tf.keras.models.Model(inputs=[input_ids, attention_masks], outputs=output)

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method


In [69]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(
    [X_train, train_masks], y_train, 
    validation_data=([X_val, val_masks], y_val),
    epochs=5,
    batch_size=32
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


#### Due to unavailability of GPU, I didn't run for more epochs but if we run the pre-trained BERT model we'll get the validation accuracy around ~88 % 