## Sentimaster

A sentiment analysis project for a competition-based hiring process. 

Here I investigate the application of a BERT-based approach for the tweets sentiment classification task. The model choice was based on the reports of  	
<http://nlpprogress.com/english/sentiment_analysis.html> and specifically <https://doi.org/10.48550/arXiv.1905.05583>.

Nonetheless, a baseline approach using TF-IDF preprocessing with a random forest classifier was also implemented to compare the adequacy of the more sophisticated BERT-based strategy.



In [2]:
import pandas as pd

random_state = 42
train_file = "/home/colombelli/temp/applications/ey/data-set/train_complete.csv"
df = pd.read_csv(train_file)
df.head()

Unnamed: 0,tweet,label
0,"""QT @user In the original draft of the 7th boo...",2
1,"""Ben Smith / Smith (concussion) remains out of...",1
2,Sorry bout the stream last night I crashed out...,1
3,Chase Headley's RBI double in the 8th inning o...,1
4,@user Alciato: Bee will invest 150 million in ...,2


## Investigation of the basic dataset properties

In [11]:
print("Number of samples: ", len(df))
print("Labels:")
print(df['label'].value_counts())
print("\nTweet NaN values: ", df['tweet'].isna().sum())

Number of samples:  47615
Labels:
1    21542
2    18668
0     7405
Name: label, dtype: int64

Tweet NaN values:  0


## Data preprocessing 

I will use a baseline approach and compare to a state-of-the-art approach for sentiment analysis. 
The data preprocessing for the TF-IDF approach used with the baseline algorithm is heavier than the preprocessing performed in the data used by BERT.

In [32]:
import re
import string
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer 
from nltk.stem import PorterStemmer


tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                            reduce_len=True)
stemmer = PorterStemmer() 

# There are lots of these functions available on the internet
def text_preprocessing_tfidf(text):

    # Remove @mentions
    text = re.sub(r'(@.*?)[\s]', ' ', text)
    # Remove old style retweet text "RT"
    text = re.sub(r'^RT[\s]+', '', text)
    # Remove hyperlinks
    text = re.sub(r'https?://[^\s\n\r]+', '', text)
    # Remove the hash # sign from hashtags
    text = re.sub(r'#', '', text)

    text_clean = []
    for word in tokenizer.tokenize(text):
        if (word not in stopwords.words('english') and  # Remove stopwords
            word not in string.punctuation):  # Remove punctuation

            stem_word = stemmer.stem(word) # happy, happiness, etc -> happi            
            text_clean.append(stem_word)

    return " ".join(text_clean)


# This process can take some time and could be improved
def get_tfidf_preprocessed_dataset(df):
    df['tweet'] = df['tweet'].map(text_preprocessing_tfidf)
    return df

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/colombelli/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [33]:
tfidf_df = get_tfidf_preprocessed_dataset(df)
tfidf_df.head()

Unnamed: 0,tweet,label
0,qt origin draft 7th book remu lupin surviv bat...,2
1,ben smith smith concuss remain lineup thursday...,1
2,sorri bout stream last night crash tonight sur...,1
3,chase headley' rbi doubl 8th inning david pric...,1
4,alciato bee invest 150 million januari anoth 2...,2


In [3]:
def text_preprocessing_bert(text):

    # Remove @mentions
    text = re.sub(r'(@.*?)[\s]', ' ', text)
    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)
    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text


def get_bert_preprocessed_dataset(df):
    df['tweet'] = df['tweet'].map(text_preprocessing_bert)
    return df

In [4]:
bert_df = get_bert_preprocessed_dataset(df)
bert_df.head()

Unnamed: 0,tweet,label
0,"""QT In the original draft of the 7th book, Rem...",2
1,"""Ben Smith / Smith (concussion) remains out of...",1
2,Sorry bout the stream last night I crashed out...,1
3,Chase Headley's RBI double in the 8th inning o...,1
4,Alciato: Bee will invest 150 million in Januar...,2


## Baseline evaluation

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
import numpy as np

label_encoder = LabelBinarizer()
label_encoder.fit(df['label'].values)
label_encoder.classes_

array([0, 1, 2])

In [40]:
from sklearn.feature_extraction.text import TfidfVectorizer

X = tfidf_df['tweet'].values
y = tfidf_df['label'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, 
                                                    random_state=random_state)

tf_idf = TfidfVectorizer(ngram_range=(1, 3),
                         binary=True)

X_train_tfidf = tf_idf.fit_transform(X_train)
X_test_tfidf = tf_idf.transform(X_test)

In [41]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

clf = RandomForestClassifier(max_depth=2, random_state=random_state)
clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)

f1 = precision_recall_fscore_support(y_test, y_pred, average='macro', beta=1)[2]
print("Model performance (F1):", f1)
print("Accuracy:", accuracy_score(y_test, y_pred))  

Model performance (F1): 0.20529961730368648
Accuracy: 0.44498110037799243


  _warn_prf(average, modifier, msg_start, len(result))


## BERT state-of-the-art evaluation

In [42]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization 
tf.get_logger().setLevel('INFO')

X = bert_df['tweet'].values
y = label_encoder.transform(bert_df['label'].values)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, 
                                                    random_state=random_state)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                                    test_size=0.2,
                                                    random_state=random_state)

train_tf_df = tf.data.Dataset.from_tensor_slices((X_train, y_train)).batch(32)
val_tf_df = tf.data.Dataset.from_tensor_slices((X_val, y_val)).batch(32)
test_tf_df = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(32)

In [43]:
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

bert_model = hub.KerasLayer(tfhub_handle_encoder)
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)

print(f'BERT model selected           : {tfhub_handle_encoder}')
print(f'Preprocess model auto-selected: {tfhub_handle_preprocess}')

BERT model selected           : https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1
Preprocess model auto-selected: https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3


In [44]:
import tensorflow_addons as tfa

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(3, activation='softmax', name='classifier')(net)
  return tf.keras.Model(text_input, net)


classifier_model = build_classifier_model()
loss = tf.keras.losses.CategoricalCrossentropy()
metrics = [tfa.metrics.F1Score(num_classes=3, average='macro'),
            tf.keras.metrics.CategoricalAccuracy()]

In [45]:
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_tf_df).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')


classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)

In [46]:
history = classifier_model.fit(x=train_tf_df,
                               validation_data=val_tf_df,
                               epochs=epochs)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [47]:
classifier_model.save("models/x_train.h5")

In [48]:
evaluation = classifier_model.evaluate(test_tf_df)



In [51]:
print("Loss: ", evaluation[0])
print("F1-Score: ", evaluation[1])
print("Accuracy: ", evaluation[2])

Loss:  0.7888618111610413
F1-Score:  0.6001226305961609
Accuracy:  0.6417471766471863


### As expected, the BERT results were far better than the baseline's. Note that this result could be easily improved if I had enough computational resources to train larger BERT models for longer.

## Making the challange predictions with the fine-tuned BERT

In [None]:
import re
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as hub
from official.nlp import optimization 
from sklearn.preprocessing import LabelBinarizer



def text_preprocessing_bert(text):

    # Remove @mentions
    text = re.sub(r'(@.*?)[\s]', ' ', text)
    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)
    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text


train_file = "/home/colombelli/temp/applications/ey/data-set/train_complete.csv"
df = pd.read_csv(train_file)
label_encoder = LabelBinarizer()
label_encoder.fit(df['label'].values)

with open('../data-set/test_text.txt') as f: 
    challenge_test_tweets = np.array([
            text_preprocessing_bert(line.rstrip()) for line in f
        ])

model = tf.keras.models.load_model(
            "models/x_train.h5", 
            custom_objects={'KerasLayer':hub.KerasLayer,
                            'AdamWeightDecay': optimization.AdamWeightDecay})

In [23]:
# Unfortunatelly, the following command crashes on my PC due to memory limits
# An analogous model was trained using all the provided data through Google Colab
# The execution details can be found in the colab_runtime.ipynb notebook
predictions = model(tf.constant(challenge_test_tweets))