This notebook needs to be run from Google Colab!! If viewing the file from github, click on the button below to open this file in Golab and execute this code!

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dilipis/CourseProject/blob/main/Sentiment_Analysis_with_BERT.ipynb)

This code needs to run on a GPU. From Colab, navigate to **Edit> Notebook Settings**. Select **GPU** from the 'Hardware accelerator' dropdown.

The data for training and testing the model is stored in a github repository. The code below copies it to the workspace in Colab for easier processing.

In [None]:
import os

source_folder = './data'
output_folder = './output'
github_source = 'https://raw.githubusercontent.com/dilipis/CourseProject/main/data/'

# Folder for storing the source files in Colab
os.makedirs(os.path.dirname(source_folder + '/dummy.txt'), exist_ok=True)

import urllib.request

# Copy the source files from github into the workspace
urllib.request.urlretrieve(github_source + '/test.jsonl', source_folder +'/test.jsonl')
urllib.request.urlretrieve(github_source + '/train.jsonl', source_folder +'/train.jsonl')

The twitter data needs to be read from the jsonl files and preprocessed. This block of code does the following


1.   Read the testing and training data from jsonl file and convert them into a csv file
2.    Clean the input data by removing URL and USER tags from the tweets
3.    Split the training dataset into training and validation. This will be used to train the model.
4.    Extract only the required columns for further processing. Only the label field and the response field are currently being used.



In [None]:
# Contains BERT modules that we would be using
!pip install transformers

# For converting the jsonl file into the required format
!pip install jsonlines

# The training set in train.jsonl will be divided into training and validation sets based on this ratio
train_valid_ratio = 0.6

# Import dependencies
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

import tensorflow as tf
import jsonlines
import csv
import pandas as pd
import random
import json

from sklearn.model_selection import train_test_split

input_data = []

# Read the training data and convert to a csv. The SARCASM/NOT_SARCASM values are converted into 0/1
with open(source_folder + '/train.jsonl',"r+",encoding='utf-8') as input_file:
    for tweet in jsonlines.Reader(input_file):
        if tweet['label'] == 'SARCASM':
            tweet['label'] = 0
        else:
            tweet['label'] = 1

        input_data.append(tweet)

# Create the folder for storing transformed data if it does not exist 
file_path = output_folder + "/twitter.csv"
os.makedirs(os.path.dirname(file_path), exist_ok=True)

# The tweets in the training data need to be preprocessed and cleaned. The following items are removed - USER tags and URL tags. The data is then stored temporarily in a csv
with open(file_path, 'w+', newline='', encoding='utf-8') as input_file:
    csv_writer = csv.writer(input_file)
    csv_head = ['label','response', 'context', 'ID']
    csv_writer.writerow(csv_head)

    for i in range(len(input_data)):
        data_row = [ input_data[i]['label'],  (input_data[i]['response']).replace('@USER', '').replace('<URL>',''),
                    (''.join(input_data[i]['context'])).replace('@USER', '').replace('<URL>',''), '']
        csv_writer.writerow(data_row)

# Read the data into a Pandas dataframe for easier processing
tweets = pd.read_csv(output_folder + '/twitter.csv')

# Split the input training data into two different dataframes for training and validation 
tweets_train, tweets_val = train_test_split(tweets, test_size=train_valid_ratio, random_state=42)

tweets_train.to_csv(output_folder + '/train.csv', index=False)
tweets_val.to_csv(output_folder + '/valid.csv', index=False)

# Helper function to read data from a jsonl file
def parse_json(file):
    for l in open(file,'r'):
        yield json.loads(l)

file_path = source_folder + "/test.jsonl"
input_data = list(parse_json(file_path))    

# Preprocess test dataset by removing USER and URL tags 
with open(output_folder + '/test.csv', 'w+', newline='', encoding='utf-8') as input_file:
    csv_writer = csv.writer(input_file)
    csv_head = ['label','response', 'context', 'ID']
    csv_writer.writerow(csv_head)

    for i in range(len(input_data)):
        data_row = ['3', (input_data[i]['response']).replace('@USER', '').replace('<URL>',''),
                    (''.join(input_data[i]['context'])).replace('@USER', '').replace('<URL>',''),input_data[i]['id']]
        csv_writer.writerow(data_row)

tweets_test = pd.read_csv(output_folder + '/test.csv')

# Extract only the required columns from the input data. The context field could be used in the future to achieve better accuracy 
tweets_train = pd.DataFrame({'response': tweets_train['response'],
    'label': tweets_train['label']})

tweets_val = pd.DataFrame({'response': tweets_val['response'],
    'label': tweets_val['label']})

The following code defines the helper functions convert the data into a format that BERT requires for its processing

In [None]:
# Helper function to convert the data in the format that BERT expects. 
# The GUID and text_b are set to None as we are not using it in this implementation
def convert_data_to_examples(train, test, response, label): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[response], 
                                                          text_b = None,
                                                          label = x[label]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[response], 
                                                          text_b = None,
                                                          label = x[label]), axis = 1)
  
  return train_InputExamples, validation_InputExamples

# Helper function that tokenizes the InputExample object and creates a dataset that can be loaded into the module   
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


response = 'response'
label = 'label'

This is where the main acion happens! This block is expected to take 5-10 minutes to execute. In this block, we



1. Create the BERT model and tokenizer
2.   Convert the training and validation data into the BERT format using the helper functions defined above
3. Use model.compile to set the optimizer, loss function that BERT will use to train the model
4. Call model.fit to actually train the model based on the training and validation data



In [None]:
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Convert the train and validation data into the format that BERT can process. This uses the helper functions that were defined above.
tweets_train_InputExamples, tweets_val_InputExamples = convert_data_to_examples(tweets_train, tweets_val, response, label)

tweets_train_data = convert_examples_to_tf_dataset(list(tweets_train_InputExamples), tokenizer)
tweets_train_data = tweets_train_data.shuffle(100).batch(32).repeat(2)

tweets_val_data = convert_examples_to_tf_dataset(list(tweets_val_InputExamples), tokenizer)
tweets_val_data = tweets_val_data.batch(32)

# Setting up callbacks for TensorBoard
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

# Set the optimizer, loss function and the metrics to track
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

# This trains the model and gets all parameters to get all the paramters to the correct value to map our inputs to our outputs
model.fit(tweets_train_data, 
          epochs=2, 
          batch_size = 32,
          validation_data=tweets_val_data,
          callbacks=[tensorboard_callback])

Now it is time to do predictions based on our trained model. For each row in the testing dataset, the predictions are obtained. Finally, the resuts are written to answer.txt in the 'output' folder in the workspace.

In [None]:

test_batch = tokenizer(tweets_test['response'].astype(str).tolist(), max_length=64, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(test_batch)

# Get the pedictions
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)

#Check if the  prediction is positive(NOT_SARCASM) or negative(SARCASM)
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()

# Will be used in the last step to convert 0/1 to SARCASM?NOT_SARCASM
labels = ['SARCASM','NOT_SARCASM']

file_path = output_folder + "/answer.txt"

# Loop through the predictions and write the result in a text file. This works because tweet_test and labels are the same size and ordered the same
with open(file_path, 'w+', encoding='utf-8') as results_file:
    for i in range(len(tweets_test)):
      results_file.write(tweets_test['ID'][i] + "," + labels[label[i]] + '\n')