* https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671
* https://arxiv.org/pdf/1810.04805.pdf
* https://huggingface.co/bert-base-uncased?text=Paris+is+the+%5BMASK%5D+of+France.
* https://huggingface.co/docs/transformers/main_classes/tokenizer
* https://www.tensorflow.org/api_docs/python/tf/keras/Model
* https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
* https://www.tensorflow.org/text/tutorials/classify_text_with_bert
* https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/classify_text_with_bert.ipynb#scrollTo=VBWzH6exlCPS
* https://www.tensorflow.org/api_docs/python/tf/keras/Model
* https://arxiv.org/pdf/1810.04805.pdf
* https://www.tensorflow.org/api_docs/python/tf/keras/optimizers
* https://huggingface.co/docs/transformers/index
* https://www.kaggle.com/datasets/yasserh/imdb-movie-ratings-sentiment-analysis?select=movie.csv

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from datetime import datetime
from google.colab import files

## Install Transformers library

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 4.8 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 62.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.4 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstallin

## Load the BERT Classifier and Tokenizer along with Input modules

In [4]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures
from transformers import BertConfig, BertModel

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [5]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


## Download Kaggle dataset

In [6]:
! pip install kaggle
! mkdir ~/.kaggle
# ! cp kaggle.json ~/.kaggle/
! cp /content/gdrive/MyDrive/Colab\ Notebooks/Thesis/kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
# ! kaggle datasets download -d yasserh/twitter-tweets-sentiment-dataset
# ! unzip twitter-tweets-sentiment-dataset.zip

In [8]:
! kaggle datasets download -d yasserh/imdb-movie-ratings-sentiment-analysis
! unzip imdb-movie-ratings-sentiment-analysis.zip

Downloading imdb-movie-ratings-sentiment-analysis.zip to /content
 44% 9.00M/20.6M [00:00<00:00, 54.5MB/s]
100% 20.6M/20.6M [00:00<00:00, 99.4MB/s]
Archive:  imdb-movie-ratings-sentiment-analysis.zip
  inflating: movie.csv               


In [9]:
path = '/content/movie.csv'
dataset_file = pd.read_csv(path)

In [10]:
dataset_file.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


In [11]:

dataset_file.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
dataset_file['DATA_COLUMN'] = dataset_file['DATA_COLUMN'].replace(np.nan, 'Empty')
dataset_file['LABEL_COLUMN'] = dataset_file['LABEL_COLUMN'].replace('positive', 0)
dataset_file['LABEL_COLUMN'] = dataset_file['LABEL_COLUMN'].replace('negative', 1)
dataset_file['LABEL_COLUMN'] = dataset_file['LABEL_COLUMN'].replace('neutral', 2)
dataset_file

Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1
39996,This movie is an incredible piece of work. It ...,1
39997,My wife and I watched this movie because we pl...,0
39998,"When I first watched Flatliners, I was amazed....",1


In [12]:
# del dataset_file['textID']
# del dataset_file['selected_text']
# dataset_file['text'] = dataset_file['text'].replace(np.nan, 'Empty')
# dataset_file['sentiment'] = dataset_file['sentiment'].replace('positive', 0)
# dataset_file['sentiment'] = dataset_file['sentiment'].replace('negative', 1)
# dataset_file['sentiment'] = dataset_file['sentiment'].replace('neutral', 2)
# dataset_file.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
# dataset_file

## Split train sets

In [13]:
train, test_and_validatition = train_test_split(dataset_file, test_size=0.2, random_state=77)
# train_1, train_rest = train_test_split(train, test_size=0.5, random_state=77)
# train_2, train_3 = train_test_split(train_rest, test_size=0.5, random_state=77)
test, validation = train_test_split(test_and_validatition, test_size=0.5, random_state=77)
print(len(train),len(test),len(validation))

32000 4000 4000


## Save test and validation datasets

In [14]:
model_save_name = 'BERTModel-IMDB'
path = F"/content/gdrive/MyDrive/Colab_Notebooks/savedModels/{model_save_name}" 
testSetPath = F"/content/gdrive/MyDrive/Colab_Notebooks/savedModels/{model_save_name}-test.csv"
validationSetPath = F"/content/gdrive/MyDrive/Colab_Notebooks/savedModels/{model_save_name}-validation.csv"

In [15]:
# test.to_csv('{model_save_name}-test.csv', encoding = 'utf-8-sig') 
# files.download('{model_save_name}-test.csv')

# validation.to_csv('{model_save_name}-validation.csv', encoding = 'utf-8-sig') 
# files.download('{model_save_name}-validation.csv')


with open(testSetPath, 'w', encoding = 'utf-8-sig') as f:
  test.to_csv(f)

with open(validationSetPath, 'w', encoding = 'utf-8-sig') as f:
  validation.to_csv(f)

## Create input sequences

In [16]:
def convert_data_to_examples_single(inputDataset, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = inputDataset.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)  
  return train_InputExamples


def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )

In [17]:
DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'

In [18]:
test_inputExamples = convert_data_to_examples_single(test, DATA_COLUMN, LABEL_COLUMN)
test_data = convert_examples_to_tf_dataset(list(test_inputExamples), tokenizer)
test_data = test_data.batch(32)

validation_InputExamples = convert_data_to_examples_single(validation, DATA_COLUMN, LABEL_COLUMN)
validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



## Configure the Loaded BERT model and Train for Fine-tuning

In [19]:
train_InputExamples  = convert_data_to_examples_single(train, DATA_COLUMN, LABEL_COLUMN)
train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)



In [21]:
# model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate=3e-5, epsilon=1e-08), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f511f2244d0>

In [22]:
model.evaluate(test_data)



[0.4587947130203247, 0.8867499828338623]

In [23]:
model.save_pretrained(path)

## Load model

In [24]:
loaded_model = TFBertForSequenceClassification.from_pretrained(path, local_files_only=True)
loaded_model.summary()

Some layers from the model checkpoint at /content/gdrive/MyDrive/Colab_Notebooks/savedModels/BERTModel-IMDB were not used when initializing TFBertForSequenceClassification: ['dropout_37']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at /content/gdrive/MyDrive/Colab_Notebooks/savedModels/BERTModel-IMDB.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions wit

Model: "tf_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_75 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


## Make Predictions with the Fine-tuned Model

In [25]:
pred_sentences = ['Terrible movie',
                  'I wasted 2 hours of my life',
                  'I will watch it a million times',
                  'I Absolutely loved it']

In [26]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = loaded_model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])

Terrible movie : 
 Negative
I wasted 2 hours of my life : 
 Negative
I will watch it a million times : 
 Positive
I Absolutely loved it : 
 Positive
