<a href="https://colab.research.google.com/github/CodeMonkey01/DataMiningI/blob/main/ANN/Option_A/ANN_with_BERT_add_BERT_seperator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ANN with BERT
In this notebook I tried to solve the classification model with an ANN based on pretrained BERT layers.

This notebook shows the preprocessing and hyperparameter selection process. 

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive/')
    %cd '/content/drive/MyDrive/'

    gpu_info = !nvidia-smi
    gpu_info = '\n'.join(gpu_info)
    if gpu_info.find('failed') >= 0:
      print('Not connected to a GPU')
    else:
      print(gpu_info)
except ImportError as e:
    pass

Mounted at /content/drive/
/content/drive/MyDrive
Thu May 26 20:28:36 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-------------------------------------

In [None]:
!pip install tensorflow_text
!pip install tensorflow_hub
!pip install transformers

In [3]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf

In [4]:
df_raw = pd.read_csv('/content/drive/MyDrive/Data Mining/dataset.csv')
df_raw.describe()

Unnamed: 0,text,humor
count,200000,200000
unique,200000,2
top,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
freq,1,100000


In [5]:
# todo --> take (random) sample to speed up training
df_sampled = df_raw.sample(20_000)
df = df_raw

# Check for imbalance
The dataset is equally balanced. Therefore, we do not need to rebalance the dataset.

In [6]:
print(df_sampled["humor"].value_counts())
print(df["humor"].value_counts())

True     10053
False     9947
Name: humor, dtype: int64
False    100000
True     100000
Name: humor, dtype: int64


# Preprocessing

As wrote in the paper, we tried out different preprocessing approaches for BERT. 

1.   Stop word removal
2.   Stemming
3.   Seperator / Special tokens

We tried these preprocessing independently and together. After testing each method (or combined with others) we found out that option 3 ("Seperator / Special tokens) is working the best for BERT and this binary classification problem. The code for the other preprocessing methods is listed below.


In [7]:
# Transform class from Boolean to integer value
df_sampled['class']=df_sampled['humor'].apply(lambda x: 1 if x==True else 0)
df['class']=df['humor'].apply(lambda x: 1 if x==True else 0)

# Option 1 
Stop word removal

In [None]:
# Remove stop words
from gensim.parsing.preprocessing import remove_stopwords

df['stop_word']=df['text'].apply(lambda x: remove_stopwords(x))

# Option 2
Stemming

In [None]:
# Stemming
import nltk
from nltk.stem import PorterStemmer

token_pattern = re.compile(r"(?u)\b\w\w+\b")

ps = PorterStemmer()

nltk.download('punkt')
nltk.download('stopwords')

df['stemmed']=df['text'].apply(lambda x: ' '.join([ps.stem(y) for y in token_pattern.findall(x)]))

# Option 3 (WE USED THIS)
Seperator and Special tokens

In [8]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

MAX_LEN = 128
#pad_to_max_length=True,
df_sampled['bert_preprocessed']=df_sampled['text'].apply(lambda x: " ".join(list(tokenizer.convert_ids_to_tokens(tokenizer.encode(x, add_special_tokens=True, max_length=MAX_LEN, truncation=True)))))
df['bert_preprocessed']=df['text'].apply(lambda x: " ".join(list(tokenizer.convert_ids_to_tokens(tokenizer.encode(x, add_special_tokens=True, max_length=MAX_LEN, truncation=True)))))

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

# Option 4
Stemming + Stop word removal

In [None]:
# Stemming
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re

token_pattern = re.compile(r"(?u)\b\w\w+\b")

ps = PorterStemmer()

nltk.download('punkt')
nltk.download('stopwords')

my_stopwords = set(stopwords.words('english'))

df['stemmed_stop_removed']=df['text'].apply(lambda x: ' '.join([ps.stem(y) for y in token_pattern.findall(x) if y not in my_stopwords]))

# Info
Because the preprocessing part was a long lasting process the code for the different methods (stop word removal, stemming) were just added to show how we actually did it. The results are NOT used, because in our interative process we found out that seperator and special tokens (Option 3) works by far the best. 

In [9]:
df.head()

Unnamed: 0,text,humor,class,bert_preprocessed
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False,0,[CLS] Joe bid ##en rules out 2020 bid : ' guys...
1,Watch: darvish gave hitter whiplash with slow ...,False,0,[CLS] Watch : da ##r ##vis ##h gave hitter whi...
2,What do you call a turtle without its shell? d...,True,1,[CLS] What do you call a turtle without its sh...
3,5 reasons the 2016 election feels so personal,False,0,[CLS] 5 reasons the 2016 election feels so per...
4,"Pasco police shot mexican migrant from behind,...",False,0,[CLS] Pa ##sco police shot me ##xi ##can migra...


In [10]:
from sklearn.model_selection import train_test_split

# Create train test split for training

X_train, X_test, y_train, y_test = train_test_split(df['bert_preprocessed'], df['class'], test_size=0.4)
X_train_sampled, X_test_sampled, y_train_sampled, y_test_sampled = train_test_split(df_sampled['bert_preprocessed'], df_sampled['class'], test_size=0.4)

# BERT

In [11]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

In [12]:
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

## Test embedding
Test word embedding from pretrained BERT model with a real sentence from dataset.

In [13]:
test_sentence = df["bert_preprocessed"][1]
print("Test sentence:")
print(test_sentence)
print("Test sentence (word embedding):")
print(get_sentence_embeding([test_sentence]))

Test sentence:
[CLS] Watch : da ##r ##vis ##h gave hitter whip ##lash with slow pitch [SEP]
Test sentence (word embedding):
tf.Tensor(
[[-5.35017192e-01  4.20762599e-01  9.99616385e-01 -9.87604976e-01
   9.07577336e-01  9.38914835e-01  9.45565760e-01 -9.93435681e-01
  -9.48297083e-01 -5.15298069e-01  9.62185860e-01  9.95873511e-01
  -9.98733640e-01 -9.99418557e-01  8.06529641e-01 -9.32940423e-01
   9.81226981e-01 -5.10653913e-01 -9.99887168e-01 -6.48698688e-01
  -7.84378171e-01 -9.99633610e-01  1.55686572e-01  9.82200027e-01
   9.40321684e-01 -8.46317858e-02  9.75835562e-01  9.99889851e-01
   6.13145173e-01 -1.26248568e-01  2.51276761e-01 -9.81639445e-01
   8.09462667e-01 -9.98223543e-01  1.50366932e-01  3.93522352e-01
   7.17029512e-01 -1.49403870e-01  7.88604319e-01 -9.39887047e-01
  -4.89151835e-01 -6.31983817e-01  5.66573143e-01 -4.94452208e-01
   8.81204724e-01  6.84452876e-02 -9.58826672e-03 -9.67185646e-02
   1.65023580e-02  9.99337792e-01 -9.00641918e-01  9.42173600e-01
  -9.95

# Build model

```
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 text (InputLayer)              [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_mask': (Non  0           ['text[0][0]']                   
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128),                                                      
                                 'input_word_ids':                                                
                                (None, 128)}                                                      
                                                                                                  
 keras_layer_1 (KerasLayer)     {'encoder_outputs':  108310273   ['keras_layer[1][0]',            
                                 [(None, 128, 768),               'keras_layer[1][1]',            
                                 (None, 128, 768),                'keras_layer[1][2]']            
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768),                                                
                                 (None, 128, 768)],                                               
                                 'sequence_output':                                               
                                 (None, 128, 768),                                                
                                 'pooled_output': (                                               
                                None, 768),                                                       
                                 'default': (None,                                                
                                768)}                                                             
                                                                                                  
 dropout (Dropout)              (None, 768)          0           ['keras_layer_1[1][13]']         
                                                                                                  
 output (Dense)                 (None, 1)            769         ['dropout[0][0]']                
                                                                                                  
==================================================================================================
Total params: 108,311,042
Trainable params: 769
Non-trainable params: 108,310,273
__________________________________________________________________________________________________
```

In [14]:
def build_model() -> tf.keras.Model:
    # Bert layers
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
    preprocessed_text = bert_preprocess(text_input)
    outputs = bert_encoder(preprocessed_text)

    # Neural network layers
    l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output']) # dropout rate of 0.1 works the best
    l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l) # other activation functions like softmax reduce the accuracy by A LOT

    # Use inputs and outputs to construct a final model
    model = tf.keras.Model(inputs=[text_input], outputs = [l])

    #model.summary()

    return model

# Grid Search

In [None]:
import numpy as np
import itertools
EPOCHS = 5

METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall'),
]


lr_values = np.arange(1e-3, 1e-2, 0.001)
epsilon_values = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
batch_values = [32]

values = list(itertools.product(lr_values, epsilon_values, batch_values))

print(f"Combinations: {len(values)}")
for lr, ep, batch in values:
  model: tf.keras.Model = build_model()

  print(f"Try adam learning rate of: {lr} and e: {ep} and batch size: {batch}")

  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=lr, epsilon=ep),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=METRICS)
    
  model.fit(X_train_sampled, y_train_sampled, epochs=EPOCHS, batch_size=batch)
  model.evaluate(X_test_sampled, y_test_sampled)

# Evaluation of learning rate and epsilon
After running the grid search for different learning rate values with 20k samples for 5 epochs, we get the following results. 




```
    Try adam learning rate of: 0.001 and e: 1e-06 and batch size: 32
    accuracy: 0.8656 - precision: 0.8964 - recall: 0.8237
    Try adam learning rate of: 0.001 and e: 1e-05 and batch size: 32
    accuracy: 0.8634 - precision: 0.9016 - recall: 0.8126
    Try adam learning rate of: 0.001 and e: 0.0001 and batch size: 32
    accuracy: 0.8676 - precision: 0.9010 - recall: 0.8229
    Try adam learning rate of: 0.001 and e: 0.001 and batch size: 32
    accuracy: 0.8744 - precision: 0.8774 - recall: 0.8674
    Try adam learning rate of: 0.001 and e: 0.01 and batch size: 32
    accuracy: 0.8439 - precision: 0.9033 - recall: 0.7666
    Try adam learning rate of: 0.001 and e: 0.1 and batch size: 32
    accuracy: 0.8291 - precision: 0.8127 - recall: 0.8507
    Try adam learning rate of: 0.002 and e: 1e-06 and batch size: 32
    accuracy: 0.8875 - precision: 0.8883 - recall: 0.8838
    Try adam learning rate of: 0.002 and e: 1e-05 and batch size: 32
    accuracy: 0.8826 - precision: 0.8503 - recall: 0.9257
    Try adam learning rate of: 0.002 and e: 0.0001 and batch size: 32
    accuracy: 0.8857 - precision: 0.8647 - recall: 0.9118
    Try adam learning rate of: 0.002 and e: 0.001 and batch size: 32
    accuracy: 0.8838 - precision: 0.8753 - recall: 0.8921
    Try adam learning rate of: 0.002 and e: 0.01 and batch size: 32
    accuracy: 0.8677 - precision: 0.8276 - recall: 0.9255
    Try adam learning rate of: 0.002 and e: 0.1 and batch size: 32
    accuracy: 0.8446 - precision: 0.8258 - recall: 0.8694
    Try adam learning rate of: 0.003 and e: 1e-06 and batch size: 32
    accuracy: 0.8919 - precision: 0.8762 - recall: 0.9101
    Try adam learning rate of: 0.003 and e: 1e-05 and batch size: 32
    accuracy: 0.8917 - precision: 0.9028 - recall: 0.8755
    Try adam learning rate of: 0.003 and e: 0.0001 and batch size: 32
    accuracy: 0.8842 - precision: 0.8444 - recall: 0.9391
    Try adam learning rate of: 0.003 and e: 0.001 and batch size: 32
    accuracy: 0.8903 - precision: 0.9013 - recall: 0.8740
    Try adam learning rate of: 0.003 and e: 0.01 and batch size: 32
    accuracy: 0.8841 - precision: 0.8852 - recall: 0.8800
    Try adam learning rate of: 0.003 and e: 0.1 and batch size: 32
    accuracy: 0.8531 - precision: 0.8754 - recall: 0.8199
    Try adam learning rate of: 0.004 and e: 1e-06 and batch size: 32
    accuracy: 0.8866 - precision: 0.9311 - recall: 0.8325
Try adam learning rate of: 0.004 and e: 1e-05 and batch size: 32
accuracy: 0.9041 - precision: 0.9013 - recall: 0.9477
    Try adam learning rate of: 0.004 and e: 0.0001 and batch size: 32
    accuracy: 0.8971 - precision: 0.9052 - recall: 0.8848
    Try adam learning rate of: 0.004 and e: 0.001 and batch size: 32
    accuracy: 0.8953 - precision: 0.8807 - recall: 0.9118
    Try adam learning rate of: 0.004 and e: 0.01 and batch size: 32
    accuracy: 0.8864 - precision: 0.8882 - recall: 0.8813
    Try adam learning rate of: 0.004 and e: 0.1 and batch size: 32
    accuracy: 0.8602 - precision: 0.8298 - recall: 0.9028
    Try adam learning rate of: 0.005 and e: 1e-06 and batch size: 32
    accuracy: 0.8754 - precision: 0.8184 - recall: 0.9616
    Try adam learning rate of: 0.005 and e: 1e-05 and batch size: 32
    accuracy: 0.8775 - precision: 0.9474 - recall: 0.7967
    Try adam learning rate of: 0.005 and e: 0.0001 and batch size: 32
    accuracy: 0.8891 - precision: 0.9298 - recall: 0.8394
    Try adam learning rate of: 0.005 and e: 0.001 and batch size: 32
    accuracy: 0.8972 - precision: 0.8934 - recall: 0.8997
    Try adam learning rate of: 0.005 and e: 0.01 and batch size: 32
    accuracy: 0.8873 - precision: 0.8688 - recall: 0.9096
    Try adam learning rate of: 0.005 and e: 0.1 and batch size: 32
    accuracy: 0.8644 - precision: 0.8453 - recall: 0.8886
    Try adam learning rate of: 0.006 and e: 1e-06 and batch size: 32
    accuracy: 0.8219 - precision: 0.7411 - recall: 0.9838
    Try adam learning rate of: 0.006 and e: 1e-05 and batch size: 32
    accuracy: 0.8846 - precision: 0.9486 - recall: 0.8108
    Try adam learning rate of: 0.006 and e: 0.0001 and batch size: 32
    accuracy: 0.9013 - precision: 0.9047 - recall: 0.8947
    Try adam learning rate of: 0.006 and e: 0.001 and batch size: 32
    accuracy: 0.7896 - precision: 0.9814 - recall: 0.5860
    Try adam learning rate of: 0.006 and e: 0.01 and batch size: 32
    accuracy: 0.8873 - precision: 0.8529 - recall: 0.9331
    Try adam learning rate of: 0.006 and e: 0.1 and batch size: 32
    accuracy: 0.8700 - precision: 0.8587 - recall: 0.8825
    Try adam learning rate of: 0.007 and e: 1e-06 and batch size: 32
    accuracy: 0.9038 - precision: 0.8971 - recall: 0.9098
    Try adam learning rate of: 0.007 and e: 1e-05 and batch size: 32
    accuracy: 0.8845 - precision: 0.8837 - recall: 0.8870
    accuracy: 0.9019 - precision: 0.8795 - recall: 0.9290
    Try adam learning rate of: 0.007 and e: 0.0001 and batch size: 32
    accuracy: 0.8942 - precision: 0.9342 - recall: 0.8459
    Try adam learning rate of: 0.007 and e: 0.001 and batch size: 32
    accuracy: 0.8972 - precision: 0.8709 - recall: 0.9303
    Try adam learning rate of: 0.007 and e: 0.01 and batch size: 32
    accuracy: 0.8785 - precision: 0.9361 - recall: 0.8098
    Try adam learning rate of: 0.007 and e: 0.1 and batch size: 32
    accuracy: 0.8725 - precision: 0.8619 - recall: 0.8841
    Try adam learning rate of: 0.008 and e: 1e-06 and batch size: 32
    accuracy: 0.8990 - precision: 0.9207 - recall: 0.8709
    Try adam learning rate of: 0.008 and e: 1e-05 and batch size: 32
    accuracy: 0.9013 - precision: 0.9177 - recall: 0.8793
    Try adam learning rate of: 0.008 and e: 0.0001 and batch size: 32
    accuracy: 0.8982 - precision: 0.9217 - recall: 0.8681
    Try adam learning rate of: 0.008 and e: 0.001 and batch size: 32
    accuracy: 0.8899 - precision: 0.9370 - recall: 0.8335
    Try adam learning rate of: 0.008 and e: 0.01 and batch size: 32
    accuracy: 0.8873 - precision: 0.9214 - recall: 0.8442
    Try adam learning rate of: 0.008 and e: 0.1 and batch size: 32
    accuracy: 0.8616 - precision: 0.8165 - recall: 0.9293
    Try adam learning rate of: 0.009000000000000001 and e: 1e-06 and batch size: 32
    accuracy: 0.8947 - precision: 0.9306 - recall: 0.8507
    Try adam learning rate of: 0.009000000000000001 and e: 1e-05 and batch size: 32
    accuracy: 0.8964 - precision: 0.9372 - recall: 0.8474
    Try adam learning rate of: 0.009000000000000001 and e: 0.0001 and batch size: 32
    accuracy: 0.8854 - precision: 0.8317 - recall: 0.9634
    Try adam learning rate of: 0.009000000000000001 and e: 0.001 and batch size: 32
    accuracy: 0.8859 - precision: 0.9482 - recall: 0.8138
    Try adam learning rate of: 0.009000000000000001 and e: 0.01 and batch size: 32
    accuracy: 0.8776 - precision: 0.9427 - recall: 0.8015
    Try adam learning rate of: 0.009000000000000001 and e: 0.1 and batch size: 32
    accuracy: 0.8689 - precision: 0.9006 - recall: 0.8262
```

> The best results are reached for a learning rate of 0.004 and an epsilon of 1e-05.

# Evaluate Batch size

Because the batch size is dependend on the data set size we had to run the grid search on the 200k data set. 

We used the optimized learning rate and episolon value from above.

In [None]:
import numpy as np
import itertools
EPOCHS = 5

ADAM_LEARNING_RATE = 0.004
ADAM_EPSILON = 1e-05


METRICS = [
      tf.keras.metrics.BinaryAccuracy(name='accuracy'),
      tf.keras.metrics.Precision(name='precision'),
      tf.keras.metrics.Recall(name='recall'),
]

batch_values = [32, 38, 64, 128, 256, 512]

for batch in batch_values:
  model: tf.keras.Model = build_model()

  print(f"Try batch size: {batch}")

  model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=ADAM_LEARNING_RATE, epsilon=ADAM_EPSILON),
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=METRICS)
    
  model.fit(X_train, y_train, epochs=EPOCHS, batch_size=batch)
  model.evaluate(X_test, y_test)

# Evaluate

After training the model and evaluation the model for different batch sizes we come up with the following results. 


```
    Try batch size: 32
    accuracy: 0.9117 - precision: 0.9247 - recall: 0.8959
    Try batch size: 48
    accuracy: 0.9221 - precision: 0.9399 - recall: 0.9011
Try batch size: 64
accuracy: 0.9227 - precision: 0.9168 - recall: 0.9295
    Try batch size: 128
    accuracy: 0.8965 - precision: 0.8673 - recall: 0.9370
    Try batch size: 256
    accuracy: 0.9189 - precision: 0.9068 - recall: 0.9333
    Try batch size: 512
    accuracy: 0.8771 - precision: 0.8654 - recall: 0.8940
```

We managed to reach the highest score with batch size 64. Therefore we use this for the final training.