<a href="https://colab.research.google.com/github/adimyth/datascience_stuff/blob/master/nlp/BertTextClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading Data

In [1]:
import pandas as pd
from tensorflow.keras.utils import to_categorical

In [None]:
!wget --no-check certificate "https://drive.google.com/uc?export=download&id=1fYV-PPmnJMkW5m9T1WoinDfI7x4P3xDV" -O Fake.zip

In [None]:
!unzip Fake.zip

In [None]:
!wget --no-check certificate "https://drive.google.com/uc?export=download&id=1VcIG3ZwM1Ab6v9_yYISMPvsWxufL4g2I" -O True.zip

In [None]:
!unzip True.zip

* 1:True
* 0:False

In [2]:
true = pd.read_csv("True.csv")
true["target"] = [1]*true.shape[0]

In [3]:
fake = pd.read_csv("Fake.csv")
fake["target"] = [0]*fake.shape[0]

In [4]:
df = pd.concat([true, fake])

In [5]:
df = df.sample(frac=1).reset_index(drop=True)

In [6]:
df.head()

Unnamed: 0,title,text,subject,date,target
0,Thousands of Romanians mourn former king Michael,BUCHAREST (Reuters) - Thousands of Romanians l...,worldnews,"December 16, 2017",1
1,What Twitter Just Did To Right-Wingers Makes ...,Right-wing pundits on Fox and Friends this wee...,News,"January 2, 2016",0
2,Falling Apart: West’s Media-Driven Deception i...,21st Century Wire says The wagon wheels are al...,Middle-east,"August 16, 2016",0
3,Nigeria flies migrants home from Libya after s...,LAGOS (Reuters) - Nigeria s president said on ...,worldnews,"November 29, 2017",1
4,Trump escalates attacks on judge in Trump Univ...,(Reuters) - Republican presidential candidate ...,politicsNews,"June 3, 2016",1


In [7]:
df['target'].value_counts()

0    23481
1    21417
Name: target, dtype: int64

# Fake News Classification

* [Fake and real news dataset | Kaggle](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)
* [Miachel Kazachok's Jigsaw Toxic Classification Kernel](https://www.kaggle.com/miklgr500/jigsaw-tpu-bert-with-huggingface-and-keras)


In [None]:
!pip install transformers

In [8]:
import os
import warnings

import numpy as np
import pickle
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras import backend as K
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint

import transformers
from transformers import BertTokenizerFast, TFAutoModel
from tokenizers import BertWordPieceTokenizer
import traitlets

from tqdm.notebook import tqdm
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.model_selection import train_test_split

warnings.simplefilter("ignore")

In [9]:
RANDOM_SEED = 42

Tokenizers library (implemented in rust) by Hugging Face provides a significantly faster BERT WordPiece tokenizer implementation than that in the Transformers library.

In [10]:
def fast_encode(texts, tokenizer, chunk_size=256, maxlen=512):
    # truncates sentences longer than maxlen
    tokenizer.enable_truncation(max_length=maxlen)
    # right side padding till maxlen
    tokenizer.enable_padding(length=maxlen)
    all_ids = []
    
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)

BERT requires all 3 inputs `(ids, masks, and tokens)`. Huggingfact implementation allows us to pass only `ids` to the model & it still works. Also, `mask` and `tokentypeids` is useful for task with composite input

## Using Tokenizers Library

In [18]:
sentences = ["This is a test sentence",
             "Random sentences are difficult to write",
             "Maybe should have copied from Wikipedia",
             "Sentence one. Sentence two"
             ]

In [None]:
temp_encoding = fast_encode(pd.Series(sentences), fast_tokenizer)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




The `fast_tokenizer` object is created below. So run that cell first

In [None]:
temp_encoding.shape

(4, 512)

In [None]:
temp_encoding[3][:30]

array([ 101, 6251, 2028, 1012, 6251, 2048,  102,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0])

In [None]:
print(" ".join(vocab['Word'].iloc[x] for x in temp_encoding[3][:30]), end="\n\n")    

[CLS] sentence one . sentence two [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]



## Using Transformer Library

* **Input IDs**: They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
* **Attention Mask**: is a binary tensor indicating the position of the padded indices so that the model does not attend to them.
* **Token Type IDs**: Some models require two different sequences to be encoded in same input ids. They are seperated using tokens like [SEP] & [CLS]
* **Segment IDs**: BERT requires understanding of where one sequence ends and where another begins. Segment IDs is a binary mask identifying the different sequences in the model.
* **Position IDs**: The position IDs are used by the model to identify which token is at which position.

In [11]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")

In [19]:
print(tokenizer.encode_plus(sentences[0]))

{'input_ids': [101, 1188, 1110, 170, 2774, 5650, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}


In [20]:
tokenizer.decode([101, 1188, 1110, 170, 2774, 5650, 102])

'[CLS] This is a test sentence [SEP]'

## Model

In [12]:
def build_model(transformer, loss='binary_crossentropy', max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    # last hidden state : (batch_size, sequence_length, hidden_size)
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    x = Dropout(0.35)(cls_token)
    out = Dense(2, activation='softmax', name="classifier_layer")(x)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=3e-5), loss=loss, metrics=[tf.keras.metrics.AUC()])
    
    return model

For Classification task we are only interested in the BERT's output of `[CLS]` token.
BERT output is 3D of `(num_sentence, num_tokens, num_hidden_units)`. `[CLS]` is always the first token.

Since, we are interested in `[CLS]` tokens of all the sentences `[all_rows, [CLS], all_hidden_units]` so `[ :, 0, : ]`


![BERT](https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png)

## TPU Configs

In [None]:
# Create strategy from tpu
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)

In [14]:
AUTO = tf.data.experimental.AUTOTUNE
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

In [15]:
BATCH_SIZE

128

## Fast Tokenizer

In [16]:
# First load the real tokenizer
tokenizer = transformers.BertTokenizerFast.from_pretrained('bert-base-uncased')

# Save the loaded tokenizer locally
save_path = 'distilbert_base_uncased/'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.save_pretrained(save_path)

# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('distilbert_base_uncased/vocab.txt', lowercase=True)

**vocab.txt** - Stores index for each token

**special_tokens_map.json** - Stores special tokens & their meaning, such as -
* `[UNK]` - Unknown token
* `[SEP]` - Seperator token
* `[CLS]` - Classifier token used in BERT
* `[MASK]` - Token used for masking

**tokenizer_config.json** - Stores configuration option like lower casing, maximum length

## Train Test Split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['target'], 
                                                    test_size=0.2, shuffle=True,
                                                    random_state=RANDOM_SEED)

In [18]:
X_train.shape, y_train.shape

((35918,), (35918,))

In [19]:
X_test.shape, y_test.shape

((8980,), (8980,))

## Encode

In [20]:
X_train_tokenized = fast_encode(X_train.astype(str), fast_tokenizer, maxlen=512)

HBox(children=(FloatProgress(value=0.0, max=141.0), HTML(value='')))




In [21]:
X_train.shape, X_train_tokenized.shape

((35918,), (35918, 512))

In [22]:
out = open("distilbert_base_uncased/vocab.txt", "r").read().splitlines()
vocab = pd.DataFrame({"Word": out})

In [23]:
vocab.head()

Unnamed: 0,Word
0,[PAD]
1,[unused0]
2,[unused1]
3,[unused2]
4,[unused3]


In [24]:
X_train_tokenized[10][:50]

array([  101,  1999,  2010,  1056, 28394,  2102,  1010,  2343,  8398,
        3855,  2008,  2002,  3764,  2148,  4420,  1055,  2343,  4231,
        1998,  2008,  2002,  6727,  1996,  2343,  2008,  2146,  3806,
        3210,  2024,  5716,  1010,  8131,  2008,  1996, 17147,  2024,
        2551,  1012,  1045,  3764,  2007,  2343,  4231,  1997,  2148,
        4420,  2197,  2305,  1012,  2356])

In [25]:
def print_texts(idx):
    print("="*30+"ORIGINAL TEXT"+"="*30)
    print(X_train.iloc[idx][:100], end="\n\n")
    print("="*30+"TOKENIZED FORM"+"="*30)
    print(X_train_tokenized[idx][:12], end="\n\n")
    print("="*30+"RECONSTRUCTED "+"="*30)
    print(" ".join(vocab['Word'].iloc[x] for x in X_train_tokenized[idx][:30]), end="\n\n")    

In [26]:
print_texts(0)

MANCHESTER, England (Reuters) - British Prime Minister Theresa May said on Wednesday she understood 

[  101  5087  1010  2563  1006 26665  1007  1011  2329  3539  2704 14781]

[CLS] manchester , england ( reuters ) - british prime minister theresa may said on wednesday she understood that some are finding the br ##ex ##it talks frustrating but that



In [27]:
X_test_tokenized = fast_encode(X_test.astype(str), fast_tokenizer, maxlen=512)

HBox(children=(FloatProgress(value=0.0, max=36.0), HTML(value='')))




## Tensorflow Datasets

In [28]:
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((X_train_tokenized, to_categorical(y_train)))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

In [29]:
test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(X_test_tokenized)
    .batch(BATCH_SIZE)
)

## Focal Loss

In [30]:
def focal_loss(gamma=2., alpha=.2):
    def focal_loss_fixed(y_true, y_pred):
        pt_1 = tf.where(tf.equal(y_true, 1), y_pred, tf.ones_like(y_pred))
        pt_0 = tf.where(tf.equal(y_true, 0), y_pred, tf.zeros_like(y_pred))
        return -K.mean(alpha * K.pow(1. - pt_1, gamma) * K.log(pt_1)) - K.mean((1 - alpha) * K.pow(pt_0, gamma) * K.log(1. - pt_0))
    return focal_loss_fixed

Positive and negative classes are fairly balanced here, so could use binary loss instead

In [31]:
%%time
with strategy.scope():
    transformer_layer = transformers.TFBertModel.from_pretrained('bert-base-uncased')
    model = build_model(transformer_layer, loss=focal_loss(gamma=1.5), max_len=512)
model.summary()

- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_word_ids (InputLayer)  [(None, 512)]             0         
_________________________________________________________________
tf_bert_model (TFBertModel)  ((None, 512, 768), (None, 109482240 
_________________________________________________________________
tf_op_layer_strided_slice (T [(None, 768)]             0         
_________________________________________________________________
dropout_37 (Dropout)         (None, 768)               0         
_________________________________________________________________
classifier_layer (Dense)     (None, 2)                 1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
CPU times: user 13.5 s, sys: 8.02 s, total: 21.5 s
Wall time: 45.6 s


## LrScheduler

***Adam*** adapts to learning rate itself, however using additional learning rate decays can be helpful

In [32]:
def build_lrfn(lr_start=0.000001, lr_max=0.000002, 
               lr_min=0.0000001, lr_rampup_epochs=7, 
               lr_sustain_epochs=0, lr_exp_decay=.87):
    lr_max = lr_max * strategy.num_replicas_in_sync

    def lrfn(epoch):
        if epoch < lr_rampup_epochs:
            lr = (lr_max - lr_start) / lr_rampup_epochs * epoch + lr_start
        elif epoch < lr_rampup_epochs + lr_sustain_epochs:
            lr = lr_max
        else:
            lr = (lr_max - lr_min) * lr_exp_decay**(epoch - lr_rampup_epochs - lr_sustain_epochs) + lr_min
        return lr
    
    return lrfn

## Model Training

In [33]:
NUM_STEPS = X_train.shape[0] // BATCH_SIZE

In [34]:
lrfn = build_lrfn()
lr_schedule = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=1)

train_history = model.fit(
    train_dataset,
    steps_per_epoch=NUM_STEPS,
    callbacks=[lr_schedule],
    epochs=3
)


Epoch 00001: LearningRateScheduler reducing learning rate to 1e-06.
Epoch 1/3

















Epoch 00002: LearningRateScheduler reducing learning rate to 3.142857142857143e-06.
Epoch 2/3

Epoch 00003: LearningRateScheduler reducing learning rate to 5.285714285714285e-06.
Epoch 3/3


In [35]:
y_pred = model.predict(test_dataset)

In [36]:
y_pred = np.argmax(y_pred, axis=1)

In [37]:
report_dict = classification_report(y_test, y_pred, output_dict=True)

In [38]:
pd.DataFrame(report_dict)

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.999572,0.999304,0.999443,0.999438,0.999443
recall,0.999358,0.999536,0.999443,0.999447,0.999443
f1-score,0.999465,0.99942,0.999443,0.999442,0.999443
support,4670.0,4310.0,0.999443,8980.0,8980.0


In [39]:
print(f"ROC AUC Score: {roc_auc_score(y_test, y_pred)}")

ROC AUC Score: 0.9994467822950461


## Evaluation

Test sentence for when Iran issued arrest warrant against Trump, taken from Hindustan Times page 

In [40]:
test_sentence = """Iran has issued an arrest warrant and asked Interpol for help in detaining President Donald Trump and dozens of others it believes carried out the drone strike that killed a top Iranian general in Baghdad, a local prosecutor reportedly said Monday.
While Trump faces no danger of arrest, the charges underscore the heightened tensions between Iran and the United States since Trump unilaterally withdrew America from Tehran’s nuclear deal with world powers.
Tehran prosecutor Ali Alqasimehr said Trump and more than 30 others whom Iran accuses of involvement in the Jan. 3 strike that killed Gen. Qassem Soleimani in Baghdad face “murder and terrorism charges,” the semiofficial ISNA news agency reported.
"""

In [41]:
test_encoded = fast_encode(pd.Series([test_sentence]).astype(str), fast_tokenizer)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [42]:
test_encoded.shape

(1, 512)

In [43]:
model.predict(test_encoded)

array([[0.36518136, 0.6348187 ]], dtype=float32)

The model thinks it as a **TRUE** news.

US Election news taken from [NBC News](https://www.nbcnews.com/politics/2020-election/top-gop-senator-fears-trump-soft-independents-urges-shift-strategy-n1232024)

In [44]:
test_sentence = """
WASHINGTON — Senate Majority Whip John Thune sounded the alarm Wednesday that President Donald Trump's support among independent voters was "soft," urging a change in strategy in light of a New York Times survey that showed the president trailing Democrat Joe Biden by double digits.
The Times/Siena national poll found Biden winning 50 percent of registered voters, with Trump winning 36 percent. Among independents, Biden led by a substantial 18 points.
"Right now, obviously, Trump has a problem with the middle of the electorate, with independents, and they're the people who are undecided in national elections," Thune, R-S.D., told reporters in the Capitol. "I think he can win those back, but it'll probably require not only a message that deals with substance and policy but, I think, a message that conveys, perhaps, a different tone."
Trump won independent voters by 6 points in 2016, according to exit polls compiled by NBC News.
Asked whether the latest numbers were a wake-up call for the Trump campaign, Thune said, "It's a message that there needs to be a — certainly a change in probably strategy as far as the White House's messaging is concerned."
Thune said Trump could boost Republican Senate candidates if he could "perform better in terms of his own standing with the voters." He said the president is "in a bit of a low point right now, but as we all know in politics, in a short amount of time things can change."
"""

In [45]:
test_encoded = fast_encode(pd.Series([test_sentence]).astype(str), fast_tokenizer)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [46]:
model.predict(test_encoded)

array([[0.0814704, 0.9185296]], dtype=float32)

The model thinks it as a true news with high percentage, which is correct!

## Saving Model

[Xhlulu's XLM Roberta Kernel | Kaggle](https://www.kaggle.com/xhlulu/jigsaw-tpu-xlm-roberta/comments)

Utility function to create a directory

In [47]:
def touch_dir(dirname):
    if not os.path.exists(dirname):
        os.makedirs(dirname)
        print(f"Created directory {dirname}.")
    else:
        print(f"Directory {dirname} already exists.")


**Transformer Model**

Using `save_pretrained` function to save a model and its configuration file to a directory. It can be re-loaded using `transformers.PreTrainedModel.from_pretrained` class method.

**Custom Layers**

Then we save the weights of the custom classifier layer (`Dense`) seperately.

In [48]:
def save_model(model, transformer_dir='transformer'):
    """
    Special function to load a keras model that uses a transformer layer
    """
    transformer = model.layers[1]
    touch_dir(transformer_dir)
    transformer.save_pretrained(transformer_dir)
    sigmoid = model.get_layer('classifier_layer').get_weights()
    pickle.dump(sigmoid, open('classifier_layer.pkl', 'wb'))

In [49]:
save_model(model)

Directory transformer already exists.


## Reloading

Reload transformer using `from_pretrained` method & set the weights from the saved pickle to the last layer 

In [50]:
def load_model(transformer_dir='transformer', max_len=512):
    """
    Special function to load a keras model that uses a transformer layer
    """
    transformer = TFAutoModel.from_pretrained(transformer_dir)
    model = build_model(transformer, max_len=max_len)
    sigmoid = pickle.load(open('classifier_layer.pkl', 'rb'))
    model.get_layer('classifier_layer').set_weights(sigmoid)

    return model

In [None]:
model = load_model()

Using the previous example

In [52]:
model.predict(test_encoded)

array([[0.08111261, 0.91888744]], dtype=float32)