# Problem statement
- Text Classification using TF-BERT (distilled-bert)
- Not performing text pre-processing
- Using reduced dataset as it needs high computation

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import re

from sklearn.utils import shuffle
from sklearn.metrics import f1_score, roc_auc_score, precision_score, recall_score

from datasets import load_dataset, load_metric

In [2]:
warnings.filterwarnings("ignore")

In [3]:
import pandas as pd
from datasets import load_dataset
import tensorflow as tf
import numpy as np
from transformers import TFAutoModelForTokenClassification, AutoTokenizer
from transformers import DataCollatorForTokenClassification
from tensorflow.keras import layers, losses, optimizers
from tensorflow.keras.utils import plot_model
import mlflow
from tqdm import tqdm

# Logging
- MLFlow logging used to perform model monitoring
- `CLASSIFICATION_TF_NEWS` used for experiment name


In [4]:
mlflow.get_tracking_uri()
mlflow.set_experiment("CLASSIFICATION_TF_NEWS")
experiment_id = mlflow.get_experiment_by_name("CLASSIFICATION_TF_NEWS").experiment_id
mlflow.start_run(experiment_id=experiment_id)

  cmdline: git version
  stderr: 'xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun'


<ActiveRun: >

In [5]:
true_df = pd.read_csv("./True.csv")
fake_df = pd.read_csv("./Fake.csv")

In [6]:
true_df["label"] = 1
fake_df["label"] = 0

# Dataset creation
- Shuffled dataset and reduced to 10000
- Split dataset 80:20 Train:Test
- Model can be trained using `Title` or `Text` (I am going to use `Title` attribute due to computational complexity)

In [7]:
df = shuffle(pd.concat([true_df, fake_df], axis=0)[["title","label"]], random_state=42)

In [8]:
df["title_word_count"] = df.title.str.lower().apply(lambda x: len(x.split()))
# df["text_word_count"] = df.text.str.lower().apply(lambda x: len(x.split()))

- Perform basic descriptive stats
- removing titles having word length < 8

In [10]:
df.title_word_count.describe(percentiles=[0.01,.1,.25,.50,.75,.90,.99])

count    44898.000000
mean        12.453472
std          4.111476
min          1.000000
1%           6.000000
10%          8.000000
25%         10.000000
50%         11.000000
75%         14.000000
90%         18.000000
99%         26.000000
max         42.000000
Name: title_word_count, dtype: float64

In [13]:
# df = df[(df.text_word_count>80)&(df.text_word_count<1580)].iloc[:2000]
df = df[df.title_word_count>8].iloc[:2000]
train_index = int(df.shape[0]*0.8)

train_df = df.iloc[:train_index]
test_df = df.iloc[train_index:]

test_df.shape, train_df.shape

((400, 3), (1600, 3))

In [14]:
df

Unnamed: 0,title,label,title_word_count
799,BREAKING: GOP Chairman Grassley Has Had Enoug...,0,11
6500,Failed GOP Candidates Remembered In Hilarious...,0,9
3590,Mike Pence’s New DC Neighbors Are HILARIOUSLY...,0,14
1377,California AG pledges to defend birth control ...,1,9
11059,AZ RANCHERS Living On US-Mexico Border Destroy...,0,19
...,...,...,...
20464,Trump on Hurricane Irma: 'This is some big mon...,1,9
1165,Trump does not support Alexander-Murray health...,1,9
14501,Lebanon FM says Hariri crisis an attempt to cr...,1,11
9142,WATCH: SENATOR RAND PAUL Calls For Investigati...,0,22


In [15]:
test_df.to_csv("./dataset/test.csv", index=False)
train_df.to_csv("./dataset/train.csv", index=False)

In [16]:
raw_datasets = load_dataset('csv', data_files={
    'train': './dataset/train.csv',
    'test': './dataset/test.csv'
})


Downloading and preparing dataset csv/default to /Users/dhruv/.cache/huggingface/datasets/csv/default-2b1708d65409e326/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /Users/dhruv/.cache/huggingface/datasets/csv/default-2b1708d65409e326/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

# Model Loading
- Using `distilbert-base-uncased` tokenizer for training text classification
- `tokenizer` used sub-word tokenization trained with `distilbert-base-uncased` 

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [21]:
def preprocess_function(examples):
    return tokenizer(examples["title"], truncation=True)

In [22]:
tokenized_data = raw_datasets.map(preprocess_function, batched=True)

Map:   0%|          | 0/1600 [00:00<?, ? examples/s]

Map:   0%|          | 0/400 [00:00<?, ? examples/s]

In [23]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

In [24]:
id2label = {0: "FAKE", 1: "TRUE"}
label2id = {"FAKE": 0, "TRUE": 1}

In [36]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

- Creating `train` & `test` dataset

In [37]:
batch_size = 32
tf_train_set = model.prepare_tf_dataset(
    tokenized_data["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
    tokenized_data["test"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

# Model Summary
- It is the combined model having bert model with classification head

In [38]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
 dropout_39 (Dropout)        multiple                  0         
                                                                 
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
_________________________________________________________________


In [39]:
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(tf.nn.softmax(predictions), axis=1)
#     print(labels.numpy())
    f1_ = f1_score(labels, predictions, zero_division=0)
    roc_auc_ = roc_auc_score(labels.numpy(), predictions)
    precision_ = precision_score(labels.numpy(), predictions, zero_division=0)
    recall_ = recall_score(labels.numpy(), predictions, zero_division=0)
    return {
        "precision": precision_,
        "recall": recall_,
        "f1": f1_,
        "roc_auc": roc_auc_,
    }

In [40]:
loss_fn = losses.CategoricalCrossentropy()
optimizer = optimizers.AdamW(learning_rate=0.0001)



# Training
- epochs 10
- Logging all the training and validation parameters
- I have stopped model at 5th epoch

In [41]:
epochs = 10

for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    train_loss_list = []
    train_precision_list = []
    train_recall_list = []
    train_f1_list = []
    train_accuracy_list = []
    
    test_loss_list = []
    test_precision_list = []
    test_recall_list = []
    test_f1_list = []
    test_accuracy_list = []
    for step, (x_batch, y_batch) in tqdm(enumerate(tf_train_set)):
        with tf.GradientTape() as tape:
            # Forward pass
            y_pred = model(x_batch, training=True)[0]
            # Compute loss
            loss_value = loss_fn(tf.one_hot(y_batch, 2), y_pred)
            metric = compute_metrics((y_pred, y_batch))
            
        # Compute gradients
        gradients = tape.gradient(loss_value, model.trainable_variables)
        # Update weights using the optimizer
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))
        # Print the loss for monitoring
        
        train_precision_list.append(metric["precision"])
        train_recall_list.append(metric["recall"])
        train_f1_list.append(metric["f1"])
        train_accuracy_list.append(metric["roc_auc"])
        train_loss_list.append(loss_value)
        train_loss_list.append(loss_value.numpy())
    
    print(f"Train-Step {epoch + 1}, Loss: {np.mean(train_loss_list):.4f}, Prec: {np.mean(train_precision_list):.4f} Recall: {np.mean(train_recall_list):.4f}, F1: {np.mean(train_f1_list):.4f},  Acc: {np.mean(train_accuracy_list):.4f}" )
    mlflow.log_metric("Train Loss",np.mean(train_loss_list),step=epoch)
    mlflow.log_metric("Train Precision",np.mean(train_precision_list),step=epoch)
    mlflow.log_metric("Train Recall",np.mean(train_recall_list),step=epoch)
    mlflow.log_metric("Train F1 Score",np.mean(train_f1_list),step=epoch)
    mlflow.log_metric("Train ROC_AUC",np.mean(train_accuracy_list),step=epoch)
    
    for step, (x_batch, y_batch) in enumerate(tf_test_set):
        y_pred = model(x_batch, training=False)[0]
            # Compute loss
        loss_value = loss_fn(tf.one_hot(y_batch, 2), y_pred)
        metric = compute_metrics((y_pred, y_batch))
        test_precision_list.append(metric["precision"])
        test_recall_list.append(metric["recall"])
        test_f1_list.append(metric["f1"])
        test_accuracy_list.append(metric["roc_auc"])
        test_loss_list.append(loss_value)
        
    print(f"Test-Step {epoch + 1}, Loss: {np.mean(test_loss_list):.4f}, Prec: {np.mean(test_precision_list):.4f} Recall: {np.mean(test_recall_list):.4f}, F1: {np.mean(test_f1_list):.4f},  Acc: {np.mean(test_accuracy_list):.4f}" )
    mlflow.log_metric("Test Loss",np.mean(test_loss_list),step=epoch)
    mlflow.log_metric("Test Precision",np.mean(test_recall_list),step=epoch)
    mlflow.log_metric("Test Recall",np.mean(test_recall_list),step=epoch)
    mlflow.log_metric("Test F1 Score",np.mean(test_f1_list),step=epoch)
    mlflow.log_metric("Test ROC_AUC",np.mean(test_accuracy_list),step=epoch)
    


Epoch 1/10


50it [00:30,  1.65it/s]


Train-Step 1, Loss: 1.2283, Prec: 0.7719 Recall: 0.7154, F1: 0.7042,  Acc: 0.7899
Test-Step 1, Loss: 0.2919, Prec: 0.8574 Recall: 0.9686, F1: 0.9070,  Acc: 0.9172
Epoch 2/10


50it [00:30,  1.62it/s]


Train-Step 2, Loss: 0.2339, Prec: 0.9180 Recall: 0.9716, F1: 0.9415,  Acc: 0.9539
Test-Step 2, Loss: 0.1284, Prec: 0.9680 Recall: 0.9440, F1: 0.9543,  Acc: 0.9594
Epoch 3/10


50it [00:29,  1.68it/s]


Train-Step 3, Loss: 0.0797, Prec: 0.9733 Recall: 0.9842, F1: 0.9779,  Acc: 0.9818
Test-Step 3, Loss: 0.2604, Prec: 0.9525 Recall: 0.9724, F1: 0.9613,  Acc: 0.9670
Epoch 4/10


50it [00:29,  1.68it/s]


Train-Step 4, Loss: 0.0398, Prec: 0.9969 Recall: 0.9919, F1: 0.9942,  Acc: 0.9938
Test-Step 4, Loss: 0.2645, Prec: 0.9454 Recall: 0.9820, F1: 0.9622,  Acc: 0.9664
Epoch 5/10


50it [00:31,  1.61it/s]


Train-Step 5, Loss: 0.0253, Prec: 0.9951 Recall: 0.9940, F1: 0.9942,  Acc: 0.9956
Test-Step 5, Loss: 0.2046, Prec: 0.9688 Recall: 0.9454, F1: 0.9549,  Acc: 0.9601
Epoch 6/10


5it [00:03,  1.38it/s]


KeyboardInterrupt: 

# Model Saving
- Save model to model directory
- Now, you can load model in production

In [43]:
model.save_pretrained("./model/tf_model")

# Inference
- Dataset size is quite small and not much diverse
- If we increase dataset then model can start converging 
- Model trained using custom model training method to control training more precisely
- Model is showing high accuracy due to small dataset size and less diversity
- EDA can be seen in `Text Classification | NB Classifier` notebook
- All the training has been logged in MLFlow logger

# Future Work
- We can try different methods of model training like using `huggingface Trainer API`, `Tensorflow Sequential API`
- By increasing the dataset, it can perform better
- We can try training model with text description 
