# **Jigsaw Multilingual Toxic Comment Classification**

Use TPUs to identify toxicity comments across multiple languages.

> [**Kaggle Dataset**](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/data)

In [None]:
# Install Kaggle.
!pip install --upgrade --force-reinstall --no-deps kaggle

In [None]:
# Files Upload.
from google.colab import files

files.upload()

In [None]:
# Create a Kaggle Folder.
!mkdir ~/.kaggle

# Copy the kaggle.json to the folder created.
!cp kaggle.json ~/.kaggle/

# Permission for the json file to act.
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
# Dataset Download.
!kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification

In [None]:
# Unzip Dataset.
!unzip jigsaw-multilingual-toxic-comment-classification.zip

## **Multilingual Toxic Comment Classification using BERT.**

> [**Medium Blog**](https://medium.com/@sarahal.jodaiby/multilingual-toxic-comment-classification-3f25fb407c1c)

> [**A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution**](https://www.mdpi.com/2078-2489/12/5/205/htm)

> [**Multilingual Toxic Comment Classification**](https://cs230.stanford.edu/projects_spring_2020/reports/38964941.pdf)

In [None]:
!pip install transformers
!pip install bert-tensorflow

In [None]:
# Import Library.
import pandas as pd
import numpy as np
from tqdm import tqdm
import transformers, bert.tokenization
from tokenizers import BertWordPieceTokenizer
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau
import warnings, re

warnings.filterwarnings("ignore")

In [None]:
# TPU Detection - Check if TPU is running or not.
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print("Running on TPU", tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)  # Connection of TPU.
else:
    strategy = tf.distribute.get_strategy()

print("REPLICAS:", strategy.num_replicas_in_sync)

Running on TPU grpc://10.27.162.146:8470
INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.


INFO:tensorflow:Initializing the TPU system: grpc://10.27.162.146:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.27.162.146:8470


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Found TPU system:


INFO:tensorflow:Found TPU system:


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Cores: 8


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Workers: 1


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Num TPU Cores Per Worker: 8


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 0, 0)


REPLICAS: 8


In [None]:
# Load Dataset.
train_1 = pd.read_csv("/content/jigsaw-toxic-comment-train.csv")
train_2 = pd.read_csv("/content/jigsaw-unintended-bias-train.csv")
train_2.toxic = train_2.toxic.round().astype(int)
validation = pd.read_csv("/content/validation.csv")
test = pd.read_csv("/content/test.csv")
sub = pd.read_csv("/content/sample_submission.csv")

# Combine train_1 with a subset of train_2.
train = pd.concat(
    [
        train_1[["comment_text", "toxic"]],
        train_2[["comment_text", "toxic"]].query("toxic==1"),
        train_2[["comment_text", "toxic"]]
        .query("toxic==0")
        .sample(n=150000, random_state=42),
    ]
)

train.head()

Unnamed: 0,comment_text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [None]:
# Shape of the Dataset.
print("Dataset Shape is", train.shape)

# Class Frequency.
print(train["toxic"].value_counts())

Dataset Shape is (485775, 2)
0    352165
1    133610
Name: toxic, dtype: int64


## **Text Preprocessing & Cleaning.**

Function to clean the review text and remove all the unnecessary elements.

In [None]:
# Text Preprocessing & Cleaning.
def clean_review_text(text):
    text = text.lower()  # Convert text to lowercase.
    text = re.sub("<.*?>", "", text).strip()  # Remove HTML chars.
    text = re.sub(
        "\[|\(.*\]|\)", "", text
    ).strip()  # Remove text in square brackets and parenthesis.
    text = re.sub("(\\W)", " ", text).strip()  # Remove NON-ASCII chars.
    text = re.sub("\S*\d\S*\s*", "", text).strip()  # Remove words containing numbers.
    return text.strip()


# Apply Text Preprocessing.
train.comment_text = train.comment_text.astype(str)
train.comment_text = train.comment_text.apply(clean_review_text)

test.content = test.content.astype(str)
test.content = test.content.apply(clean_review_text)

## **Multilingual Toxic Comment Detector Classifier.**

Detect whether the multilingual input text is toxic or not.

In [None]:
# Encoder to encode text into a sequence of integers for BERT input.
def fast_encode(texts, tokenizer, batch_size=256, MAX_LEN=512):
    tokenizer.enable_truncation(max_length=MAX_LEN)
    tokenizer.enable_padding(length=MAX_LEN)
    all_ids = []
    for i in tqdm(range(0, len(texts), batch_size)):
        text_chunk = texts[i : i + batch_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    return np.array(all_ids)


# Model Configuration.
AUTO = tf.data.experimental.AUTOTUNE
EPOCHS = 5
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 512

# Model Architecture/Pipeline.
def create_model(transformer, maxlen):
    # Pretrained BERT Model.
    input_word_ids = Input(shape=(maxlen,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]  # Output from the BERT Model.
    cls_token = sequence_output[:, 0, :]
    # Fine-Tuning BERT Model.
    out = Dense(300, activation="relu")(cls_token)
    out = Dropout(0.4)(out)
    out = Dense(128, activation="relu")(out)
    out = Dropout(0.4)(out)
    out = Dense(128, activation="relu")(out)
    out = Dropout(0.4)(out)
    out = Dense(1, activation="sigmoid")(out)
    # Final Model Construction.
    model = Model(inputs=input_word_ids, outputs=out)
    # Compile the Model.
    model.compile(Adam(lr=1e-5), loss="binary_crossentropy", metrics=["AUC"])
    return model


# Load DistilBERT with the Tokenizer.
tokenizer = transformers.DistilBertTokenizer.from_pretrained(
    "distilbert-base-multilingual-cased"
)
# Save the loaded tokenizer locally.
tokenizer.save_pretrained(".")
# Reload it with the HuggingFace tokenizers library.
fast_tokenizer = BertWordPieceTokenizer("vocab.txt", lowercase=False)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466 [00:00<?, ?B/s]

In [None]:
# Prepare Dataset for BERT Model.
X_train = fast_encode(train.comment_text.astype(str), fast_tokenizer, MAX_LEN)
y_train = train.toxic.values

X_valid = fast_encode(validation.comment_text.astype(str), fast_tokenizer, MAX_LEN)
y_valid = validation.toxic.values

X_test = fast_encode(test.content.astype(str), fast_tokenizer, MAX_LEN)

# Build Dataset Objects.
train_dataset = (
    tf.data.Dataset.from_tensor_slices((X_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = tf.data.Dataset.from_tensor_slices(X_test).batch(BATCH_SIZE)

100%|██████████| 949/949 [01:27<00:00, 10.80it/s]
100%|██████████| 16/16 [00:01<00:00,  9.38it/s]
100%|██████████| 125/125 [00:13<00:00,  9.02it/s]


In [None]:
# Load Model into the TPU.
with strategy.scope():
    transformer_layer = transformers.TFDistilBertModel.from_pretrained(
        "distilbert-base-multilingual-cased"
    )
    model = create_model(transformer_layer, MAX_LEN)

# Model Summary.
model.summary()

Downloading:   0%|          | 0.00/869M [00:00<?, ?B/s]

Some layers from the model checkpoint at distilbert-base-multilingual-cased were not used when initializing TFDistilBertModel: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFDistilBertModel were initialized from the model checkpoint at distilbert-base-multilingual-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_word_ids (InputLayer)  [(None, 512)]            0         
                                                                 
 tf_distil_bert_model (TFDis  TFBaseModelOutput(last_h  134734080
 tilBertModel)               idden_state=(None, 512,             
                             768),                               
                              hidden_states=None, att            
                             entions=None)                       
                                                                 
 tf.__operators__.getitem (S  (None, 768)              0         
 licingOpLambda)                                                 
                                                                 
 dense (Dense)               (None, 300)               230700    
                                                             

## **Train the BERT Model.**

In [None]:
""" At first, train on the subset of the training set, which is entirely in English. """
callbacks = [
    EarlyStopping(patience=2, verbose=1),
    ReduceLROnPlateau(factor=0.1, patience=2, min_lr=0.00001, verbose=1),
    ModelCheckpoint("eng_bert.h5", verbose=1, save_best_only=True),
]

n_steps = X_train.shape[0] // BATCH_SIZE

# Fit the Model.
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS,
    callbacks=callbacks,
)

Epoch 1/5
Epoch 1: val_loss improved from inf to 0.43728, saving model to eng_bert.h5
Epoch 2/5
Epoch 2: val_loss improved from 0.43728 to 0.38778, saving model to eng_bert.h5
Epoch 3/5
Epoch 3: val_loss did not improve from 0.38778
Epoch 4/5
Epoch 4: val_loss did not improve from 0.38778
Epoch 4: early stopping


In [None]:
""" The learning potential of the model got pretty much saturated on English-only data. Therefore, we train the model for 
    a few more epochs on the validation set, which is significantly smaller but contains a mixture of different languages. """

n_steps = X_valid.shape[0] // BATCH_SIZE

# Fit the Model.
train_history_2 = model.fit(
    valid_dataset.repeat(), steps_per_epoch=n_steps, epochs=EPOCHS * 2
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## **Predict Test Data.**

In [None]:
# Toxic Comment Prediction.
sub["toxic"] = np.round_(model.predict(test_dataset, verbose=1), decimals=2)
# Save Predictions.
sub.to_csv("submission.csv", index=False)

