<a href="https://colab.research.google.com/github/anubhavwalia999/ATM-Simulator-System/blob/master/Anubhav_Anubhav_Final_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Title: AIDI 1002 Final Term Project Report



# Introduction

#### Problem Description:

Bidirectional Encoder Representations from Transformers, or BERT, is a well-liked pre-training technique for problems involving natural language processing that has produced noteworthy results lately. Although BERT has proved successful in a variety of NLP activities, there are a number of restrictions that may limit how well it performs in some NLP tasks. These drawbacks include a fixed-length attention mechanism, a lack of attentional specificity, a failure to explicitly model relationships between various input modalities, and a high number of parameters that make it computationally expensive to train and use in environments with limited resources.


#### Context of the Problem:

The desire for natural language processing models that can efficiently and accurately carry out a variety of language-related tasks, such as sentiment analysis, question-answering, and language translation, is driving the challenge with BERT. BERT was developed as a pre-training technique for problems involving natural language processing, and it has produced outstanding performance on numerous benchmarks. Nevertheless, despite its success, BERT has drawbacks that may reduce its efficiency in some situations. The drawbacks include a fixed-length attention mechanism, a lack of attentional specificity, a failure to explicitly model relationships between various input modalities, and a high number of parameters that make it computationally expensive to train and use in environments with limited resources.

#### Limitation About other Approaches:

The following succinctly describes the BERT approach's shortcomings in natural language processing:

-> Limited attention span: Because BERT's attention mechanism can only focus on a certain number of tokens in the input sequence, it can be difficult to detect long-range dependencies.

-> Lack of specificity: BERT's attention mechanism does not explicitly distinguish between various aspects of the input, such as the subject and object in a sentence or the sentiment and content of a text, which can restrict its capacity to capture fine-grained information that is crucial for some tasks.

-> Inability to model relationships between different modalities: BERT lacks the capacity to explicitly represent relationships between various input modalities, such as text and images, which makes it difficult for the model to tackle tasks that call for a knowledge of links between various modalities.

-> Computationally expensive: Costly to train and deploy: BERT is a complex model with numerous parameters, making it costly to do so, especially in contexts with limited resources.

-> Limited transferability: Although BERT has performed well in many NLP tasks, it may not translate well to tasks that are unrelated to its pre-training goals, necessitating further fine-tuning or retraining on task-specific data.

#### Solution:

A cutting-edge natural language processing model called DEBERTA (Decoding-Enhanced BERT with Disentangled Attention) was released by Microsoft Research in 2020. It is a development of the well-known BERT (Bidirectional Encoder Representations from Transformers) model, which has achieved outstanding outcomes in a variety of NLP tasks. To boost BERT's performance, DEBERTA combines a number of cutting-edge characteristics, such as disentangled attention and enhanced decoding.

The model may selectively pay attention to various parts of the input using the new method of disentangled attention, such as the subject and object of a sentence or the sentiment and content of a document. This makes it possible for the model to gather more precise data and enhances its capacity to manage challenging natural language understanding jobs.

Another significant element of DEBERTA is enhanced decoding, which includes building more layers into the model to help it manage long-range dependencies in the input. In tasks like language generation and text completion, this enables the model to provide output that is more accurate and coherent.

DEBERTA has outperformed the original BERT model in many ways and has produced cutting-edge outcomes on a variety of NLP tasks, including text categorization, question answering, and natural language inference.

# Background

Explain the related work using the following table

| Reference |Explanation |  Dataset/Input |Weakness
| --- | --- | --- | --- |
| Devlin et al. [1] | The BERT model, which pre-trains a deep bidirectional transformer on a huge quantity of text data to enhance performance on subsequent natural language processing tasks, is introduced in this study.| the BooksCorpus and the English Wikipedia. | Due to the fact that the datasets are mostly composed of written material, they may not be representative of all linguistic contexts and modes.
| Yang et al. [2] | An autoregressive method with a permutation-based training target is used by the pre-training language model "XLNet" to identify long-term dependencies in text data.| Wikipedia and BookCorpus | The BookCorpus dataset only includes English-language literature, which restricts the model's capacity to be applied to other languages.


# Methodology

By adding a disentangled attention mechanism, which enables the model to better capture long-range relationships and fine-grained information in the input, the DeBERTa model enhances the original BERT technique. In particular, the DeBERTa model substitutes a two-stage attention process for the conventional multi-head attention mechanism utilised in BERT.

For each input token, the model calculates a collection of key and value vectors. The semantic characteristics of these vectors, such as the part of speech or the sentiment of the token, are then used to divide them into several groups. For each token, the model computes a set of query vectors in the second stage. Based on the groups of key and value vectors' semantic features, the model then employs a disentangled attention mechanism to focus on the relevant groups of vectors.

The DeBERTa model is able to attend to certain components of the input more explicitly and collect fine-grained information, which may be crucial for some NLP tasks, by detaching the attention mechanism in this way. A mask-predicting training objective and a task-specific adaptor layer are two more improvements the model makes that further boost its performance.

The DeBERTa model performs at the cutting edge on many of these tasks when trained and tested on benchmark datasets like GLUE, SuperGLUE, and SQuAD.

We have imnplemented Kfold in DeBERTa Model to increse the efficiency

# Implementation

In this section, you will provide the code and its explanation. You may have to create more cells after this. (To keep the Notebook clean, do not display debugging output or thousands of print statements from hundreds of epochs. Make sure it is readable for others by reviewing it yourself carefully.)

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m22.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m58.7 MB/s[0m eta [36m0:00:0

In [None]:
#imports
import numpy as np
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from torch.optim import lr_scheduler
from torch import nn
from torch.optim import Adam
from tqdm.notebook import tqdm
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-large-cased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

In [None]:
#loading train and test dataset
train = pd.read_csv("train.csv")
print(f'Train_Shape: {train.shape}')
test=pd.read_csv("test.csv")
print(f'Test_Shape: {test.shape}')
train.head()

Train_Shape: (3911, 8)
Test_Shape: (3, 2)


Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5


In [None]:
sample = pd.read_csv('sample_submission.csv')
print(f'Sample_Shape: {sample.shape}')
sample.head()

Sample_Shape: (3, 7)


Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,3.0,3.0,3.0,3.0,3.0,3.0
1,000BAD50D026,3.0,3.0,3.0,3.0,3.0,3.0
2,00367BB2546B,3.0,3.0,3.0,3.0,3.0,3.0


In [None]:
# train_test_split
np.random.seed(42)
df_train, df_val, df_test = np.split(train.sample(frac=1, random_state=42), [int(.9*len(train)), int(.95*len(train))])

print(f'Train_Shape: {len(df_train)}')
print(f'Val_Shape: {len(df_val)}')
print(f'Test_Shape: {len(df_test)}')

Train_Shape: 3519
Val_Shape: 196
Test_Shape: 196


In [None]:
#clean the training dataset by removing space from the text and make the letters in lowercase
def clean_text(data):
    data = data.copy()
    data['full_text'] = data['full_text'].apply(lambda x : x.replace('\n', ' '))
    data['full_text'] = data['full_text'].apply(lambda x: x.strip())
    data['full_text'] = data['full_text'].apply(lambda x: x.lower())
    return data
#applying the text
train_df = clean_text(train)
train_df.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,i think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,when a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"dear, principal if u change the school policy...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,the best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5


In [None]:
# tokenizer = BertTokenizer.from_pretrained('../input/huggingface-bert-variants/bert-large-cased/bert-large-cased')

class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.labels = df[["cohesion", "syntax", "vocabulary", "phraseology", "grammar", "conventions"]].reset_index()
        self.texts = df[["full_text"]].reset_index()

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return np.array(self.labels.loc[idx].values[1:]).astype(float)

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return tokenizer(self.texts.loc[idx].values[1],
                        padding='max_length', max_length = 512, truncation=True,
                        return_tensors="pt")

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y

In [None]:
#Test Data

In [None]:
class TestDataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.texts = df[["full_text"]].reset_index()

    def __len__(self):

        return len(self.texts)

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return tokenizer(self.texts.loc[idx].values[1], padding = 'max_length', max_length = 512, truncation = True, return_tensors = 'pt')

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)

        return batch_texts

In [None]:
class FeedbackModel(nn.Module):

    def __init__(self, dropout = 0.1):

        super(FeedbackModel, self).__init__()
        self.bert = BertModel.from_pretrained("bert-large-cased")

#         self.bert = BertModel.from_pretrained('../input/huggingface-bert-variants/bert-large-cased/bert-large-cased')
        self.dropout = nn.Dropout(dropout) # nn.Dropout(dropout,0)
        self.linear = nn.Linear(1024, 256)
        self.relu = nn.ReLU() # nn.LeakyReLU(0.1)
        self.out = nn.Linear(256,6)

    def forward(self, input_id, mask):

        _, x = self.bert(input_ids = input_id, attention_mask = mask, return_dict = False)
        x = self.dropout(x)
        x = self.linear(x)
        x = self.relu(x)
        final_layer = self.out(x)
        return final_layer

In [None]:
# Train the model

In [None]:
def train(model, train_data, val_data, epochs):

    train, val = Dataset(train_data), Dataset(val_data)
    if torch.cuda.is_available():
        dev = "cuda:0"
    else:
        dev = "cpu"
    device = torch.device(dev)
    criterion = nn.MSELoss()
    optimizer = Adam(model.parameters(), lr = 1e-5)
    scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0 = 500, eta_min = 1e-6)

    train_dataloader = torch.utils.data.DataLoader(train, batch_size = 2, shuffle = True)
    val_dataloader = torch.utils.data.DataLoader(val, batch_size = 2)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

            model = model.cuda()
            criterion = criterion.cuda()

    for epoch_num in range(epochs):

            total_loss_train = 0

            for train_input, train_labels in tqdm(train_dataloader):

                train_labels = train_labels.to(device).float()
                mask = train_input['attention_mask'].to(device)
                input_id = train_input['input_ids'].squeeze(1).to(device)

                output = model(input_id, mask)

                batch_loss = criterion(output, train_labels)
                total_loss_train += batch_loss.item()

                model.zero_grad()
                batch_loss.backward()
                optimizer.step()
                scheduler.step()

            total_loss_val = 0

            with torch.no_grad():

                for val_input, val_label in val_dataloader:

                    val_label = val_label.to(device)
                    mask = val_input['attention_mask'].to(device)
                    input_id = val_input['input_ids'].squeeze(1).to(device)

                    output = model(input_id, mask)

                    batch_loss = criterion(output, val_label)
                    total_loss_val += batch_loss.item()

            print(f'Epoch: {epoch_num + 1} | Train Loss: {total_loss_train / len(train_data): .3f} | Val Loss: {total_loss_val / len(val_data): .3f}')

In [None]:
#predict

In [None]:
def predict(model, test_data):

    test =  TestDataset(test_data)
    test_dataloader = torch.utils.data.DataLoader(test,batch_size = 1)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

    out = []
    with torch.no_grad():
        for test_input in tqdm(test_dataloader):
            mask = test_input["attention_mask"].to(device)
            input_id = test_input["input_ids"].squeeze(1).to(device)
            output =  model(input_id, mask)
            out.append(output.tolist())

    return out

In [None]:
#evaluate

In [None]:
def evaluate(model, test_data):

    test = Dataset(test_data)

    test_dataloader = torch.utils.data.DataLoader(test, batch_size = 2)
    criterion = nn.MSELoss()
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:

        model = model.cuda()

    total_loss_test = 0
    with torch.no_grad():

        for test_input, test_labels in tqdm(test_dataloader):

            test_labels = test_labels.to(device)
            mask = test_input['attention_mask'].to(device)
            input_id = test_input['input_ids'].squeeze(1).to(device)

            output = model(input_id, mask)

            loss = criterion(output, test_labels)
            total_loss_test += loss

    print(f'Test Loss: {total_loss_test / len(test_data): .3f}')

In [None]:
model = FeedbackModel()

EPOCHS = 30
train(model, train_df, df_val, EPOCHS)

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

  0%|          | 0/1956 [00:00<?, ?it/s]

In [None]:
evaluate(model, df_test)

In [None]:
torch.save(model.cpu().state_dict(), "BERT_epoch_30.bin")

In [None]:
prediction = predict(model, test)

In [None]:
######## Implementing deberta ####################################################################################################################################33

In [None]:
!pip install -q transformers==4.20
!pip install tensorflow-addons
!pip install iterstrat
!pip install imbalanced-learn
import transformers
print(f"trainsformer version {transformers.__version__}")
tokenizer = DebertaTokenizer.from_pretrained('microsoft/deberta-base')
model = TFDebertaModel.from_pretrained('microsoft/deberta-base')
!pip install iterative-stratification
import sys
sys.path.append('../input/iterative-stratification/iterative-stratification-master')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import numpy as np
from transformers import AutoTokenizer, TFDebertaV2Model, AutoConfig
from sklearn.model_selection import train_test_split
import pandas as pd
import os
from tqdm import tqdm
import random
from sklearn.model_selection import KFold
import tensorflow_addons as tfa
import sys
sys.path.append('../input/iterativestratification')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import tensorflow_addons as tfa
import math

In [None]:
def seed_everything(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

seed_everything(1234)

In [None]:
import tensorflow as tf

# Define and initialize the distribution strategy
strategy = tf.distribute.MirroredStrategy()

In [None]:
BATCH_SIZE = 8*strategy.num_replicas_in_sync
BUFFER_SIZE = 3200
AUTO = tf.data.AUTOTUNE
SEQ_LEN = 512
MODEL_NAME = "microsoft/deberta-v3-base"
FOLD_NUM = 4
tokenizer = DebertaTokenizer.from_pretrained('microsoft/deberta-base')
TARGET_COLS = ["cohesion", "syntax", "vocabulary", "phraseology", "grammar", "conventions"]

In [None]:
def preprocess(df, tokenizer):
    inputs, labels = np.array(df["full_text"]), np.array(df[["cohesion", "syntax", "vocabulary", "phraseology", "grammar", "conventions"]])
    input_ids = []
    attention_mask = []
    for x in tqdm(inputs):
        tokens = tokenizer(x, padding="max_length", truncation=True, max_length=SEQ_LEN, return_tensors="np")
        ids = tokens["input_ids"]
        mask = tokens["attention_mask"]
        input_ids.append(ids)
        attention_mask.append(mask)
    input_ids = np.array(input_ids).squeeze()
    attention_mask = np.array(attention_mask).squeeze()
    return input_ids, attention_mask, labels

In [None]:
df = pd.read_csv(r"/content/train.csv")
kf = MultilabelStratifiedKFold(n_splits=FOLD_NUM, shuffle=True, random_state=123)

for fold, (train_idx, test_idx) in enumerate(kf.split(df, df[TARGET_COLS])):
    df.loc[test_idx, "fold"] = int(fold)
df["fold"] = df["fold"].astype(int)
# df.to_csv("train_fold.csv", index=False)
df

In [None]:
def get_train_dataset(ids, mask, y):
    x = tf.data.Dataset.from_tensor_slices({
        "input_ids": tf.constant(ids, dtype="int32"),
        "attention_mask": tf.constant(mask, dtype="int32")
    })
    y = tf.data.Dataset.from_tensor_slices(y)
    data = tf.data.Dataset.zip((x, y))
    data = data.repeat()
    data = data.shuffle(BUFFER_SIZE)
    data = data.batch(BATCH_SIZE)
    data = data.prefetch(AUTO)
    return data

def get_val_dataset(ids, mask, y):
    x = tf.data.Dataset.from_tensor_slices({
        "input_ids": tf.constant(ids, dtype="int32"),
        "attention_mask": tf.constant(mask, dtype="int32")
    })
    y = tf.data.Dataset.from_tensor_slices(y)
    data = tf.data.Dataset.zip((x, y))
    data = data.repeat()
    data = data.batch(BATCH_SIZE)
    data = data.prefetch(AUTO)
    return data

In [None]:
class AttentionPool(layers.Layer):
    def __init__(self, num_layers, hidden_size, hiddendim_fc):
        super().__init__()
        self.num_hidden_layers = num_layers
        self.hidden_size = hidden_size
        self.hiddendim_fc = hiddendim_fc
        self.dropout = layers.Dropout(0.0)

        self.q = tf.Variable(initial_value=keras.initializers.GlorotNormal()(shape=(1, self.hidden_size), dtype=tf.float32), name="attention_pool_q")
        self.w_h = tf.Variable(initial_value=keras.initializers.GlorotNormal()(shape=(self.hidden_size, self.hiddendim_fc), dtype=tf.float32), name="attention_pool_wh")

    def call(self, all_hidden_states):
        # use CLS token
        hidden_states  = tf.stack([all_hidden_states[i][:, 0] for i in range(1, self.num_hidden_layers+1)], axis=-1)
        hidden_states = tf.reshape(hidden_states, (-1, self.num_hidden_layers, self.hidden_size))
        out = self.attention(hidden_states)
        out = self.dropout(out)
        return out

    def attention(self, h):
        v = tf.squeeze(tf.matmul(self.q, tf.transpose(h, perm=(0, 2, 1))), axis=1)
        v = tf.nn.softmax(v, axis=-1)
        v_temp = tf.transpose(tf.matmul(tf.expand_dims(v, axis=1), h), perm=(0, 2, 1))
        v = tf.squeeze(tf.matmul(tf.transpose(self.w_h, perm=(1, 0)), v_temp), axis=2)
        return v

In [None]:
class MeanPool(keras.layers.Layer):
    def call(self, x, mask=None):
        broad_mask = tf.cast(tf.expand_dims(mask, -1), "float32")
        # [batch, maxlen, hidden_state]
        x = tf.math.reduce_sum( x * broad_mask, axis=1)
        x = x / tf.math.maximum(tf.reduce_sum(broad_mask, axis=1), tf.constant([1e-9]))
        return x

In [None]:
from transformers.tf_utils import shape_list

def take_along_axis(x, indices, gather_axis):
    # Only a valid port of np.take_along_axis when the gather axis is -1

    # TPU + gathers and reshapes don't go along well -- see https://github.com/huggingface/transformers/issues/18239
    if isinstance(tf.distribute.get_strategy(), tf.distribute.TPUStrategy):
        # [B, S, P] -> [B, S, P, D]
        one_hot_indices = tf.one_hot(indices, depth=x.shape[-1], dtype=x.dtype)

        # if we ignore the first two dims, this is equivalent to multiplying a matrix (one hot) by a vector (x)
        # grossly abusing notation: [B, S, P, D] . [B, S, D] = [B, S, P]
        gathered = tf.einsum("ijkl,ijl->ijk", one_hot_indices, x)

    # GPUs, on the other hand, prefer gathers instead of large one-hot+matmuls
    else:
        gathered = tf.gather(x, indices, batch_dims=2)

    return gathered


transformers.models.deberta_v2.modeling_tf_deberta_v2.take_along_axis = take_along_axis

class TFDebertaV2StableDropout(tf.keras.layers.Layer):
    """
    Optimized dropout module for stabilizing the training
    Args:
        drop_prob (float): the dropout probabilities
    """

    def __init__(self, drop_prob, **kwargs):
        super().__init__(**kwargs)
        self.drop_prob = drop_prob

    @tf.custom_gradient
    def xdropout(self, inputs):
        """
        Applies dropout to the inputs, as vanilla dropout, but also scales the remaining elements up by 1/drop_prob.
        """
        mask = tf.cast(
            1
            - tf.compat.v1.distributions.Bernoulli(probs=1.0 - self.drop_prob).sample(sample_shape=shape_list(inputs)),
            tf.bool,
        )
        scale = tf.convert_to_tensor(1.0 / (1 - self.drop_prob), dtype=tf.float32)
        if self.drop_prob > 0:
            inputs = tf.where(mask, 0.0, inputs) * scale

        def grad(upstream):
            if self.drop_prob > 0:
                return tf.where(mask, 0.0, upstream) * scale
            else:
                return upstream

        return inputs, grad

    def call(self, inputs: tf.Tensor, training: tf.Tensor = False):
        if training:
            return self.xdropout(inputs)
        return inputs

transformers.models.deberta_v2.modeling_tf_deberta_v2.TFDebertaV2StableDropout = TFDebertaV2StableDropout

In [None]:
def build_model(trainable=True):
    input1 = keras.Input(shape=(None,), batch_size=BATCH_SIZE, dtype="int32", name="input_ids")
    input2 = keras.Input(shape=(None,), batch_size=BATCH_SIZE, dtype="int32", name="attention_mask")

    config = AutoConfig.from_pretrained(MODEL_NAME)
    config.attention_probs_dropout_prob = 0.0
    config.hidden_dropout_prob = 0.0
    config.update({"output_hidden_states": True})

    base_model = TFDebertaV2Model.from_pretrained(
        MODEL_NAME,
        config=config,
    )
    # Re-initialize last layer
    REINIT_LAYER = 1
    for layer in base_model.deberta.encoder.layer[-REINIT_LAYER:]:
        for module in layer.submodules:
            if isinstance(module, layers.Dense):
                module.kernel.assign(keras.initializers.GlorotUniform()(shape=module.kernel.shape, dtype=module.kernel.dtype))
            elif isinstance(module, layers.LayerNormalization):
                module.beta.assign(keras.initializers.Zeros()(shape=module.beta.shape, dtype=module.beta.dtype))
                module.gamma.assign(keras.initializers.Ones()(shape=module.gamma.shape, dtype=module.gamma.dtype))

    base_model.trainable = trainable
    base_outputs = base_model.deberta({"input_ids": input1,
                              "attention_mask": input2})
    all_hidden_states = tf.stack(base_outputs[1])
    hiddendim_fc = 512
    pooler = AttentionPool(config.num_hidden_layers, config.hidden_size, hiddendim_fc)
    attention_pooling_embeddings = pooler(all_hidden_states)
    outputs = layers.Dense(6)(attention_pooling_embeddings)
    model = keras.Model(inputs={"input_ids": input1,"attention_mask": input2}, outputs=outputs)
    return model

In [None]:
# Multi-optimizers
def get_optimizers(model, base_lr, head_lr, train_num):

    layer_list = [model.get_layer("deberta").embeddings] + model.get_layer("deberta").encoder.layer
    layer_list.reverse()
    base_schedule = [keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=base_lr*0.9**i,
                                                        decay_steps=train_num//BATCH_SIZE,
                                                        decay_rate=0.9) for i in range(len(layer_list))]
    base_optimizers = [keras.optimizers.Adam(learning_rate=lr_schedule) for lr_schedule in base_schedule]

    head_schedule = keras.optimizers.schedules.ExponentialDecay(initial_learning_rate=head_lr,
                                                                 decay_steps=train_num//BATCH_SIZE,
                                                                 decay_rate=0.9)
    head_optimizers = keras.optimizers.Adam(learning_rate=head_schedule)
    # Get head layers
    idx = 3
    for i, layer in enumerate(model.layers):
        if layer.name == "deberta":
            idx = i
    optimizers = tfa.optimizers.MultiOptimizer([(head_optimizers, model.layers[idx+1:])] + list(zip(base_optimizers, layer_list)))
    return optimizers

In [None]:
tpu_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://10.0.0.2:8470')
tf.config.experimental_connect_to_cluster(tpu_resolver)
tf.tpu.experimental.initialize_tpu_system(tpu_resolver)
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu_resolver)

In [None]:
historys = []

for fold in range(FOLD_NUM):
    # Initialize tpu
    tf.tpu.experimental.initialize_tpu_system(tpu)
    print(f"{'#'*25}\nFold {fold}")
    train_df = df[df["fold"]!=fold].reset_index(drop=True)
    val_df = df[df["fold"]==fold].reset_index(drop=True)
    train_ids, train_mask, train_y = preprocess(train_df, tokenizer)
    val_ids, val_mask, val_y = preprocess(val_df, tokenizer)
    train_dataset = get_train_dataset(train_ids, train_mask, train_y)
    val_dataset = get_val_dataset(val_ids, val_mask, val_y)
    TRAIN_NUM = int(len(train_ids))
    VAL_NUM = int(len(val_ids))

    tf.keras.backend.clear_session()
    with strategy.scope():
        model = build_model(trainable=True)
        model.compile(loss=keras.losses.MeanSquaredError(),
                      optimizer=get_optimizers(model, base_lr=1e-5, head_lr=1e-3, train_num=TRAIN_NUM),
                     metrics=[keras.metrics.RootMeanSquaredError()])

    # callbacks
    save_locally = tf.train.CheckpointOptions(experimental_io_device="/job:localhost")

    CHECKPOINT_PATH = f"./deberta_model_fold{fold}.h5"
    model_checkpoint_callback = keras.callbacks.ModelCheckpoint(CHECKPOINT_PATH,
                                                                monitor='val_loss',
                                                                options=save_locally,
                                                                save_best_only=True,
                                                                save_weights_only=True,
                                                                verbose=1)

    earlystop_callback = keras.callbacks.EarlyStopping(monitor='val_loss',
                                                      patience=6,
                                                      verbose=1)

    callbacks = [
        model_checkpoint_callback,
        earlystop_callback,
    ]

    history = model.fit(train_dataset,
                            validation_data=val_dataset,
                            steps_per_epoch=TRAIN_NUM // BATCH_SIZE,
                            validation_steps=VAL_NUM // BATCH_SIZE,
                            callbacks=callbacks,
                            epochs=30,
                            verbose=1,
                           )
    historys.append(history)

In [None]:
def plot_history(historys):
    for id, history in enumerate(historys):
        loss = history.history['root_mean_squared_error']
        min_loss = round(np.min(loss), 6)
        epoch = range(len(loss))
        plt.plot(epoch, loss, label=f"fold {id}: {min_loss}")
        plt.legend()
        plt.title("Train rmse")
    plt.figure()
    for id, history in enumerate(historys):
        val_rmse = history.history['val_root_mean_squared_error']
        min_val_rmse = round(np.min(val_rmse), 6)
        epoch = range(len(val_rmse))
        plt.plot(epoch, val_rmse, label=f"fold {id}: {min_val_rmse}")
        plt.legend()
        plt.title("Val rmse")


plot_history(historys)

def get_scores(historys):
    scores = []
    for id, history in enumerate(historys):
        val_rmse = history.history['val_root_mean_squared_error']
        min_val_rmse = np.min(val_rmse)
        scores.append(min_val_rmse)
    return scores

scores = get_scores(historys)
print(f"\nAll scores of K fold validation: {scores}\n")
print(f"Mean score: {np.mean(scores)}\n")


# Conclusion and Future Direction

DeBERTa's future development is likely to be directed on enhancing its performance on a variety of NLP tasks, including those that call for greater contextual knowledge and deductive reasoning skills. Here are some potential directions for DeBERTa going forward:

-> Multimodal Language Understanding: To better comprehend the context of natural language, DeBERTa can be extended to handle multimodal inputs, such as text and visuals. This has a number of uses, including captioning pictures and visual question answering.

-> Few-Shot Learning: DeBERTa can be further enhanced to perform better on tasks requiring very sparse samples for the model to learn from. This can be accomplished by utilising strategies like transfer learning, meta-learning, and more.

-> DeBERTa can be expanded to learn from a stream of data in a perpetual learning environment, where the model must adapt to new tasks and ideas without forgetting the ones it has already learned. This can be helpful in a variety of real-world situations where the data is ever-changing.

-> DeBERTa can be customised to include privacy-preserving methods like federated learning and differential privacy to safeguard user privacy and secure sensitive data.

Overall, DeBERTa's future development is probably going to be centred on enhancing its performance on a variety of NLP tasks, as well as expanding its capacity to handle multimodal inputs, few-shot learning, continuous learning, and privacy-preserving learning.

Using the the Kfold the accuracy of the DeBERTa model has increased

# References:

[1]:  Official DeBERTa Github repository: https://github.com/microsoft/DeBERTa.

[2]:  Microsoft Research Asia. "Microsoft Research Asia achieves state-of-the-art performance on natural language processing benchmarks with    DeBERTa." Microsoft News Center, 2020

[3]:  Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Xiaodong Liu. "DeBERTa: Bridging the Gap between RoBERTa and GPT-2." In arXiv preprint arXiv:2008.05663, 2020.