Custom loss and optimizers for BERT models through the ktrain Transformers call? #228

mldatabunch · 2020-08-12T01:32:23Z

Hi, I looked through the FAQ and closed issues as much as possible so apologies if answered already.

Can I add different loss functions and optimizers for pre-trained BERT models? To give you my use case, I have an imbalanced dataset so I'm looking to use focal loss. Thanks for the very useful library!

amaiya · 2020-08-12T02:45:17Z

Hi,

The answer is yes, as ktrain is just a lightweight wrapper around tf.Keras. This isn't in the FAQ, but probably should be added.

The models returned by text_classifier and Transformer.get_classifier and other model-creation methods/functions are just tf.Keras models. Thus, you can execute the compile method on them to change the loss function or optimizer just as you would normally in Keras. For instance, the learner.set_weight_decay method simply switches out the Adam optimizer with the AdamWeightDecay optimizer by invoking model.compile.

Also, as an aside, all *fit* methods in ktrain include a class_weight parameter (although you may prefer to use focal loss).

mldatabunch · 2020-08-12T02:58:55Z

Great. Thank you. I will check out the class weights as well. I'm trying to incorporate focal loss with ktrain, and I'm getting hit with the following error. The focal loss implementation was taken from the above link.

ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int64: <tf.Tensor 'IteratorGetNext:3' shape=(None, None) dtype=int64>

I'm debugging it, but I appreciate it if you could take a look at it as well if you have some time, since you have an idea of how the downstream lr tuning methods work as well.

mldatabunch · 2020-08-12T03:06:48Z

Sorted out that issue, but my loss is currently nan. Here is the updated focal loss code.

`def focal_loss(gamma=2., alpha=4.):

gamma = float(gamma)
alpha = float(alpha)

def focal_loss_fixed(y_true, y_pred):
    """Focal loss for multi-classification
    FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
    Notice: y_pred is probability after softmax
    gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
    d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
    Focal Loss for Dense Object Detection
    https://arxiv.org/abs/1708.02002

    Arguments:
        y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
        y_pred {tensor} -- model's output, shape of [batch_size, num_cls]

    Keyword Arguments:
        gamma {float} -- (default: {2.0})
        alpha {float} -- (default: {4.0})

    Returns:
        [tensor] -- loss.
    """
    epsilon = 1.e-9
    y_true = tf.cast(y_true, dtype=tf.float32)
    y_pred = tf.cast(y_pred, dtype=tf.float32)

    model_out = tf.add(y_pred, epsilon)
    ce = tf.multiply(y_true, -tf.math.log(model_out))
    weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
    fl = tf.multiply(alpha, tf.multiply(weight, ce))
    reduced_fl = tf.reduce_max(fl, axis=1)
    return tf.reduce_mean(reduced_fl)
return focal_loss_fixed`

amaiya · 2020-08-12T12:52:46Z

Have you verified that this works outside of ktrain with a simple baseline tf.keras model? If so, send me link to Google Colab demo using 20NewsGroup dataset.

amaiya · 2020-08-12T16:17:28Z

Unlike other models in ktrain, transformers-based models return raw output prior to softmax. However, as the documentation for focal_loss shows above,, y_pred is assumed to be probability after softmax. Thus, you must run y_pred through softmax yourself after which you can use it with models like DistilBert (see softmax conversion below when from_logits=True is supplied):

import tensorflow as tf
from tensorflow.keras import activations
def focal_loss(gamma=2., alpha=4., from_logits=False):

    gamma = float(gamma)
    alpha = float(alpha)

    def focal_loss_fixed(y_true, y_pred):
        """Focal loss for multi-classification
        FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
        Notice: y_pred is model output BEFORE softmax, if from_logits in True
        gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
        d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
        Focal Loss for Dense Object Detection
        https://arxiv.org/abs/1708.02002

        Arguments:
            y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
            y_pred {tensor} -- model's output, shape of [batch_size, num_cls]

        Keyword Arguments:
            gamma {float} -- (default: {2.0})
            alpha {float} -- (default: {4.0})

        Returns:
            [tensor] -- loss.
        """
        epsilon = 1.e-9
        y_true = tf.cast(y_true, dtype=tf.float32)
        y_pred = tf.cast(y_pred, dtype=tf.float32)
        if from_logits:  y_pred = activations.softmax(y_pred) 

        model_out = tf.add(y_pred, epsilon)
        ce = tf.multiply(y_true, -tf.math.log(model_out))
        weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
        fl = tf.multiply(alpha, tf.multiply(weight, ce))
        reduced_fl = tf.reduce_max(fl, axis=1)
        return tf.reduce_mean(reduced_fl)
    return focal_loss_fixed

mldatabunch · 2020-08-12T17:03:20Z

Thank you. I was actually just in the process of sending over the Colab notebook. This works on my end and I'm sure this will be helpful for anyone in the future trying to work with imbalanced datasets. Class weights, focal loss are two of the better ways to handle, without the need for synthetic sampling.

One question, can we get class weights to work with learner.lr_find() as well? Currently it works with fit methods as you described as such learner.fit_onecycle(2e-6, 4,class_weight=class_weight_dict)

amaiya · 2020-08-12T18:25:32Z

Thanks - class_weight will be added to lr_find in next release. Also, FAQ has been updated based on this post.

amaiya closed this as completed Aug 12, 2020

amaiya added the user question Further information is requested label Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom loss and optimizers for BERT models through the ktrain Transformers call? #228

Custom loss and optimizers for BERT models through the ktrain Transformers call? #228

mldatabunch commented Aug 12, 2020

amaiya commented Aug 12, 2020

mldatabunch commented Aug 12, 2020 •

edited

mldatabunch commented Aug 12, 2020 •

edited

amaiya commented Aug 12, 2020

amaiya commented Aug 12, 2020 •

edited

mldatabunch commented Aug 12, 2020 •

edited

amaiya commented Aug 12, 2020

Custom loss and optimizers for BERT models through the ktrain Transformers call? #228

Custom loss and optimizers for BERT models through the ktrain Transformers call? #228

Comments

mldatabunch commented Aug 12, 2020

amaiya commented Aug 12, 2020

mldatabunch commented Aug 12, 2020 • edited

mldatabunch commented Aug 12, 2020 • edited

amaiya commented Aug 12, 2020

amaiya commented Aug 12, 2020 • edited

mldatabunch commented Aug 12, 2020 • edited

amaiya commented Aug 12, 2020

mldatabunch commented Aug 12, 2020 •

edited

mldatabunch commented Aug 12, 2020 •

edited

amaiya commented Aug 12, 2020 •

edited

mldatabunch commented Aug 12, 2020 •

edited