Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom loss and optimizers for BERT models through the ktrain Transformers call? #228

Closed
mldatabunch opened this issue Aug 12, 2020 · 7 comments
Labels
user question Further information is requested

Comments

@mldatabunch
Copy link

Hi, I looked through the FAQ and closed issues as much as possible so apologies if answered already.

Can I add different loss functions and optimizers for pre-trained BERT models? To give you my use case, I have an imbalanced dataset so I'm looking to use focal loss. Thanks for the very useful library!

@amaiya
Copy link
Owner

amaiya commented Aug 12, 2020

Hi,

The answer is yes, as ktrain is just a lightweight wrapper around tf.Keras. This isn't in the FAQ, but probably should be added.

The models returned by text_classifier and Transformer.get_classifier and other model-creation methods/functions are just tf.Keras models. Thus, you can execute the compile method on them to change the loss function or optimizer just as you would normally in Keras. For instance, the learner.set_weight_decay method simply switches out the Adam optimizer with the AdamWeightDecay optimizer by invoking model.compile.

Also, as an aside, all *fit* methods in ktrain include a class_weight parameter (although you may prefer to use focal loss).

@mldatabunch
Copy link
Author

mldatabunch commented Aug 12, 2020

Great. Thank you. I will check out the class weights as well. I'm trying to incorporate focal loss with ktrain, and I'm getting hit with the following error. The focal loss implementation was taken from the above link.

ValueError: Tensor conversion requested dtype float32 for Tensor with dtype int64: <tf.Tensor 'IteratorGetNext:3' shape=(None, None) dtype=int64>

I'm debugging it, but I appreciate it if you could take a look at it as well if you have some time, since you have an idea of how the downstream lr tuning methods work as well.

@mldatabunch
Copy link
Author

mldatabunch commented Aug 12, 2020

Sorted out that issue, but my loss is currently nan. Here is the updated focal loss code.

`def focal_loss(gamma=2., alpha=4.):

gamma = float(gamma)
alpha = float(alpha)

def focal_loss_fixed(y_true, y_pred):
    """Focal loss for multi-classification
    FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
    Notice: y_pred is probability after softmax
    gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
    d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
    Focal Loss for Dense Object Detection
    https://arxiv.org/abs/1708.02002

    Arguments:
        y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
        y_pred {tensor} -- model's output, shape of [batch_size, num_cls]

    Keyword Arguments:
        gamma {float} -- (default: {2.0})
        alpha {float} -- (default: {4.0})

    Returns:
        [tensor] -- loss.
    """
    epsilon = 1.e-9
    y_true = tf.cast(y_true, dtype=tf.float32)
    y_pred = tf.cast(y_pred, dtype=tf.float32)

    model_out = tf.add(y_pred, epsilon)
    ce = tf.multiply(y_true, -tf.math.log(model_out))
    weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
    fl = tf.multiply(alpha, tf.multiply(weight, ce))
    reduced_fl = tf.reduce_max(fl, axis=1)
    return tf.reduce_mean(reduced_fl)
return focal_loss_fixed`

@amaiya
Copy link
Owner

amaiya commented Aug 12, 2020

Have you verified that this works outside of ktrain with a simple baseline tf.keras model? If so, send me link to Google Colab demo using 20NewsGroup dataset.

@amaiya
Copy link
Owner

amaiya commented Aug 12, 2020

Unlike other models in ktrain, transformers-based models return raw output prior to softmax. However, as the documentation for focal_loss shows above,, y_pred is assumed to be probability after softmax. Thus, you must run y_pred through softmax yourself after which you can use it with models like DistilBert (see softmax conversion below when from_logits=True is supplied):

import tensorflow as tf
from tensorflow.keras import activations
def focal_loss(gamma=2., alpha=4., from_logits=False):

    gamma = float(gamma)
    alpha = float(alpha)

    def focal_loss_fixed(y_true, y_pred):
        """Focal loss for multi-classification
        FL(p_t)=-alpha(1-p_t)^{gamma}ln(p_t)
        Notice: y_pred is model output BEFORE softmax, if from_logits in True
        gradient is d(Fl)/d(p_t) not d(Fl)/d(x) as described in paper
        d(Fl)/d(p_t) * [p_t(1-p_t)] = d(Fl)/d(x)
        Focal Loss for Dense Object Detection
        https://arxiv.org/abs/1708.02002

        Arguments:
            y_true {tensor} -- ground truth labels, shape of [batch_size, num_cls]
            y_pred {tensor} -- model's output, shape of [batch_size, num_cls]

        Keyword Arguments:
            gamma {float} -- (default: {2.0})
            alpha {float} -- (default: {4.0})

        Returns:
            [tensor] -- loss.
        """
        epsilon = 1.e-9
        y_true = tf.cast(y_true, dtype=tf.float32)
        y_pred = tf.cast(y_pred, dtype=tf.float32)
        if from_logits:  y_pred = activations.softmax(y_pred) 

        model_out = tf.add(y_pred, epsilon)
        ce = tf.multiply(y_true, -tf.math.log(model_out))
        weight = tf.multiply(y_true, tf.pow(tf.subtract(1., model_out), gamma))
        fl = tf.multiply(alpha, tf.multiply(weight, ce))
        reduced_fl = tf.reduce_max(fl, axis=1)
        return tf.reduce_mean(reduced_fl)
    return focal_loss_fixed

@amaiya amaiya closed this as completed Aug 12, 2020
@mldatabunch
Copy link
Author

mldatabunch commented Aug 12, 2020

Thank you. I was actually just in the process of sending over the Colab notebook. This works on my end and I'm sure this will be helpful for anyone in the future trying to work with imbalanced datasets. Class weights, focal loss are two of the better ways to handle, without the need for synthetic sampling.

One question, can we get class weights to work with learner.lr_find() as well? Currently it works with fit methods as you described as such learner.fit_onecycle(2e-6, 4,class_weight=class_weight_dict)

@amaiya
Copy link
Owner

amaiya commented Aug 12, 2020

Thanks - class_weight will be added to lr_find in next release. Also, FAQ has been updated based on this post.

@amaiya amaiya added the user question Further information is requested label Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants