I reproduce the data preprocess ,and then train the model with electra-large-discriminator plm / msde local_and_nonlocal strategy.
I found it takes me around 50 min per epoch on a Tesla V100 32G with the same hyper-parameters on paper
Besides, I do some modify to use the DDP with 4 GPU, but the time only reduce to 40 min per epoch
Is the same for your training time ?
I want to do some experiment with the LGESQL basemodel, but the time counsuming is .....[SAD]