Ok, I'm scoping this out right now, taking the approach of adding a few targeted callbacks to the original trainer, and simplifying what's there / making it more modular. Here's a list of specific items that I'm thinking of doing; I'm putting this here to get feedback before starting on this. My process for deciding on these things to change was (1) see what parameters to __init__ I can remove / push to other, configurable dependencies, (2) see what other places might want some customization, and propose specific callbacks for them.
Change the name of TrainerBase to Trainer, and Trainer to GradientDescentTrainer. This is just a cosmetic change that makes the trainer code consistent with other naming conventions in the library, and makes it so we can use Trainer annotations in places where it's appropriate, instead of the more awkward-looking TrainerBase. I could maybe be persuaded to just make it one class, even, that's still Registrable and can be subclassed (like Embedding is).
Make TensorboardWriter an argument to __init__, and to from_partial_objects (with a Lazy annotation), so I can remove several of the __init__ parameters that are specific to the tensorboard writer. Specifically, that's these parameters:
Question here: I could generalize TensorboardWriter to handle other kinds of logging, and just add a bunch of methods / calls to it, which might simplify some things. If the only reason we want callbacks is to provide better logging, we could accomplish that with a specific, overridable TrainLogger class, or something, which is a generalization of the TensorboardWriter. Does this would probably also let me completely remove the serialization_dir parameter to the Trainer. It's tempting to also rip out tqdm and roll our own thing somehow (or figure out a way to push it into the TrainLogger or whatever it gets called), as that would simplify the trainer and fix some bugs. Decision on the question: add a couple of simple callbacks, one for end of each batch, and one for end of each epoch. See item 7.
Remove these two parameters, as they are redundant with the Checkpointer (and were already removed from from_partial_objects):
I'll make sure that the default Checkpointer behavior is reasonable if none is passed. Also, move this parameter into the Checkpointer and remove it from the Trainer:
On grad_norm and grad_clipping: I could try to remove these, but I'm not sure it's worth it. This would mean adding some kind of callback, except it's a weird callback, because clipping requires setting hooks on the model before starting training. I'm leaning towards just leaving these two alone; they don't add much complexity to the Trainer, and we don't get issues asking to configure this more than what we already allow. (Nothing to do, crossing it off the list)
I don't think num_gradient_accumulation_steps can be simplified or put into a callback. There are three parameters for distributed training, which I could maybe simplify to one object, but that doesn't really seem worth it. All of the rest of the parameters are either core training parameters or configurable objects that don't seem like they need to be further separated out. (Nothing to do, crossing it off the list)
Add two very simple callbacks, one that gets called at the end of each batch (getting passed the batch inputs and outputs, and the trainer object), and one that gets called at the end of each epoch (getting metrics and the trainer object). This allows for an incredible amount of flexibility, for logging, saving predictions, or whatever you want.
The text was updated successfully, but these errors were encountered:
From #3519 (comment):
Ok, I'm scoping this out right now, taking the approach of adding a few targeted callbacks to the original trainer, and simplifying what's there / making it more modular. Here's a list of specific items that I'm thinking of doing; I'm putting this here to get feedback before starting on this. My process for deciding on these things to change was (1) see what parameters to
__init__
I can remove / push to other, configurable dependencies, (2) see what other places might want some customization, and propose specific callbacks for them.Change the name ofTrainerBase
toTrainer
, andTrainer
toGradientDescentTrainer
. This is just a cosmetic change that makes the trainer code consistent with other naming conventions in the library, and makes it so we can useTrainer
annotations in places where it's appropriate, instead of the more awkward-lookingTrainerBase
. I could maybe be persuaded to just make it one class, even, that's stillRegistrable
and can be subclassed (likeEmbedding
is).MakeTensorboardWriter
an argument to__init__
, and tofrom_partial_objects
(with aLazy
annotation), so I can remove several of the__init__
parameters that are specific to the tensorboard writer. Specifically, that's these parameters:allennlp/allennlp/training/trainer.py
Lines 59 to 62 in dd4ee62
Move the tensorboard logging logic into theTensorboardWriter
class, so it's more easily configurable with your ownTensorboardWriter
:allennlp/allennlp/training/trainer.py
Lines 445 to 464 in dd4ee62
This also could let us remove this parameter:allennlp/allennlp/training/trainer.py
Line 63 in dd4ee62
Question here: I could generalize. Decision on the question: add a couple of simple callbacks, one for end of each batch, and one for end of each epoch. See item 7.TensorboardWriter
to handle other kinds of logging, and just add a bunch of methods / calls to it, which might simplify some things. If the only reason we want callbacks is to provide better logging, we could accomplish that with a specific, overridableTrainLogger
class, or something, which is a generalization of theTensorboardWriter
. Does this would probably also let me completely remove theserialization_dir
parameter to theTrainer
. It's tempting to also rip out tqdm and roll our own thing somehow (or figure out a way to push it into theTrainLogger
or whatever it gets called), as that would simplify the trainer and fix some bugsRemove these two parameters, as they are redundant with theCheckpointer
(and were already removed fromfrom_partial_objects
):allennlp/allennlp/training/trainer.py
Lines 50 to 51 in dd4ee62
I'll make sure that the defaultAlso, move this parameter into theCheckpointer
behavior is reasonable if none is passed.Checkpointer
and remove it from theTrainer
:allennlp/allennlp/training/trainer.py
Line 53 in dd4ee62
On(Nothing to do, crossing it off the list)grad_norm
andgrad_clipping
: I could try to remove these, but I'm not sure it's worth it. This would mean adding some kind of callback, except it's a weird callback, because clipping requires setting hooks on the model before starting training. I'm leaning towards just leaving these two alone; they don't add much complexity to theTrainer
, and we don't get issues asking to configure this more than what we already allow.I don't think(Nothing to do, crossing it off the list)num_gradient_accumulation_steps
can be simplified or put into a callback. There are three parameters for distributed training, which I could maybe simplify to one object, but that doesn't really seem worth it. All of the rest of the parameters are either core training parameters or configurable objects that don't seem like they need to be further separated out.Add two very simple callbacks, one that gets called at the end of each batch (getting passed the batch inputs and outputs, and the trainer object), and one that gets called at the end of each epoch (getting metrics and the trainer object). This allows for an incredible amount of flexibility, for logging, saving predictions, or whatever you want.
The text was updated successfully, but these errors were encountered: