Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing CallbackTrainer #3519

Closed
DeNeutoy opened this issue Dec 13, 2019 · 44 comments
Closed

Consider removing CallbackTrainer #3519

DeNeutoy opened this issue Dec 13, 2019 · 44 comments
Milestone

Comments

@DeNeutoy
Copy link
Contributor

DeNeutoy commented Dec 13, 2019

The callback trainer hasn't really worked, because callbacks that we've tried to add have required setting state on the CallbackTrainer itself, which makes them hard to add. Given this, we are just maintaining 2 Trainers unnecessarily, which slows us down.

Happy to hear reasons why we should keep the callback trainer/ if people have found it particularly useful!

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 13, 2019

I would have voted the other way, that we should remove the original trainer.

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Dec 13, 2019

I'm opposed, because:

@schmmd schmmd added this to the 1.0.0 milestone Dec 13, 2019
@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 13, 2019

To summarize some in-person discussion: because training GANs and typical training are so different, we explicitly don't want to have one class that can do both, because it means that it's too general. What we want instead is to have re-usable components that make it easy to do normal training, GAN-style training, and customization in your training loop. We're not certain yet exactly what form that will take, but we don't think either the original trainer or the callback trainer really give us that.

@epwalsh
Copy link
Member

epwalsh commented Dec 16, 2019

FWIW I love the callback trainer - even if it's implementation / API isn't perfect - because it makes it really easy to customize logging and add some other useful stuff to the training loop.

@mojesty
Copy link
Contributor

mojesty commented Dec 18, 2019

I like CallbackTrainer too, because it makes extending training process far easier then the in original Trainer. Actiually, setting state of the CallbackTrainer is a bit painful, however it's still better than re-writing the whole code.

@epwalsh
Copy link
Member

epwalsh commented Dec 18, 2019

If anyone has thoughts on how to improve it though I wouldn't mind putting in some work there.

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 18, 2019

What if instead of passing the trainer itself around, and putting it in the constructor of the callback handler, we instead have a trainer_state dictionary, which we pass when we fire the callbacks? There could be a method (even a callback itself) that constructs (or just updates) this for each batch. This would also let callbacks modify that state, adding things that are needed later.

The hard-coded batch updates were something that you pushed for in the first place, @DeNeutoy, if I'm remembering correctly; the original version had the batch update as a callback. I don't have a strong opinion about this, but I thought the more flexible version, with the batch update as a callback, was probably better.

@mikerossgithub
Copy link
Contributor

mikerossgithub commented Dec 18, 2019

FWIW, I think the modularization of CallbackTrainer is a vast improvement over the default Trainer. I used it to implement a callback which wrote out predictions for my test and validation set during training, which in turn saved me from having to serialize every model.

For me, the biggest gripe was that it is awkward to define callback order using the priority-integer values. Having a separate trainer_state dictionary as @matt-gardner suggests would be an improvement, and could also be used to make the ordering less awkward without defining explicit dependencies. For example: a callback could check the trainer_state for a needed variable and be put in the back of the current queue if the variable isn't ready yet.

@epwalsh
Copy link
Member

epwalsh commented Dec 18, 2019

@mikerossgithub another option to replace the priority thing - maybe similar to what you're saying - would be to have a depends_on attribute or something like that for each callback. So callbacks could specify other callbacks that they depend on, and the ones they depend on would have to be executed first.

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 18, 2019

Seems like either of those options would be more confusing to me than specifying a priority. @mikerossgithub's option needs logic for having a callback being put back in a queue, which @epwalsh's option needs logic for checking depends_on. There are all kinds of complexities when you have either of those things (e.g., what happens when your requirements are never met?). Specifying a priority seems a lot easier to reason about.

@mikerossgithub
Copy link
Contributor

mikerossgithub commented Dec 19, 2019

Agree with @matt-gardner that priority integers are the simplest to implement. But I'm not sure they are the simplest for end-users. The issue is that when you write a new callback method, you need to find an integer that is between all of the consumers and producers of any variables you modify in trainer_state. And the only way to do that is to check the source code for all of the other callbacks. It's awkward.

@epwalsh and my proposals are just ways to automate finding the correct priority integer. That said, the benefits of something easier to use aren't necessarily worth the implementation complexity. Maybe it would be enough to just print the priority ordering for each callback point in the log? That would ease debugging.

@epwalsh
Copy link
Member

epwalsh commented Dec 19, 2019

@mikerossgithub I agree, logging the priority would be a good easy improvement, at least for the short term

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 19, 2019

Ok, so, we have a few problems, each with their own solution:

  1. We want to avoid having to set state on the trainer, because it's hard to maintain, extend and reason about. Getting rid of the need for this state was one of the main reasons we added the CallbackTrainer in the first place, but it apparently didn't work well enough. A potential solution is to have a separate dictionary which gets passed to callbacks (instead of self), and the callbacks modify and/or add to it.

  2. The priorities are hard to reason about / discover their ordering. A quick fix is to log the priorities (probably upon instantiation, so it only happens once). I'm maybe coming around to being able to specify which other callbacks need to run first, though that's a much more complex change.

  3. The batch update is hard-coded, instead of in a callback. This could easily be moved into a callback, especially once the state has moved off of self.

@DeNeutoy, does this resolve your concerns?

One annoying issue with moving to a state dictionary is that we lose type checking in the callbacks. We basically need to declare state as Dict[str, Any]. I'm pretty sure this is why Joel implemented things the way he did in the first place, because he really wanted a way to maintain type checking within the callbacks. But it really hurts extensibility. An extensible framework, where unknown callbacks can add arbitrary things to the state, is fundamentally incompatible with python's typing library. We have to choose one or the other. And in this case, I think we should choose extensibility. We can at least do things like loss: torch.Tensor = state['loss'] # type: ignore as we read from the state dictionary.

@mikerossgithub
Copy link
Contributor

mikerossgithub commented Dec 19, 2019

One question: what does the state-dictionary really buy you over just using the trainer object as shared storage? I get the conceptual tidiness, but it narrows the potential scope of callbacks.

As an example of the opposite design philosophy, I've enjoyed working with mxnet, which passes the entire environment (via locals()) to give maximum flexibility to any callback.

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 19, 2019

All three methods (using the trainer object itself, using a state dictionary, and using locals()) accomplish the same thing. The difference is in how easy it is to reason about what's going on, and what you have access to at any given point.

If you want to use the trainer object as shared storage, what you really mean is that you're going to be adding things to self on the trainer. This means that I'm going to be calling things like trainer.my_special_state, which is really confusing if I don't see trainer.my_special_state defined in Trainer.__init__. Where did it come from? This also really confuses automatic type checkers like mypy, so you'd need to have all kinds of # type: ignore statements everywhere.

If you want to use locals(), you expose way more than you really need to, and it's easy to have name conflicts and other confusing things. It's not obvious to someone who is pretty familiar with python exactly what things are included in that dictionary, and knowing what to access and how is kind of a mess.

Using a state dictionary is more transparent. We can document exactly what things the trainer puts into that state, and anything else that you want to access had better be put in by the callback itself. It's easier to reason about that using the trainer object as shared state (in addition to what I said above, it's cleaner in this case to separate the thing that has state from the thing that produces the state).

@epwalsh
Copy link
Member

epwalsh commented Dec 19, 2019

What if instead of just using a raw dictionary for the state, we used an extendable dataclass-like thing. This could be registrable so users could extend it while still having the benefits of type-checking

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 19, 2019

The trouble with that is when you need several callbacks that add different things. And how does the trainer construct the specific version that you want? You really need a single object, constructed by the trainer, that gets modified as you go. Pretty hard to see how that works with something like a dataclass.

@epwalsh
Copy link
Member

epwalsh commented Dec 19, 2019

Well the trainer would construct the specific version by calling .by_name() since it would be registrable, right? Meh.. but then we wouldn't have type-checking on any custom fields anyway.

What about a combination of both those ideas: Have an actual class for the state that contains fields for everything that must always be in the training state, but also have a field in that state that is just a dictionary so callbacks could really add anything they want.

Another idea would be to allow callbacks to access the state of other callbacks. Then we don't need to have any dict state object in the trainer itself.

@matt-gardner
Copy link
Contributor

matt-gardner commented Dec 20, 2019

I think having a TrainerState object that gets constructed by the trainer, has fields for standard parts, and a mutable dictionary for callback-specific stuff, seems like a reasonable middle ground. If we decide that something is generally useful enough for many callbacks, it can be graduated to a dedicated field, but that's not required to add your own state inside a callback.

I'm not really sure what you mean by having callbacks access the state of other callbacks. That sounds really messy.

@epwalsh
Copy link
Member

epwalsh commented Dec 20, 2019

I'm not really sure what you mean by having callbacks access the state of other callbacks.

I'm not really sure either 🙃 Just using this TrainerState object is probably sufficient and definitely simpler.

@dirkgr
Copy link
Member

dirkgr commented Dec 20, 2019

I'm surprised I don't see a suggestion of subclassing the trainer in specific cases, i.e., class MyCustomTrainer(Trainer): .... The trainer can then call some extension methods if necessary that would act a little like callbacks. State is stored inside the self object, but it's all properly initialized in __init__() methods, and the type checker has a chance to check the types.

@DeNeutoy
Copy link
Contributor Author

DeNeutoy commented Dec 21, 2019

Sorry, i've been traveling, some thoughts below:

First, @matt-gardner's comments:

  1. We want to avoid having to set state on the trainer - this is not quite an accurate representation of what I meant. Setting state on the trainer is not problematic in of itself - it's just that in the CallbackTrainer, things which previously were stateless became stateful, which was a red flag for me. It seems risky, because assigning self.batch = batch might have unforseen consequences with stuff like distributed training etc. Additionally, literally no other trainer you will see would set this state.

  2. Again, this problem points to the fact that callbacks are hard to reason about (everyone on this thread who has used the CallbackTrainer can reasonably be considered AllenNLP PowerUsers). From the reactions to @epwalsh's post above, it seems like a common reason to like the callback trainer is that it makes custom logging easier. I would be in favour of, for example, adding adding a way to add custom logging to the default trainer easier, rather than use this as a sign that the callback trainer is the correct way forward.

  3. Having the batch update as a callback seems tough. What if you forget to include it? If we want to check for it's presence, how will we do that whilst also making it easy to override?

I'm ambivalent about a state dictionary vs passing self. However, I think we should stay away from @epwalsh's idea of registerable dataclasses - anything which relies on constructing the trainer from params should be avoided.

I like @dirkgr's comment - I think the Callbacks are too flexible, but having some methods on the trainer, such as after_grad_update etc, which can be overridden (or passed as functions to the default trainer constructor, but limited in the scope of where they get called, and their signatures) seems good.

Overall this discussion is an example of why pytorch don't have a maintained Trainer. Even with the callbacks, we're forcing people into a certain way of thinking about training NLP models, which is bad for research.

However, i'm basically in favour of having 1 trainer which people think is the best.

Approaches that I favour, in order:

  1. Adding a couple of callbacks directly to the default Trainer, such as on_epoch_end and on_grad_step, which are specifically designed to make it easy to do custom logging. Removing the callback trainer. Refactor current trainer to be more modular like the callback trainer, but without callbacks.

  2. The same as 1, but with the on_epoch_end and on_grad_step callbacks being methods on the default trainer, which you could override.

  3. Remove default trainer, keep callback trainer. Refactor to log callback priorities and use some kind of state dictionary, which is untyped and not registrable.

The Trainer and CallbackTrainer are two extremes - surely there is some halfway house between them which 1) doesn't set unnecessary state, 2) is easily subclassed/composed and 3) removes the duplication we currently have. How exactly this happens I'm not too bothered about - my preference is above, but i'm happy for other people with stronger opinions to pick a solution, so long as it results in one of the two trainers being removed.

@epwalsh
Copy link
Member

epwalsh commented Dec 23, 2019

FWIW I am totally okay with option 1 above, since at least that fits my personal use case, and I think it would just be simpler to maintain a single trainer. Besides, if someone is willing they could split off the callback trainer into a separate repo and maintain it from there.

@schmmd
Copy link
Member

schmmd commented Jan 2, 2020

Related to #3380

@schmmd
Copy link
Member

schmmd commented Jan 6, 2020

Related: #3420

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 19, 2020

Ok, I'm scoping this out right now, taking the approach of adding a few targeted callbacks to the original trainer, and simplifying what's there / making it more modular. Here's a list of specific items that I'm thinking of doing; I'm putting this here to get feedback before starting on this. My process for deciding on these things to change was (1) see what parameters to __init__ I can remove / push to other, configurable dependencies, (2) see what other places might want some customization, and propose specific callbacks for them.

  1. Change the name of TrainerBase to Trainer, and Trainer to GradientDescentTrainer. This is just a cosmetic change that makes the trainer code consistent with other naming conventions in the library, and makes it so we can use Trainer annotations in places where it's appropriate, instead of the more awkward-looking TrainerBase. I could maybe be persuaded to just make it one class, even, that's still Registrable and can be subclassed (like Embedding is).

  2. Make TensorboardWriter an argument to __init__, and to from_partial_objects (with a Lazy annotation), so I can remove several of the __init__ parameters that are specific to the tensorboard writer. Specifically, that's these parameters:

    summary_interval: int = 100,
    histogram_interval: int = None,
    should_log_parameter_statistics: bool = True,
    should_log_learning_rate: bool = False,

  3. Move the tensorboard logging logic into the TensorboardWriter class, so it's more easily configurable with your own TensorboardWriter:

    if self._tensorboard.should_log_this_batch() and self._master:
    self._tensorboard.log_parameter_and_gradient_statistics(self.model, batch_grad_norm)
    self._tensorboard.log_learning_rates(self.model, self.optimizer)
    self._tensorboard.add_train_scalar("loss/loss_train", metrics["loss"])
    self._tensorboard.log_metrics({"epoch_metrics/" + k: v for k, v in metrics.items()})
    if self._tensorboard.should_log_histograms_this_batch() and self._master:
    self._tensorboard.log_histograms(self.model, histogram_parameters)
    if self._log_batch_size_period:
    batch_group_size = sum(training_util.get_batch_size(batch) for batch in batch_group)
    cumulative_batch_group_size += batch_group_size
    if (batches_this_epoch - 1) % self._log_batch_size_period == 0:
    average = cumulative_batch_group_size / batches_this_epoch
    logger.info(
    f"current batch size: {batch_group_size} mean batch size: {average}"
    )
    self._tensorboard.add_train_scalar("current_batch_size", batch_group_size)
    self._tensorboard.add_train_scalar("mean_batch_size", average)
    This also could let us remove this parameter:
    log_batch_size_period: Optional[int] = None,
    Question here: I could generalize TensorboardWriter to handle other kinds of logging, and just add a bunch of methods / calls to it, which might simplify some things. If the only reason we want callbacks is to provide better logging, we could accomplish that with a specific, overridable TrainLogger class, or something, which is a generalization of the TensorboardWriter. Does this would probably also let me completely remove the serialization_dir parameter to the Trainer. It's tempting to also rip out tqdm and roll our own thing somehow (or figure out a way to push it into the TrainLogger or whatever it gets called), as that would simplify the trainer and fix some bugs

  4. Remove these two parameters, as they are redundant with the Checkpointer (and were already removed from from_partial_objects):

    num_serialized_models_to_keep: int = 20,
    keep_serialized_model_every_num_seconds: int = None,
    I'll make sure that the default Checkpointer behavior is reasonable if none is passed. Also, move this parameter into the Checkpointer and remove it from the Trainer:
    model_save_interval: float = None,

  5. On grad_norm and grad_clipping: I could try to remove these, but I'm not sure it's worth it. This would mean adding some kind of callback, except it's a weird callback, because clipping requires setting hooks on the model before starting training. I'm leaning towards just leaving these two alone; they don't add much complexity to the Trainer, and we don't get issues asking to configure this more than what we already allow.

  6. I don't think num_gradient_accumulation_steps can be simplified or put into a callback. There are three parameters for distributed training, which I could maybe simplify to one object, but that doesn't really seem worth it. All of the rest of the parameters are either core training parameters or configurable objects that don't seem like they need to be further separated out.

I'm tempted to just start with this, without adding any callbacks, and see what we think. This amount of customization might itself already be enough, and the objects that are used should be re-usable if you want to write a separate training loop that does different things.

@dirkgr
Copy link
Member

dirkgr commented Feb 19, 2020

I could generalize TensorboardWriter to handle other kinds of logging ...

The TensorboardWriter changes sound like good ideas, but are somewhat orthogonal to this, and could be done separately. Two small PRs are better than one big one.

Just hypothetically speaking, how would we train GANs in this new world?

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 19, 2020

Training GANs is different, and should use a separate script / trainer. It is explicitly not a goal of this to figure out how to handle GANs / other kinds of training in a single abstraction.

@mikerossgithub
Copy link
Contributor

mikerossgithub commented Feb 19, 2020

So just to inject some feedback from a user. Our team has been making heavy use of CallbackTrainer with our own callbacks. But we are now looking at switching over to Pytorch Lightning and completely avoiding any of the AllenNlp trainers (we still want to use AllenNlp model infrastructure).

Pytorch Lightning seems to be getting a lot of traction at creating a flexible and generalizable training loop infrastructure, with transparent access to distributed training (single-machine multi-gpu and clusters with multi-node/multi-gpu), 16-bit, GAN-training, etc.

You already closed #3221, but it feels to me like Pytorch Lightning may already be mature enough that AllenNLP could benefit from getting out of the business of training loop management and leave that to a dedicated project. Sort of like you did with transformers and huggingface. We haven't completed our analysis yet, but so far I don't see any trainer capabilities in Allennlp that can't be handled by Lightning.

Anyway, just a thought to consider. Maybe its worth taking a second look at?

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 19, 2020

@mikerossgithub, thanks for the feedback. What do your custom callbacks do, and when are they run?

I think for now we definitely are going to keep our own training loop implementation. But if you run into things that are hard in using our data / model code with another trainer, please let us know, and we'll see what we can do to make that easier.

@mikerossgithub
Copy link
Contributor

mikerossgithub commented Feb 20, 2020

All of our customization is related to logging/predictions. We have a predict_and_validate callback which subclasses the validate callback so that it will also store the output predictions to disk. We do this to save space over storing the model, while still having flexibility to do post-analysis on the validation set (such as running new metrics). We do something similar to store training predictions. We also allow the specification of multiple test/validation sets instead of just one. And we subclass log_to_tensorboard to sort the metrics alphabetically since we have several dozen metrics (per-label metrics).

Any of these could be done by simply adding callbacks to the default trainer, without the need to actually implement the entire training logic as a set of callbacks. The tricky thing is to figure out how to allow someone to, for example, write a callback that stores predictions on the validation or training sets without having to traverse the those sets a second time (since the traversal is already hardcoded in the default trainer).

We'll start an issue if we run into any problems using an external trainer.

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 20, 2020

Yeah, those are all things I want this refactoring to be able to handle. I was imagining having the Logger object get a method that gets the outputs from each training and validation batch, and by default does nothing (but you can configure it to save predictions, or something). It's good to know in practice what kinds of extensions you wanted, because that's exactly what I'm targeting here.

But, it's also an explicit goal of our 1.0 release to make it much easier for people to use whatever pieces of the library that they want to in their own scripts, so it's ok to us if our default trainer doesn't meet your needs and you want to use a different one. If you do have a really good experience merging allennlp code with pytorch-lightning code, let us know about it; I think we're open to having things change in the future.

@bryant1410
Copy link
Contributor

bryant1410 commented Feb 21, 2020

  1. Change the name of TrainerBase to Trainer, and Trainer to DefaultTrainer.

What about GradientDescentTrainer? It's a minor thing but helps to get in the context of what the trainer is actually doing (e.g., != GAN, != NoOpTrainer, != from a Trainer for other forms of training such as Scikit Learn's fit+predict). So maybe things such as num_gradient_accumulation_steps make sense to be kept then, and not as callbacks.

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 21, 2020

@bryant1410, yes, I like that suggestion.

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 21, 2020

@mikerossgithub, I just looked over the quick start to pytorch lightning, and I think it would be pretty easy for us to write a simple train-with-lightning script that instantiates a generic LightningModule that wraps our model and data classes. I have no complaints about having that option live alongside our existing training code, as it would really be quite simple and not much work for us to maintain. If you're interested in contributing to this, let me know, otherwise, I'll look into doing this myself, but it might be a little while (I'm traveling the first week of March, and don't have a lot of development time in between now and then).

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 21, 2020

This also needs to wait for #3700 to get finished and merged, as that transitions our data code to use pytorch data loaders, which is a prerequisite for using pytorch lightning.

@mikerossgithub
Copy link
Contributor

mikerossgithub commented Feb 21, 2020

@matt-gardner Yes, that's basically our plan too. I would be happy to share our solution here as a PR, but we also may be a few weeks off from doing that. Hopefully sooner though...

@schmmd
Copy link
Member

schmmd commented Feb 21, 2020

Just as an FYI, we did an internal study and other users mentioned an interest in pytorch-lightning too.

@viking-sudo-rm
Copy link
Contributor

viking-sudo-rm commented Feb 22, 2020

(Commenting my use case for callbacks as suggested in #3828)

I'm using callbacks to implement magnitude weight pruning at various points during training. I do this by registering custom callbacks at the end of epochs and the end of training to recompute masks for the weights.

FWIW I suggest continuing to support EPOCH_END and TRAINING_END, as they are important for this use case, and probably also others.

@mateuszpieniak
Copy link

mateuszpieniak commented Feb 25, 2020

Well, my use case for using callbacks was the integration of AllenNLP with Polyaxon, so I could still use AllenNLP's configuration files and manage experiments, but in the external tool (for example track metrics, compare hyperparameters, make plots and so on).

@matt-gardner
Copy link
Contributor

matt-gardner commented Feb 27, 2020

@mikerossgithub, this is still really rough, but now that #3700 is merged, I made an initial pass at a pytorch-lightning trainer here: #3860. It was really easy to show a proof of concept, though making it clean will be a lot more work.

@dirkgr
Copy link
Member

dirkgr commented Mar 4, 2020

With CallbackTrainer gone, can we close this issue?

@matt-gardner
Copy link
Contributor

matt-gardner commented Mar 4, 2020

No, we want to do the things listed here: #3519 (comment)

@schmmd
Copy link
Member

schmmd commented Mar 6, 2020

@matt-gardner I cut a new issue (#3913) with #3519 (comment) as the description. This ticket is getting a bit hard to follow...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants