Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design decision query: recommended pattern for auxiliary loss terms that should be ignored during evaluation? #276

Closed
chiragraman opened this issue Dec 19, 2019 · 14 comments
Labels
discussion Discussions about designs & implementation topic: executor Issue about the executor and related utilities

Comments

@chiragraman
Copy link

chiragraman commented Dec 19, 2019

Hello!

I came across texar-pytorch while in the process of writing my own version of the Executor, and am really happy that someone's already done the work for it, so firstly, thanks for the excellent repository.

Broad Query
One of the main questions I have pertains to the design requirement of including the loss as an item in the dictionary returned by the forward method of the model. Essentially, I'm wondering what's the recommended pattern for including terms in the loss (for eg. regularization terms) that ought to not be a part of the validation loss? Conceptually, I think of the forward pass as being completed with the model making its predictions, which is what I believe the predict() method is for. However, having the computation of the loss as a responsibility of the forward pass could lead to certain problems.

Context
Ideally, the training loss ought to be computed in a separate forward pass after the training epoch is completed. I'm aware that most people use an average over the training batches as an approximation of the training loss. However, this becomes an issue when comparing against a validation loss curve, where the difference between the training and validation curve indicates generalization error. This is for two reasons in the typical case:

  1. The model changes at the end of each batch when the optimizer takes a step, so it's an unfair comparison against the evaluation setting. The model might also overfit on a single batch.
  2. The model might have dropout and batch norm turned on during training which behave differently during evaluation.

As far as I can tell, the training loss is computed within the training loop in the executor, as opposed to an additional forward pass over the training set with model.eval(). Is my understanding correct?

On regularization
Typically, any regularization terms over the model parameters is added to the loss before the call to backward. With the given interface, it seems like the right place to do this would be in the forward pass. However, I find it a little weird to make the model responsible for regularizing itself. I usually have a separate nn.Module subclass responsible for regularizing a model and dependency-inject the model into this class. That way I can swap out different regularizers without changing the model, in compliance with the Open-Close SOLID principle.

Could you please explain how to achieve these two things (loss computation over the training set after the train loop, and computing auxiliary loss terms outside the model) with the current setup of the executor? It seems like this is a direct consequence of requiring the model to compute the loss, which seems to be a little problematic. This is currently the main two problems precluding my use of this otherwise awesome repo, so I'd appreciate any insight.

Thanks!

@huzecong
Copy link
Collaborator

Hi Chirag, thank you for your interest in Texar-PyTorch! I'm the main developer behind the Executor module, and I can address your questions.


For your first question, I'm not sure I completely understood, but I guess that your intention was to perform evaluation (under model.eval() mode) on the training set after each epoch. While the Executor doesn't do this by default, you can add a custom action using Executor.on like this:

executor = Executor(train_data=datasets["train"], ...)

@executor.on(cond.epoch(1))
def eval_on_train(executor):
	executor.test(datasets["train"])
    # obtain results from the status dictionary
    metrics = executor.status["eval_metric"]
    # compare it with your stored validation results ...

However, please pay attention to your validation/testing settings as test() uses testing metrics while status["eval_metric"] refers to the validation metrics. This is not a problem if your validation and testing metrics are the same (the default if you leave test_metric as None), but it could be a source of hard-to-find bugs.


For your second question, the design principle of the Executor is to leave all model-related computations to the model, so that the Executor itself can be as general as possible. Our recommended approach is to create a "wrapper" model class that takes care of everything, including re-structuring of inputs/outputs, and the regularization as you mentioned. This is also what we did in the examples (e.g., XLNet).

This might seem like a bit of boilerplate, which I agree, but I currently don't see any way of avoiding this. Most built-in (PyTorch or Texar-PyTorch) or third-party modules does not take a batch as input, nor do they return a dictionary, so it's probably necessary to have this wrapper to at least re-structure inputs and outputs.


I would say that Executor is still in its early stages of development, and is far from complete. I encounter many tough spots while using the module myself, and I think it can be further improved. If you have any suggestions or are willing to contribute, please feel free to ask any questions and create pull requests!

@huzecong huzecong added question Further information is requested topic: executor Issue about the executor and related utilities labels Dec 19, 2019
@chiragraman
Copy link
Author

Zecong! Small world! I think I ran into you at the LTI just before I left haha, wonderful to know you are developing the Executor :) Thanks for getting back to me so quickly!


I think the two problems are related; even if I wrap the model, the loss would still need to be computed by the model, which is the larger issue. I'd argue that this is just an incorrect class design. I agree that the Executor shouldn't be responsible for this, but neither should the model. I think a better abstraction would be to have another object responsible for computing the loss; the base behaviour would be to do this using the predictions and labels, but one could extend this behaviour to accept the model itself too, and compute the regularization terms for instance. The executor's behaviour is then to call backprop() on whatever this criterion like object returns. This way the model is only responsible for performing the forward pass, which is a good division of responsibility.

As an example usecase, say you are doing something related to domain generalization and need to induce orthogonality in the representation space, and you choose to do this by adding some cosine similarity terms to the loss. As it stands, this needs to be the responsibility of the model wrapper, but conceptually, it isn't really a wrapper, and there ought to be some point at which the end user can extend the behaviour without opening up the model and changing it's behaviour.

Which brings me to the XLNet example. While you can indeed implement this by using inheritance, it seems like an incorrect usage. A far better example of a model wrapper is in the BERT example, where the wrapper accepts the model as a dependency. I think the doc might benefit from using a similar XLNet wrapper that doesn't need to call super(), but can still perform it's computations by wrapping the loss into a dictionary. Thoughts?


On Return Types
Prior to Python 3.8, I've been using NamedTuples in my own code to maintain some sanity in terms of what my methods return, since having dictionaries flying all around can quickly lead to highly uncivilised code. However, with the introduction of PEP 589, it seems like TypedDict might be a better return type to get from the forward() method. I think it would be more prudent to wait for pytorch to support Python 3.8 before this happens, but I was curious about your thoughts on this.


It's great to know you are willing to accept pull requests, I can modify the parts of the executor to fix this design assumption and issue a pull request. We can figure out the final design in this thread if it's the best way to keep track of things.

@huzecong
Copy link
Collaborator

Wow, I didn't realize that! It is a small world indeed :D


On the topic of whether the model should compute the loss, I guess it's more of a personal preference. I've seen quite a lot of code (albeit research code) that computes the loss within the model, so I would say habit-wise this is acceptable to most people. You example of adding additional terms outside the model is a very reasonable use case, but there are also cases that favor having the model compute the loss, such as losses that consist of several parts.

Indeed, the XLNet example is not a good example in terms of class hierarchy, because as you said, it's weird to inherit the wrapper from the model. I think the better term to describe what we want here is an "adapter" that somehow connects the model and the Executor. Since the user will probably need to write such an adapter anyway, I think it would be better to keep the current design, instead of creating a new general adapter module. What do you think?


I've used the TypedDict type in a couple of cases in the Texar codebase, but IIRC it has a number of limitations: 1) you can't add keys that are not defined in the type signature; 2) you have to use a string literal but not an str variable; otherwise it won't typecheck. These are against the intended usage of the returned dictionary (the user should be able to add arbitrary fields), so I don't think it's necessary to use TypedDict as the return type.


I'm happy to know that you'd be willing to contribute! Let's keep the discussion within this issue. In the meantime, you're more than welcome to take a look at other outstanding issues as well 😄

@chiragraman
Copy link
Author

chiragraman commented Dec 20, 2019

Ah I should probably have explained it better; I am not against the model computing the loss if the programmer so intends, I am mostly against the framework being prescriptive about requiring the model to compute it, and offering no other obvious opportunities to inject the logic.

The current way of doing this would indeed require the user to write a wrapper for the model. For a running example, say the loss consists of three terms, followed by a regularization term, and we want to compare L1/L2/L1+L2 regularization. Let the model compute the first three terms. In the current design I could write one model class and three regularization modules that accepts the model, compute the forward pass, add the regularization term to the loss term and return the dictionary with the composite loss. The only nuance is that this module is masquerading as the model itself, while it isn't.

So in essence, what I am proposing is something like a Criterion module that assumes the sole responsibility of returning the loss, parts of which could indeed come from the model itself, but it's no longer the responsibility of the model. So the pseudocode algorithm is as follows:

outputs = model(batch, **model_kwargs)
# Currently subsumed by the previous step. Here, one could even follow 
# the pred_name, label_name setup for metrics and pass in only the relevant 
# items, allowing for a Metric object to serve as a criterion, but the following
# allows the user to pass in other items to the criterion by forwarding the outputs
# and the batch. This seems like the safest abstraction for now. 
loss = criterion(outputs, batch) 
if regularizer:
    loss += regularizer(model)
loss.backward()

Indeed, this essentially performs the steps that an adapter would, like you said, without needing the user to write an adapter in the most common usecase, and just passing in some default criterion and regularizer. I'm advocating for the criterion to be a first class entity in the pipeline, since it conceptually matches the training process in my opinion, and allows for . As for the XLNet example, my comment was mostly about updating the docs with a non-inheritance example, so as to not encourage people to adopt that pattern when the examples already execute a better pattern. Thoughts?

[Edit] Note that this division of responsibility would solve the problem of not having the regularization term added to the validation loss because the regularization shouldn't be called during the validation / test loop. Currently I don't think there is a way to achieve this at all in any simple way. I'd say this is important for correctly comparing the training curves.

Also, to make things a little clearer using the gp2 example, the criterion would encapsulate the following:

outputs = model(inputs=input_ids, decoding_strategy='train_greedy')
# Following encapsulated by the criterion object
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
                labels=batch['text_ids'][:, 1:],
                logits=outputs.logits[:, :-1, :],
                sequence_length=batch['length'] - 1,
                average_across_timesteps=True,
                sum_over_timesteps=False)

Notice how the model does not output the loss but only the logits. The abstraction here would be that the criterion object accepts the batch and outputs, gets the appropriate items from these, and returns the dictionary with the loss. Porting the above code to the executor in the current state would require this loss computation to be a part of the model.


Ah yes, I agree that the inability to have extra keys is extremely restrictive, but I found that the expected behaviour is supported for function arguments. For convenience, from that link:

from mypy_extensions import TypedDict

A = TypedDict('A', {'x': int})
B = TypedDict('B', {'x': int, 'y': str})

def f(x: A) -> None: ...

b: B = {'x': 1, 'y': 'foo'}
f(b)  # Ok

The restriction on extra keys applies to creating the dict, but I imagine the way to use it in the executor loop would be similar to the above example. There's probably a few other considerations I am missing, but this one seems manageable in terms of the extra keys?

@huzecong
Copy link
Collaborator

I understand your concern, but in my opinion this is more of a naming problem instead of a design problem. If you think the name "model" is confusing, I'm willing to change it to "module" or something else.

The problem with the criterion approach that you're proposing is that the user would probably have to write their own "criterion adapter" as well. I know PyTorch provides module-style losses, but in order to use them you'd need to pass in the label tensor, not the batch object. The metric approach could work for simple losses, but not for stuff like the sequence_sparse_softmax_cross_entropy which requires text_ids and length, or for more complex ones such as the GAN loss.

My point is, we want to make this simple for the user, and having one custom class is probably better than having two. For switching between training and validation modes, the model adapter can use the nn.Module.training flag to determine the current mode, and apply dropout/regularization accordingly. If you wish to separately control training mode and regularization, you can always add a new argument to the adapter (and personally I believe this is better than built-in support because IMHO this is not a very commonly used feature).

For the XLNet example, I agree with you that using non-inheritance adapter-style is better, and will create a PR to change that.


The support for function arguments does not completely solve the problem. Take this snippet as an example:

from mypy_extensions import TypedDict
import torch
from torch import nn

class ReturnDict(TypedDict, total=False):
    loss: torch.Tensor

class ModelAdapter(nn.Module):
    def forward(self, input: torch.Tensor) -> ReturnDict:
        return {
            "loss": torch.sum(input),
            "something_else": input,
        }

mypy would complain on the return line that there are extra keys in the TypedDict. I think it's better to just use Dict[str, Any] until there's official support for extra keys for TypedDict.

@chiragraman
Copy link
Author

chiragraman commented Dec 22, 2019

Hey Zecong, thanks for getting back to me.


On side-effects of the model adapter

So I'm assuming we have the overarching goal that simple things should be simple, but complex things should still be doable in preferably not complicated ways. Ideally, in the simplest use-cases, the user shouldn't have to write any adapter at all. This is covered by having the model's forward method return the loss. So far so good.

Now if you're writing an adapter, my concern is not so much to do with the name, but with the responsibility. As far as possible, I think it's better to have a class have a single well-defined responsibility.

So what is the responsibility of a model adapter? Currently, this is to wrap any outputs into a dictionary that has a loss item. It's well-defined. If you start using it as a catch-all to add branching logic about training and validation, it's purpose is going to quickly start bloating. Also, if you recommend that the adapter can change model state, it's harder to have any guarantees about the side-effects of the adapter down the road. Will dependent methods be able to work with the guarantee that the model adapter does not alter the model state as a side effect?

Besides, it seems appropriate that the executor should encapsulate any branching logic of that nature because it has to do with execution anyways. As of now, the executor handles the model mode via calles to self.model.train(self.test_mode == 'train') during the evaluation methods. I don't think that's a responsibility that needs to be split between the adapter and the executor when it's already appropriately taken care of.

For these reasons, I am of the opinion that the adapter shouldn't be responsible for more complex logic than simple transforming of output data from the forward pass into an appropriate data structure. If you believe it's alright for the adapter to alter model state, I'd request you to consider how these side effects might affect the building of future behaviour using these interfaces.


On regularization

I'm a little confused about what you're referring to as a not commonly used feature; is it turning off regularization during computation of the validation loss? If that's what you're referring to, I think how commonly it is used depends on whether people want to compare the validation and training losses. If only the training loss is being looked at, sure it's not as important. But the difference between the curves is something that's taught in Intro to ML courses, so it ought to be supported without having to right custom logic to turn off regularization. To me this is a similar responsibility to calling model.train() or model.eval(), and therefore falls under the purview of the executor.


On losses as a separate module

I agree that the user shouldn't have to write a criterion wrapper in the simple cases, and I now that I think of it more, they need to be modules as such; what I was proposing is that the executor accept any callable object. This could be a function itself or an nn.module. In the simple cases, any metric can be passed in.

As for the arguments being labels or batch, this is easily solved for the simple cases by having a default decorator that extracts labels from the batch object and passes it in. But having a general interface for passing in the batch as well as the outputs allows for easier extension of custom behaviour that is not currently possible since you might need more information than the labels and pred in computing losses; you can think of variational inference applications as an example where you need to compute KL terms between different things that's not just the labels of preds.


Yes of course, that's what I meant by the problem only being during creation of the dictionary. The reason I said this isn't a problem while using is something like this:

# ... imports ...

## LIBRARY CODE ----
# Somewhere in the Executor
# Not quite sure why you have `total=False`, I believe we don't want to omit the loss key
DictWithLoss = TypedDict("DictWithLoss", {"loss": torch.Tensor}, total=True) 
...
def some_executor_method(d: DictWithLoss):
    loss = d["loss"] # should be okay according the example below
    loss.backward()
...
## -----

## USER CODE ----
UserReturnDict = TypedDict("UserReturnDict", {"loss":torch.Tensor, "something_else":torch.Tensor})

class ModelAdapter(nn.Module):
    def forward(self, input: torch.Tensor) -> UserReturnDict:
        return {
            "loss": torch.sum(input),
            "something_else": input,
        } # ---> should be okay
## ----

So I think in terms of usage, this should be fine. I don't have a strong opinion on TypedDict being used, but coming from a C++ background I liked the typing constraints and having users think about what they put in their dictionaries to avoid bloat. But I'll admit this is me probably reacting to looking at code updates where people start sticking more things in dictionaries just because they can.

@huzecong
Copy link
Collaborator

Sorry for getting back to you late; been busy during the holidays.


I think the word "statefulness" better describes the case than "having side-effects". I don't think the adapter being stateful is a bad thing, since all PyTorch modules have states built in them (e.g., the train/eval flag). However, I do agree that the example I showcased is not the best design, as it requires users to manually switch states outside of the executor, and they might forget to do so.

The purpose of the example was to provide an easy solution for the goal of "computing training loss under eval mode but keep regularization terms", without modifying any of the executor internals. After reading your replies again, I realize that I have probably misunderstood your requirements. Let me try to rephrase the question you were asking:

  1. You would like to add to the training loss some regularization terms (e.g., L1/L2), which should not be computed during validation or testing.
  2. You would like to compare the curves of training and validation losses over epochs, and to do properly it is necessary to compute the loss on the training set under evaluation mode.
  3. You would like to separate the actual model (that computes logits/predictions), the loss computation, and regularization into different modules.

For 1., I guess the better way to do it would be to check for self.training in the forward function of the adapter and add the regularization term only during training. This is similar to how dropout is implemented and requires no modifications to the current framework.

For 2., my argument of this not being a very commonly used feature, is mainly limited to deep learning applications, where the training set is often too large for this to be practical. I'm less familiar with applications outside of the NLP domain, but I can see how this would be useful on smaller datasets or computationally cheaper models. But, as you said, this can be done in the current framework by manually evaluating on the training set in a custom validation action. What could be improved here is to allow the custom action to return metric values so the training losses (w/o regularization) can also be registered as a validation metric.

For 3., I stand with my previous opinion that coupling the loss/regularization with the model makes more sense. My major points are:

  • Having separate model and loss modules would mean introducing a large number of custom loss modules, which introduces more boilerplate code for the user.
  • Having regularization outside the model would only work for regularization methods that considers only model parameters, such as L1/L2. Not all models use this kind of regularization. Other methods such as dropout or batch/layer normalization are embedded into the model architecture, and don't work this way.
  • In terms of switching between different losses/regularizations, both designs would require equal efforts for the user: to switch modules (with appropriate constructor arguments) in the executor config, or to switch function (with appropriate call arguments) in the adapter.

Of course, it is also possible, for instance, to allow the user to add regularization in the executor arguments. However, this gives the user less control over the training process. For example, the user might want to also return the loss value without the regularization terms in a separate dictionary entry, but it's not possible (or less straightforward) to do so with regularizers handled by the executor. Similarly, it wouldn't be difficult to observe individual loss components in a complex loss module, or to enable/disable loss computation for evaluation.


I see your point with the TypedDict, and I think they can be integrated into the current framework. My concern is that mypy might throw errors where we access the return dictionaries using metric pred_names, if the type of the return dictionary is restricted to DictWithLoss.

I'll look into this and submit a separate issue, but you're also more than welcome to propose changes!

@chiragraman
Copy link
Author

No worries about the delay in response, enjoy your holidays!


I think we are ultimately saying the same thing on this subpoint, but for clarity:
  • It's statefulness if the adapter is a subclass of the model, because the state of the model is the state of the adapter. In this case it seems wasteful to have the model internally branch in every forward pass depending on state.
  • It's a side-effect if the adapter is any class that wraps a model; the adapter in this case need not even be of type nn.Module, and could call the forward on the model, compute and embed the loss in a dictionary and return it. The reason it's a nn.Module subclass is so that the executor can call a forward on it. Changing model state in this forward pass is a side effect because the model goes in in some state and may or may not remain in the same state once the adapter is done with its operation. Function interfaces are like contracts, and if the contract semantics is to just return the loss in a dictionary, altering the model state qualifies as a side-effect in this case.

Before I try and distill the essence of why I think the former is the poorer choice, let me restate the principle of the executor from your comment above:

the design principle of the Executor is to leave all model-related computations to the model, so that the Executor itself can be as general as possible.

With this in mind, I think the essence of the design flaw as I see it this:

It is sort of weird for the model to have to care about anything to do with training itself. A model performs some task given some inputs. In most cases, this involves making some predictions. It shouldn't have to have knowledge of the concept of a loss, or existence or absence of ground truth labels to compute a loss, or how well it is performing.

The exception to this is that the model has some gradients sure, but it shouldn't need to know how those are computed. That is the job of the training apparatus, so at the very least it should happen outside the model.

If you can concede this, then it makes sense that the the measures required for making a model better at its task, the loss, regularization, metric, etc. live outside the model, because while they relate to the model, they aren't inherently a part of the model's task, and so not a part of "model-related computations". This is more than a naming problem; if you think of the adapter as a model subclass, it's a behavioural semantic. I agree these computations shouldn't be within the executor, and that it should be possible to modify the executor's behaviour by injecting such logic. So option 2 above, with the adapter wrapping the model is a better option, and still in line with the design principle.


For ease of consumption, I'll respond to your specific proposed solutions in a follow-up comment, but I wanted to leave the core argument here, and use it as context next.

@chiragraman
Copy link
Author

Right, so hopefully we agree that the computations pertaining to training the model should be outside the core model, and outside the executor itself, with the adapter being the candidate to hold this logic. With that,

For 1., I guess the better way to do it would be to check for self.training in the forward function of the adapter and add the regularization term only during training. This is similar to how dropout is implemented and requires no modifications to the current framework.

I agree. If this is done within the adapter though, it needs to branch on the model state, which isn't too bad, but not the most elegant I think since the check needs to be performed in every forward pass.

For 2., ... What could be improved here is to allow the custom action to return metric values so the training losses (w/o regularization) can also be registered as a validation metric.

That's an interesting idea, registering the loss w/o regularization as a validation metric. The way I had envisioned it, the regularization bit comes from a separate callable, so the same core loss function could be reused across the train and val loops.

For 3., I stand with my previous opinion that coupling the loss/regularization with the model makes more sense.

I can partially agree if you mean the coupling occurs with the adapter and not the core model for reasons covered in my previous comment. However, even with the adapter, this solution condemns the user to having to write an adapter in almost all cases, which to me is a symptom of a design flaw. In the most simple case, the user ought to be able to write the core model, specify a core loss, an optional regularizer, and optimization procedure and let the executor handle the rest. No adapters needed, since the criterion returns the loss anyways. The need for an adapter is itself in some way boilerplate code to me.

Having separate model and loss modules would mean introducing a large number of custom loss modules, which introduces more boilerplate code for the user.

I should clarify that I'm not advocating for module losses. In my pseudocode above I assumed any callable object that accepts the labels and outputs and computes the loss. This could be a function, an object, an nn.Module subclass, left upto the user. In this case, there are specific loss functions already provided, so this falls under the same category. These functions should just be passed in like they are, and the user can write custom ones if so desired.

Having regularization outside the model would only work for regularization methods that considers only model parameters, such as L1/L2.

True. Although, there are some papers that argue that dropout doesn't necessarily perform the regularization it is thought to perform. However, this is irrelevant to the spirit of your argument, which holds. I was indeed referring to reg terms that are computed from model parameters and used in the optimization.

In terms of switching between different losses/regularizations, both designs would require equal efforts for the user: to switch modules (with appropriate constructor arguments) in the executor config, or to switch function (with appropriate call arguments) in the adapter.

I think I have to disagree here. In the most common case with what I am proposing, the user doesn't have to write any adapter at all, or at best a thin adapter that wraps the loss in a dictionary. The branching logic, etc. is handled by the executor. So the user has to write a custom regularizer if there is a need, such as for maintaining orthogonality of RNN weights, and pass that in. Even with similar work, I'd argue configuring the executor is a more elegant solution than branching logic in the adapter, with the user having to write one in most situations.

Of course, it is also possible, for instance, to allow the user to add regularization in the executor arguments. However, this gives the user less control over the training process.

This statement seems a little self-contradictory, perhaps? An optional argument to inject logic ought to afford the user more flexibility without being prescriptive. The other use-cases you describe are still supported by the library in my proposed alteration, and the user can choose to completely ignore the optional regularizer and compute all those terms in an adapter if they choose to write one. It's just not required in most cases by the relaxation of the need for the model wrapper to return a loss.


Sorry for the long comments, I thought it best to err on the side of being more specific to avoid confusion. In any case, thanks for engaging in this discussion!

@huzecong
Copy link
Collaborator

Thanks for the reply. I appreciate you putting so much thought into the executor design!


I think your core argument is that loss computation and regularization is not part of the model's responsibility, and should not be done in the model class. However, I believe this is more of a personal preference than a rule to follow. True, in machine learning definitions the model is separate from the loss, but in practice they're often coupled together. Forcing them to be separate modules or functions seems, to me, a bit like abstraction for abstraction's sake.

I can give examples where changing loss functions would require modifying the model's internal computations. Consider word-level language models with a large vocabulary size. At each time step, the model predicts the next word given previous context. The most straightforward way to do this is compute the entire logits as a tensor of [B x V], where B is batch size and V is vocab size, and then apply the cross-entropy loss. However, in practice this is too slow, so people use a variety of techniques:

  • Sampling-based losses, including importance sampling, noise-contrastive estimation, and negative sampling (not exactly for LMs, but often for word2vec-style stuff). These allow the model to only compute logits for a sampled subset of words.
  • Hierarchical losses, including hierarchical softmax and adaptive softmax. Both methods group vocabularies into a tree structure, where leaves are words. The model then only computes logits for each non-leaf node along the path from the root to the leaf (word).

As you can see, each different technique requires the model to compute logits differently. These losses can be used interchangeably in the same model, but would require non-trivial modifications to the model itself. If you're interested, check out this article for more information on these techniques.


And let me address your refutations to my claims in the previous reply:

The adapter is boilerplate code.

This is true, and is something I'm trying to get rid of. The problem is mainly with the return values -- most existing modules would just return the output as a tensor or a tuple of tensors, but not as a dictionary. However, the adapter is not for the sole purpose of computing the loss, so separating model and loss would not get rid of the adapter.

On losses module.

My argument is not limited to module-like losses. My point was that having the loss computed outside of the model would require the user to write additional code to adapt between the model and the loss, be it in the form of a class or a function.

Branching logic.

I'm not sure what you mean by "the branching logic is handled by the executor". I would assume that if the loss or regularizer is set in the executor arguments, it would be something like this:

executor = Executor(
    ...,
    loss=tx.losses.sequence_sparse_softmax_cross_entropy,
    regularizer=tx.core.L1L2(l1=0.5),
    ...)

Then if the user were to choose between multiple losses, they would still have to write something like:

if args.loss_type == "mle":
    loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
elif args.loss_type == "adversarial":
    loss_func = some_adversarial_loss_function
# or, something like:
# loss_func = tx.utils.get_function(args.loss_func_name, tx.losses)
executor = Executor(loss=loss_func, ...)

I don't see how this would be too different from doing things in the model class. And this is assuming the losses can be used as is, without adapting to your model or setting additional arguments.

Optional argument for regularizer.

What I was talking about is if we only allowed specifying the regularizer through executor arguments. I agree that if we allow both choices, it would be more flexible for the user. But in my opinion it might also become a source of confusion to the user -- the user might apply regularization in both places and not realize so.

@chiragraman
Copy link
Author

chiragraman commented Jan 1, 2020

Thanks for the link! I am aware of some of these, and the thrust of the argument seems to be that the model computations might have to change depending on the loss computation. This is true, and in these cases the model ought to change anyways so it's okay to write multiple model classes. However, this doesn't seem to me to be a reason to couple the responsibilities, because it makes the simpler cases significantly more convoluted than needed. I also don't think I explained my point very clearly.

For having a common reference, here's a complete-ish example with suitable adaptation of the ConditionalGPT2 Model example for illustration. Say I would like to train the model with two losses (sequence_sparse_softmax_cross_entropy, sequence_softmax_cross_entropy), and orthogonal regularization (arbitrary choice, might not make most sense, but works for illustration):

# ... imports ...

## ------------ Ideal minimum user required code ---------------

class ConditionalGPT2Model(nn.Module):
  """An encoder-decoder model with GPT-2 as the decoder."""
  def __init__(self, vocab_size):
    super().__init__()
    # Use hyperparameter dict for model configuration
    self.embedder = tx.modules.WordEmbedder(vocab_size, hparams=...)
    self.encoder = tx.modules.TransformerEncoder(hparams=...)
    self.decoder = tx.modules.GPT2Decoder("gpt2-small")  # With pre-trained weights

  def _get_decoder_output(self, batch, train=True):
    """Perform model inference, i.e., decoding."""
    enc_states = self.encoder(inputs=self.embedder(batch['source_text_ids']),
                              sequence_length=batch['source_length'])
    if train:  # Teacher-forcing decoding at training time
      return self.decoder(
          inputs=batch['target_text_ids'], sequence_length=batch['target_length'] - 1,
          memory=enc_states, memory_sequence_length=batch['source_length'])
    else:      # Beam search decoding at prediction time
      start_tokens = torch.full_like(batch['source_text_ids'][:, 0], BOS)
      return self.decoder(
          beam_width=5, start_tokens=start_tokens,
          memory=enc_states, memory_sequence_length=batch['source_length'])

  def forward(self, batch): ## modified to return only outputs
    """Compute training loss."""
    outputs = self._get_decoder_output(batch)
    return {"outputs": outputs.logits} 

  def predict(self, batch):
    """Compute model predictions."""
    sequence, _ = self._get_decoder_output(batch, train=False)
    return {"gen_text_ids": sequence}

# Adapted from github.com/kevinzakka/pytorch-goodies into a module for illustration
class OrhogonalRegularizer(nn.Module):
    def __init__(self, reg=1e-6):
        self.reg = reg
    
    def forward(self, model):
        orth_loss = torch.zeros(1)
        for name, param in model.named_parameters():
            if 'bias' not in name:
                param_flat = param.view(param.shape[0], -1)
                sym = torch.mm(param_flat, torch.t(param_flat))
                sym -= torch.eye(param_flat.shape[0])
                orth_loss = orth_loss + (self.reg * sym.abs().sum())
        return orth_loss

# Losses
loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
loss_func_sparse = tx.losses.sequence_softmax_cross_entropy

# Instantiate
model = ConditionalGPT2Model(...)
regularizer = OrthogonalRegularizer()

Ideal Scenario
With the previous user specification, it would be great if the user can just leave the rest to the executor using an interface like so:

executor = tx.run.Executor(
    model=model,
    loss=loss_func,
    regularizer=regularizer,
    keys = {pred_name:"outputs",  labels:"target_text_ids"} # something like this,
    ....
)

executor_sparse =  tx.run.Executor(
    model=model,
    loss=loss_func_sparse,
    regularizer=regularizer,
    keys = {pred_name:"outputs",  labels:"target_text_ids"} # something like this,
    ....
)

I'm using keys as some dictionary that specifies how the executor can unpack the data returned by the model, and the batch, and pass those into the loss in the simplest case for provided losses. Of course, the user can write a custom loss and pass that in too. Alternatively, the functionality can be extended via helpers provided by the Executor, without needing to modify the Executor class (Open-Close principle).

Current reality
In the current design, to make this work, the user is forced to write the following additional code unless I am mistaken:

class GPT2WithSoftmaxCrossEntropy(nn.Module)
    def __init__(self,  gpt2, regularizer=None):
        self.gpt2 = gpt2
        self.regularizer = regularizer
   
    def forward(self, batch):
        logits = self.gpt2(batch)["outputs"]
        loss = tx.losses.sequence_softmax_cross_entropy( 
            labels=batch['target_text_ids'][:, 1:], logits=logits,
            sequence_length=batch['target_length'] - 1)
        if self.training: ## Branching
            if regularizer:
                loss += regularizer(self.gpt2)
    return {"loss": loss}

    def predict(...):
       ....

class GPT2WithSparseSoftmaxCrossEntropy(nn.Module)
    def __init__(self,  gpt2, regularizer=None):
        self.gpt2 = gpt2
        self.regularizer = regularizer
   
    def forward(self, batch):
        logits = self.gpt2(batch)["outputs"]
        loss = tx.losses.sequence_sparse_softmax_cross_entropy( 
            labels=batch['target_text_ids'][:, 1:], logits=logits,
            sequence_length=batch['target_length'] - 1)
        if self.training: ## Branching
            if regularizer:
                loss += regularizer(self.gpt2)
    return {"loss": loss}

    def predict(...):
       ....

Issues
The above code that the user is forced to write is problematic for several reasons:

  1. Additional code:
    The executor is truly useful if the user can specify the experiment parameters similar to the minimum specification above and use the executor to run the rest. Currently the above two classes are required (or one if you branch on loss type, refer next point), which can be easily handled by auxiliary helper classes within the executor.

  2. Avoidable code duplication:
    One might argue that instead of two classes, the user can write a single adapter and accept the loss function as an argument and branch on that. That's really bad for testing on the user side. I understand people in academia don't usually write tests, but if you start writing these, you have to add validation tests for different behaviours, increasing development time and writing less maintainable code.
    Worse, people might start passing in the loss function and regularizers into the original GPT2Model class instead of the adapter, which is even messier.
    This is a case of the library violating the Single Responsibility Principle, and in turn forcing the user to write unnecessary code, or worse, violating the Dependency Inversion and Open-Close principles themselves.

  3. Responsibility leaking:
    The logic in both of the adapters is execution logic. Even in this straight-forward case the user has to handle some of the execution branching logic that should be the responsibility of the executor. The simple scenarios ought to work out of the box while allowing the user to inject more advanced logic via helpers etc.

  4. Branching:
    Note that my comment about branching logic wasn't on the type of loss, but on the model state. Also note that this branching check will be performed on every iteration of the data iterator


From a user's perspective, the overhead of adding a dependency to a project ought to be overshadowed by the benefits. Currently if I advocate of adding this library for the repositories I am responsible for, I would have to invest in a more robust testing suite for the user code, owing to easily avoidable issues in an otherwise excellent library interface. Hopefully I have done a slightly better job of explaining the bigger picture and considerations that go into adopting a new library, and the potential issues of dropping texar-pytorch into the workflows I am responsible for.

@chiragraman
Copy link
Author

Technically, you have answered my original questions within the current interface design, so feel free to close this issue. However, we have wandered into the territory of proposed changes, and there might be valuable discussions here, so perhaps it's easier if we continue this discussion if you find it useful by changing the labels? I might test a few interface changes in my fork and am happy to issue a PR if you find it useful too.

@huzecong huzecong added discussion Discussions about designs & implementation and removed question Further information is requested labels Jan 13, 2020
@huzecong
Copy link
Collaborator

Hey, so sorry for the late reply. Been caught up by work piled up during the holidays.

I'm fine with continuing the discussion here, and I've changed the label from "question" to "discussion".


Thank you for explaining your point in detail with code examples. However, I'm afraid I can't agree with you on all these points.

  1. Additional code. This one I agree with. Your design can reduce a part of the boilerplate code.
  2. Responsibility leaking and branching. I kinda feel that these are talking about the same thing. It is true that with the current design, the user must take on the responsibility of checking the model state, and it might be sub-optimal. If this is really a big issue, I think the better option might be to adopt what is already in the framework, i.e. regularizer classes like nn.Dropout that handles state checking within.
  3. Avoidable code duplication. I think this point is more about testing. I don't think there's too much difference here, and I'll show that in an example below.

In your example you assumed that the keys dictionary are the same for both loss functions. However, when they're different, in the current design, you can write:

class GPT2WithSoftmaxCrossEntropy(nn.Module)
    def __init__(self,  gpt2, loss_func: str):
        self.gpt2 = gpt2
        self.loss_func = loss_func
   
    def forward(self, batch):
        logits = self.gpt2(batch)["outputs"]
        if self.loss_func == "sparse":
            loss = tx.losses.sequence_sparse_softmax_cross_entropy( 
                labels=batch['target_text_ids'][:, 1:], logits=logits,
                sequence_length=batch['target_length'] - 1,
                label_smoothing=0.1)
        else:
            loss = tx.losses.sequence_softmax_cross_entropy( 
                labels=batch['target'][:, 1:], logits=logits,
                sequence_length=batch['target_length'] - 1)
        return {"loss": loss}

In your alternative design, you'd have to write (or put these in a config):

if args.loss_func == "sparse":
    loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
    keys = {"logits": "logits", "labels": "target_text_ids",
            "sequence_length": "target_length",
            "label_smoothing": 0.1}  # additional kwargs too
else:
    loss_func = tx.losses.sequence_sparse_softmax_cross_entropy
    keys = {"logits": "logits", "labels": "target",
            "sequence_length": "target_length"}
executor = tx.run.Executor(
    model=model,
    loss=loss_func,
    regularizer=regularizer,
    keys=keys,
    ....
)

which I think is more or less the same. It doesn't really enable you to test less code -- you'd still need to cover both branches; if you're loading everything from a config, you'll need to test that the config is written correctly. Either way, even a well-tested library doesn't give you that benefit, it's the user code you'd still have to test.

Also, the code you showcased was not exactly fair comparison. In the current design the model outputs cannot be directly used in the loss, it needs to be sliced or subtracted. So, in order to specify the loss in the executor arguments, you'd need to return a bunch more in the model adapter:

def forward(self, batch):
    outputs = self.gpt2(batch)
    return {"outputs": outputs["logits"],
            "target_text_ids": batch["target_text_ids"][:, 1:],
            "target_length": batch["target_length"] - 1}

which also makes it longer, or at least not much shorter compared to the current version.


To summarize, as I said before, I think your design is reasonable and would make the code shorter and perhaps more readable in certain cases, and those are probably the cases that you often work with. However, the design would make other cases harder to implement, and does not really serve the purpose it's supposed to serve. Since we want Texar-PyTorch to be a general-purpose library, we probably have to choose the current design over yours.

But still, thank you for bringing up the topic and engaging in the discussion. It made me think about the current design and gave me ideas for improvement.

@huzecong
Copy link
Collaborator

huzecong commented Feb 4, 2020

Closing due to lack of activity.

@huzecong huzecong closed this as completed Feb 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussions about designs & implementation topic: executor Issue about the executor and related utilities
Projects
None yet
Development

No branches or pull requests

2 participants