Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix off-by-one in torch DeepAR #2618

Merged
merged 6 commits into from
Mar 9, 2023
Merged

Conversation

lostella
Copy link
Contributor

@lostella lostella commented Feb 6, 2023

Issue #, if available: Fixes #2616

Description of changes: The RNN in torch DeepAR was getting dynamic features shifted by one as input.

  • Split prepare_rnn_input out of unroll_lagged_rnn
  • Add test to verify alignment of RNN inputs (target values vs dynamic features, test is failing prior to fix)
  • Change is breaking in that previously serialized models will not work as before (network changes)

Todo:

  • Take the opportunity to improve the code
  • Test effect of bug/fix (e.g. simple indicator that zeroes-out a series, model should not get it right prior to fix)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Please tag this pr with at least one of these labels to make our release process faster: BREAKING, new feature, bug fix, other change, dev setup

@lostella lostella added BREAKING This is a breaking change (one of pr required labels) bug fix (one of pr required labels) torch This concerns the PyTorch side of GluonTS labels Feb 6, 2023
time_feat = torch.cat(
(
past_time_feat[..., -self.context_length + 1 :, :],
future_time_feat,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a need to append this when there is no future_target

Copy link
Contributor Author

@lostella lostella Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, future_time_feat is not an Optional argument anymore (it never should have been, probably). The assumption is that at least one "future output" will be produced, for which the corresponding dynamic features are needed. In fact, it is assumed that there's one less future_target than future_time_feat, if any, see lines 211-215 above. Let me improve this a bit

@kashif
Copy link
Contributor

kashif commented Feb 8, 2023

BTW @lostella do you think the same issue also affects the transformer model? https://github.com/awslabs/gluonts/blob/dev/src/gluonts/mx/model/transformer/_network.py#L172-L178

@lostella lostella marked this pull request as ready for review February 15, 2023 12:50
@lostella
Copy link
Contributor Author

BTW @lostella do you think the same issue also affects the transformer model? https://github.com/awslabs/gluonts/blob/dev/src/gluonts/mx/model/transformer/_network.py#L172-L178

It's hard to say. The easiest way to verify this, I think, is to set up a small unit test what the input to whatever layer will be: in this case the tensor is split into encoder + decoder input, so one could have a method prepare those two directly and then test that lagged target values, dynamic features, and what not, are aligned as expected.

Comment on lines 180 to 181
.unsqueeze(0)
.unsqueeze(-1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer reshape(1, history_length, 1) instead of unsqueeze. Makes it more clear what the output shapes is in my view.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even .view()

@lostella lostella force-pushed the fix-deepar-off-by-one branch 5 times, most recently from 53bedf4 to b110bb2 Compare February 24, 2023 10:27
Comment on lines 70 to 77
# lambda dataset: DeepNPTSEstimator(
# freq=dataset.metadata.freq,
# prediction_length=dataset.metadata.prediction_length,
# context_length=2 * dataset.metadata.prediction_length,
# batch_size=4,
# num_batches_per_epoch=3,
# epochs=2,
# ),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason this test started failing again with

E       RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

../.pyenv/versions/3.8.13/envs/gluonts/lib/python3.8/site-packages/torch/autograd/__init__.py:197: RuntimeError

I think we need to understand this better and fix it separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of error kept #2496 on hold for a while, then it disappeared on its own. But something must be off.

@lostella lostella force-pushed the fix-deepar-off-by-one branch 2 times, most recently from 058a9eb to 550af57 Compare February 27, 2023 14:29
@lostella lostella added this to the v0.13 milestone Feb 27, 2023
@@ -132,6 +132,7 @@ def __init__(
else [min(50, (cat + 1) // 2) for cat in cardinality]
)
self.lags_seq = lags_seq or get_lags_for_frequency(freq_str=freq)
self.lags_seq = [l - 1 for l in self.lags_seq]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm completely not getting the issue here, but if we do l-1 for the lags, don't we feed in the current time stamp during training when l=1 thereby introducing a data leak? Maybe I'm wrong, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on what time index the lag is applied: for how the RNN input preparation is written, the lag index gets subtracted from t, when we have data up to time t and want a prediction for time t+1. The test_rnn_input test function aims at verifying precisely this. Let me try to see whether the test can be made more apparent.

@kashif
Copy link
Contributor

kashif commented Mar 6, 2023

@lostella one issue i found with the lagged_sequence_values helper is that with multivariate inputs this helper doesn't return the stacked tensors... e.g.

# prior_input.shape torch.Size([1, 167, 137])
# input.shape torch.Size([1, 24, 137])

lags = lagged_sequence_values([0, 167], prior_input, input, dim=1)
# lags.shape torch.Size([1, 24, 137, 2])

where i would have expected torch.Size([1, 24, 137*2])

@lostella lostella enabled auto-merge (squash) March 9, 2023 14:15
@lostella lostella disabled auto-merge March 9, 2023 14:22
@lostella lostella merged commit cd3ddae into awslabs:dev Mar 9, 2023
@lostella lostella deleted the fix-deepar-off-by-one branch March 9, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BREAKING This is a breaking change (one of pr required labels) bug fix (one of pr required labels) torch This concerns the PyTorch side of GluonTS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Off-by-one bug in PyTorch DeepAR implementation
4 participants