Skip to content
This repository has been archived by the owner on Jun 2, 2023. It is now read-only.

Changing offsets and sequence lengths lead to reasonable predictions but very odd training logs #152

Closed
SimonTopp opened this issue Dec 29, 2021 · 13 comments

Comments

@SimonTopp
Copy link
Contributor

Not exactly an issue, but something to be aware of and that I'm curious to get other peoples opinions on. While hypertuning sequence length and offset for the RGCN, I found that while various sequence length/offset combinations lead to reasonable predictions, the training logs for any combination other than a sequence of 365 and offset of 1 are very wonky, almost as if they're beginning to overfit on the training data right from the start. For all the model runs below, the only thing changing between runs is sequence length and offset of our train, test, and val partitions, and I didn't see any obvious errors in the data prep or training pipelines that would account for the weird training logs shown in the final figure here.

First, each cell in the heatmaps below represent a different model run with a unique sequence length and offset combination, while some combinations appear to outperform others, there are no super suspect numbers.
image

Similarly, when we just plot out the validation predictions, they seem pretty reasonable.
image

But! When we look at our training logs, we see that our validation loss throughout training is very erratic for all combinations except sequence length 365/offset 1 and sequence length 180/offset 1 (note that here I was using an early stopping value of 20)
image

One thing to consider is that the offsets and shorter sequence lengths lead to a lot more samples for each epoch, so there are subsequently more batches/more updates in each epoch. Maybe this is just allowing the model to converge much more quickly? Or maybe there's something about the RGCN that depends on sequences starting on the same day of the year, but if that was the case, I wouldn't expect the high test/val metrics or reasonable predictions. I'm curious to get your thoughts @janetrbarclay, @jdiaz4302, @jsadler2.

@SimonTopp
Copy link
Contributor Author

Quick update, when we restrict the training to only 4 years (meaning fewer batches/updates each epoch), we start to see training logs more like we'd expect in regards to validation loss. The model still appears to begin overfitting on the training data fairly quickly, which is probably worth keeping in mind for future applications of the RGCN.

image

@jsadler2
Copy link
Collaborator

jsadler2 commented Dec 30, 2021

image

It is interesting that it's totally missing the summers. Would it be difficult to plot the same thing for the longer sequence lengths?

@jdiaz4302
Copy link
Collaborator

jdiaz4302 commented Dec 30, 2021

Some thoughts for looking into this -

  • Regarding the 4 years of data training logs and the whole data set training logs, I would like to see the plots with the same range of values on the y-axis because when I try to mentally adjust the axes to be the same, I feel like they don't look as different - the top row of the second figure may actually be worse (e.g., use y = 0.4 as a reference)
  • It seems that the default dropout values for the repo are to set regular and recurrent dropout to 0. I'd be interested in seeing what these results look like if you set both dropout values to 0.5 (or maybe even 0.25). In either case, you may want to be less forgiving with the early stopping patience or use restore_best_weights=True because I feel like they show that the current approach can still allow a lot of overfitting
  • Which training log does the 60-day validation time series belong to? I agree that it isn't the worst, that's about what I'd expect from a first pass, non-tuned run. I'd definitely wager that those predictions look better than the randomly initialized weights' predictions (can verify) which the training log suggests have the same or better RMSE, so that's really strange.

(last two points kinda fight with each other because restore_best_weights may say that best validation set RMSE was at epoch 0 but the forecasts could look better despite the training log painting a poor picture?)

@SimonTopp
Copy link
Contributor Author

Thanks for chiming in @jdiaz4302! It sounds like overall you're suggesting re-running the grid search on the full training data but with more measures in place to avoid overfitting? In response to:

It seems that the default dropout values for the repo are to set regular and recurrent dropout to 0.

and

you may want to be less forgiving with the early stopping patience or use restore_best_weights=True

I'm using 0.3 for regular dropout and 0 for recurrent, but I'll redo a run with both set to 0.5 to see what happens. Also, these results are using the best_val_weights (not final weights) and with early stopping at 20.

  • In response to your first point. It may not be across the board, but with the same y-axis, to me it still looks like the logs from the 4 year training period show a better relationship between train loss and val loss.

image

image

  • Which training log does the 60-day validation time series belong to?

These are predictions using the best validation weights from the 60_1 run, so the model had 6 or 7 epochs of training, which from the log we can see performs better than the initialization weights. I haven't looked at what the predictions from the final weights are, but I'm guessing they'd be much worse. Would definitely be interesting to look into.

@jdiaz4302
Copy link
Collaborator

Thanks for those plots and the clarifications on your run specs.

I find the training logs with the full y range a lot more comforting. I didn't have any range in mind for the first plots so they looked pretty wild and chaotic as a first impression but given the fully y range, that these are like first passes in new conditions and the "very odd training logs" discussion prompt, I feel like even the worst among them are pretty reasonable - showing normal-to-moderate levels of imperfections (with moderate levels being more exclusive to the offset = 0.25 and, to a lesser extent, the offset = 0.5 columns with the full training period).

It sounds like overall you're suggesting re-running the grid search on the full training data but with more measures in place to avoid overfitting?

Yeah, it sounded like you suspected overfitting, so setting regular and recurrent dropout to 0.5 should be a very strong measure to combat that (not necessarily the best for performance but it would be difficult to overfit with those conditions). This may drop the moderate levels of imperfections to normal for those worse training logs.

@aappling-usgs
Copy link
Member

Why do you suppose the val Rmses are almost always lower than the train Rmses even for 0 epochs?

@SimonTopp
Copy link
Contributor Author

Why do you suppose the val Rmses are almost always lower than the train Rmses even for 0 epochs?

@aappling-usgs, I believe that epoch loss is defined as the average of the batch losses over the epoch. Therefore in epoch 0 (epoch 1 in reality), the training rmses will have some really high values from the first few batches with the randomly initialized weights that will pull the average up. The val loss on the other hand is only calculated after the first full epoch of training, so it won't have those super high initial values artificially inflating the loss.

@jsadler2
Copy link
Collaborator

@SimonTopp - not sure if you saw my comment above (@jdiaz4302 and I posted at the same time). I'm wondering if for other sequence lengths the model predictions are similarly poor in the summers.

@SimonTopp
Copy link
Contributor Author

SimonTopp commented Dec 30, 2021

@jsadler2 Definitely looks like with the shorter sequences we do worse during the summer months. While they never really get there, they do get closer to matching the obs as you move from 60 to 365. Also, in plotting these up I realized that our keep_frac for the predictions is acting a little wonky for some of the combinations. This should only be impacting the final predictions and not those in the training log (in other words, I don't think the predictions missing in the last figure aren't missing during train/val, just in the output from predict() after training is finished. I'll look into it.

image

image

image

And the bad keep fraction example
image

@SimonTopp
Copy link
Contributor Author

It's also worth noting that the poor summer predictions vary from reach to reach, below are two additional examples
image
image

@jsadler2
Copy link
Collaborator

Thanks for making and posting those, @SimonTopp. It's very interesting that the longer sequence lengths do better in the summer. ... I can't think of why that might be 🤔

@janetrbarclay
Copy link
Collaborator

It seems reasonable to me that having a full year of data could help with the summer temp patterns since the model is then seeing the full annual temporal pattern rather than just a snippet.

It also looks like the summer observations for some of those reaches are a little unexpected, either plateauing or decreasing where I'd expect increases.

@SimonTopp
Copy link
Contributor Author

So I believe I've gotten to the bottom of this. The baselines above were all done with dropout set to 0.3 and recurrent dropout set to 0. I did some simple tuning of the dropout space using 180/0.5 (sequence/offset) runs as a test. Below, rows are dropout and the columns are recurrent dropout. We can see that while recurrent dropout improves the train/val relationship, regular dropout significantly reduces validation performance leading to the training logs shared earlier.

image

When we then redo the hypertuning of the sequence length and offset with recurrent dropout set to 0.3 and regular dropout set to zero, we see significant improvement in all of our metrics, with our best validation RMSE going from 2.13 to 1.68. We also see our best run moves from 365/1 (sequence/offset) to 60/0.25.

image

Finally, with the new dropout setting, we see overall much better agreement with the summer months.

image

There are still a few questions. Specifically:

  • Why does dropout have such a detrimental effect while recurrent dropout appears to help/perform as expected.
  • Is the 60/0.25 sequence length/offset combo the best performing because the model doesn't gain useful temporal information beyond 60 days, or because it leads to more samples in our training data for the model to learn from?

I think this issue is largely solved, but I'll leave it open until the end of the day in case anyone has some follow up thoughts.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants