-
Notifications
You must be signed in to change notification settings - Fork 14
Changing offsets and sequence lengths lead to reasonable predictions but very odd training logs #152
Comments
Quick update, when we restrict the training to only 4 years (meaning fewer batches/updates each epoch), we start to see training logs more like we'd expect in regards to validation loss. The model still appears to begin overfitting on the training data fairly quickly, which is probably worth keeping in mind for future applications of the RGCN. |
Some thoughts for looking into this -
(last two points kinda fight with each other because restore_best_weights may say that best validation set RMSE was at epoch 0 but the forecasts could look better despite the training log painting a poor picture?) |
Thanks for chiming in @jdiaz4302! It sounds like overall you're suggesting re-running the grid search on the full training data but with more measures in place to avoid overfitting? In response to:
and
I'm using 0.3 for regular dropout and 0 for recurrent, but I'll redo a run with both set to 0.5 to see what happens. Also, these results are using the
These are predictions using the best validation weights from the 60_1 run, so the model had 6 or 7 epochs of training, which from the log we can see performs better than the initialization weights. I haven't looked at what the predictions from the final weights are, but I'm guessing they'd be much worse. Would definitely be interesting to look into. |
Thanks for those plots and the clarifications on your run specs. I find the training logs with the full y range a lot more comforting. I didn't have any range in mind for the first plots so they looked pretty wild and chaotic as a first impression but given the fully y range, that these are like first passes in new conditions and the "very odd training logs" discussion prompt, I feel like even the worst among them are pretty reasonable - showing normal-to-moderate levels of imperfections (with moderate levels being more exclusive to the offset = 0.25 and, to a lesser extent, the offset = 0.5 columns with the full training period).
Yeah, it sounded like you suspected overfitting, so setting regular and recurrent dropout to 0.5 should be a very strong measure to combat that (not necessarily the best for performance but it would be difficult to overfit with those conditions). This may drop the moderate levels of imperfections to normal for those worse training logs. |
Why do you suppose the val Rmses are almost always lower than the train Rmses even for 0 epochs? |
@aappling-usgs, I believe that epoch loss is defined as the average of the batch losses over the epoch. Therefore in epoch 0 (epoch 1 in reality), the training rmses will have some really high values from the first few batches with the randomly initialized weights that will pull the average up. The val loss on the other hand is only calculated after the first full epoch of training, so it won't have those super high initial values artificially inflating the loss. |
@SimonTopp - not sure if you saw my comment above (@jdiaz4302 and I posted at the same time). I'm wondering if for other sequence lengths the model predictions are similarly poor in the summers. |
@jsadler2 Definitely looks like with the shorter sequences we do worse during the summer months. While they never really get there, they do get closer to matching the obs as you move from 60 to 365. Also, in plotting these up I realized that our |
Thanks for making and posting those, @SimonTopp. It's very interesting that the longer sequence lengths do better in the summer. ... I can't think of why that might be 🤔 |
It seems reasonable to me that having a full year of data could help with the summer temp patterns since the model is then seeing the full annual temporal pattern rather than just a snippet. It also looks like the summer observations for some of those reaches are a little unexpected, either plateauing or decreasing where I'd expect increases. |
So I believe I've gotten to the bottom of this. The baselines above were all done with dropout set to 0.3 and recurrent dropout set to 0. I did some simple tuning of the dropout space using 180/0.5 (sequence/offset) runs as a test. Below, rows are dropout and the columns are recurrent dropout. We can see that while recurrent dropout improves the train/val relationship, regular dropout significantly reduces validation performance leading to the training logs shared earlier. When we then redo the hypertuning of the sequence length and offset with recurrent dropout set to 0.3 and regular dropout set to zero, we see significant improvement in all of our metrics, with our best validation RMSE going from 2.13 to 1.68. We also see our best run moves from 365/1 (sequence/offset) to 60/0.25. Finally, with the new dropout setting, we see overall much better agreement with the summer months. There are still a few questions. Specifically:
I think this issue is largely solved, but I'll leave it open until the end of the day in case anyone has some follow up thoughts. |
Not exactly an issue, but something to be aware of and that I'm curious to get other peoples opinions on. While hypertuning sequence length and offset for the RGCN, I found that while various sequence length/offset combinations lead to reasonable predictions, the training logs for any combination other than a sequence of 365 and offset of 1 are very wonky, almost as if they're beginning to overfit on the training data right from the start. For all the model runs below, the only thing changing between runs is sequence length and offset of our train, test, and val partitions, and I didn't see any obvious errors in the data prep or training pipelines that would account for the weird training logs shown in the final figure here.
First, each cell in the heatmaps below represent a different model run with a unique sequence length and offset combination, while some combinations appear to outperform others, there are no super suspect numbers.
Similarly, when we just plot out the validation predictions, they seem pretty reasonable.
But! When we look at our training logs, we see that our validation loss throughout training is very erratic for all combinations except sequence length 365/offset 1 and sequence length 180/offset 1 (note that here I was using an early stopping value of 20)
One thing to consider is that the offsets and shorter sequence lengths lead to a lot more samples for each epoch, so there are subsequently more batches/more updates in each epoch. Maybe this is just allowing the model to converge much more quickly? Or maybe there's something about the RGCN that depends on sequences starting on the same day of the year, but if that was the case, I wouldn't expect the high test/val metrics or reasonable predictions. I'm curious to get your thoughts @janetrbarclay, @jdiaz4302, @jsadler2.
The text was updated successfully, but these errors were encountered: