Skip to content
This repository has been archived by the owner on Jun 2, 2023. It is now read-only.

Pad train/val/test data #218

Merged
merged 13 commits into from
May 11, 2023
Merged

Pad train/val/test data #218

merged 13 commits into from
May 11, 2023

Conversation

jds485
Copy link
Member

@jds485 jds485 commented Apr 24, 2023

This PR adds function arguments that allow a user to pad the training, validation, and/or testing datasets so that data are not trimmed. For the inland salinity dataset, this resulted in 5000-20000 more observations used.

I tested this for cases where the train/val/test is a continuous time period. I think if a partition is a discontinuous time period (e.g., training is 2000-01-01 to 2005-01-01 and 2010-01-01 to 2015-01-01), that would require a different coding approach. For example, pad each of the continuous periods within the discontinuous blocks so that batches are defined for each of the continuous periods. That might also address #127. Curious in your thoughts on how is best to approach padding for discontinuous time periods.

Closes: #217

:return: [numpy array] batched data with dims [nbatches, nseg, seq_len
(batch_size), nfeat]
"""
if offset>1:
period = int(offset)
else:
period = int(offset*seq_len)
num_batches = data_array.shape[1]//period

ndays = data_array.shape[1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth calling this nsteps instead of ndays? Or maybe the convention of daily timesteps is hardcoded elsewhere such that it doesn't matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good catch. I was trying to be generic with this code, but missed a few spots. I think I hard-coded 'D' for daily timesteps somewhere and I can make that user specified

:param fill_time: [bool] When True, filled in data are time indices that
follow in sequence from the previous timesteps. When False, filled in data
are replicates of the previous timesteps.
When False, filled in data are replicates of the previous timesteps.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this line is accidentally duplicated? ("When False . . . previous timesteps.")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks! I'll fix that

river_dl/evaluate.py Outdated Show resolved Hide resolved
@jds485 jds485 removed the request for review from SimonTopp April 25, 2023 19:19
fill_dates_array),
axis = 1)
else:
data_array = np.concatenate((data_array,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is it to maintain the temporal structure of the original dataset in the padded values? (It seems important, but I can also convince myself that it isn't.) With this approach to padding the data array, the temporal structure isn't continuous. For example, if the data_array has 545 steps and a period length of 365 (so 1 1/2 years of data with 1 yr batches), the number of repeated steps is 185 and the repeated steps would be 360 - 545. If the original dataset starts in Jan, then it would go through the following June and we'd need to pad for July - Dec, but the repeated steps are Jan - June, so the last batch would have winter-spring-winter-spring (instead of winter-spring-summer-fall). With annual periods, you could pull the needed data from the last timesteps of the prior batch, but that breaks down if the sequence length is other than annual. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is it to maintain the temporal structure of the original dataset in the padded values?

If I'm thinking about this correctly, the padding only affects the last batch. All of the observations for the filled in timesteps in that batch are set to np.nan, so I think it does not matter which data are used to pad the timeseries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely only affects the last batch (or at least on the run I did). I compared the input data for the test partition with and without padding and they were identical through the full shape of the un-padded array.

And the models shouldn't be using any future data. Is it worth checking that? (doing a run or 2 pulling different data to pad the series? could even test the extremes or something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the models shouldn't be using any future data. Is it worth checking that?

I checked the shapes of the observation data in the training, validation and testing partitions for spatial and temporal methods in the repo. The shapes matched the x data array (and ids and times arrays), and the observations were nan for the padded times. Is that the kind of check you were thinking of?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing a run or 2 pulling different data to pad the series?

Like program a random sample to pad the timeseries? I can try that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shapes matched the x data array (and ids and times arrays), and the observations were nan for the padded times. Is that the kind of check you were thinking of?

I was thinking more something to test the assumption that it doesn't really matter the x data that's used to for the padding. So from a temperature perspective, trying the padding with warmer air temps or colder air temps and checking to make sure the simulated temps are essentially the same. (and if not, then we need to think more carefully about what we use for the x data for padding)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me try with the random sample. I have set the random seeds and verified that results are the same when those are specified. So using a random sample should provide the same results as I've obtained from a previous run

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got identical results using a random sample and filling in by replicating the last timesteps.

Code for random sampling of timesteps:

sample_inds = random.sample(range(nsteps), num_rep_steps)
                    data_array = np.concatenate((data_array, 
                                                 data_array[:,sample_inds,:]),
                                                axis = 1)

Results for replicating timesteps (top) and random sampling (middle) and a difference (bottom)
image

@jds485
Copy link
Member Author

jds485 commented Apr 26, 2023

How should we pad discontinuous partitions?

  1. We could make discontinuous partitions continuous when creating batches by filling in the gap with observed x data and np.nan y data. This assumes that we would always have the observed x data to fill in the gaps, which I think its true. @janetrbarclay notes that if we use this approach, we could end up with batches with no non-nan obs, so maybe then a step is to remove the batches with all nan obs.
  1. Alternately, we could treat each section of the discontinuous partitions separately for the purposes of padding. Then batches would always start with observed data.
  • same question about how hidden and cell states are handled.

@jdiaz4302
Copy link
Collaborator

how are cell and hidden states handled from batch to batch in this implementation? It seems like they are passed from the previous batch to the current batch, but I want to double check

I believe init_states was included because Simon was adding pytorch-compatability for GWN and I shared the pytorch RGCN version that I coded up for DRB forecasting where providing h/c is needed for data assimilation. I don't think anyone uses it in this repo; pretty sure it's the default init_states=None being used. Others should definitely feel free to correct me if this is wrong.

@jdiaz4302
Copy link
Collaborator

I'm super rusty on this repo as a project workflow (i.e., beyond it's general modeling methods), do I understand the original problem correctly:

  • Currently, the last batches (which are maybe shorter than the requested sequence length) are being dropped? And this is a problem because those final, shorter sequences have a nontrivial amount of data?

If I understand it well enough, I think option 2 sounds better because then you're not potentially connecting mid-winter and mid-summer together, instead you'll just have two nan-heavier sequences which sounds more appropriate and realistic to some hypothetical, operational usage.

@jds485
Copy link
Member Author

jds485 commented Apr 26, 2023

do I understand the original problem correctly

  • Currently, the last batches (which are maybe shorter than the requested sequence length) are being dropped? And this is a problem because those final, shorter sequences have a nontrivial amount of data?

yes, that's correct.

I think option 2 sounds better because then you're not potentially connecting mid-winter and mid-summer together

I think option 1 would also achieve this, after dropping any sequences that only have nan data. For a partition with [data1, gap, data2], we would fill in the gap with observed x data and set y to nan in that gap. The difference would be that the nans in data2 could appear at the start of the first and end of the last sequence for option 1, and only at the end of the last sequence for option 2. So, I think there could be fewer overall nans using option 2. If cell and hidden states are not carried over from batch to batch, that's also a good case for option 2

@jdiaz4302
Copy link
Collaborator

Ah, okay. It sounds fairly inconsequential then from my outsider point of view.

I think nan only at the end is slightly preferable. It probably has no impact on current workflow but I could see it being beneficial to autoregressive models that might get nan drivers for the lagged target during the "gap" and it would be an easier assumption to make when handling the data.

@jds485
Copy link
Member Author

jds485 commented Apr 27, 2023

Thanks for your perspectives, @jdiaz4302! I can implement option 2 for this PR


#get the start date indices. These are used to split the
# dataset before creating batches
continuous_start_inds.append(i+1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might run counter to our previous discussion re: starting batches after gaps, but would we ever want to fill short gaps and not restart the batch? (ie partitioning our data so we test on July and train on Aug through June, with 365 day batches) In that case we might want to fill the gap in July but keep going and not restart the batch. That's not the best example and maybe since that's not a current use we should just note that as an issue and keep this as you have it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we ever want to fill short gaps and not restart the batch?

Maybe. The check for gap_length < seq_len could be used to identify which indices to fill instead of restart the batch after the gap. seq_len could be changed to a function argument that specifies the maximum gap length that should be filled. I think that approach would also require an edit to the array to fill in the missing dates.

I initially thought the example you gave could be solved by specifying the start dates of training and testing as Aug 1 and July 1, but that will not work after leap years, so filling in small gaps would be preferable.

I don't currently need that functionality, so I'd vote to convert to an issue.

@janetrbarclay
Copy link
Collaborator

janetrbarclay commented May 5, 2023 via email

@janetrbarclay
Copy link
Collaborator

I ran the updates for padding the discontinuous partitions and they ran fine and the padding looks good. When I looked at the predicted temps, however, I saw that there were predictions for the added days. Looking at the updated predict function, I see there's an option to specify the last days, but I hadn't noticed that (and therefore hadn't updated my snakemake file to account for it) and as written it doesn't remove padded days in the middle of the partition (like with discontinuous times).

What would you think of addressing both of these issues (the whoops I forgot to specify in multiple places the padding dates and the latest_time not working with discontinuous issues) by adding a bit to the preprocessing that tracks whether the data is padded or not? (additional files in prepped.npz like "times_trn" or 'ids_trn', that is a boolean that tracks padded or not) Since prepped.npz (or whatever name is used for the prepped data file) is passed to predict_from_io_data, the padding flag would always be available (and it's absence could be treated as no-padding)

@jds485
Copy link
Member Author

jds485 commented May 8, 2023

What would you think of addressing both of these issues (the whoops I forgot to specify in multiple places the padding dates and the latest_time not working with discontinuous issues) by adding a bit to the preprocessing that tracks whether the data is padded or not?

That's a great suggestion! I can look into adding that

@jds485
Copy link
Member Author

jds485 commented May 10, 2023

@janetrbarclay This is ready to be tested. I tried with several discontinuous partitions and it worked for me. This is a plot of one of the prediction timeseries. The horizontal line is where there are no data. I did this with latest_time = None (default). I decided to leave that as a function argument in case someone wants to use it

image

And this is the corresponding indicator of padded data (1 = padded). The x-axis is the index in the array, not the date, so the full length of the data gap is not included in this figure.
image

@jds485 jds485 requested a review from janetrbarclay May 10, 2023 20:06
Copy link
Collaborator

@janetrbarclay janetrbarclay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for sticking with this through many slow reviews!

@jds485 jds485 merged commit f545bdf into main May 11, 2023
@jds485 jds485 deleted the 217-pad-data branch May 11, 2023 16:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pad training/validation/testing
3 participants