Pad train/val/test data #218

jds485 · 2023-04-24T14:58:18Z

This PR adds function arguments that allow a user to pad the training, validation, and/or testing datasets so that data are not trimmed. For the inland salinity dataset, this resulted in 5000-20000 more observations used.

I tested this for cases where the train/val/test is a continuous time period. I think if a partition is a discontinuous time period (e.g., training is 2000-01-01 to 2005-01-01 and 2010-01-01 to 2015-01-01), that would require a different coding approach. For example, pad each of the continuous periods within the discontinuous blocks so that batches are defined for each of the continuous periods. That might also address #127. Curious in your thoughts on how is best to approach padding for discontinuous time periods.

Closes: #217

… lengths

… filled-in data

janetrbarclay · 2023-04-24T17:35:38Z

river_dl/preproc_utils.py

    :return: [numpy array] batched data with dims [nbatches, nseg, seq_len
    (batch_size), nfeat]
    """
    if offset>1:
        period = int(offset)
    else:
        period = int(offset*seq_len)
-    num_batches = data_array.shape[1]//period
+
+    ndays = data_array.shape[1]


Would it be worth calling this nsteps instead of ndays? Or maybe the convention of daily timesteps is hardcoded elsewhere such that it doesn't matter?

Yes, that's a good catch. I was trying to be generic with this code, but missed a few spots. I think I hard-coded 'D' for daily timesteps somewhere and I can make that user specified

janetrbarclay · 2023-04-24T18:02:08Z

river_dl/preproc_utils.py

+    :param fill_time: [bool] When True, filled in data are time indices that
+    follow in sequence from the previous timesteps. When False, filled in data 
+    are replicates of the previous timesteps.
+    When False, filled in data are replicates of the previous timesteps.


looks like this line is accidentally duplicated? ("When False . . . previous timesteps.")

Yes, thanks! I'll fix that

river_dl/preproc_utils.py

river_dl/evaluate.py

janetrbarclay · 2023-04-26T16:06:46Z

river_dl/preproc_utils.py

+                                                 fill_dates_array),
+                                                axis = 1)
+                else:
+                    data_array = np.concatenate((data_array, 


How important is it to maintain the temporal structure of the original dataset in the padded values? (It seems important, but I can also convince myself that it isn't.) With this approach to padding the data array, the temporal structure isn't continuous. For example, if the data_array has 545 steps and a period length of 365 (so 1 1/2 years of data with 1 yr batches), the number of repeated steps is 185 and the repeated steps would be 360 - 545. If the original dataset starts in Jan, then it would go through the following June and we'd need to pad for July - Dec, but the repeated steps are Jan - June, so the last batch would have winter-spring-winter-spring (instead of winter-spring-summer-fall). With annual periods, you could pull the needed data from the last timesteps of the prior batch, but that breaks down if the sequence length is other than annual. Thoughts?

How important is it to maintain the temporal structure of the original dataset in the padded values?

If I'm thinking about this correctly, the padding only affects the last batch. All of the observations for the filled in timesteps in that batch are set to np.nan, so I think it does not matter which data are used to pad the timeseries.

It definitely only affects the last batch (or at least on the run I did). I compared the input data for the test partition with and without padding and they were identical through the full shape of the un-padded array.

And the models shouldn't be using any future data. Is it worth checking that? (doing a run or 2 pulling different data to pad the series? could even test the extremes or something)

And the models shouldn't be using any future data. Is it worth checking that?

I checked the shapes of the observation data in the training, validation and testing partitions for spatial and temporal methods in the repo. The shapes matched the x data array (and ids and times arrays), and the observations were nan for the padded times. Is that the kind of check you were thinking of?

doing a run or 2 pulling different data to pad the series?

Like program a random sample to pad the timeseries? I can try that

The shapes matched the x data array (and ids and times arrays), and the observations were nan for the padded times. Is that the kind of check you were thinking of?

I was thinking more something to test the assumption that it doesn't really matter the x data that's used to for the padding. So from a temperature perspective, trying the padding with warmer air temps or colder air temps and checking to make sure the simulated temps are essentially the same. (and if not, then we need to think more carefully about what we use for the x data for padding)

Okay, let me try with the random sample. I have set the random seeds and verified that results are the same when those are specified. So using a random sample should provide the same results as I've obtained from a previous run

I got identical results using a random sample and filling in by replicating the last timesteps.

Code for random sampling of timesteps:

sample_inds = random.sample(range(nsteps), num_rep_steps) data_array = np.concatenate((data_array, data_array[:,sample_inds,:]), axis = 1)

Results for replicating timesteps (top) and random sampling (middle) and a difference (bottom)

jds485 · 2023-04-26T18:20:19Z

How should we pad discontinuous partitions?

We could make discontinuous partitions continuous when creating batches by filling in the gap with observed x data and np.nan y data. This assumes that we would always have the observed x data to fill in the gaps, which I think its true. @janetrbarclay notes that if we use this approach, we could end up with batches with no non-nan obs, so maybe then a step is to remove the batches with all nan obs.

This raises a question: how are cell and hidden states handled from batch to batch in this implementation? It seems like they are passed from the previous batch to the current batch, but I want to double check with @jdiaz4302, @jsadler2, or @SimonTopp

Alternately, we could treat each section of the discontinuous partitions separately for the purposes of padding. Then batches would always start with observed data.

same question about how hidden and cell states are handled.

jdiaz4302 · 2023-04-26T21:09:06Z

how are cell and hidden states handled from batch to batch in this implementation? It seems like they are passed from the previous batch to the current batch, but I want to double check

I believe init_states was included because Simon was adding pytorch-compatability for GWN and I shared the pytorch RGCN version that I coded up for DRB forecasting where providing h/c is needed for data assimilation. I don't think anyone uses it in this repo; pretty sure it's the default init_states=None being used. Others should definitely feel free to correct me if this is wrong.

jdiaz4302 · 2023-04-26T21:16:22Z

I'm super rusty on this repo as a project workflow (i.e., beyond it's general modeling methods), do I understand the original problem correctly:

Currently, the last batches (which are maybe shorter than the requested sequence length) are being dropped? And this is a problem because those final, shorter sequences have a nontrivial amount of data?

If I understand it well enough, I think option 2 sounds better because then you're not potentially connecting mid-winter and mid-summer together, instead you'll just have two nan-heavier sequences which sounds more appropriate and realistic to some hypothetical, operational usage.

jds485 · 2023-04-26T22:34:12Z

do I understand the original problem correctly

Currently, the last batches (which are maybe shorter than the requested sequence length) are being dropped? And this is a problem because those final, shorter sequences have a nontrivial amount of data?

yes, that's correct.

I think option 2 sounds better because then you're not potentially connecting mid-winter and mid-summer together

I think option 1 would also achieve this, after dropping any sequences that only have nan data. For a partition with [data1, gap, data2], we would fill in the gap with observed x data and set y to nan in that gap. The difference would be that the nans in data2 could appear at the start of the first and end of the last sequence for option 1, and only at the end of the last sequence for option 2. So, I think there could be fewer overall nans using option 2. If cell and hidden states are not carried over from batch to batch, that's also a good case for option 2

jdiaz4302 · 2023-04-27T13:25:23Z

Ah, okay. It sounds fairly inconsequential then from my outsider point of view.

I think nan only at the end is slightly preferable. It probably has no impact on current workflow but I could see it being beneficial to autoregressive models that might get nan drivers for the lagged target during the "gap" and it would be an easier assumption to make when handling the data.

jds485 · 2023-04-27T17:41:04Z

Thanks for your perspectives, @jdiaz4302! I can implement option 2 for this PR

janetrbarclay · 2023-05-05T11:50:07Z

river_dl/preproc_utils.py

+
+                    #get the start date indices. These are used to split the
+                    # dataset before creating batches
+                    continuous_start_inds.append(i+1)


This might run counter to our previous discussion re: starting batches after gaps, but would we ever want to fill short gaps and not restart the batch? (ie partitioning our data so we test on July and train on Aug through June, with 365 day batches) In that case we might want to fill the gap in July but keep going and not restart the batch. That's not the best example and maybe since that's not a current use we should just note that as an issue and keep this as you have it?

would we ever want to fill short gaps and not restart the batch?

Maybe. The check for gap_length < seq_len could be used to identify which indices to fill instead of restart the batch after the gap. seq_len could be changed to a function argument that specifies the maximum gap length that should be filled. I think that approach would also require an edit to the array to fill in the missing dates.

I initially thought the example you gave could be solved by specifying the start dates of training and testing as Aug 1 and July 1, but that will not work after leap years, so filling in small gaps would be preferable.

I don't currently need that functionality, so I'd vote to convert to an issue.

janetrbarclay · 2023-05-05T15:50:33Z

An issue sounds good to me 🙂

------------------------------------------------ Janet Barclay U.S. Geological Survey New England Water Science Center Connecticut Office 101 Pitkin St. East Hartford, CT 06108 Phone (office) 860 291-6763 Fax 860 291-6799 Email ***@***.***>***@***.******@***.***> https://www.usgs.gov/staff-profiles/janet-barclay

________________________________ From: Jared D. Smith ***@***.***> Sent: Friday, May 5, 2023 11:48 AM To: USGS-R/river-dl ***@***.***> Cc: Barclay, Janet R ***@***.***>; Mention ***@***.***> Subject: [EXTERNAL] Re: [USGS-R/river-dl] Pad train/val/test data (PR #218) This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding. @jds485 commented on this pull request.

________________________________ In river_dl/preproc_utils.py<#218 (comment)>:

+ gap_ind = np.where(time_diff == t)[0]

+ #there can be multiple gaps of the same length + for i in gap_ind: + #determine if the gap is longer than the sequence length + date_before_gap = dataset[time_idx_name][i].values + next_date = dataset[time_idx_name][i+1].values + #gap length in the same unit as the timestep + gap_length = int((next_date - date_before_gap)/timestep_1) + if gap_length < seq_len: + #I originally had this as a sys.exit, but did not want + # to force this condition. + print("The gap between this partition's continuous time periods is less than the sequence length") + + #get the start date indices. These are used to split the + # dataset before creating batches + continuous_start_inds.append(i+1) would we ever want to fill short gaps and not restart the batch? Maybe. The check for gap_length < seq_len could be used to identify which indices to fill instead of restart the batch after the gap. seq_len could be changed to a function argument that specifies the maximum gap length that should be filled. I think that approach would also require an edit to the array to fill in the missing dates. I initially thought the example you gave could be solved by specifying the start dates of training and testing as Aug 1 and July 1, but that will not work after leap years, so filling in small gaps would be preferable. I don't currently need that functionality, so I'd vote to convert to an issue. — Reply to this email directly, view it on GitHub<#218 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA5H7UAQZ4ANCIY4VQTCW5DXEUOLVANCNFSM6AAAAAAXJWUUIE>. You are receiving this because you were mentioned.Message ID: ***@***.***>

janetrbarclay · 2023-05-08T15:24:18Z

I ran the updates for padding the discontinuous partitions and they ran fine and the padding looks good. When I looked at the predicted temps, however, I saw that there were predictions for the added days. Looking at the updated predict function, I see there's an option to specify the last days, but I hadn't noticed that (and therefore hadn't updated my snakemake file to account for it) and as written it doesn't remove padded days in the middle of the partition (like with discontinuous times).

What would you think of addressing both of these issues (the whoops I forgot to specify in multiple places the padding dates and the latest_time not working with discontinuous issues) by adding a bit to the preprocessing that tracks whether the data is padded or not? (additional files in prepped.npz like "times_trn" or 'ids_trn', that is a boolean that tracks padded or not) Since prepped.npz (or whatever name is used for the prepped data file) is passed to predict_from_io_data, the padding flag would always be available (and it's absence could be treated as no-padding)

jds485 · 2023-05-08T16:13:33Z

What would you think of addressing both of these issues (the whoops I forgot to specify in multiple places the padding dates and the latest_time not working with discontinuous issues) by adding a bit to the preprocessing that tracks whether the data is padded or not?

That's a great suggestion! I can look into adding that

jds485 · 2023-05-10T20:05:36Z

@janetrbarclay This is ready to be tested. I tried with several discontinuous partitions and it worked for me. This is a plot of one of the prediction timeseries. The horizontal line is where there are no data. I did this with latest_time = None (default). I decided to leave that as a function argument in case someone wants to use it

And this is the corresponding indicator of padded data (1 = padded). The x-axis is the index in the array, not the date, so the full length of the data gap is not included in this figure.

janetrbarclay

Looks good! Thanks for sticking with this through many slow reviews!

jds485 added 3 commits April 21, 2023 15:39

add option to pad train, val, and test datasets to match the sequence…

25185aa

… lengths

add latest time for predictions so that they can be trimmed to remove…

b636db9

… filled-in data

fix typo

82cf9ae

jds485 requested review from SimonTopp and janetrbarclay April 24, 2023 15:00

janetrbarclay reviewed Apr 24, 2023

View reviewed changes

river_dl/preproc_utils.py Outdated Show resolved Hide resolved

janetrbarclay reviewed Apr 25, 2023

View reviewed changes

river_dl/evaluate.py Outdated Show resolved Hide resolved

jds485 removed the request for review from SimonTopp April 25, 2023 19:19

jds485 added 5 commits April 26, 2023 11:09

generalize function for the timestep used

4a71e1f

fix comment

ea4b31e

handle case when any of trn, val, tst are not used

59cd79c

add small offset for log metrics

30e383d

use keys for dict instead of files

22bba78

janetrbarclay reviewed Apr 26, 2023

View reviewed changes

jds485 mentioned this pull request Apr 26, 2023

Edits to spatial train/val/test, additional performance metrics #211

Merged

merge main

b3c5795

handle discontinuous time periods

85ce314

janetrbarclay reviewed May 5, 2023

View reviewed changes

jds485 mentioned this pull request May 5, 2023

Allow for filling small gaps instead of restarting the batch for every gap #220

Open

add indicator of padded data

81d064b

jds485 requested a review from janetrbarclay May 10, 2023 20:06

jds485 added 2 commits May 10, 2023 16:09

added if statement for a function arg

8f2203a

using is not None format in if statement

f711012

janetrbarclay approved these changes May 11, 2023

View reviewed changes

jds485 merged commit f545bdf into main May 11, 2023

jds485 deleted the 217-pad-data branch May 11, 2023 16:02

jds485 mentioned this pull request May 23, 2023

Multiple train/test splits result in discontinuous batches #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pad train/val/test data #218

Pad train/val/test data #218

jds485 commented Apr 24, 2023 •

edited

Loading

janetrbarclay Apr 24, 2023

jds485 Apr 24, 2023

janetrbarclay Apr 24, 2023

jds485 Apr 24, 2023

janetrbarclay Apr 26, 2023

jds485 Apr 26, 2023

janetrbarclay Apr 26, 2023

jds485 Apr 26, 2023

jds485 Apr 26, 2023

janetrbarclay Apr 26, 2023

jds485 Apr 26, 2023

jds485 Apr 26, 2023

jds485 commented Apr 26, 2023

jdiaz4302 commented Apr 26, 2023

jdiaz4302 commented Apr 26, 2023

jds485 commented Apr 26, 2023

jdiaz4302 commented Apr 27, 2023

jds485 commented Apr 27, 2023

janetrbarclay May 5, 2023

jds485 May 5, 2023

janetrbarclay commented May 5, 2023 via email

janetrbarclay commented May 8, 2023

jds485 commented May 8, 2023

jds485 commented May 10, 2023 •

edited

Loading

janetrbarclay left a comment

Pad train/val/test data #218

Pad train/val/test data #218

Conversation

jds485 commented Apr 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jds485 commented Apr 26, 2023

jdiaz4302 commented Apr 26, 2023

jdiaz4302 commented Apr 26, 2023

jds485 commented Apr 26, 2023

jdiaz4302 commented Apr 27, 2023

jds485 commented Apr 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janetrbarclay commented May 5, 2023 via email

janetrbarclay commented May 8, 2023

jds485 commented May 8, 2023

jds485 commented May 10, 2023 • edited Loading

janetrbarclay left a comment

Choose a reason for hiding this comment

jds485 commented Apr 24, 2023 •

edited

Loading

jds485 commented May 10, 2023 •

edited

Loading