Preprocessing overlapping data not implemented yet #54

alasdairhunter · 2018-11-12T09:39:57Z

Hi,
I am having an issue with the preprocessing when trying to run a diagnostic which uses two variables from a single model:
https://github.com/ESMValGroup/ESMValTool/blob/MAGIC_BSC/esmvaltool/recipes/recipe_diurnal_temperature_index_wp7.yml

An unexpected problem prevented concatenation.
Expected only a single cube, found 2.

I guess this is just a minor error in our .yml file. Do you have an example of any other diagnostics which load two (or more) variables from a single model for use as input for a single R script (or Python script).

bjoernbroetz · 2018-11-12T12:16:30Z

I was able to reproduce this error. In my case there was confusion with the input data:
IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/v20110915/tasmax/ contains
overlapping times for:

tasmax_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-19991231.nc
tasmax_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-20051231.nc

Removing one of this files seams to solve the problem for me...

mattiarighi · 2018-11-12T12:22:39Z

I though we implemented a bug fix for this kind of problem (overlapping time spans in the same path).
@valeriupredoi do you remember?

bouweandela · 2018-11-12T12:29:27Z

No this is an open issue (actually, no issue has been opened for it yet): ESMValGroup/ESMValTool#538 (comment)

bouweandela · 2018-11-12T12:32:53Z

I've changed the title of this issue to reflect the actual problem. To solve this issue, we need to make a decision on what data to select. Do we always select the longest data? Or is there some other way to automatically determine which file we need?

bouweandela · 2018-11-12T12:37:05Z

I actually found the issue related to this: ESMValGroup/ESMValTool#314
But if the user has no control over the data repository, just raising an exception is no solution, esmvaltool should be able to select the right file.

bjoernbroetz · 2018-11-12T12:38:41Z

Sorry, I was too optimistic. It is still failing with the corrected input

valeriupredoi · 2018-11-12T12:41:44Z

so @alasdairhunter recipe is a bit of a different case than the one that @bjoernbroetz replicated: in his recipe the models are inherently different because they use different experiments and concatenation should not be attempted; I ran several times in the problem the case that Bjoern ran spits out (I opened an issue about it many moons ago) and I recollect that was sorted out by a simple nicer error message to the user (along the lines 'get your times for your files sorted out, dude!'). And yes as @bouweandela says, we should let the tool figure out the file with the longest time and not attempt concatenation

valeriupredoi · 2018-11-12T12:42:22Z

ah #314 it was indeed, thanks @bouweandela

bjoernbroetz · 2018-11-12T13:04:01Z

I was too pessimistic. After fixing the same overlapping data problem for tasmin this works fine.

@valeriupredoi

in his recipe the models are inherently different because they use different experiments and concatenation should not be attempted

There is a valid usecase for mixing experiments, specially historical and scenarios. Right @mattiarighi ?

valeriupredoi · 2018-11-12T13:07:59Z

aha, I remembered about it only after I blurted my comment, sorry, but in that case we should probably include an option alike mixed-experiments: [datasets] in the recipe so we can perform custom concatenation?

bjoernbroetz · 2018-11-12T13:08:27Z

@alasdairhunter
When I need to fix a broken ESGF-tree (files with overlapping times) on a system where I don't have write access on the data, I create a mirror of the ESGF-tree with real directories and links to the original files. This takes only a few GB. Here is a tool for that:
https://gitlab.dkrz.de/b309070/linkbaum

And then I can delete single bad files... (links)

mattiarighi · 2018-11-12T13:22:57Z

in his recipe the models are inherently different because they use different experiments and
concatenation should not be attempted

There is a valid usecase for mixing experiments, specially historical and scenarios. Right @mattiarighi ?

This is a different issue, see #244

alasdairhunter · 2018-11-12T14:04:11Z

Hi,
Thanks for all your comments. And to clarify, during testing we have used several experiments in a single namelist before without a problem, so I think that is not the issue. I am trying to rerun it now based on the above feedback, but I still feel the issue could be related to loading two variables (tasmin and tasmax) from a single model. I will let you know.

bettina-gier · 2018-11-13T10:50:27Z

@bjoernbroetz cp -as Should do the same thing as your code. On the DKRZ the resulting directory was about 17GB.

I've run into a related problem at #693 where on the DKRZ duplicate data for EC-EARTH exists due to all data being present in .nc AND .nc4 files, with the nc4 files being newer. This could be solved by having a priority on newer data on overlap to not need to treat specific formats differently.

bouweandela · 2018-11-13T11:38:33Z

Is there any way to tell which file is newer from the netcdf global attributes? Because then we could automate this somewhat reliably in esmvaltool, e.g. in the data_finder or cube loading function.

Using the actual file creation data of the file system is not very reliable if someone just copied the files around on their local machine without keeping the original dates.

mattiarighi · 2018-11-13T11:41:25Z

There is a global attribute creation_date which could serve this purpose.

bouweandela · 2018-11-13T12:42:54Z

Unfortunately that is identical for the files referred to by @bettina-gier

/badc/cmip5/data/cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r1i1p1/latest/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc
/badc/cmip5/data/cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r1i1p1/latest/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc4

The tracking_id attribute is also identical..

bjoernbroetz · 2018-11-13T12:57:26Z

The .nc and .nc4 files are different "dialects" of the Netcdf format (netcdf-classic and netcdf 4). See ncdump -k
or
file
The data inside is supposed to be identical.

mattiarighi · 2018-11-13T13:01:25Z

Then we could just ignore the .nc4 files.

bouweandela · 2018-11-13T14:02:09Z

While the data is probably the same for the files mentioned above, there are other differences. E.g. the lat_bnds variable is renamed to lat_bounds in the .nc4 version, the bnds dimension is renamed to nv, .. I don't know if this is an issue?

Ignoring the .nc4 files would be really easy, just update the filename globs in config_developer.yml.

mattiarighi · 2018-11-13T14:35:59Z

I don't think this is an issue: I've runs many tests with version 2 since the very beginning, always using a drs where such .nc4 files are not present, and never encountered any problem.

Looking at the history of the EC-EARTH files mentioned above, it seems the .nc4 version is just the results of a cdo command:

cdo -f nc4 -z zip copy r1i1p1/v20131120/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc r1i1p1/v20131120/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc4

This confirms that the .nc version is the "original" file and we should use that.

valeriupredoi · 2018-11-28T16:42:01Z

ahaa now I know more about nc4 files - netcdf4 output equivalent to netcdf (nc) - cf-python does that too. So no need to worry about nc4's. Is this issue still active or we can nuke it?

bouweandela · 2019-01-22T15:05:40Z

Encountered yet another example of this issue:

$ for f in /badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/*; do echo $f; ncdump -h $f | grep creation_date; done
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_18500101-18991231.nc
		:creation_date = "2011-01-07T23:42:44Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_18500101-19491231.nc
		:creation_date = "2011-01-17T22:32:08Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_19000101-19491231.nc
		:creation_date = "2011-01-07T23:41:43Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-19991231.nc
		:creation_date = "2011-01-07T23:45:27Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-20051231.nc
		:creation_date = "2011-01-17T22:37:08Z" ;

valeriupredoi · 2019-02-18T16:42:00Z

email from Ruth Petrie (one of the peeps in charge of CMIP6 data and replication):

Hi V,

Just following up on “over-lapping time-series data in CMIP5”. I just wanted to confirm with you that there will be some form of test for this within ESMValTool so it doesn’t fail completely but exits nicely. The reason I say that is that I have found at least one example where there is a timeseries overlap in the CMIP5 data but we have the latest available version according to ESGF and so this will never be fixed, even after I have performed my update scan of CMIP5. I expect that this will be the case for all sources of affected data, unless the ESGF record hasn’t been updated in which case there isn’t much I can do. I am in discussion with as to whether we should perform a test for this in CMIP6 data as currently this isn’t tested for in any of the preparations that I am aware of and there is nothing in ESGF that prohibits publication with this error. It would be good to have this tested for in CMIP6 at least as at an exclusion level if we don’t have permission from data providers to fix.

Anyway just confirming that you are aware that CMIP5 will likely have a number of datasets with this error and we won’t be able to fix them all.

Cheers,
Ruth

They will not do anything about CMIP5 data, and hopefully they will have this fixed in CMIP6. I say we close this.

mattiarighi · 2019-02-18T17:19:23Z

@schlunma had this problem today with the data at DKRZ. Our solution so far was to create a local replica (with links) of all data, to manually get rid of overlapping files. Not very nice, but it works.

It depends on how much effort it is, but it would be nice to have a better solution for this problem.

bouweandela · 2019-02-19T12:36:29Z

It shouldn't be too much work to fix, if the data_finder.py file find overlapping files, it should ignore the ones with an older 'creation_date' attribute.

mattiarighi · 2019-02-19T12:38:49Z

I'm not sure that 's the only selection criterion.
For example, one may also want to select the file with the longer time coverage.

@schlunma can you post here the problem you had yesterday?

schlunma · 2019-02-19T13:22:18Z

Will do tomorrow when Mistral is back online 👍

schlunma · 2019-02-20T14:41:36Z

On mistral, there are sometimes files with overlapping time ranges. An example are the historical Lmon variables of IPSL-CM5B-LR:

...
├── c3PftFrac
│   ├── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc
├── c4PftFrac
│   ├── c4PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── c4PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc
├── cLeaf
│   ├── cLeaf_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── cLeaf_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc
...

The tool cannot process those files because it cannot concatenate cubes with overlapping time ranges. This case can be handled relatively easy by ignoring the file with the shorter time range.

I once had a case (but I don't remember the variable...) where this is not trivial: The were two files which overlapped by just one month, e.g.

..._185001-200012.nc
..._200011-200512.nc

What we definitely need in both cases is a function which checks for overlapping time ranges by just looking at the filename and without loading the cube.

bouweandela · 2019-03-04T15:23:54Z

I'm not sure we will be able to handle all possible errors in the file system organization automatically. Maybe we could by default always use the newest (i.e. latest creation date) files, but also give the user the option to blacklist certain files in config-user.yml?

For the files mentioned above, it looks like the shorter version was created just 5 seconds after the longer version.

By the way, this problem appears to be particularly bad for the IPSL models, I have not seen it in any other model so far.

earnone · 2019-03-20T11:39:10Z

Running into the same problem both with historical going further than 200512 or one month only overlapping and the like. For cmip5 we know how each run was performed, can't we preprocess and re-archive these data once or all? I.e. historical ending 200512 and rcp starting at 200601.

BTW, if two dataset of the exact same experiment are present covering two time spans, why would one want to use the longer timeseries if the shorter one is long enough and can be processed faster? See the above example where a subset was likely extracted to save processing time.

├── c3PftFrac
│   ├── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc

npgillett · 2019-04-15T21:38:38Z

Hi all, I'm getting this same problem with running ESMValTool on CMIP6 data on mistral. Some CMIP6 data is on the main archive, but some is in a separate buffer directory /work/bd0854/CMIP6_buffer . I want to run esmvaltool using data from both sources, because some data is in one place and some is in the other. Including both directories in the CMIP6 data path works, except that in some cases there are copies of the same file in both directories and I get this error:

2019-04-12 00:08:33,506 UTC [38035] ERROR   Can not concatenate cubes: failed to concatenate into a single cube.
  An unexpected problem prevented concatenation.
  Expected only a single cube, found 2.
2019-04-12 00:08:33,507 UTC [38035] ERROR   Cubes:
2019-04-12 00:08:33,508 UTC [38035] ERROR   air_temperature / (K)               (time: 1980; latitude: 128; longitude: 256)

I don't have write permission for the buffer directory, so I can't delete duplicates. Other than by making a copy of one of the directories with symlinks and deleting the duplicates, is there a way to make esmvaltool not fail in this case?
Thanks, Nathan

bouweandela · 2019-05-06T11:35:07Z

Not at the moment unfortunately. This issue is still an open issue.

mattiarighi · 2020-02-11T16:42:10Z

#280 seems to have solved this issue as well.

I tried reading the following dataset (see also @bjoernbroetz's comment above):

  - {dataset: IPSL-CM5A-LR, project: CMIP5, exp: historical, mip: day, ensemble: r1i1p1, start_year: 1998, end_year: 2003}

which I would expect to give problems, since the input directory contains overlapping files:

/mnt/lustre02/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/v20110915/tasmin/tasmin_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-19991231.nc
/mnt/lustre02/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/v20110915/tasmin/tasmin_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-20051231.nc

but it went through without errors.

However, from the log I cannot understand which of the two was actually used.

@valeriupredoi ?

valeriupredoi · 2020-02-11T16:51:44Z

did you read the debug log? It's all there @mattiarighi (like in the Da Vinci Code 😁 )

valeriupredoi · 2020-02-11T17:00:21Z

yay! 🎉

alasdairhunter assigned nielsdrost, bouweandela, bjoernbroetz and alasdairhunter and unassigned nielsdrost, bouweandela and bjoernbroetz Nov 12, 2018

bouweandela changed the title ~~Preprocessing crashing for multiple variables from a single model~~ Preprocessing overlapping data not implemented yet Nov 12, 2018

mattiarighi assigned valeriupredoi and unassigned nielsdrost, bouweandela and bjoernbroetz Jun 11, 2019

mattiarighi transferred this issue from ESMValGroup/ESMValTool Jun 11, 2019

mattiarighi added the preprocessor Related to the preprocessor label Jun 11, 2019

schlunma mentioned this issue Oct 7, 2019

Data finder cannot find data spread over different versions #286

Closed

valeriupredoi mentioned this issue Oct 7, 2019

Attempt concatenation when cubes have overlapping data #280

Merged

7 tasks

mattiarighi closed this as completed Feb 11, 2020

Preprocessing overlapping data not implemented yet #54

Preprocessing overlapping data not implemented yet #54

Comments

alasdairhunter commented Nov 12, 2018

bjoernbroetz commented Nov 12, 2018

mattiarighi commented Nov 12, 2018

bouweandela commented Nov 12, 2018

bouweandela commented Nov 12, 2018

bouweandela commented Nov 12, 2018

bjoernbroetz commented Nov 12, 2018

valeriupredoi commented Nov 12, 2018

valeriupredoi commented Nov 12, 2018

bjoernbroetz commented Nov 12, 2018

valeriupredoi commented Nov 12, 2018

bjoernbroetz commented Nov 12, 2018 • edited Loading

mattiarighi commented Nov 12, 2018

alasdairhunter commented Nov 12, 2018

bettina-gier commented Nov 13, 2018

bouweandela commented Nov 13, 2018

mattiarighi commented Nov 13, 2018

bouweandela commented Nov 13, 2018

bjoernbroetz commented Nov 13, 2018

mattiarighi commented Nov 13, 2018

bouweandela commented Nov 13, 2018

mattiarighi commented Nov 13, 2018

valeriupredoi commented Nov 28, 2018

bouweandela commented Jan 22, 2019

valeriupredoi commented Feb 18, 2019

mattiarighi commented Feb 18, 2019

bouweandela commented Feb 19, 2019

mattiarighi commented Feb 19, 2019

schlunma commented Feb 19, 2019

schlunma commented Feb 20, 2019

bouweandela commented Mar 4, 2019

earnone commented Mar 20, 2019

npgillett commented Apr 15, 2019

bouweandela commented May 6, 2019

mattiarighi commented Feb 11, 2020

valeriupredoi commented Feb 11, 2020

valeriupredoi commented Feb 11, 2020

bjoernbroetz commented Nov 12, 2018 •

edited

Loading