Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing overlapping data not implemented yet #54

Closed
alasdairhunter opened this issue Nov 12, 2018 · 36 comments
Closed

Preprocessing overlapping data not implemented yet #54

alasdairhunter opened this issue Nov 12, 2018 · 36 comments
Assignees
Labels
preprocessor Related to the preprocessor

Comments

@alasdairhunter
Copy link
Contributor

Hi,
I am having an issue with the preprocessing when trying to run a diagnostic which uses two variables from a single model:
https://github.com/ESMValGroup/ESMValTool/blob/MAGIC_BSC/esmvaltool/recipes/recipe_diurnal_temperature_index_wp7.yml

An unexpected problem prevented concatenation.
Expected only a single cube, found 2.

I guess this is just a minor error in our .yml file. Do you have an example of any other diagnostics which load two (or more) variables from a single model for use as input for a single R script (or Python script).

@bjoernbroetz
Copy link
Contributor

I was able to reproduce this error. In my case there was confusion with the input data:
IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/v20110915/tasmax/ contains
overlapping times for:

tasmax_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-19991231.nc
tasmax_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-20051231.nc

Removing one of this files seams to solve the problem for me...

@mattiarighi
Copy link
Contributor

I though we implemented a bug fix for this kind of problem (overlapping time spans in the same path).
@valeriupredoi do you remember?

@bouweandela
Copy link
Member

No this is an open issue (actually, no issue has been opened for it yet): ESMValGroup/ESMValTool#538 (comment)

@bouweandela bouweandela changed the title Preprocessing crashing for multiple variables from a single model Preprocessing overlapping data not implemented yet Nov 12, 2018
@bouweandela
Copy link
Member

I've changed the title of this issue to reflect the actual problem. To solve this issue, we need to make a decision on what data to select. Do we always select the longest data? Or is there some other way to automatically determine which file we need?

@bouweandela
Copy link
Member

I actually found the issue related to this: ESMValGroup/ESMValTool#314
But if the user has no control over the data repository, just raising an exception is no solution, esmvaltool should be able to select the right file.

@bjoernbroetz
Copy link
Contributor

Sorry, I was too optimistic. It is still failing with the corrected input

@valeriupredoi
Copy link
Contributor

so @alasdairhunter recipe is a bit of a different case than the one that @bjoernbroetz replicated: in his recipe the models are inherently different because they use different experiments and concatenation should not be attempted; I ran several times in the problem the case that Bjoern ran spits out (I opened an issue about it many moons ago) and I recollect that was sorted out by a simple nicer error message to the user (along the lines 'get your times for your files sorted out, dude!'). And yes as @bouweandela says, we should let the tool figure out the file with the longest time and not attempt concatenation

@valeriupredoi
Copy link
Contributor

ah #314 it was indeed, thanks @bouweandela

@bjoernbroetz
Copy link
Contributor

I was too pessimistic. After fixing the same overlapping data problem for tasmin this works fine.

@valeriupredoi

in his recipe the models are inherently different because they use different experiments and concatenation should not be attempted

There is a valid usecase for mixing experiments, specially historical and scenarios. Right @mattiarighi ?

@valeriupredoi
Copy link
Contributor

aha, I remembered about it only after I blurted my comment, sorry, but in that case we should probably include an option alike mixed-experiments: [datasets] in the recipe so we can perform custom concatenation?

@bjoernbroetz
Copy link
Contributor

bjoernbroetz commented Nov 12, 2018

@alasdairhunter
When I need to fix a broken ESGF-tree (files with overlapping times) on a system where I don't have write access on the data, I create a mirror of the ESGF-tree with real directories and links to the original files. This takes only a few GB. Here is a tool for that:
https://gitlab.dkrz.de/b309070/linkbaum

And then I can delete single bad files... (links)

@mattiarighi
Copy link
Contributor

in his recipe the models are inherently different because they use different experiments and
concatenation should not be attempted

There is a valid usecase for mixing experiments, specially historical and scenarios. Right @mattiarighi ?

This is a different issue, see #244

@alasdairhunter
Copy link
Contributor Author

Hi,
Thanks for all your comments. And to clarify, during testing we have used several experiments in a single namelist before without a problem, so I think that is not the issue. I am trying to rerun it now based on the above feedback, but I still feel the issue could be related to loading two variables (tasmin and tasmax) from a single model. I will let you know.

@bettina-gier
Copy link
Contributor

@bjoernbroetz cp -as Should do the same thing as your code. On the DKRZ the resulting directory was about 17GB.

I've run into a related problem at #693 where on the DKRZ duplicate data for EC-EARTH exists due to all data being present in .nc AND .nc4 files, with the nc4 files being newer. This could be solved by having a priority on newer data on overlap to not need to treat specific formats differently.

@bouweandela
Copy link
Member

Is there any way to tell which file is newer from the netcdf global attributes? Because then we could automate this somewhat reliably in esmvaltool, e.g. in the data_finder or cube loading function.

Using the actual file creation data of the file system is not very reliable if someone just copied the files around on their local machine without keeping the original dates.

@mattiarighi
Copy link
Contributor

There is a global attribute creation_date which could serve this purpose.

@bouweandela
Copy link
Member

Unfortunately that is identical for the files referred to by @bettina-gier

/badc/cmip5/data/cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r1i1p1/latest/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc
/badc/cmip5/data/cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r1i1p1/latest/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc4

The tracking_id attribute is also identical..

@bjoernbroetz
Copy link
Contributor

The .nc and .nc4 files are different "dialects" of the Netcdf format (netcdf-classic and netcdf 4). See ncdump -k
or
file
The data inside is supposed to be identical.

@mattiarighi
Copy link
Contributor

Then we could just ignore the .nc4 files.

@bouweandela
Copy link
Member

While the data is probably the same for the files mentioned above, there are other differences. E.g. the lat_bnds variable is renamed to lat_bounds in the .nc4 version, the bnds dimension is renamed to nv, .. I don't know if this is an issue?

Ignoring the .nc4 files would be really easy, just update the filename globs in config_developer.yml.

@mattiarighi
Copy link
Contributor

I don't think this is an issue: I've runs many tests with version 2 since the very beginning, always using a drs where such .nc4 files are not present, and never encountered any problem.

Looking at the history of the EC-EARTH files mentioned above, it seems the .nc4 version is just the results of a cdo command:

cdo -f nc4 -z zip copy r1i1p1/v20131120/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc r1i1p1/v20131120/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc4

This confirms that the .nc version is the "original" file and we should use that.

@valeriupredoi
Copy link
Contributor

ahaa now I know more about nc4 files - netcdf4 output equivalent to netcdf (nc) - cf-python does that too. So no need to worry about nc4's. Is this issue still active or we can nuke it?

@bouweandela
Copy link
Member

Encountered yet another example of this issue:

$ for f in /badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/*; do echo $f; ncdump -h $f | grep creation_date; done
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_18500101-18991231.nc
		:creation_date = "2011-01-07T23:42:44Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_18500101-19491231.nc
		:creation_date = "2011-01-17T22:32:08Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_19000101-19491231.nc
		:creation_date = "2011-01-07T23:41:43Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-19991231.nc
		:creation_date = "2011-01-07T23:45:27Z" ;
/badc/cmip5/data/cmip5/output1/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/latest/pr/pr_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-20051231.nc
		:creation_date = "2011-01-17T22:37:08Z" ;

@valeriupredoi
Copy link
Contributor

email from Ruth Petrie (one of the peeps in charge of CMIP6 data and replication):

Hi V,

Just following up on “over-lapping time-series data in CMIP5”. I just wanted to confirm with you that there will be some form of test for this within ESMValTool so it doesn’t fail completely but exits nicely. The reason I say that is that I have found at least one example where there is a timeseries overlap in the CMIP5 data but we have the latest available version according to ESGF and so this will never be fixed, even after I have performed my update scan of CMIP5. I expect that this will be the case for all sources of affected data, unless the ESGF record hasn’t been updated in which case there isn’t much I can do. I am in discussion with as to whether we should perform a test for this in CMIP6 data as currently this isn’t tested for in any of the preparations that I am aware of and there is nothing in ESGF that prohibits publication with this error. It would be good to have this tested for in CMIP6 at least as at an exclusion level if we don’t have permission from data providers to fix.

Anyway just confirming that you are aware that CMIP5 will likely have a number of datasets with this error and we won’t be able to fix them all.

Cheers,
Ruth

They will not do anything about CMIP5 data, and hopefully they will have this fixed in CMIP6. I say we close this.

@mattiarighi
Copy link
Contributor

@schlunma had this problem today with the data at DKRZ. Our solution so far was to create a local replica (with links) of all data, to manually get rid of overlapping files. Not very nice, but it works.

It depends on how much effort it is, but it would be nice to have a better solution for this problem.

@bouweandela
Copy link
Member

It shouldn't be too much work to fix, if the data_finder.py file find overlapping files, it should ignore the ones with an older 'creation_date' attribute.

@mattiarighi
Copy link
Contributor

I'm not sure that 's the only selection criterion.
For example, one may also want to select the file with the longer time coverage.

@schlunma can you post here the problem you had yesterday?

@schlunma
Copy link
Contributor

Will do tomorrow when Mistral is back online 👍

@schlunma
Copy link
Contributor

On mistral, there are sometimes files with overlapping time ranges. An example are the historical Lmon variables of IPSL-CM5B-LR:

...
├── c3PftFrac
│   ├── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc
├── c4PftFrac
│   ├── c4PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── c4PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc
├── cLeaf
│   ├── cLeaf_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── cLeaf_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc
... 

The tool cannot process those files because it cannot concatenate cubes with overlapping time ranges. This case can be handled relatively easy by ignoring the file with the shorter time range.

I once had a case (but I don't remember the variable...) where this is not trivial: The were two files which overlapped by just one month, e.g.

..._185001-200012.nc
..._200011-200512.nc

What we definitely need in both cases is a function which checks for overlapping time ranges by just looking at the filename and without loading the cube.

@bouweandela
Copy link
Member

I'm not sure we will be able to handle all possible errors in the file system organization automatically. Maybe we could by default always use the newest (i.e. latest creation date) files, but also give the user the option to blacklist certain files in config-user.yml?

For the files mentioned above, it looks like the shorter version was created just 5 seconds after the longer version.

By the way, this problem appears to be particularly bad for the IPSL models, I have not seen it in any other model so far.

@earnone
Copy link
Contributor

earnone commented Mar 20, 2019

Running into the same problem both with historical going further than 200512 or one month only overlapping and the like. For cmip5 we know how each run was performed, can't we preprocess and re-archive these data once or all? I.e. historical ending 200512 and rcp starting at 200601.

BTW, if two dataset of the exact same experiment are present covering two time spans, why would one want to use the longer timeseries if the shorter one is long enough and can be processed faster? See the above example where a subset was likely extracted to save processing time.

├── c3PftFrac
│   ├── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_185001-200512.nc
│   └── c3PftFrac_Lmon_IPSL-CM5B-LR_historical_r1i1p1_197901-200512.nc

@npgillett
Copy link
Contributor

Hi all, I'm getting this same problem with running ESMValTool on CMIP6 data on mistral. Some CMIP6 data is on the main archive, but some is in a separate buffer directory /work/bd0854/CMIP6_buffer . I want to run esmvaltool using data from both sources, because some data is in one place and some is in the other. Including both directories in the CMIP6 data path works, except that in some cases there are copies of the same file in both directories and I get this error:

2019-04-12 00:08:33,506 UTC [38035] ERROR   Can not concatenate cubes: failed to concatenate into a single cube.
  An unexpected problem prevented concatenation.
  Expected only a single cube, found 2.
2019-04-12 00:08:33,507 UTC [38035] ERROR   Cubes:
2019-04-12 00:08:33,508 UTC [38035] ERROR   air_temperature / (K)               (time: 1980; latitude: 128; longitude: 256)

I don't have write permission for the buffer directory, so I can't delete duplicates. Other than by making a copy of one of the directories with symlinks and deleting the duplicates, is there a way to make esmvaltool not fail in this case?
Thanks, Nathan

@bouweandela
Copy link
Member

Not at the moment unfortunately. This issue is still an open issue.

@mattiarighi
Copy link
Contributor

#280 seems to have solved this issue as well.

I tried reading the following dataset (see also @bjoernbroetz's comment above):

  - {dataset: IPSL-CM5A-LR, project: CMIP5, exp: historical, mip: day, ensemble: r1i1p1, start_year: 1998, end_year: 2003}

which I would expect to give problems, since the input directory contains overlapping files:

/mnt/lustre02/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/v20110915/tasmin/tasmin_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-19991231.nc
/mnt/lustre02/work/bd0854/DATA/ESMValTool2/CMIP5_DKRZ/IPSL/IPSL-CM5A-LR/historical/day/atmos/day/r1i1p1/v20110915/tasmin/tasmin_day_IPSL-CM5A-LR_historical_r1i1p1_19500101-20051231.nc

but it went through without errors.

However, from the log I cannot understand which of the two was actually used.

@valeriupredoi ?

@valeriupredoi
Copy link
Contributor

did you read the debug log? It's all there @mattiarighi (like in the Da Vinci Code 😁 )

@valeriupredoi
Copy link
Contributor

yay! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessor Related to the preprocessor
Projects
None yet
Development

No branches or pull requests

10 participants