-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing overlapping data not implemented yet #54
Comments
I was able to reproduce this error. In my case there was confusion with the input data:
Removing one of this files seams to solve the problem for me... |
I though we implemented a bug fix for this kind of problem (overlapping time spans in the same path). |
No this is an open issue (actually, no issue has been opened for it yet): ESMValGroup/ESMValTool#538 (comment) |
I've changed the title of this issue to reflect the actual problem. To solve this issue, we need to make a decision on what data to select. Do we always select the longest data? Or is there some other way to automatically determine which file we need? |
I actually found the issue related to this: ESMValGroup/ESMValTool#314 |
Sorry, I was too optimistic. It is still failing with the corrected input |
so @alasdairhunter recipe is a bit of a different case than the one that @bjoernbroetz replicated: in his recipe the models are inherently different because they use different experiments and concatenation should not be attempted; I ran several times in the problem the case that Bjoern ran spits out (I opened an issue about it many moons ago) and I recollect that was sorted out by a simple nicer error message to the user (along the lines 'get your times for your files sorted out, dude!'). And yes as @bouweandela says, we should let the tool figure out the file with the longest time and not attempt concatenation |
ah #314 it was indeed, thanks @bouweandela |
I was too pessimistic. After fixing the same overlapping data problem for
There is a valid usecase for mixing experiments, specially |
aha, I remembered about it only after I blurted my comment, sorry, but in that case we should probably include an option alike |
@alasdairhunter And then I can delete single bad files... (links) |
This is a different issue, see #244 |
Hi, |
@bjoernbroetz I've run into a related problem at #693 where on the DKRZ duplicate data for EC-EARTH exists due to all data being present in .nc AND .nc4 files, with the nc4 files being newer. This could be solved by having a priority on newer data on overlap to not need to treat specific formats differently. |
Is there any way to tell which file is newer from the netcdf global attributes? Because then we could automate this somewhat reliably in esmvaltool, e.g. in the data_finder or cube loading function. Using the actual file creation data of the file system is not very reliable if someone just copied the files around on their local machine without keeping the original dates. |
There is a global attribute |
Unfortunately that is identical for the files referred to by @bettina-gier /badc/cmip5/data/cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r1i1p1/latest/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc
/badc/cmip5/data/cmip5/output1/ICHEC/EC-EARTH/historical/mon/atmos/Amon/r1i1p1/latest/ta/ta_Amon_EC-EARTH_historical_r1i1p1_200901-200911.nc4 The tracking_id attribute is also identical.. |
The |
Then we could just ignore the |
While the data is probably the same for the files mentioned above, there are other differences. E.g. the Ignoring the .nc4 files would be really easy, just update the filename globs in config_developer.yml. |
I don't think this is an issue: I've runs many tests with version 2 since the very beginning, always using a drs where such Looking at the history of the
This confirms that the |
ahaa now I know more about |
Encountered yet another example of this issue:
|
email from Ruth Petrie (one of the peeps in charge of CMIP6 data and replication):
They will not do anything about CMIP5 data, and hopefully they will have this fixed in CMIP6. I say we close this. |
@schlunma had this problem today with the data at DKRZ. Our solution so far was to create a local replica (with links) of all data, to manually get rid of overlapping files. Not very nice, but it works. It depends on how much effort it is, but it would be nice to have a better solution for this problem. |
It shouldn't be too much work to fix, if the data_finder.py file find overlapping files, it should ignore the ones with an older 'creation_date' attribute. |
I'm not sure that 's the only selection criterion. @schlunma can you post here the problem you had yesterday? |
Will do tomorrow when Mistral is back online 👍 |
On mistral, there are sometimes files with overlapping time ranges. An example are the historical
The tool cannot process those files because it cannot concatenate cubes with overlapping time ranges. This case can be handled relatively easy by ignoring the file with the shorter time range. I once had a case (but I don't remember the variable...) where this is not trivial: The were two files which overlapped by just one month, e.g.
What we definitely need in both cases is a function which checks for overlapping time ranges by just looking at the filename and without loading the cube. |
I'm not sure we will be able to handle all possible errors in the file system organization automatically. Maybe we could by default always use the newest (i.e. latest creation date) files, but also give the user the option to blacklist certain files in config-user.yml? For the files mentioned above, it looks like the shorter version was created just 5 seconds after the longer version. By the way, this problem appears to be particularly bad for the IPSL models, I have not seen it in any other model so far. |
Running into the same problem both with historical going further than 200512 or one month only overlapping and the like. For cmip5 we know how each run was performed, can't we preprocess and re-archive these data once or all? I.e. historical ending 200512 and rcp starting at 200601. BTW, if two dataset of the exact same experiment are present covering two time spans, why would one want to use the longer timeseries if the shorter one is long enough and can be processed faster? See the above example where a subset was likely extracted to save processing time.
|
Hi all, I'm getting this same problem with running ESMValTool on CMIP6 data on mistral. Some CMIP6 data is on the main archive, but some is in a separate buffer directory
I don't have write permission for the buffer directory, so I can't delete duplicates. Other than by making a copy of one of the directories with symlinks and deleting the duplicates, is there a way to make esmvaltool not fail in this case? |
Not at the moment unfortunately. This issue is still an open issue. |
#280 seems to have solved this issue as well. I tried reading the following dataset (see also @bjoernbroetz's comment above):
which I would expect to give problems, since the input directory contains overlapping files:
but it went through without errors. However, from the log I cannot understand which of the two was actually used. |
did you read the debug log? It's all there @mattiarighi (like in the Da Vinci Code 😁 ) |
yay! 🎉 |
Hi,
I am having an issue with the preprocessing when trying to run a diagnostic which uses two variables from a single model:
https://github.com/ESMValGroup/ESMValTool/blob/MAGIC_BSC/esmvaltool/recipes/recipe_diurnal_temperature_index_wp7.yml
An unexpected problem prevented concatenation.
Expected only a single cube, found 2.
I guess this is just a minor error in our .yml file. Do you have an example of any other diagnostics which load two (or more) variables from a single model for use as input for a single R script (or Python script).
The text was updated successfully, but these errors were encountered: