Describe test ...

In [1]:
import intake

In [2]:
cat = intake.cat.nci
cmip6 = cat.esgf.cmip6

In [3]:
# I changed a bit the original constraints to get less results
subset = cmip6.search(require_all_on=['source_id','member_id'], 
                      experiment_id=['hist-GHG','hist-aer'], 
                      variable_id=['thetao','tasmin'])

In [4]:
runs = subset.df.groupby(['source_id','member_id', 'experiment_id'])
len(runs)

98

In the filter we first sort the date_range values so we can pick the frist and last date_range easily. Then we get the starting year (first 4 digits of first date_range) and the last year (split the date_range and match first 4 digits of second date).<br>
The filter returns False if they don't match the years we set as start and end: 1850/2020.

In [5]:
def check_range(group, ystart=1850, yend=2020):
    """ Check if dates passed matched start/end years
        we could just check that start and end year are the same but we really want
        to include also runs that starts earlier or end later
    """
    check = False
    drange = group.date_range.sort_values()
    dstart = drange.iloc[0][:4]
    dend = drange.iloc[-1].split("-")[1][:4]
    if int(dstart) <= ystart and int(dend) >= yend:
        check = True
    if check is False: print(dstart, dend)
    return check

In [6]:
dfnew = subset.df.groupby(['source_id','member_id', 'experiment_id']).filter(lambda x: check_range(x) )

1850 2015
1850 2015
1850 2014
1850 2015


In [7]:
dfnew.reset_index(drop=True, inplace=True)
dfnew.groupby(['source_id','member_id', 'experiment_id']).all()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,project,activity_id,institution_id,table_id,variable_id,grid_label,date_range,path,version
source_id,member_id,experiment_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ACCESS-CM2,r1i1p1f1,hist-GHG,True,True,True,True,True,True,True,True,True
ACCESS-CM2,r1i1p1f1,hist-aer,True,True,True,True,True,True,True,True,True
ACCESS-CM2,r2i1p1f1,hist-GHG,True,True,True,True,True,True,True,True,True
ACCESS-CM2,r2i1p1f1,hist-aer,True,True,True,True,True,True,True,True,True
ACCESS-CM2,r3i1p1f1,hist-GHG,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...
NorESM2-LM,r1i1p1f1,hist-aer,True,True,True,True,True,True,True,True,True
NorESM2-LM,r2i1p1f1,hist-GHG,True,True,True,True,True,True,True,True,True
NorESM2-LM,r2i1p1f1,hist-aer,True,True,True,True,True,True,True,True,True
NorESM2-LM,r3i1p1f1,hist-GHG,True,True,True,True,True,True,True,True,True


Let's see how this would work with your smaller example. This model has 3 members and 1 starts from 1930 instead of 1850. We introduced the print statement to test that the filter can detect this correctly.

In [35]:
subset = cmip6.search(variable_id=['thetao'], table_id=['Omon'], grid_label=['gn'], 
                      source_id=['BCC-CSM2-MR'], experiment_id=['hist-aer'])

In [36]:
dfnew = subset.df.groupby('member_id').filter(lambda x: check_range(x) )

1930 2020


The filter seems to work, by grouping again by member_id we can see only the 2 complete members are left.

In [37]:
dfnew.reset_index(drop=True, inplace=True) # reset index
dfnew.groupby('member_id').all()

Unnamed: 0_level_0,project,activity_id,institution_id,source_id,experiment_id,table_id,variable_id,grid_label,date_range,path,version
member_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
r1i1p1f1,True,True,True,True,True,True,True,True,True,True,True
r2i1p1f1,True,True,True,True,True,True,True,True,True,True,True


The problem is that this now a Pandas DataFrame, so we lost the link to the intake catalogue. It is not possible yet to run such a query in intake. However, once we identified which members to remove using the filter we can build their intake key and remove them from the dataset dictionary.<br>
To do this more easily we will modify the function to return the dataset key instead of True and False.

In [38]:
cmip6.aggregation_info.groupby_attrs

['project',
 'activity_id',
 'institution_id',
 'source_id',
 'experiment_id',
 'member_id',
 'table_id',
 'variable_id',
 'grid_label',
 'version']

Based on the aggregation_info we can build a dataset key using the groupby columns:
    'project', 'activity_id', 'institution_id', 'source_id', 'experiment_id', 'member_id', 'table_id', 'variable_id', 'grid_label', 'version'

In [39]:
def check_range(group, ystart=1850, yend=2020):
    """ Check if dates passed matched start/end years"""
    attrs = cmip6.aggregation_info.groupby_attrs
    drange = group.date_range.sort_values()
    dstart = drange.iloc[0][:4]
    dend = drange.iloc[-1].split("-")[1][:4]    
    # build back key
    ds_key = ".".join([group[x].iloc[0] for x in attrs if x in group.columns])
    if int(dstart) > ystart or int(dend) < yend:
        return ds_key

Let's test this on our small example

In [40]:
to_remove = [check_range(x[1]) for x in subset.df.groupby('member_id')]

We need to remove the None values returned when the member range was fine.

In [41]:
to_remove = list(filter(None, to_remove))
to_remove

['CMIP6.DAMIP.BCC.BCC-CSM2-MR.hist-aer.r3i1p1f1.Omon.thetao.gn.v20190516']

Finally, we can create a dataset_dict and then remove this key

In [42]:
ds_dict = subset.to_dataset_dict()


--> The keys in the returned dictionary of datasets are constructed as follows:
	'project.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'


In [43]:
for k in to_remove:
    ds_dict.pop(k) 

Let's check that the dataset r3i1p1f1 has been removed

Can I now rerun the subset search but passing only this table instead of the all CMIP6 catalogue table?

In [44]:
ds_dict.keys()

dict_keys(['CMIP6.DAMIP.BCC.BCC-CSM2-MR.hist-aer.r2i1p1f1.Omon.thetao.gn.v20190513', 'CMIP6.DAMIP.BCC.BCC-CSM2-MR.hist-aer.r1i1p1f1.Omon.thetao.gn.v20190516'])

Let's finally try to see if this work on a bigger subset, we repeat the first query which returned 156 runs

In [45]:
subset = cmip6.search(require_all_on=['source_id','member_id'], experiment_id=['hist-GHG','hist-aer'], variable_id=['thetao','tasmin'])

In [46]:
to_remove = [check_range(x[1]) for x in subset.df.groupby(['source_id','member_id', 'experiment_id', 'variable_id'])]

In [47]:
to_remove = list(filter(None, to_remove))
print(len(to_remove))
len(subset.df.groupby(['source_id','member_id', 'experiment_id', 'variable_id']))

9


196

There's 9 simulations  out of 196 to remove let's check a few to be sure

In [None]:
ds_dict = subset.to_dataset_dict()


--> The keys in the returned dictionary of datasets are constructed as follows:
	'project.activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.version'


  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(
  new_vars[k] = decode_cf_variable(


In [None]:
ds_dict.keys(to_remove[4])

So far this is sub-optimal as you still have to aggregate the datasets before you can remove the `faulty` ones (hence I used constraints that would return less simulations). What I still need to work out is how to filter out efficiently the `subset` before this step. From the intake-esm docs there's a way to pass a filter excluding a parituclar mdoel, is not clear to me how to exclude a combination, but this is what I will try after.<br>
In fact, i didn't even get to fully run the last test :-)

The current intake-esm release has a bug that breaks our catalogues, we notified them and this has been resolved but they haven't yet produced a new release, so I will try to get the `fixed` version in our conda anyway as then you can do stuff like opening the files in parallel which greatly reduced the running time.

STILL WORKING ON THE STUFF BELOW

In [14]:
import intake_esm

In [33]:
# I changed a bit the original constraints to get less results
newsubset = intake_esm.search.search(df=dfnew, require_all_on=['source_id','member_id'], experiment_id=['hist-GHG','hist-aer'], variable_id=['thetao','tasmin'])

In [27]:
dir(intake_esm.search.search)

['__annotations__',
 '__call__',
 '__class__',
 '__closure__',
 '__code__',
 '__defaults__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__get__',
 '__getattribute__',
 '__globals__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__kwdefaults__',
 '__le__',
 '__lt__',
 '__module__',
 '__name__',
 '__ne__',
 '__new__',
 '__qualname__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__']