# Do we need a modeling realm?

In [1]:
import pandas as pd

import data_request as dr

pd.set_option("display.max_rows", None)

In [2]:
table_ids = pd.read_csv(
    "https://docs.google.com/spreadsheets/d/1Hmu9fE9GdXUZoTl90vDv-qiI803L2feMxe6dAR0oErI/export?format=csv&gid=110419801"
)
table_ids

Unnamed: 0,table_id,description
0,3hr,atmosphere sampled every 3 hours
1,6hrLev,6-hourly data on atmospheric model levels
2,6hrPlev,6-hourly atmospheric data on pressure levels (...
3,6hrPlevPt,6-hourly atmospheric data on pressure levels (...
4,AERday,Daily atmospheric chemistry and aerosol data
5,AERfx,Fixed atmospheric chemistry and aerosol data
6,AERhr,Hourly atmospheric chemistry and aerosol data
7,AERmon,Monthly atmospheric chemistry and aerosol data
8,AERmonZ,Monthly atmospheric chemistry and aerosol data
9,Amon,Monthly atmospheric data


### What variables have the same `out_name`, `frequency` and `modeling_realm` in CMIP6?

This comibation would make trouble, if we only split tables by frequency. We would have duplicated `variable_entry` keys in the table which is not allowed. Actually, the `Omon` table has some examples for this problem, e.g., the have the `variable_entry` key differ from `out_name`. The duplicated entries are mostly distinguised by different dimensions, e.g., different naming of axis for different mips i guess or different vertical coordinates...

In [3]:
df = dr.retrieve_cmip6_mip_tables()

df[df.duplicated(["out_name", "frequency", "modeling_realm"], keep=False)][
    [
        "out_name",
        "frequency",
        "modeling_realm",
        "cmip6_table",
        "dimensions",
        "long_name",
    ]
].join(
    table_ids.rename(columns={"table_id": "cmip6_table"}).set_index("cmip6_table"),
    on="cmip6_table",
).set_index(
    ["out_name", "frequency", "modeling_realm"]
).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cmip6_table,dimensions,long_name,description
out_name,frequency,modeling_realm,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
acabf,mon,landIce,ImonAnt,xant yant time,Surface Mass Balance Flux,Monthly fields on the Antarctic ice sheet
acabf,mon,landIce,ImonGre,xgre ygre time,Surface Mass Balance Flux,Monthly fields on the Greenland ice sheet
acabf,yr,landIce,IyrAnt,xant yant time,Surface Mass Balance Flux,Annual fields on the Antarctic ice sheet
acabf,yr,landIce,IyrGre,xgre ygre time,Surface Mass Balance Flux,Annual fields on the Greenland ice sheet
areacellg,fx,land,IfxAnt,longitude latitude,Grid-Cell Area for Ice Sheet Variables,Fixed fields on the Antarctic ice sheet
areacellg,fx,land,IfxGre,longitude latitude,Grid-Cell Area for Ice Sheet Variables,Fixed fields on the Greenland ice sheet
ch4,mon,aerosol,AERmon,longitude latitude alevel time,Mole Fraction of CH4,Monthly atmospheric chemistry and aerosol data
ch4,mon,aerosol,AERmonZ,latitude plev39 time,Mole Fraction of CH4,Monthly atmospheric chemistry and aerosol data
clt,day,atmos,Eday,longitude latitude time,Total Cloud Cover Percentage,"Daily (time mean, extension)"
clt,day,atmos,day,longitude latitude time,Total Cloud Cover Percentage,Daily Data (extension - contains both atmosphe...


If we also ignore duplicates in dimensions, we get much less entries, that seem to be distinguised by other factors, e.g., physics or additional aggregations:

In [4]:
df = dr.retrieve_cmip6_mip_tables()

df[
    df.duplicated(["out_name", "frequency", "modeling_realm", "dimensions"], keep=False)
][
    [
        "out_name",
        "frequency",
        "modeling_realm",
        "cmip6_table",
        "dimensions",
        "long_name",
    ]
].join(
    table_ids.rename(columns={"table_id": "cmip6_table"}).set_index("cmip6_table"),
    on="cmip6_table",
).set_index(
    ["out_name", "frequency", "modeling_realm"]
).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cmip6_table,dimensions,long_name,description
out_name,frequency,modeling_realm,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
areacellg,fx,land,IfxAnt,longitude latitude,Grid-Cell Area for Ice Sheet Variables,Fixed fields on the Antarctic ice sheet
areacellg,fx,land,IfxGre,longitude latitude,Grid-Cell Area for Ice Sheet Variables,Fixed fields on the Greenland ice sheet
clt,day,atmos,Eday,longitude latitude time,Total Cloud Cover Percentage,"Daily (time mean, extension)"
clt,day,atmos,day,longitude latitude time,Total Cloud Cover Percentage,Daily Data (extension - contains both atmosphe...
hfls,day,atmos,Eday,longitude latitude time,Surface Upward Latent Heat Flux,"Daily (time mean, extension)"
hfls,day,atmos,day,longitude latitude time,Surface Upward Latent Heat Flux,Daily Data (extension - contains both atmosphe...
hfss,day,atmos,Eday,longitude latitude time,Surface Upward Sensible Heat Flux,"Daily (time mean, extension)"
hfss,day,atmos,day,longitude latitude time,Surface Upward Sensible Heat Flux,Daily Data (extension - contains both atmosphe...
iareafl,yr,landIce,IyrAnt,time,Area Covered by Floating Ice Shelves,Annual fields on the Antarctic ice sheet
iareafl,yr,landIce,IyrGre,time,Area Covered by Floating Ice Shelves,Annual fields on the Greenland ice sheet


There are no duplicates, if we also include the 'table_id', so the table entry is uniquely defined by `["out_name", "frequency", "modeling_realm", "dimensions", "cmip6_table"]`

In [5]:
df = dr.retrieve_cmip6_mip_tables()

df[
    df.duplicated(
        ["out_name", "frequency", "modeling_realm", "dimensions", "cmip6_table"],
        keep=False,
    )
][
    [
        "out_name",
        "frequency",
        "modeling_realm",
        "cmip6_table",
        "dimensions",
        "long_name",
    ]
].join(
    table_ids.rename(columns={"table_id": "cmip6_table"}).set_index("cmip6_table"),
    on="cmip6_table",
).set_index(
    ["out_name", "frequency", "modeling_realm"]
).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cmip6_table,dimensions,long_name,description
out_name,frequency,modeling_realm,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1


Check duplicates in `Amon` and `Omon` tables...

In [6]:
df = dr.retrieve_cmip6_mip_tables()

df = df[df.cmip6_table.isin(["Amon", "Omon"])]

df[
    df.duplicated(
        ["out_name", "modeling_realm"],
        keep=False,
    )
][
    [
        "out_name",
        "frequency",
        "modeling_realm",
        "cmip6_table",
        "dimensions",
        "long_name",
    ]
].join(
    table_ids.rename(columns={"table_id": "cmip6_table"}).set_index("cmip6_table"),
    on="cmip6_table",
).set_index(
    ["out_name", "modeling_realm"]
).sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,frequency,cmip6_table,dimensions,long_name,description
out_name,modeling_realm,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ch4,atmos atmosChem,mon,Amon,longitude latitude plev19 time,Mole Fraction of CH4,Monthly atmospheric data
ch4,atmos atmosChem,monC,Amon,longitude latitude plev19 time2,Mole Fraction of CH4,Monthly atmospheric data
ch4global,atmos atmosChem,mon,Amon,time,Global Mean Mole Fraction of CH4,Monthly atmospheric data
ch4global,atmos atmosChem,monC,Amon,time2,Global Mean Mole Fraction of CH4,Monthly atmospheric data
co2,atmos,mon,Amon,longitude latitude plev19 time,Mole Fraction of CO2,Monthly atmospheric data
co2,atmos,monC,Amon,longitude latitude plev19 time2,Mole Fraction of CO2,Monthly atmospheric data
co2mass,atmos,mon,Amon,time,Total Atmospheric Mass of CO2,Monthly atmospheric data
co2mass,atmos,monC,Amon,time2,Total Atmospheric Mass of CO2,Monthly atmospheric data
ficeberg,ocean,mon,Omon,longitude latitude olevel time,Water Flux into Sea Water from Icebergs,Monthly ocean data
ficeberg,ocean,mon,Omon,longitude latitude time,Water Flux into Sea Water from Icebergs,Monthly ocean data
