# Compute Local Ensemble Mean of CMIP6 Projections

This notebook demonstrates the creation of an ensemble mean of CMIP6 ScenarioMIP models available in the Google Cloud.  This Notebook is licensed for free and open consumption under the [Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license, Dr. T. Brikowski, U. Texas-Dallas Mar. 2025

CVS: $Id: CMIP6ensembleMean.ipynb,v 25.7 2025/04/01 16:16:17 brikowi Exp $

# Load needed packages

Be sure they have been installed using "conda install -c conda-forge packagename".

In [None]:
import glob
import matplotlib.pyplot as plt
import urllib.request
import numpy as np
import xarray as xr
import zarr

import pandas as pd
from difflib import get_close_matches
import gcsfs
import datetime                                   # Pandas date/time manipulation routines
import cftime                                     # NetCDF time manipulation routines
import os

# Plots open within notebook
%matplotlib inline

# Get list of desired models for processing
Get file list from Google via the given URL, use Pandas (pd) to read the CSV file and save in a *dataframe* "df".  A Pandas dataframe is like a spreadsheet format for Python with named columns.  A slightly different approach for climate model result downloads can be explored at PanGeo [intake-esm](https://intake-esm.readthedocs.io/en/stable/tutorials/loading-cmip6-data.html).  That method requires zarr version <3.0, and so is not used for this class.

In [None]:
%%time
df = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

## Get list of all available SSP5 8.5 models with *tasmax*
Find lines in the file-list dataframe "df" that meet the given criteria, most important are that the variable *tasmax* is in the file and that it is from a model of experiment *ssp585* (the worst-case scenario emissions/warming).

In [None]:
df_ssp585 = df.query("activity_id=='ScenarioMIP' & table_id == 'Amon' & " +\
    "variable_id == 'tasmax' & experiment_id == 'ssp585' & member_id == 'r1i1p1f1'")
print(len(df_ssp585), 'forecast files match the search criteria, the first 3 are:')
df_ssp585.head(3)

## Get list of all historical models with *tasmax*

In [None]:
df_historical = df.query("activity_id == 'CMIP' & table_id == 'Amon' & " +\
    "variable_id == 'tasmax' & experiment_id == 'historical' & member_id == 'r1i1p1f1'")
print(len(df_historical), 'historical files match the search criteria, the first 3 are:')
df_historical.head(3)

## Make list of models with BOTH historical and future tasmax and pr
See <a href="https://stackoverflow.com/questions/55898796/how-to-match-keys-of-2-data-frames-and-create-new-df-with-matching-keys">StackOverflow</a> for suggestions.  Current code requires start with shortest dataframe (!).

In [None]:
seq = [r for r in df_ssp585["source_id"] if get_close_matches(r, df_historical["source_id"], n=1, cutoff = .85)]
bothSources = np.unique(np.array(seq))
print(len(bothSources), "models have both historical and SSP5-8.5 results, e.g. ", bothSources[3])

gcs = gcsfs.GCSFileSystem(token='anon')


In [None]:
for i in range(len(bothSources)):
    # model = 'GFDL-CM4'
    print("Model name: ", bothSources[i])


## Define functions to get desired variable from desired model

In [None]:
def getVar (modelList, modelName):
    # Custom function to get variable for model 'modelName' from list of possible models 'modelList'
    zstore = modelList.query(f"source_id == '{modelName}'").zstore.values[0]
    ds = xr.open_zarr(zstore, consolidated = True)
    ds.load()
    return ds

In [None]:
# Define lat-long box, initially the DFW Metroplex
# For best results be sure this is at least 2x2 degrees, to allow for coarse models
left= 360-97.5
right = 360-95.5
bottom = 32.0
top = 34.0
locName = 'DFW'                     # Name of location (for labeling plots)

# Get historical model statistics

Accumulate ensemble mean and confidence bands.

Author's note: consider filtering for NaN results if lat-long box is narrower than some model grid spacing

In [None]:
%%time
filename = "historical_tasmax.csv"
for i in range(len(bothSources)):
    model = bothSources[i]
    print("Processing model: %s" % (model), end='...')
    # Call my custom function
    ds_hist = getVar(df_historical, model)
    localHist = ds_hist.isel(lat = (ds_hist.lat>=bottom) & (ds_hist.lat<=top),
                         lon = (ds_hist.lon>=left) & (ds_hist.lon<=right)).mean(['lat','lon']).drop_vars(['height'])
    hist4export = (localHist.groupby('time.year').mean().tasmax-273.15)
    tempDF = hist4export.to_dataframe()
    # Append results to dataframe
    if i==0:
        histTASmax = tempDF
        histTASmax.columns = [model]
    else:
        tempDF2 = pd.concat([histTASmax, tempDF["tasmax"]], axis=1).reindex(histTASmax.index)
        tempDF2.rename(columns={tempDF2.columns[i]:model}, inplace=True)
        histTASmax = tempDF2


# Repeat for projected TASmax

In [None]:
%%time
filename = "ssp585_tasmax.csv"
for i in range(len(bothSources)):
    model = bothSources[i]
    print("Processing model: %s" % (model), end='...')
    ds_proj = getVar(df_ssp585, model)
    localProj = ds_proj.isel(lat = (ds_proj.lat>=bottom) & (ds_proj.lat<=top),
                         lon = (ds_proj.lon>=left) & (ds_proj.lon<=right)).mean(['lat','lon']).drop_vars(['height'])
    proj4export = (localProj.groupby('time.year').mean().tasmax-273.15)
    tempDF = proj4export.to_dataframe()
    # Append results to dataframe
    if i==0:
        projTASmax = tempDF
        projTASmax.columns = [model]
    else:
        tempDF2 = pd.concat([projTASmax, tempDF["tasmax"]], axis=1).reindex(projTASmax.index)
        tempDF2.rename(columns={tempDF2.columns[i]:model}, inplace=True)
        projTASmax = tempDF2


# Generate and Plot Local All-Avaliable-Model Ensemble Mean-Median-Confidence Intervals

Again note that models coarser than the specified local lat-long box will yield 'NaN' (not a number) values.  These are ignored by matplotlib, but may confuse other plotting tools like Excel.

In [None]:
projAvg = projTASmax.mean(axis=1)
histAvg = histTASmax.mean(axis=1)

## Plot all models and mean

In [None]:
colors = ['#89FAB4', '#FAE4A0', '#FA837D', '#B049E3', '#E3BA5F', '#E35E54', '#6591EA', '#EB83C6', '#EB1551', '#1802F4']
styles = ['', '', '-+', '-o', '', '', '', '', '', '']

ax = projTASmax.plot(style=styles, color=colors, figsize=(16, 9))  # plot the dataframe and set Time as x
projAvg.plot(label='Mean', color='black', linewidth=4)
ax.legend(bbox_to_anchor=(1, 1.01), loc='upper left') # move the legend
fig = ax.get_figure()  # extract the figure object
fig.tight_layout(pad=3)
fig.suptitle(f'TASmax From All Available CMIP6 Models {locName} Area', fontsize=22, y=1.02, color='black')


In [None]:
# Compute quantiles, append to dataframe
histTASmax["Mean"] = histAvg
histTASmax["Q05"] = histTASmax.quantile(.05, axis=1)
histTASmax["Q25"] = histTASmax.quantile(.25, axis=1)
histTASmax["Q50"] = histTASmax.quantile(.50, axis=1)       # Median not mean !
histTASmax["Q75"] = histTASmax.quantile(.75, axis=1)
histTASmax["Q95"] = histTASmax.quantile(.95, axis=1)

# Save final results to file (alert Python programmers could just plot the dataframe histTASmax..)
filename = f'allCMIP6historicalTASmax_{locName}.csv'
histTASmax.to_csv(filename)
print("\nSaved local historic TASmax to file: ", filename)

# Compute quantiles, append to dataframe
projTASmax["Mean"] = projAvg
projTASmax["Q05"] = projTASmax.quantile(.05, axis=1)
projTASmax["Q25"] = projTASmax.quantile(.25, axis=1)
projTASmax["Q50"] = projTASmax.quantile(.50, axis=1)
projTASmax["Q75"] = projTASmax.quantile(.75, axis=1)
projTASmax["Q95"] = projTASmax.quantile(.95, axis=1)

# Save final results to file (alert Python programmers could just plot the dataframe projTASmax..)
filename = f'allCMIP6projTASmax_{locName}.csv'
projTASmax.to_csv(filename)
print("\nSaved local projected TASmax to file: ", filename)
projTASmax.head(5)

## Plot all-model mean-median & confidence intervals

# Explore Mahoney-Wang-2022 Ensemble Local Mean

[Mahoney, et al., 2022](https://dx.doi.org/10.1002/joc.7566) define 13 and 8 model ensembles drawn from 45 available CMIP6 models (in 2020).   These are claimed to be superior for North America, and especially for specialized downscaling approaches. 

In [None]:
# Define list of model names in the two ensembles
mahoneyWang13 = ['ACCESS-ESM1.5', 'BCC-CSM2', 'CanESM5', 'CNRM-ESM2-1', 'EC-Earth3', 'GFDL-ESM4', 'GISS-E2-1', 'INM-CM5-0', 
                 'IPSL-CM6A-LR', 'MIROC6', 'MPI-ESM1-2-HR', 'MRI-ESM2-0', 'UKESM1']
mahoneyWang8 = ['ACCESS-ESM1-5', 'CNRM-ESM2-1', 'EC-Earth3', 'GFDL-ESM4', 'GISS-E2-1-H', 'MIROC6', 'MPI-ESM1-2-HR', 'MRI-ESM2-0']
# Limit to Mahoney-Wang ensembles that have both historical ans SSP585 projection results at Google
mahoneyBoth = []
for i in range(len(mahoneyWang13)):
    if get_close_matches(mahoneyWang13[i],bothSources, n=1, cutoff = .85):
        mahoneyBoth.append(mahoneyWang13[i])
    else:
       print(f'\n\tModel {mahoneyWang13[i]}: unable to find historical and SSP585 results in Google CMIP6 list.')

print(f'{len(mahoneyBoth)} out of {len(mahoneyWang13)} historical+SSP585 models available for ensemble calculations')

In [None]:
%%time
colNum = 0
for i in range(len(mahoneyBoth)):
    model = mahoneyBoth[i]
    print(f'Processing model {model}', end='...')
    mh_hist = getVar(df_historical, model)
    localHist = mh_hist.isel(lat = (mh_hist.lat>=bottom) & (mh_hist.lat<=top), lon = (mh_hist.lon>=left) 
                            & (mh_hist.lon<=right)).mean(['lat','lon']).drop_vars(['height'])
    hist4export = (localHist.groupby('time.year').mean().tasmax-273.15)
    tempDF = hist4export.to_dataframe()
    # Append results to dataframe
    if i==0:
        mhhistTASmax = tempDF
        mhhistTASmax.columns = [model]
    else:
        tempDF2 = pd.concat([mhhistTASmax, tempDF["tasmax"]], axis=1).reindex(mhhistTASmax.index)
        tempDF2.rename(columns={tempDF2.columns[colNum]:model}, inplace=True)
        mhhistTASmax = tempDF2
    colNum += 1

mhhistTASmax

In [None]:
%%time

# Repeat for future
maxTime = cftime.DatetimeNoLeap(2099, 12, 31, 23, 59, 59, 0, has_year_zero=True)   # Clip CanESMS model to 21st century
colNum = 0
for i in range(len(mahoneyBoth)):
    model = mahoneyBoth[i]
    print(f'Processing model {model}', end='...')
    mh_proj = getVar(df_ssp585, model)
    localProj = mh_proj.isel(lat = (mh_proj.lat>=bottom) & (mh_proj.lat<=top), lon = (mh_proj.lon>=left) 
                            & (mh_proj.lon<=right)).mean(['lat','lon']).drop_vars(['height'])
    proj4export = (localProj.groupby('time.year').mean().tasmax-273.15)
    tempDF = proj4export.to_dataframe()
    # Append results to dataframe
    if colNum==0:
        mhprojTASmax = tempDF
        mhprojTASmax.columns = [model]
    else:
        tempDF2 = pd.concat([mhprojTASmax, tempDF["tasmax"]], axis=1).reindex(mhprojTASmax.index)
        tempDF2.rename(columns={tempDF2.columns[colNum]:model}, inplace=True)
        mhprojTASmax = tempDF2
    colNum += 1        

mhprojMeanTASmax = mhprojTASmax[:2100]                    # Clip array to 21st century (fix error in CanESMS...)


## Export ensemble means

Calculate and include 50 & 95% confidence intervals.  Save to CSV file.

In [None]:
filename = f'CMPI6historicalEnsembleMeanTASmax_{locName}.csv'
mhhistTASmax["Q05"] = mhhistTASmax.quantile(.05, axis=1)
mhhistTASmax["Q25"] = mhhistTASmax.quantile(.25, axis=1)
mhhistTASmax["Q50"] = mhhistTASmax.quantile(.50, axis=1)
mhhistTASmax["Q75"] = mhhistTASmax.quantile(.75, axis=1)
mhhistTASmax["Q95"] = mhhistTASmax.quantile(.95, axis=1)
mhhistTASmax["Mean"] = mhhistTASmax.mean(axis=1)
mhhistTASmax.to_csv(filename)
print("\nSaved local projected TASmax to file: ", filename)

filename = f'ssp585ensembleMeanTASmax_{locName}.csv'
mhprojTASmax["Q05"] = mhprojTASmax.quantile(.05, axis=1)
mhprojTASmax["Q25"] = mhprojTASmax.quantile(.25, axis=1)
mhprojTASmax["Q50"] = mhprojTASmax.quantile(.50, axis=1)
mhprojTASmax["Q75"] = mhprojTASmax.quantile(.75, axis=1)
mhprojTASmax["Q95"] = mhprojTASmax.quantile(.95, axis=1)
mhprojTASmax["Mean"] = mhprojTASmax.mean(axis=1)
mhprojTASmax.to_csv(filename)
mhprojTASmax.head(5)

# Plot Ensemble Means & Confidence Interval