# Access Coupled Model Intercomparison Project (CMIP) Data
This notebook uses google cloud file storage to access outputs from the [Coupled Model Intercomparison Project 6](https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6). Data can also be accessed at this [portal](https://esgf-node.llnl.gov/projects/cmip6/). This is the most involved part of the data collection portion of this project. We will compare the climate projections to historical temperature variability from weather stations in New Orleans. To see how gumbo weather has changed in the past 60 years, and will change 60 years in the future.  

Climate models are run in ensembles. Each model, operated by different groups around the world, has slightly different physics. These models and their internal variability are compared to one another and averaged to produce a projection.  

## Choosing a Scenario
While each model contains different mathematical representations of the physics of earth's climate system, they depend on similar input "forcings", a stimulus to which the climate system should respond to. For earth surface temperature, this is almost entirely dependent on CO$_2$ concentrations in the atmosphere, which is completely dependent on future choices humanity makes about CO$_2$ emissions. The [Intergovernmental Panel on Climate Change](https://www.ipcc.ch) (IPCC) has outlined a few **scenarios** representing paths humanity might follow in our choices to curb fossil fuel emissions.  
![](Images/rcps.jpg) <div style='text-align: right'> </div> 
Source: [Wikipedia: Relative Concentration Pathways](https://en.wikipedia.org/wiki/Representative_Concentration_Pathway)  
<br>
The above plot shows a few ways atmospheric CO$_2$ concentration might evolve over the coming century. They used to be called [Relative Concentration Pathways](https://en.wikipedia.org/wiki/Representative_Concentration_Pathway) (RCPs), but now these are integrated into the larger framework of a models that include population change, economic growth, education, urbanisation and the rate of technological development called [Shared Socioeconomic Pathways](https://www.carbonbrief.org/explainer-how-shared-socioeconomic-pathways-explore-future-climate-change) (SSPs). The numbers in these scenarios (2.6, 4.5, 6.0, 8.5) represent the power of [radiative forcing](https://en.wikipedia.org/wiki/Radiative_forcing) (watts/meter$^2$), which includes the effect of greenhouse gases, and today is estimated at 1.6 watts/meter$^2$. RCP 8.5/SSP 5 represents the most severe scenario, with no efforts to curb emissions. RCP 2.6/SSP 1 shows a path in which humanity takes a "very stringent" approach to reducing emissions, and is the path required to keep warming below the IPCC goal of 2 C.
#### We choose to compare historical temperature variability to a scenario with intermediate severity, **RCP 4.5/SSP 2**  
<br>
In this scenario, CO2 emissions peak in 2045, and decline afterwards. Atmopsheric CO$_2$ concentrations stabilize by 2060. Total global warming in the situation is estimated between 2.5-3 &deg C.    
<br><br>
RCP 4.5/SSP 2 is an appropriate choice because it is close to the path humanity is currently on.

## Code
Below is the code required to pull down the outputs from climate models. Generally, it uses google cloud file storage to access the data, x-array and dask to contain and structure it, x-agg to slice it down to the pixel nearest New Orleans, and then unwraps it using dask. 
### Import Packages

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import cftime
from tqdm.notebook import tqdm
import re
from operator import itemgetter
import intake
import gcsfs
import os
import warnings 
import xagg as xa
import dask

### Select Experiments, Variables, and Times
We are going to pull down data for SSPs 2/RCP 4.5 and models which "hindcast" historical climate variability. Historical variability created by models is important in providing a benchmark for the warming predicted by the same model. For each simulation, you get a temperature difference from the models historical value, representing a temperature change. This is preferred to literally interpretting temperatures from the model output.  
<br>
Our variable is "tas" which stands for "temperature of air at the surface." For historical we pull down 1985-2014 and we pull down all the data from 2020-2100 for the simulations.

In [4]:
data_params_all = [{'experiment_id':'historical','table_id':'day','variable_id':'tas','member_id':'r1i1p1f1'},
                   {'experiment_id':'ssp245','table_id':'day','variable_id':'tas','member_id':'r1i1p1f1'}]
subset_params = {'lat':[28,32],'lon':[-92,-88],
                  'time':{'historical':['1985-01-01','2014-12-31'],
                          'ssp245':['2020-01-01','2080-12-31']}}

### Set up Google Cloud Storage access
This code gets us a key to access the files and generates a table with metadata for the experiments. 

In [5]:
# Access google cloud storage links
fs = gcsfs.GCSFileSystem(token='anon', access='read_only')
# Get info about CMIP6 datasets
cmip6_datasets = pd.read_csv('https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv')

In [6]:
data_params = data_params_all[0]
cmip6_sub = cmip6_datasets.query(' and '.join([k+" == '"+data_params[k]+"'" 
                                               for k in data_params.keys() 
                                               if k != 'other']))

In [7]:
dask.config.set({"array.slicing.split_large_chunks": False});

### Use GCFS to Access Data
Credit to [Kevin Schwarzwald](https://iri.columbia.edu/contact/staff-directory/kevin-schwarzwald/). I adapted this code to run from his [GitHub repository on climate model downloads](https://github.com/ks905383/climate-downloads).<br>
<br>
**Warning**: This block takes a couple minutes to run. There will be a lot of stuff printed in the notebook and some warnings thrown, but don't worry about them. Two progress bars will appear, one for SSP245 and another for the historical models.

In [8]:
remove_leaps = True
dss_out = dict()
for data_params in data_params_all:
    dss_out[data_params['experiment_id']] = dict()

    cmip6_sub = cmip6_datasets.query(' and '.join([k+" == '"+data_params[k]+"'" 
                                                   for k in data_params.keys() 
                                                   if k != 'other']))
        
    for url in tqdm(cmip6_sub.zstore.values):
        mod = re.split('\/',url)[6]
        print('processing '+mod+'!')
        
        
        # Open dataset
        ds = xr.open_zarr(fs.get_mapper(url),consolidated=True)

        # Coerce all possible geospatial attribute names to simply 'lat' and 'lon'
        try:
            ds = ds.rename({'longitude':'lon','latitude':'lat'})
        except: 
            pass
        try:
            ds = ds.rename({'nav_lon':'lon','nav_lat':'lat'})
        except: 
            pass

        # Sort by time, if not sorted 
        if 'time' in subset_params:
            if (ds.time.values != np.sort(ds.time)).any():
                warnings.warn('Model '+ds.source_id+' has an unsorted time dimension.')
                ds = ds.sortby('time')
            
        # Now, save by the subsets desired in subset_params_all above
        ds_tmp = xa.fix_ds(ds)
        # Subset by time as set in subset_params
        if 'time' in subset_params:
            if (ds.time.max().dt.day==30) | (type(ds.time.values[0]) == cftime._cftime.Datetime360Day): 
                ds_tmp = (ds_tmp.sel(time=slice(subset_params['time'][data_params['experiment_id']][0],
                                        re.sub('-31','-30',subset_params['time'][data_params['experiment_id']][1]))))
            else:
                ds_tmp = (ds_tmp.sel(time=slice(*subset_params['time'][data_params['experiment_id']])))

        # Subset by space as set in subset_params
        if 'lat' in subset_params.keys():
            if not 'lat' in ds[data_params['variable_id']].dims:
                ds_tmp = ds_tmp.where((ds_tmp.lat >= subset_params['lat'][0]) & (ds_tmp.lat <= subset_params['lat'][1]) &
                 (ds_tmp.lon >= subset_params['lon'][0]) & (ds_tmp.lon <= subset_params['lon'][1]),drop=True)
            else:
                ds_tmp = (ds_tmp.sel(lat=slice(*subset_params['lat']),
                                     lon=slice(*subset_params['lon'])))

        # Output
        dss_out[data_params['experiment_id']][mod] = ds_tmp

        # Status update
        print(mod+' processed!')
        
        del ds, ds_tmp
        

  0%|          | 0/42 [00:00<?, ?it/s]

processing GFDL-CM4!
GFDL-CM4 processed!
processing GFDL-CM4!
GFDL-CM4 processed!
processing BCC-CSM2-MR!
BCC-CSM2-MR processed!
processing AWI-CM-1-1-MR!
AWI-CM-1-1-MR processed!
processing BCC-ESM1!
BCC-ESM1 processed!
processing CESM2-WACCM!
CESM2-WACCM processed!
processing CESM2!
CESM2 processed!
processing SAM0-UNICON!


  return self.array[key]


SAM0-UNICON processed!
processing CanESM5!
CanESM5 processed!
processing INM-CM4-8!
INM-CM4-8 processed!
processing MRI-ESM2-0!
MRI-ESM2-0 processed!
processing INM-CM5-0!
INM-CM5-0 processed!
processing IPSL-CM6A-LR!
IPSL-CM6A-LR processed!
processing MPI-ESM-1-2-HAM!
MPI-ESM-1-2-HAM processed!
processing MPI-ESM1-2-LR!
MPI-ESM1-2-LR processed!
processing MPI-ESM1-2-HR!
MPI-ESM1-2-HR processed!
processing GFDL-ESM4!
GFDL-ESM4 processed!
processing NESM3!
NESM3 processed!
processing NorESM2-LM!
NorESM2-LM processed!
processing FGOALS-g3!
FGOALS-g3 processed!
processing MIROC6!
MIROC6 processed!
processing FGOALS-f3-L!
FGOALS-f3-L processed!
processing ACCESS-CM2!
ACCESS-CM2 processed!
processing NorESM2-MM!
NorESM2-MM processed!
processing ACCESS-ESM1-5!
ACCESS-ESM1-5 processed!
processing CESM2-WACCM-FV2!




CESM2-WACCM-FV2 processed!
processing CESM2-FV2!
CESM2-FV2 processed!
processing KIOST-ESM!
KIOST-ESM processed!
processing IITM-ESM!
IITM-ESM processed!
processing AWI-ESM-1-1-LR!
AWI-ESM-1-1-LR processed!
processing EC-Earth3-Veg-LR!
EC-Earth3-Veg-LR processed!
processing EC-Earth3-Veg!
EC-Earth3-Veg processed!
processing EC-Earth3!
EC-Earth3 processed!
processing KACE-1-0-G!
KACE-1-0-G processed!
processing CMCC-CM2-SR5!
CMCC-CM2-SR5 processed!
processing EC-Earth3-AerChem!
EC-Earth3-AerChem processed!
processing TaiESM1!
TaiESM1 processed!
processing NorCPM1!
NorCPM1 processed!
processing IPSL-CM5A2-INCA!
IPSL-CM5A2-INCA processed!
processing CMCC-CM2-HR4!
CMCC-CM2-HR4 processed!
processing EC-Earth3-CC!
EC-Earth3-CC processed!
processing CMCC-ESM2!
CMCC-ESM2 processed!


  0%|          | 0/29 [00:00<?, ?it/s]

processing GFDL-CM4!
GFDL-CM4 processed!
processing GFDL-CM4!
GFDL-CM4 processed!
processing GFDL-ESM4!
GFDL-ESM4 processed!
processing BCC-CSM2-MR!
BCC-CSM2-MR processed!
processing CanESM5!
CanESM5 processed!
processing AWI-CM-1-1-MR!
AWI-CM-1-1-MR processed!
processing MRI-ESM2-0!
MRI-ESM2-0 processed!
processing INM-CM4-8!
INM-CM4-8 processed!
processing IPSL-CM6A-LR!
IPSL-CM6A-LR processed!
processing INM-CM5-0!
INM-CM5-0 processed!
processing MPI-ESM1-2-LR!
MPI-ESM1-2-LR processed!
processing MPI-ESM1-2-HR!
MPI-ESM1-2-HR processed!
processing NESM3!
NESM3 processed!
processing CESM2-WACCM!
CESM2-WACCM processed!
processing FGOALS-g3!
FGOALS-g3 processed!
processing MIROC6!
MIROC6 processed!
processing NorESM2-LM!
NorESM2-LM processed!
processing ACCESS-CM2!
ACCESS-CM2 processed!
processing NorESM2-MM!
NorESM2-MM processed!
processing KIOST-ESM!
KIOST-ESM processed!
processing EC-Earth3-Veg!
EC-Earth3-Veg processed!
processing EC-Earth3!
EC-Earth3 processed!
processing KACE-1-0-G!

### Subset to Closest Pixel to New Orleans
This block does some geospatial slicing to get the pixel to New Orleans. The lat-lon of New Orleans is 30 N, 90 W. So this code just subtracts 30 from the latitude, 90 from the longitude, takes the absolute value, and finds the index of the minimum value. The lat and lon coordinates in each dataset represent the south and west corner of a box, which represents a pixel int the model space. Each model is gridded slightly different, meaning the pixels will be different sizes for different models. They average about 130 x 130 km (80 x 80 miles), but can be as big as 250 x 250 km (150 x 150 miles).

In [9]:
for exp in dss_out:
    for mod in dss_out[exp]:
        dss_out[exp][mod] = dss_out[exp][mod].isel(lat=np.abs(dss_out[exp][mod].lat-30).argmin(),
                                                    lon=np.abs(dss_out[exp][mod].lon-(-90)).argmin())

### Load Simulations
At the moment, everything has been loaded lazily using dask. That basically means that the computer knows where to look to get the data and what to do with it, but hasn't done it yet. Running the .compute() operation on a dask object will load the data and get some real numbers. But because the volume of data is so large, this will take a few hours. 

#### SSP 2 4.5
SSP245 is the middle of the road scenario, where we invest a minimum effort to curb emissions. We are basically on this path at the moment. We drop the KACE-1-0-G because it returned an empty dataset.

In [10]:
keys=list(dss_out['ssp245'].keys())
keys.remove('KACE-1-0-G')

In [12]:
ssp245={}
for key in keys:
    ssp245[key]=dss_out['ssp245'][key]['tas'].compute()
    print(key)

GFDL-CM4
GFDL-ESM4
BCC-CSM2-MR
CanESM5
AWI-CM-1-1-MR
MRI-ESM2-0
INM-CM4-8
IPSL-CM6A-LR
INM-CM5-0
MPI-ESM1-2-LR
MPI-ESM1-2-HR
NESM3
CESM2-WACCM
FGOALS-g3
MIROC6
NorESM2-LM
ACCESS-CM2
NorESM2-MM
KIOST-ESM
EC-Earth3-Veg
EC-Earth3
CMCC-CM2-SR5
IITM-ESM
EC-Earth3-Veg-LR
EC-Earth3-CC
CMCC-ESM2
TaiESM1


#### Historical
It's necessary to draw down historical data for our projections. For each simulation, you get a temperature difference from the models historical value, representing a temperature change. This is preferred to literally interpretting temperatures from the model output. 

In [13]:
keys=list(dss_out['historical'].keys())
keys.remove('KACE-1-0-G')

In [14]:
historical={}
for key in keys:
    historical[key]=dss_out['historical'][key]['tas'].compute()
    print(key)

GFDL-CM4
BCC-CSM2-MR
AWI-CM-1-1-MR
BCC-ESM1
CESM2-WACCM
CESM2
SAM0-UNICON
CanESM5
INM-CM4-8
MRI-ESM2-0
INM-CM5-0
IPSL-CM6A-LR
MPI-ESM-1-2-HAM
MPI-ESM1-2-LR
MPI-ESM1-2-HR
GFDL-ESM4
NESM3
NorESM2-LM
FGOALS-g3
MIROC6
FGOALS-f3-L
ACCESS-CM2
NorESM2-MM
ACCESS-ESM1-5
CESM2-WACCM-FV2
CESM2-FV2
KIOST-ESM
IITM-ESM
AWI-ESM-1-1-LR
EC-Earth3-Veg-LR
EC-Earth3-Veg
EC-Earth3
CMCC-CM2-SR5
EC-Earth3-AerChem
TaiESM1
NorCPM1
IPSL-CM5A2-INCA
CMCC-CM2-HR4
EC-Earth3-CC
CMCC-ESM2


## Structure and Export
This code restructures these dask/x-array objects into a simple table with a time series and exports them. 

In [15]:
dataframes={}
scenarios_keys=['historical','ssp245']
scenarios_data={'historical':historical,
                'ssp245':ssp245}

In [19]:
out_path='/Users/danielbabin/GitHub/Gumbo_Weather/Data/'

In [20]:
for scen in scenarios_keys:
    experiments=list(scenarios_data[scen].keys())
    dataframes[scen]=scenarios_data[scen][experiments[0]].to_dataframe()
    dataframes[scen].rename(columns={'tasmax':experiments[0]})
    for exp in scenarios_data[scen].keys():
        values=scenarios_data[scen][exp].values
        if len(values)==len(dataframes[scen]):
            dataframes[scen][exp]=scenarios_data[scen][exp].values
    dataframes[scen]['date']=dataframes[scen].index.to_datetimeindex()
    dataframes[scen]=dataframes[scen].set_index('date')
    dataframes[scen].to_csv(out_path+scen+'.csv')

  dataframes[scen]['date']=dataframes[scen].index.to_datetimeindex()
