# Calculate Average Profiles

This notebook will create the average profiles required to reproduce Figure 3 from *Heuristic Methods for Determining the Number of Classes in Unsupervised Classification of Climate Models*, E. Boland et al. 2022 (doi to follow). This requires cluster_utils.py and input datafiles via the googleapi CMIP6 store (see cluster_utils.py for more info)

There are two options
- calculate from already trained models (uses data from model/)
- recreate from scratch (train models from scratch)

Outputs stored in model/\[ensemble\]/\[nclasses\]/avg.obj

Please attribute any plots or code from this notebook using the DOI from Zenodo: to come

Updated Nov 2022
E Atkinson & E Boland [emmomp@bas.ac.uk](email:emmomp@bas.ac.uk)

In [1]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:41339")
client

0,1
Connection method: Direct,
Dashboard: http://127.0.0.1:8787/status,

0,1
Comm: tcp://127.0.0.1:41339,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads: 4
Started: Just now,Total memory: 14.65 GiB

0,1
Comm: tcp://127.0.0.1:35607,Total threads: 1
Dashboard: http://127.0.0.1:37743/status,Memory: 3.66 GiB
Nanny: tcp://127.0.0.1:35849,
Local directory: /home/jovyan/dask-worker-space/worker-oy8v54x6,Local directory: /home/jovyan/dask-worker-space/worker-oy8v54x6
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 161.86 MiB,Spilled bytes: 0 B
Read bytes: 17.07 kiB,Write bytes: 28.05 kiB

0,1
Comm: tcp://127.0.0.1:46123,Total threads: 1
Dashboard: http://127.0.0.1:36733/status,Memory: 3.66 GiB
Nanny: tcp://127.0.0.1:38795,
Local directory: /home/jovyan/dask-worker-space/worker-iafpw9kw,Local directory: /home/jovyan/dask-worker-space/worker-iafpw9kw
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 4.0%,Last seen: Just now
Memory usage: 161.04 MiB,Spilled bytes: 0 B
Read bytes: 17.10 kiB,Write bytes: 28.10 kiB

0,1
Comm: tcp://127.0.0.1:41121,Total threads: 1
Dashboard: http://127.0.0.1:34585/status,Memory: 3.66 GiB
Nanny: tcp://127.0.0.1:40199,
Local directory: /home/jovyan/dask-worker-space/worker-3rfy0ylv,Local directory: /home/jovyan/dask-worker-space/worker-3rfy0ylv
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 161.74 MiB,Spilled bytes: 0 B
Read bytes: 12.69 kiB,Write bytes: 12.69 kiB

0,1
Comm: tcp://127.0.0.1:38275,Total threads: 1
Dashboard: http://127.0.0.1:40493/status,Memory: 3.66 GiB
Nanny: tcp://127.0.0.1:40017,
Local directory: /home/jovyan/dask-worker-space/worker-7ea3e332,Local directory: /home/jovyan/dask-worker-space/worker-7ea3e332
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 161.60 MiB,Spilled bytes: 0 B
Read bytes: 15.82 kiB,Write bytes: 26.81 kiB


In [1]:
import numpy as np
import xarray as xr

import os
import pickle

import cluster_utils as flt

### User options
Leave as is to recreate the paper

In [7]:
# Number of classes 
classes = [7,8]
#Time range
tslice=slice('1965-01', '1994-12') 
ids = ['r1i1p1f2', 'r2i1p1f2']
npca=3 #number of PCA components
ntrain=7000 #number of profiles per month to use in training dataset

Uncomment the following two lines if you need to generate mask.npy:

In [None]:
#data = flt.retrieve_profiles(timeRange = slice('1995-01', '1995-02'))
#np.save('data/mask', data['n'])
mask = np.load('data/mask.npy', allow_pickle=True)

### Option 1: Generate average profiles for chosen ensemble members and classes if models already generated 

In [None]:
for m_id in ids:
    
    print('Starting {}'.format(m_id))
    path_id = 'model/{}'.format(m_id)
    # Load PCA for given model
    with open('{}/pca.obj'.format(path_id), 'rb') as file:
        pca = pickle.load(file)
    # Retrieve all Southern Ocean data
    options = {'memberId' : m_id}
    data_full = flt.generate_fullset(timeRange = tslice, mask=mask, options=options,n_components=npca,pca=pca)
    print('Finished setup for {}'.format(m_id))      
 
    for nn,n_classes in enumerate(classes):        
        path_n = 'model/{}/{}'.format(m_id, n_classes)
        path_data = 'data/{}/{}'.format(m_id, n_classes)
        if not os.path.exists(path_data):
            os.makedirs(path_data)
        # Open GMM model generated from training set
        with open('{}/gmm.obj'.format(path_n), 'rb') as file:
            gmm = pickle.load(file)
        # Classify full dataset
        data_classes = flt.gmm_classify(data_full, gmm).compute()
        # Calculate average profiles for each clasee
        avg_prof = flt.avg_profiles(data_full, data_classes, n_classes)

        with open('{}/avg.obj'.format(path_data), 'wb') as file:
            pickle.dump(avg_prof, file)
    
print('Done!')

### Option 2: Generate average profiles for chosen ensemble members and classes if, generating models from scratch

In [None]:
avg_profiles = {}
for m_id in ids:
    
    print('Starting {}'.format(m_id))
    path_id = 'model/{}'.format(m_id)
    #Generate training set and PCA model
    [data_train,pca] = flt.generate_trainingset(timeRange = tslice, mask=mask, options=options,n_components=npca,N=ntrain)
    with open('{}/pca.obj'.format(path_id), 'wb') as file:
        pickle.dump(pca, file)
    #Load full Southern Ocean data to fit
    data_full = flt.generate_fullset(timeRange = tslice, mask=mask, options=options,n_components=npca,pca=pca)
    print('Finished setup for {}'.format(m_id))      
 
    for nn,n_classes in enumerate(classes):        
        path_n = 'model/{}/{}'.format(m_id, n_classes)
        path_data = 'data/{}/{}'.format(m_id, n_classes)
        # Generate GMM model generated from training set
        gmm = flt.train_gmm(data_train, n_classes)
        with open('{}/gmm.obj'.format(path_n), 'wb') as file:
            pickle.dump(gmm, file)
        # Classify full dataset
        data_classes = flt.gmm_classify(data_full, gmm).compute()
        # Calculate average profiles for each clasee
        avg_prof = flt.avg_profiles(data_full, data_classes, n_classes)

        with open('{}/avg.obj'.format(path_data), 'wb') as file:
            pickle.dump(avg_prof, file)            
    
print('Done!')