# Step 1: Train Models

This notebook will:
- Train 2 to 20 class GMM models for three UK-ESM historical ensemble members, caculate the BIC, SIC and SIL score for each. This is required to reproduce Figures 4 & 5 from *A Novel Heuristic Method for Detecting Overfit in Unsupervised Classification of Climate Models*, E. Boland et al. 2023 (doi to follow). 
- Train 7 to 9 class GMM models for all ten UK-ESM historical ensemble members. This is required to reproduce Figures 2, 3, 6, and 7 from *A Novel Heuristic Method for Detecting Overfit in Unsupervised Classification of Climate Models*, E. Boland et al. 2023 (doi to follow). 

This requires cluster_utils.py and input datafiles via the googleapi CMIP6 store or the CEDA archives via JASMIN (see cluster_utils.py for more info)

Outputs stored in \[model\]/\[ensemble\]/\[nclasses\]

Please attribute any plots or code from this notebook using the DOI from Zenodo: to come

Updated Jun 2023
E Atkinson & E Boland [emmomp@bas.ac.uk](email:emmomp@bas.ac.uk)

In [1]:
from dask.distributed import Client

client = Client("tcp://127.0.0.1:32937")
client

0,1
Connection method: Direct,
Dashboard: http://127.0.0.1:8787/status,

0,1
Comm: tcp://127.0.0.1:32937,Workers: 5
Dashboard: http://127.0.0.1:8787/status,Total threads: 5
Started: 13 minutes ago,Total memory: 40.00 GiB

0,1
Comm: tcp://127.0.0.1:36144,Total threads: 1
Dashboard: http://127.0.0.1:40414/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:37548,
Local directory: /tmp/dask-worker-space/worker-ed6goqha,Local directory: /tmp/dask-worker-space/worker-ed6goqha
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 0.0%,Last seen: Just now
Memory usage: 275.77 MiB,Spilled bytes: 0 B
Read bytes: 30.65 kiB,Write bytes: 33.87 kiB

0,1
Comm: tcp://127.0.0.1:40871,Total threads: 1
Dashboard: http://127.0.0.1:42590/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:38967,
Local directory: /tmp/dask-worker-space/worker-zyklutlg,Local directory: /tmp/dask-worker-space/worker-zyklutlg
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 0.0%,Last seen: Just now
Memory usage: 249.47 MiB,Spilled bytes: 0 B
Read bytes: 30.32 kiB,Write bytes: 33.55 kiB

0,1
Comm: tcp://127.0.0.1:33069,Total threads: 1
Dashboard: http://127.0.0.1:40815/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:42081,
Local directory: /tmp/dask-worker-space/worker-5ppprifm,Local directory: /tmp/dask-worker-space/worker-5ppprifm
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 246.86 MiB,Spilled bytes: 0 B
Read bytes: 40.47 kiB,Write bytes: 40.59 kiB

0,1
Comm: tcp://127.0.0.1:40485,Total threads: 1
Dashboard: http://127.0.0.1:39646/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:44741,
Local directory: /tmp/dask-worker-space/worker-0izax7z1,Local directory: /tmp/dask-worker-space/worker-0izax7z1
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 2.0%,Last seen: Just now
Memory usage: 298.96 MiB,Spilled bytes: 0 B
Read bytes: 40.28 kiB,Write bytes: 40.40 kiB

0,1
Comm: tcp://127.0.0.1:34073,Total threads: 1
Dashboard: http://127.0.0.1:37802/status,Memory: 8.00 GiB
Nanny: tcp://127.0.0.1:41127,
Local directory: /tmp/dask-worker-space/worker-mzc8bwb8,Local directory: /tmp/dask-worker-space/worker-mzc8bwb8
Tasks executing: 0,Tasks in memory: 0
Tasks ready: 0,Tasks in flight: 0
CPU usage: 0.0%,Last seen: Just now
Memory usage: 241.46 MiB,Spilled bytes: 0 B
Read bytes: 40.28 kiB,Write bytes: 40.39 kiB


In [3]:
import numpy as np
import os
import pickle
import cluster_utils as flt
from sklearn import metrics

### User options
Leave as is to recreate the paper

In [4]:
# Number of classes 
model_folder='model'
max_classes = 20 #max classes
#Time range
tslice=slice('2001-01', '2017-12') 
#Depth range
levSel=slice(5, 2000)
ids = ['r1i1p1f2', 'r2i1p1f2', 'r3i1p1f2']
ntrain=3000 #number of profiles per month to use in training dataset
npca=3

Uncomment the following three lines if you need to generate mask.npy:

In [5]:
#data = flt.retrieve_profiles(timeRange = slice('1995-01', '1995-02'),levSel=levSel)
#np.save('data/mask', data['n'])
#mask=data['n']
mask = np.load('data/mask.npy', allow_pickle=True)

### Fit 2-30 class models for each ensemble member
Saves each individual PCA model, GMM model and BIC/AIC/SIL score to \[model_folder\]

Saves all BICs/AICs/SILs to \[model_folder\]/\[BICs/AICs/SILs\]2-30.obj

In [7]:
BICs = {}
AICs = {}
SILs = {}
for m_id in ids:
    path_id = '{}/{}'.format(model_folder, m_id)
    if not os.path.isdir(path_id):
        os.makedirs(path_id)
    print('Starting {}'.format(m_id))
    options = {'memberId' : m_id}
    
    # Load training set
    [data,pca] = flt.generate_trainingset(timeRange = tslice, mask=mask, options=options,N=ntrain,n_components=npca,levSel=levSel)
    
    bic = np.zeros(max_classes-1)
    aic = bic.copy()
    sil = bic.copy()
    
    with open('{}/pca.obj'.format(path_id), 'wb') as file:
        pickle.dump(pca, file)
        
    print('Finished setup for {}'.format(m_id))
    
    for iin,n_classes in enumerate(range(2, max_classes+1)):
        
        path_n = '{}/{}/{}'.format(model_folder, m_id, n_classes)
        
        if not os.path.isdir(path_n):
            os.makedirs(path_n)
            
        gmm = flt.train_gmm(data, n_classes)
        with open('{}/gmm.obj'.format(path_n), 'wb') as file:
            pickle.dump(gmm, file)
        
        bic[iin] = gmm.bic(data)
        with open('{}/bic.obj'.format(path_n), 'wb') as file:
            pickle.dump(bic[iin],file)       

        aic[iin] = gmm.aic(data)
        with open('{}/aic.obj'.format(path_n), 'wb') as file:
            pickle.dump(aic[iin],file)     
            
        # Calculate silhouette score for 10000 point sample        
        inds=np.random.randint(0,data.shape[0],10000)
        labels=flt.gmm_classify(data[inds,:],gmm)
        sil[iin]=metrics.silhouette_score(data[inds,:],labels,n_jobs=-1)
        sample_silhouette_values = metrics.silhouette_samples(data[inds,:],labels,n_jobs=-1)
        with open('{}/sil.obj'.format(path_n), 'wb') as file:
            pickle.dump(sil[iin],file)
        with open('{}/sil_vals.obj'.format(path_n), 'wb') as file:
            pickle.dump(sample_silhouette_values,file)
        with open('{}/sil_labels.obj'.format(path_n), 'wb') as file:
            pickle.dump(labels,file)                
        
        print('Finished {} with {} classes'.format(m_id, n_classes))
        
    BICs[m_id] = bic
    AICs[m_id] = aic
    SILs[m_id] = sil
    
with open('{}/BICs2-20.obj'.format(model_folder), 'wb') as file:
    pickle.dump(BICs, file)
with open('{}/AICs2-20.obj'.format(model_folder), 'wb') as file:
    pickle.dump(AICs, file)
with open('{}/SILs2-20.obj'.format(model_folder), 'wb') as file:
    pickle.dump(SILs, file)

print('Done!')

Starting r1i1p1f2
Finished setup for r1i1p1f2
Finished r1i1p1f2 with 2 classes
Finished r1i1p1f2 with 3 classes
Finished r1i1p1f2 with 4 classes
Finished r1i1p1f2 with 5 classes
Finished r1i1p1f2 with 6 classes
Finished r1i1p1f2 with 7 classes
Finished r1i1p1f2 with 8 classes
Finished r1i1p1f2 with 9 classes
Finished r1i1p1f2 with 10 classes
Finished r1i1p1f2 with 11 classes
Finished r1i1p1f2 with 12 classes
Finished r1i1p1f2 with 13 classes
Finished r1i1p1f2 with 14 classes
Finished r1i1p1f2 with 15 classes
Finished r1i1p1f2 with 16 classes
Finished r1i1p1f2 with 17 classes
Finished r1i1p1f2 with 18 classes
Finished r1i1p1f2 with 19 classes
Finished r1i1p1f2 with 20 classes
Starting r2i1p1f2
Finished setup for r2i1p1f2
Finished r2i1p1f2 with 2 classes
Finished r2i1p1f2 with 3 classes
Finished r2i1p1f2 with 4 classes
Finished r2i1p1f2 with 5 classes
Finished r2i1p1f2 with 6 classes
Finished r2i1p1f2 with 7 classes
Finished r2i1p1f2 with 8 classes
Finished r2i1p1f2 with 9 classes
Finish

### Fit 7-9 class models for additional ensembles 

In [9]:
ids = ['r4i1p1f2', 'r8i1p1f2', 'r16i1p1f2', 'r19i1p1f2'] 
classes=[7,8,9]

for m_id in ids:
    path_id = '{}/{}'.format(model_folder, m_id)
    if not os.path.isdir(path_id):
        os.makedirs(path_id)
    print('Starting {}'.format(m_id))
    options = {'memberId' : m_id}
    
    # Load training set
    [data,pca] = flt.generate_trainingset(timeRange = tslice, mask=mask, options=options,N=ntrain,n_components=npca,levSel=levSel)

    with open('{}/pca.obj'.format(path_id), 'wb') as file:
        pickle.dump(pca, file)
        
    print('Finished setup for {}'.format(m_id))
    
    for n_classes in classes:
        
        path_n = '{}/{}/{}'.format(model_folder, m_id, n_classes)
        
        if not os.path.isdir(path_n):
            os.makedirs(path_n)
            
        gmm = flt.train_gmm(data, n_classes)
        with open('{}/gmm.obj'.format(path_n), 'wb') as file:
            pickle.dump(gmm, file)
                     
        print('Finished {} with {} classes'.format(m_id, n_classes))

print('Done!')

Starting r4i1p1f2
Finished setup for r4i1p1f2
Finished r4i1p1f2 with 7 classes
Finished r4i1p1f2 with 8 classes
Finished r4i1p1f2 with 9 classes
Starting r8i1p1f2
Finished setup for r8i1p1f2
Finished r8i1p1f2 with 7 classes
Finished r8i1p1f2 with 8 classes
Finished r8i1p1f2 with 9 classes
Starting r16i1p1f2
Finished setup for r16i1p1f2
Finished r16i1p1f2 with 7 classes
Finished r16i1p1f2 with 8 classes
Finished r16i1p1f2 with 9 classes
Starting r19i1p1f2
Finished setup for r19i1p1f2
Finished r19i1p1f2 with 7 classes
Finished r19i1p1f2 with 8 classes
Finished r19i1p1f2 with 9 classes
Done!


2023-06-09 11:53:30,119 - distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
