### CMIP6 Data Issues 
1. The ES-DOC errata pages are used for modelling centers to report issues with their published data here:
      [ES-DOC ERRATA]( https://errata.es-doc.org )
2. A separate list of exceptions is kept as we process the data (concatenating netcdf and saving as zarr)
      [ESGF to GCS Issues]( https://docs.google.com/spreadsheets/d/e/2PACX-1vRxKgz1xCH7zhUoDnl_llgEvbj2ssxoJiTUdbkHkkfWiCKU8EfZtPerar3ELjoIzAda5giR06QvbWGE/pubhtml?gid=128595157&single=true )
3. Issues with the existing Google Cloud collection are crowd sourced here:
      [GCS Issues]( https://tinyurl.com/y5cw76at )

### This notebook updates the list of processing exceptions in Issue 2.
1. This does not work very well, there has to be a better way ...
2. For now, just keep the local csv/exceptions.csv file up-to-date and ignore the cloud version

In [1]:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import pandas as pd
from datetime import datetime
from ast import literal_eval
import os
import numpy as np
import os
gspread.__version__

'3.1.0'

In [2]:
json_keyfile = '/home/naomi/cmip6-zarr/json/Pangeo Hackathon-e48a41b13c91.json'
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name(json_keyfile, scope)
gc = gspread.authorize(credentials)

In [3]:
sheet_name = "CMIP6_DataExceptions (Responses)"
sh = gc.open(sheet_name)
print(sh.worksheets())

[<Worksheet 'Form Responses 1' id:252604281>, <Worksheet 'NH_additions' id:128595157>]


In [4]:
wks = sh.worksheet("NH_additions")

data = wks.get_all_values()
headers = data.pop(0)

df_cloud = pd.DataFrame(data, columns=headers)
#df_cloud['name'] = ['/'.join(s[:-3]) for s in df_cloud.values]
df_local = pd.read_csv('csv/exceptions.csv', na_filter= False)
#df_local['name'] = ['/'.join(s[:-3]) for s in df_local.values]

# add new from df_local
for item,row in enumerate(df_local.values):
    name = row[-1]
    #print(item,name)
    df_match = df_cloud[df_cloud.name==name]
    if len(df_match)==0:
        print(item,name,' was not in df_cloud')
        wks.append_row(list(row))

In [5]:
df_cloud

Unnamed: 0,source_id,experiment_id,member_id,table_id,variable_id,grid_label,reason_code,reason_txt,status,name
0,ACCESS-ESM1-5,esm-hist,r1i1p1f1,Omon,fgco2,gn,local,NetCDF: Access failure,2020-03-06,ACCESS-ESM1-5/esm-hist/r1i1p1f1/Omon/fgco2/gn/...
1,ACCESS-ESM1-5,piControl,r1i1p1f1,Omon,umo,gn,noUse,HDF error,2020-02-16,ACCESS-ESM1-5/piControl/r1i1p1f1/Omon/umo/gn/n...
2,BCC-CSM2-MR,esm-hist,r1i1p1f1,SImon,siconc,gn,local,NetCDF: Access failure,2020-03-05,BCC-CSM2-MR/esm-hist/r1i1p1f1/SImon/siconc/gn/...
3,BCC-CSM2-MR,historical,r1i1p1f1,6hrLev,hus,gn,noUse,all data after 2012 has zeros in the dimensions,,BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/hus/gn/...
4,BCC-CSM2-MR,historical,r1i1p1f1,SImon,siconc,gn,local,NetCDF: Access failure,2020-03-05,BCC-CSM2-MR/historical/r1i1p1f1/SImon/siconc/g...
...,...,...,...,...,...,...,...,...,...,...
370,GFDL-ESM4,historical,r1i1p1f1,Oyr,parag,gr,local,NetCDF: Access failure,2020-03-08,GFDL-ESM4/historical/r1i1p1f1/Oyr/parag/gr/local
371,UKESM1-0-LL,piControl,r1i1p1f2,CFmon,clisccp,gn,local,NetCDF: Access failure,2020-03-08,UKESM1-0-LL/piControl/r1i1p1f2/CFmon/clisccp/g...
372,GFDL-ESM4,historical,r1i1p1f1,Oyr,nh4,gr,local,NetCDF: Access failure,2020-03-08,GFDL-ESM4/historical/r1i1p1f1/Oyr/nh4/gr/local
373,MPI-ESM1-2-HR,piControl,r1i1p1f1,Amon,rlutcs,gn,local,I/O failure,2020-03-09,MPI-ESM1-2-HR/piControl/r1i1p1f1/Amon/rlutcs/g...


In [6]:
wks = sh.worksheet("NH_additions")

data = wks.get_all_values()
headers = data.pop(0)

df_cloud = pd.DataFrame(data, columns=headers)
#df_cloud['name'] = ['/'.join(s[:-3]) for s in df_cloud.values]
df_local = pd.read_csv('csv/exceptions.csv', na_filter= False)
#df_local['name'] = ['/'.join(s[:-3]) for s in df_local.values]

# delete old from df
for item,row in enumerate(df_cloud.values):
    name = row[-1]
    #print(item,name)
    df_match = df_local[df_local.name==name]
    if len(df_match)==0:
        print(item,name,' is not in df_local')
        wks.delete_row(item+2)

0 ACCESS-ESM1-5/esm-hist/r1i1p1f1/Omon/fgco2/gn/local  is not in df_local
1 ACCESS-ESM1-5/piControl/r1i1p1f1/Omon/umo/gn/noUse  is not in df_local
138 EC-Earth3/historical/r2i1p1f1/Amon/all/gr/noUse  is not in df_local
158 EC-Earth3-LR/piControl/r1i1p1f1/Amon/clt/gr/noUse  is not in df_local
159 EC-Earth3-LR/piControl/r1i1p1f1/Amon/evspsbl/gr/noUse  is not in df_local
160 EC-Earth3-LR/piControl/r1i1p1f1/Amon/hfls/gr/noUse  is not in df_local
161 EC-Earth3-LR/piControl/r1i1p1f1/Amon/hfss/gr/noUse  is not in df_local
162 EC-Earth3-LR/piControl/r1i1p1f1/Amon/pr/gr/noUse  is not in df_local
163 EC-Earth3-LR/piControl/r1i1p1f1/Amon/prsn/gr/noUse  is not in df_local
164 EC-Earth3-LR/piControl/r1i1p1f1/Amon/psl/gr/noUse  is not in df_local
165 EC-Earth3-LR/piControl/r1i1p1f1/Amon/rlds/gr/noUse  is not in df_local
166 EC-Earth3-LR/piControl/r1i1p1f1/Amon/rlus/gr/noUse  is not in df_local
167 EC-Earth3-LR/piControl/r1i1p1f1/Amon/rlut/gr/noUse  is not in df_local
168 EC-Earth3-LR/piControl/r1i1p

In [7]:
assert False 

AssertionError: 

In [None]:
# add a new local entry
import datetime
date = str(datetime.datetime.now().strftime("%Y-%m-%d"))

store = 'IPSL-CM6A-LR/historical/r1i1p1f1/Oyr/expfe/gn/'
#code = 'noUse'
code = 'local'
#reason = 'missing time chunks'
#reason = 'conflicting values of geolat'
reason = 'NetCDF: Access failure'
#reason = 'I/O failure'
#reason = 'depth coord name change deptht to olevel'
row = store.split('/')[:-1] + [code,reason,date,store+code]
print(row)

In [None]:
# if it looks good, now check
df_local = df_local.append(pd.Series(row, index=df_local.columns ),ignore_index=True)
df_local.tail()

In [None]:
df_local.to_csv('csv/exceptions.csv', index=False)

In [None]:
assert False

In [None]:
#wks.update_acell('B1', 'Bingo!')
#wks.append_row(['junk','more_junk'])
#wks.add_rows(2)
#for row in df_local.values[:100]:
#    wks.append_row(list(row))
#wks.delete_row(7)
#wks.row_values(8)

In [None]:
common_cols = df_cloud.columns.tolist()                              #generate list of column names
df12 = pd.merge(df_cloud, df_local, on=common_cols, how='inner')     #extract common rows with merge
df2 = df_local[~df_local['name'].isin(df12['name'])]
df1 = df_cloud[~df_cloud['name'].isin(df12['name'])]
len(df_cloud),len(df_local)