# Uploading `nc` files to Google Cloud Storage

Due to the limited resources of the machine the TRACMIP dataset is stored on, we cannot perform the `zarr` conversion on this machine. Instead, we will upload the original `nc` files to the GCS bucket, where they can be accessed from a different machine at Lamont that can perform the conversion.

Making sure we have Google Cloud SDK installed, we get our user credentials:

In [None]:
!gcloud auth login

Having done this, we now can now organize and upload each dataset - while on this machine the `nc` files are decentralized across several nested directories, we will find it more useful to have them grouped by model/experiment/timestep. We can do this using glob patterns, which we will use to gather all the datasets which need to be uploaded:

In [None]:
from glob import glob

paths = []

times       = ['Amon', 'Aday', 'A3hr']
models      = ['GISS-ModelE2', 'MetUM-CTL', 'CAM5Nor', 'CAM3', 'CNRM-AM5', 'AM21', 'ECHAM61',
               'MetUM-ENT', 'MPAS', 'LMDZ5A', 'ECHAM63', 'CALTECH', 'MIROC5', 'CAM4']
experiments = ['aquaControl', 'aqua4xCO2', 'aquaAbs20', 'aquaAbs07', 'land4xCO2', 'landAbs20',
               'landAbs15', 'landOrbit', 'aquaAbs15', 'landControl', 'landAbs07']
versions    = ['v20180423', 'v20181025', 'v20190129', 'v20181024', 'v20190131', 'v20190305',
               'v20190116', 'v20190507', 'v20190408', 'v20190114', 'v20190409']

for time in times:

    for mod in models:

        for exp in experiments:

            for ver in versions:
                path = "/lsdf/kit/imk-tro/projects/MOD/Gruppe_Voigt/TRACMIP_ESGFCOPY/*/%s/%s*/*/*/%s/*/*/%s/*" % (mod, exp, time, ver)
                if glob(path):
                    paths.append(path)
                    
len(paths)

There are 206 datasets which must be uploaded, making up some 8.8 terabytes! To manage this uploading and account for potential failures, we wil save our dataset glob paths to a JSON file, where we can keep track of their uploaded status:

In [None]:
import json

with open("uploaded.json", "w") as f:
    json.dump({path : False for path in paths}, f)

With this done, we can freely start a long running script to upload all of the data to the GCS bucket, checking periodically to see if it has stalled:

In [None]:
with open("uploaded.json", "r") as f:
    d = json.load(f)

for path in d:

    if not d[path]:

        time = path.split("/")[13]
        exp = path.split("/")[10].rstrip("*")
        model = path.split("/")[9]
        version = path.split("/")[-2]

        print(time, exp, model, version)
        
        !gsutil -m cp -r {path} gs://pangeo-data/tracmip_temp/{time}/{exp}/{model}/{version}/
        
        d[path] = True
        with open("fixed.json", "w") as f:
            json.dump(d, f)