# Read States from Google Cloud Storage

Interacting with Google Cloud Storage (GCS) requires authenticating, establishing a client to connect to a project, and referencing a bucket within the project. This bucket "pointer" will be interacted with most directly when retrieving data from the GCS.

## The `gcloud` module
The `gcloud` model is a thin wrapper around the `google.cloud.storage` module that implements solutions to a few common problems we will face while interacting with the model data on GCS. `list_*` functions take the bucket pointer as the first argument and will return a list of named tuples. `*FromBlob` functions take a blob as the first argument and return a single named tuple. The definitions of these tuples are provided below for reference. `*ToFile` functions take a blob as the first argument, by default, will save the file locally mirroring the directory structure on GCS. The "name" field listed last for each named tuple is meant to hold a copy of the `blob.name`. This preseves information about the GCS directory structure.

```python
ModelInfo = namedtuple("ModelInfo", [
    "wandb_id", "lesion_start_epoch", "lesion_type", "model_type", "run_name",
    "train_data", "test_data", "mask_value", "lstm_units", "learning_rate",
    "batch_size", "frequency_scale_k", "epochs", "seed", "orth_features",
    "phon_features", "phon_max_length", "name", "bucket_name"])


BaseModelState = namedtuple("ModelState", [
    "encoder_cell_state", "encoder_hidden_state", "decoder_cell_state",
    "decoder_hidden_state", "output", "name", "bucket_name"])


class ModelState(BaseModelState):
    __slots__ = ()
    def nitems(self):
        return self.encoder_cell.shape[0]

    def nunits(self):
        return self.encoder_cell.shape[-1]

    def phon_max_length(self):
        return self.output.shape[1]
```

## Mass-downloads
If you have sufficient storage locally, you may wish to download some or all of the GCS dataset. This may be necessary if the cost of retrieving large datasets from GCS becomes untenable. The `download_many_blobs_with_transfer_manager` and `download_bucket_with_transfer_manager` functions are taken from Google's documentation (with minor revision, to assume that a bucket reference already exists) and are meant to facilitate this. Like the `*ToFile` functions, the default behavior is to let `blob.name` dictate the filename. The `destination_directory` argument to these functions allows specifying a root of the GCS directory structure somewhere other than the current working directory. The default value, `""`, implies the current working directory.

## Worked example
First, import packages and setup the environment.

In [None]:
import os
from gtools import gcloud
from google.cloud.storage import client

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/home/chriscox/.config/gcp_read_only.json"
project_name = "time-varying-reader-1"
bucket_name = "time-varying-reader-runs"
gcs = client.Client(project=project_name)
B = gcs.bucket(bucket_name=bucket_name)

Then, pick a run that you want to pull data from.

In [None]:
run_name = "s200_intact_freeze_phon"

Pull info to guide selection of model and epoch. N.B., `list_model_info()` returns ModelInfo tuples, which means data is downloaded. `list_epoch_blobs` just returns references to blobs and a little metadata, and the epoch state data must be retrieved later.

In [None]:
model_info = gcloud.list_model_info(B, run_name)
wandb_id = model_info[-1].wandb_id
epochs = gcloud.list_epoch_blobs(B, run_name, wandb_id)

### Download the state data
N.B., this will be a large download (~500 MB)


In [None]:
m = gcloud.ModelStateFromBlob(epochs[-1])

### Write downloaded data to file

In [None]:
gcloud.ModelInfoToFile(model_info[-1], bucket_name)
gcloud.ModelStateToFile(m, bucket_name)

### Write directly to file (picking a new model and epoch)
This will involve a large download of epoch state data (~500 MB).

In [None]:
model_info_blobs = gcloud.list_model_info_blobs(B, run_name)
config_from_blob = gcloud.ModelInfoFromBlob(model_info_blobs[0])
wandb_id = config_from_blob.wandb_id
epochs = gcloud.list_epoch_blobs(B, run_name, wandb_id)
gcloud.BlobToFile(model_info_blobs[0])
# N.B., this will be a large download (~500 MB)
gcloud.BlobToFile(epochs[-1])

## Read from file

In [None]:
config_from_file = gcloud.ModelInfoFromFile(
    "buckets/time-varying-reader-runs/s200_intact_freeze_phon/1nexxgef/config.json",
    "time-varying-reader-runs")
print(config_from_blob)
print(config_from_file)