In this notebook, we will download [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset from [TensorFlow Dataset(TFDS)](https://www.tensorflow.org/datasets). The dataset is alreadly prepared as TFRecord format.

We will push the downloaded dataset to a GCS bucket while keeping the directory strucutres like below.
- gs://bucket-name/span-1/train/train.tfrecord
- gs://bucket-name/span-1/test/test.tfrecord

To proceed with the rest of the notebook you'd need a billing-enabled GCP account. 

## Prerequisites
- Add the following rules to IAM
  - Storage Object Admin
  - Storage Object Creator

## Setup

In order to access Google Cloud Platform from Colab environment, we need to login to GCP account with `gcloud init` command.

In [None]:
!gcloud init

## Download the original dataset and copy over to a GCS Bucket

### 1. Create Directories

In this step we are going to create directories to hold to be downloaded TFRecord dataset. As an intial phase, the training and testing dataset will be stored in `span-1/train` and `span-1/test` directoreis respectively.

When there will be more data with the same distribution, we can update the currently stored dataset. In this case, you should turn on the GCS's versioning feature.

When there will be more data with the different distribution, we will create other directores of `span-2/test` and `span-2/test` to address data drift. In this way, we can keep data separetly for easier maintanence while handling versioning separtely for different `SPAN`s.

In [11]:
TARGET_ROOT_DIR = "cifar10"
TARGET_TRAIN_DIR = TARGET_ROOT_DIR + "/span-1/train"
TARGET_TEST_DIR = TARGET_ROOT_DIR + "/span-1/test"

!mkdir -p {TARGET_TRAIN_DIR}
!mkdir -p {TARGET_TEST_DIR}

### 2. Download CIFAR10 Dataset with TFDS

In [12]:
import tensorflow_datasets as tfds

# Generate TFRecords with TFDS
builder = tfds.builder("cifar10")
builder.download_and_prepare()

### 3. Copy Downloaded Dataset to the Directories that We have created

In [13]:
!cp {builder.data_dir}/cifar10-train.tfrecord-00000-of-00001 {TARGET_TRAIN_DIR}/cifar10-train.tfrecord
!cp {builder.data_dir}/cifar10-test.tfrecord-00000-of-00001 {TARGET_TEST_DIR}/cifar10-test.tfrecord

In [14]:
!ls -R {TARGET_ROOT_DIR}

cifar10:
span-1

cifar10/span-1:
test  train

cifar10/span-1/test:
cifar10-test.tfrecord

cifar10/span-1/train:
cifar10-train.tfrecord


### 4. Copy Local Files to the GCS Bucket

In [15]:
#@title GCS
#@markdown You should change these values as per your preferences. The copy operation can take ~5 minutes. 
BUCKET_PATH = "gs://cifar10-csp-public" #@param {type:"string"}
REGION = "us-central1" #@param {type:"string"}

!gsutil mb -l {REGION} {BUCKET_PATH}
!gsutil -m cp -r {TARGET_ROOT_DIR}/* {BUCKET_PATH}

Creating gs://cifar10-csp-public/...
ServiceException: 409 A Cloud Storage bucket named 'cifar10-csp-public' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.
Copying file://cifar10/span-1/train/cifar10-train.tfrecord [Content-Type=application/octet-stream]...
Copying file://cifar10/span-1/test/cifar10-test.tfrecord [Content-Type=application/octet-stream]...
\ [2/2 files][133.3 MiB/133.3 MiB] 100% Done                                    
Operation completed over 2 objects/133.3 MiB.                                    


Verify if the files were copied over.

In [8]:
!gsutil ls -R {BUCKET_PATH}/

gs://cifar10-csp-public/span-01/:

gs://cifar10-csp-public/span-01/test/:
gs://cifar10-csp-public/span-01/test/cifar10-test.tfrecord

gs://cifar10-csp-public/span-01/train/:
gs://cifar10-csp-public/span-01/train/cifar10-train.tfrecord


# Test with TFX's built-in function

TFX provides [`calculate_splits_fingerprint_span_and_version`](https://github.com/tensorflow/tfx/blob/00571387b7b006e2ebb0c1277380e5a47d8f0ffa/tfx/components/example_gen/utils.py#L648) function which calculates and returns the current `SPAN` and `VERSION`.

> Please note this section only works within GCP Vertex Notebook environment due to the authentication issue. If you know how to setup GCS access privilege for TFX, please let me know.

In [None]:
!pip install tfx==1.2.0

In [3]:
from tfx import v1 as tfx
from tfx.components.example_gen import utils

In [7]:
from tfx.proto import example_gen_pb2

_DATA_PATH = 'gs://cifar10-csp-public'

splits = [
  example_gen_pb2.Input.Split(name='train',
                              pattern='span-{SPAN}/train/*'),
  example_gen_pb2.Input.Split(name='val',
                              pattern='span-{SPAN}/test/*')
]

_, span, version = utils.calculate_splits_fingerprint_span_and_version(_DATA_PATH, splits)

PermissionDeniedError: ignored