# Oerall Architecture

<img src="https://i.ibb.co/FsVN6My/Screen-Shot-2022-01-14-at-12-34-03-PM.png" width="1000"/>

# Prepare Datasets and Cloud Function

- Get sampled two CIFAR10 datasets for SPAN-1 and SPAN-2
- Deploy Cloud Function to monitor any changes in a designated bucket to trigger the pipeline which should be established via [this notebook](https://colab.research.google.com/drive/13tDGF7rl0aqiwE25QLeIaC9IQuT0gP-U?usp=sharing)
- Dump the sampled dataset for the Cloud Function to be triggered

> **NOTICE**: You should change the GCS bucket names, GCP project ID with your owns

# Make sure that you have appropriate IAMs

This screenshot gives some sense which IAMs are required, but you may want to control in more fine-grained manner.

![](https://i.ibb.co/d53nbvc/Screen-Shot-2022-01-14-at-12-56-40-AM.png)

# Download CIFAR10 Datasets (JPG)

In [None]:
!rm -rf CIFAR-10-images
!git clone https://github.com/YoongiKim/CIFAR-10-images.git

!rm -rf sampled

Cloning into 'CIFAR-10-images'...
remote: Enumerating objects: 60027, done.[K
remote: Total 60027 (delta 0), reused 0 (delta 0), pack-reused 60027[K
Receiving objects: 100% (60027/60027), 19.94 MiB | 24.40 MiB/s, done.
Resolving deltas: 100% (59990/59990), done.
Checking out files: 100% (60001/60001), done.


# Getting two sub-datsets

## Imports

In [None]:
import pandas as pd
import os
from os import listdir
from os.path import isfile, join
from random import sample
from shutil import copyfile, move

## Labels and creating corresponding directories

In [None]:
labels = ['airplane', 'automobile', 'bird', 'cat', 'deer']

In [None]:
span = 'span-1'

for label in labels:
  os.makedirs(f'sampled/{span}/{label}')

## Sampling

1. get all downloaded image files under the `labels`
2. then sample the number of `num_samples` for each label
3. move the chosen files to the separate directories (for compression later)
4. create DataFrame for each label, then insert them into `tmp_dfs` list
  - `filename` column contains where each file will be stored in GCS
  - the `filename` format is `gs://bucket-name/span/label-name/filename`
  - the DataFrame also has `label` column

In [None]:
tmp_dfs = []
num_samples = 200
bucket = 'my-cifar10-dataset-1012'

for label in labels:
  result = {}
  
  path = 'CIFAR-10-images/train/' + label + '/'
  imagefiles = [f'gs://{bucket}/{span}/{label}/{f}' for f in listdir(path) if isfile(join(path, f))]

  sampled = sample(imagefiles, num_samples)
  result['filename'] = sampled
  result['label'] = label

  tmp_dfs.append(pd.DataFrame.from_dict(result))

  for imagefile in sampled:
    imagefile = imagefile.split('/')[-1]
    move(f'{path}{imagefile}', f'sampled/{span}/{label}/{imagefile}')

## Create a merged DataFrame

Create a merged DataFrame by concatenating all DataFrames of each label

In [None]:
merge_df = pd.concat(tmp_dfs)
merge_df.head()

Unnamed: 0,filename,label
0,gs://my-cifar10-dataset-1012/span-1/airplane/1...,airplane
1,gs://my-cifar10-dataset-1012/span-1/airplane/0...,airplane
2,gs://my-cifar10-dataset-1012/span-1/airplane/4...,airplane
3,gs://my-cifar10-dataset-1012/span-1/airplane/4...,airplane
4,gs://my-cifar10-dataset-1012/span-1/airplane/3...,airplane


In [None]:
merge_df.describe()

Unnamed: 0,filename,label
count,1000,1000
unique,1000,5
top,gs://my-cifar10-dataset-1012/span-1/airplane/1...,automobile
freq,1,200


## Export the merged DataFrame to CSV file

1. export the merged DataFrame to CSV file 
  - exclude `index` and `header`


In [None]:
!mkdir -p annotations/{span}
merge_df.to_csv(f'annotations/{span}/annotations.csv', header=False, index=False)

## Zip the corresponding image files

In [None]:
!zip -r cifar10-sampled-{span}.zip ./sampled/{span}

## second data

In [None]:
span = 'span-2'

for label in labels:
  os.makedirs(f'sampled/{span}/{label}')

tmp_dfs = []
bucket = 'my-cifar10-dataset-1012'

for label in labels:
  result = {}
  
  path = 'CIFAR-10-images/train/' + label + '/'
  imagefiles = [f'gs://{bucket}/{span}/{label}/{f}' for f in listdir(path) if isfile(join(path, f))]

  sampled = sample(imagefiles, num_samples)
  result['filename'] = sampled
  result['label'] = label

  tmp_dfs.append(pd.DataFrame.from_dict(result))

  for imagefile in sampled:
    imagefile = imagefile.split('/')[-1]
    move(f'{path}{imagefile}', f'sampled/{span}/{label}/{imagefile}')

merge_df = pd.concat(tmp_dfs)    

In [None]:
merge_df.describe()

Unnamed: 0,filename,label
count,1000,1000
unique,1000,5
top,gs://my-cifar10-dataset-1012/span-2/bird/3219.jpg,automobile
freq,1,200


In [None]:
!mkdir -p annotations/{span}
merge_df.to_csv(f'annotations/{span}/annotations.csv', header=False, index=False)

In [None]:
!zip -r cifar10-sampled-{span}.zip ./sampled/{span}

# Copy to the GCS

In [None]:
!gcloud init

Welcome! This command will take you through the configuration of gcloud.

Settings from your current configuration [default] are:
component_manager:
  disable_update_check: 'True'
compute:
  gce_metadata_read_timeout_sec: '0'

Pick configuration to use:
 [1] Re-initialize this configuration [default] with new settings 
 [2] Create a new configuration
Please enter your numeric choice:  2

Enter configuration name. Names start with a lower case letter and contain only 
lower case letters a-z, digits 0-9, and hyphens '-':  handson
Your current configuration has been set to: [handson]

You can skip diagnostics next time by using the following flag:
  gcloud init --skip-diagnostics

Network diagnostic detects and fixes local network connection issues.
Reachability Check passed.
Network diagnostic passed (1/1 checks passed).

You must log in to continue. Would you like to log in (Y/n)?  Y

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=

In [None]:
#@title GCS
BUCKET_PATH = "gs://my-cifar10-dataset-1012" #@param {type:"string"}
ANNOTATION_BUCKET_PATH = "gs://my-cifar10-dataset-annotations-1012" #@param {type:"string"}
REGION = "us-central1" #@param {type:"string"}

!gsutil mb -l {REGION} {BUCKET_PATH}
!gsutil mb -l {REGION} {ANNOTATION_BUCKET_PATH}

Creating gs://my-cifar10-dataset-1012/...
ServiceException: 409 A Cloud Storage bucket named 'my-cifar10-dataset-1012' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.
Creating gs://my-cifar10-dataset-annotations-1012/...
ServiceException: 409 A Cloud Storage bucket named 'my-cifar10-dataset-annotations-1012' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


## Cloud Function to trigger pipeline

In [None]:
%cd /content

/content


In [None]:
!mkdir -p cloud_function
!touch cloud_function/__init__.py

In [None]:
_cloud_function_dep = "cloud_function/requirements.txt"

In [None]:
%%writefile {_cloud_function_dep}
kfp==1.6.2
google-cloud-aiplatform
google-cloud-storage

Writing cloud_function/requirements.txt


In [None]:
_cloud_function_file = "cloud_function/main.py"

In [None]:
%%writefile {_cloud_function_file}

import os
from google.cloud import aiplatform

def trigger_pipeline(event, context):
  if 'annotations.csv' in event['name']:
    project = os.getenv("PROJECT")
    region = os.getenv("REGION")
    gcs_pipeline_file_location = os.getenv("GCS_PIPELINE_FILE_LOCATION")
    dataset_name = os.getenv("DATASET_NAME")
    model_name = os.getenv("MODEL_NAME")

    bucket = event["bucket"]
    filepath = event['name']

    print('before init')
    aiplatform.init(project=project, location=region)    
    print('after init')

    print('before job')
    print(f'gs://{bucket}/{filepath}')
    job = aiplatform.PipelineJob(
        display_name='automl-cifar10-pipeline',
        template_path=f'gs://{gcs_pipeline_file_location}',
        pipeline_root=f'gs://my-pipeline-1012',
        parameter_values={
            'project_id': project,
            'location': region,
            'dataset_name': dataset_name,
            'dataset_path': f'gs://{bucket}/{filepath}',
            'base_model_name': model_name,
        }
    )
    print('after job')

    job.submit()

Writing cloud_function/main.py


In [None]:
#@title GCP Project
GOOGLE_CLOUD_PROJECT = "celtic-iridium-338202" #@param {type:"string"}
GOOGLE_CLOUD_REGION = "us-central1" #@param {type:"string"}
PIPELINE_LOCATION = "my-pipeline-1015/cifar10_classification_pipeline.json" #@param {type:"string"}
PIPELINE_NAME = "cifar10-pipeline-automl" #@param {type:"string"}

ENTRY_POINT = "trigger_pipeline" #@param {type:"string"}

DATASET_NAME = "my-cifar10-dataset-1015" #@param {type:"string"}
MODEL_NAME = "cifar10-model" #@param {type:"string"}

CLOUD_FUNCTION_NAME = f"trigger-{PIPELINE_NAME}-fn"
BUCKET_TO_MONITOR = ANNOTATION_BUCKET_PATH

In [None]:
ENV_VARS=f"""\
PROJECT={GOOGLE_CLOUD_PROJECT},\
REGION={GOOGLE_CLOUD_REGION},\
GCS_PIPELINE_FILE_LOCATION={PIPELINE_LOCATION},\
DATASET_NAME={DATASET_NAME},\
MODEL_NAME={MODEL_NAME}
"""

In [None]:
!gcloud functions deploy {CLOUD_FUNCTION_NAME} \
--project={GOOGLE_CLOUD_PROJECT} \
--region={GOOGLE_CLOUD_REGION} \
--runtime=python38 \
--source=cloud_function \
--entry-point=trigger_pipeline \
--trigger-resource={BUCKET_TO_MONITOR} \
--trigger-event=google.storage.object.finalize \
--update-env-vars={ENV_VARS}


For Cloud Build Logs, visit: https://console.cloud.google.com/cloud-build/builds;region=us-central1/dbf53e52-4a88-4674-8b3e-ab4f10e21f32?project=161071819378
availableMemoryMb: 256
buildId: c3e8e62a-f47b-4d13-93c3-836f4c72d37c
buildName: projects/161071819378/locations/us-central1/builds/c3e8e62a-f47b-4d13-93c3-836f4c72d37c
entryPoint: trigger_pipeline
environmentVariables:
  DATASET_NAME: my-cifar10-dataset-1012
  GCS_PIPELINE_FILE_LOCATION: my-pipeline-1012/cifar10_classification_pipeline.json
  MODEL_NAME: cifar10-model
  PROJECT: phonic-agility-338223
  REGION: us-central1
eventTrigger:
  eventType: google.storage.object.finalize
  failurePolicy: {}
  resource: projects/_/buckets/my-cifar10-dataset-annotations-1012
  service: storage.googleapis.com
ingressSettings: ALLOW_ALL
labels:
  deployment-tool: cli-gcloud
name: projects/phonic-agility-338223/locations/us-central1/functions/trigger-cifar10-pipeline-automl-fn
runtime: python38
serviceAccountEmail: phonic-agility-338223@appspo

## Copy SPAN-1 to GCS

In [None]:
#@title GCS Span Name
DATA_SRC_DIR = "sampled" #@param {type:"string"}
ANNOTATION_SRC_DIR = "annotations" #@param {type:"string"}
FIRST = "span-1" #@param {type:"string"}
SECOND = "span-2" #@param {type:"string"}

In [None]:
!gsutil -m cp -r {DATA_SRC_DIR}/{FIRST} {BUCKET_PATH}/

Copying file://sampled/span-1/cat/4624.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/271.4 KiB]                                                Copying file://sampled/span-1/cat/1931.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/271.4 KiB]                                                Copying file://sampled/span-1/cat/4387.jpg [Content-Type=image/jpeg]...
Copying file://sampled/span-1/cat/3044.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/271.4 KiB]                                                / [0 files][    0.0 B/271.4 KiB]                                                Copying file://sampled/span-1/cat/1595.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/271.4 KiB]                                                Copying file://sampled/span-1/cat/3066.jpg [Content-Type=image/jpeg]...
Copying file://sampled/span-1/cat/3186.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/271.4 KiB]                                                / [0 files

In [None]:
!gsutil -m cp -r {ANNOTATION_SRC_DIR}/{FIRST} {BUCKET_TO_MONITOR}/

Copying file://annotations/span-1/annotations.csv [Content-Type=text/csv]...
/ [1/1 files][ 57.2 KiB/ 57.2 KiB] 100% Done                                    
Operation completed over 1 objects/57.2 KiB.                                     


## Copy SPAN-2 to GCS

In [None]:
!gsutil -m cp -r {DATA_SRC_DIR}/{SECOND} {BUCKET_PATH}/

Copying file://sampled/span-2/cat/2415.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/ 88.8 KiB]                                                Copying file://sampled/span-2/cat/0212.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/178.7 KiB]                                                Copying file://sampled/span-2/cat/1842.jpg [Content-Type=image/jpeg]...
Copying file://sampled/span-2/cat/1168.jpg [Content-Type=image/jpeg]...
Copying file://sampled/span-2/cat/2819.jpg [Content-Type=image/jpeg]...
/ [0 files][    0.0 B/178.7 KiB]                                                / [0 files][    0.0 B/178.7 KiB]                                                / [0 files][    0.0 B/178.7 KiB]                                                Copying file://sampled/span-2/cat/2100.jpg [Content-Type=image/jpeg]...
/ [0 files][  4.5 KiB/271.9 KiB]                                                Copying file://sampled/span-2/cat/2704.jpg [Content-Type=image/jpeg]...
/ [0 files

In [None]:
!gsutil -m cp -r {ANNOTATION_SRC_DIR}/{SECOND} {BUCKET_TO_MONITOR}/

Copying file://annotations/span-2/annotations.csv [Content-Type=text/csv]...
/ [1/1 files][ 57.2 KiB/ 57.2 KiB] 100% Done                                    
Operation completed over 1 objects/57.2 KiB.                                     
