<a name="about"></a>
About this notebook
======

This notebook assumes you have ran the local Iris classification notebook and you have not deleted the LOCAL_ROOT folder. In this notebook, we will use BigQuery to preprocess the data files. 

<a name="setup"></a>
Setting things up
=====

In [1]:
import datalab_structured_data as sd

Lets look at the versions of datalab_structured_data and TF we have. Make sure TF and SD are 1.0.0

In [2]:
import os

import tensorflow as tf
from tensorflow.python.lib.io import file_io

import datalab.ml as ml

print('tf ' + str(tf.__version__))
print('sd ' + str(sd.__version__))

tf 1.0.0
sd 1.0.0


This notebook will write files during preprocessing. Please give a root folder you wish to use.

In [3]:
LOCAL_ROOT = './iris_notebook_workspace' # This should be the same as what was used in the local iris notebook
CLOUD_ROOT = 'gs://' + datalab_project_id() + '-iris-classification-datalab' # Feel free to change this line.

# No need to edit anything else in this cell.
LOCAL_PREPROCESSING_DIR = os.path.join(LOCAL_ROOT, 'preprocessing')
CLOUD_PREPROCESSING_DIR = os.path.join(CLOUD_ROOT, 'cloud_preprocessing') 

LOCAL_TRAIN_FILE = os.path.join(LOCAL_ROOT, 'train.csv')
CLOUD_TRAIN_FILE = os.path.join(CLOUD_ROOT, 'train.csv')


LOCAL_SCHEMA_FILE = os.path.join(LOCAL_ROOT, 'schema.json')
CLOUD_SCHEMA_FILE = os.path.join(CLOUD_ROOT, 'schema.json')

if not file_io.file_exists(LOCAL_ROOT):
  raise ValueError('LOCAL_ROOT not found. Did you run the local notebook?')
  
!gsutil mb {CLOUD_ROOT}

Creating gs://cloud-ml-dev-iris-classification-datalab/...
ServiceException: 409 Bucket cloud-ml-dev-iris-classification-datalab already exists.


First, let us put the csv files on GCS

In [4]:
!gsutil cp {LOCAL_TRAIN_FILE} {CLOUD_TRAIN_FILE}
!gsutil cp {LOCAL_SCHEMA_FILE} {CLOUD_SCHEMA_FILE}

Copying file://./iris_notebook_workspace/train.csv [Content-Type=text/csv]...
/ [1 files][  3.8 KiB/  3.8 KiB]                                                
Operation completed over 1 objects/3.8 KiB.                                      


<a name="local_preprocessing"></a>
Preprocessing with BigQuery starting from csv files on GCS
=====

In [5]:
!gsutil rm -fr {CLOUD_PREPROCESSING_DIR}

Removing gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/numerical_analysis.json#1488319962597890...
Removing gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/schema.json#1488319966184631...
Removing gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/vocab_flower.csv#1488319965355094...
/ [3 objects]                                                                   
Operation completed over 3 objects.                                              


In [6]:
train_csv = ml.CsvDataSet(
  file_pattern=CLOUD_TRAIN_FILE,
  schema_file=CLOUD_SCHEMA_FILE)

In [7]:
sd.cloud_preprocess(
  dataset=train_csv,
  output_dir=CLOUD_PREPROCESSING_DIR,
)

Starting cloud preprocessing.
Track BigQuery status at
https://bigquery.cloud.google.com/queries/cloud-ml-dev
Running numerical analysis...done.
Running categorical analysis...done.
Cloud preprocessing done.


The output of preprocessing is a numerical_analysis file that contains analysis from the numerical columns, and a vocab file from each categorical column. The files produced by preprocessing are consumed in training, and you should not have to worry about these files. Just for fun, lets look at them.

In [8]:
!gsutil ls  {CLOUD_PREPROCESSING_DIR}

gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/numerical_analysis.json
gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/schema.json
gs://cloud-ml-dev-iris-classification-datalab/cloud_preprocessing/vocab_flower.csv


In [9]:
!gsutil cat  {CLOUD_PREPROCESSING_DIR}/vocab_flower.csv

Iris-setosa
Iris-versicolor
Iris-virginica


In [10]:
!gsutil cat  {CLOUD_PREPROCESSING_DIR}/numerical_analysis.json

{
  "sepal_width": {
    "max": 4.4000000000000004,
    "mean": 3.050833333333332,
    "min": 2.0
  },
  "petal_width": {
    "max": 2.5,
    "mean": 1.2324999999999995,
    "min": 0.10000000000000001
  },
  "sepal_length": {
    "max": 7.9000000000000004,
    "mean": 5.8675000000000024,
    "min": 4.2999999999999998
  },
  "key": {
    "max": 150.0,
    "mean": 76.733333333333334,
    "min": 1.0
  },
  "petal_length": {
    "max": 6.9000000000000004,
    "mean": 3.8308333333333349,
    "min": 1.1000000000000001
  }
}

Cleaning things up
=====

If you want to delete the files you made on GCS, uncomment and run the next cell.

In [11]:
#!gsutil rm -fr {CLOUD_ROOT}