# 3_prepare_data_babyweight

**Learning Objectives**

1. Setup up the environment
1. Preprocess natality dataset
1. Augment natality dataset
1. Create the train and eval tables in BigQuery
1. Export data from BigQuery to GCS in CSV format


## Introduction 
In this notebook, we will prepare the babyweight dataset for model development and training to predict the weight of a baby before it is born.  We will use BigQuery to perform data augmentation and preprocessing which will be used for AutoML Tables, BigQuery ML, and Keras models trained on Cloud AI Platform.

In this lab, we will set up the environment, create the project dataset, preprocess and augment natality dataset, create the train and eval tables in BigQuery, and export data from BigQuery to GCS in CSV format.



## Set up environment variables and load necessary libraries

Import necessary libraries.

In [1]:
import os
from google.cloud import bigquery

In [2]:
PROJECT = "predict-babyweight-10142021"
BUCKET = PROJECT
REGION = "us-central1"

os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET 
os.environ["REGION"] = REGION

## Create a BigQuery Dataset

A BigQuery dataset is a container for tables, views, and models built with BigQuery ML. Let's create one called __babyweight__.

In [3]:
%%bash

# Create a BigQuery dataset for babyweight if it doesn't exist
datasetexists=$(bq ls -d | grep -w babyweight)

if [ -n "$datasetexists" ]; then
    echo -e "BigQuery dataset already exists, let's not recreate it."

else
    echo "Creating BigQuery dataset titled: babyweight"
    
    bq --location=US mk --dataset \
        --description "Babyweight" \
        $PROJECT:babyweight
    echo "Here are the current datasets:"
    bq ls
fi

BigQuery dataset already exists, let's not recreate it.


## Create the training and evaluation data tables

First we are going to create a subset of the data limiting our columns to `weight_pounds`, `is_male`, `mother_age`, `plurality`, and `gestation_weeks` as well as some simple filtering and a column to hash on for repeatable splitting.

* Note:  The dataset in the create table code below is the one created previously, e.g. "babyweight".

### Preprocess and filter dataset

We have some preprocessing and filtering we would like to do to get our data in the right format for training.

Preprocessing:
* Cast `is_male` from `BOOL` to `STRING`
* Cast `plurality` from `INTEGER` to `STRING` where `[1, 2, 3, 4, 5]` becomes `["Single(1)", "Twins(2)", "Triplets(3)", "Quadruplets(4)", "Quintuplets(5)"]`
* Cast `cigarette_use`from `BOOL` to `STRING` where `NULL` becomes `Unknown`
* Cast `alcohol_use`from `BOOL` to `STRING` where `NULL` becomes `Unknown`
* Add `hashcolumn` hashing on `year`, `month`,`COALESCE(wday, day, 0)`,`IFNULL(state, "Unknown")`, and `IFNULL(mother_birth_state, "Unknown")`

Filtering:
* Only want data for years later than `2003`
* Only want baby weights greater than `0`
* Only want mothers whose age is greater than `0`
* Only want plurality to be greater than `0`
* Only want the number of weeks of gestation to be greater than `0`

In [4]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_2003 AS
SELECT
    weight_pounds,
    CAST(is_male AS STRING) AS is_male,
    mother_age,
    CASE
        WHEN plurality = 1 THEN "Single(1)"
        WHEN plurality = 2 THEN "Twins(2)"
        WHEN plurality = 3 THEN "Triplets(3)"
        WHEN plurality = 4 THEN "Quadruplets(4)"
        WHEN plurality = 5 THEN "Quintuplets(5)"
    END AS plurality,
    gestation_weeks,
    IFNULL(CAST(cigarette_use AS STRING), "Unknown") AS cigarette_use,
    IFNULL(CAST(alcohol_use AS STRING), "Unknown") AS alcohol_use,
    ABS(FARM_FINGERPRINT(
        CONCAT(
            CAST(year AS STRING),
            CAST(month AS STRING),
            CAST(COALESCE(wday, day, 0)  AS STRING),
            CAST(IFNULL(state, "Unknown") AS STRING),
            CAST(IFNULL(mother_birth_state, "Unknown")  AS STRING)
        )
    )) AS hash_values
FROM
    publicdata.samples.natality
WHERE
    year > 2002
    AND weight_pounds > 0
    AND mother_age > 0
    AND plurality > 0
    AND gestation_weeks > 0

### Augment dataset to simulate missing data

Now we want to augment our dataset with our simulated babyweight data by setting all gender information to `Unknown` and setting plurality of all non-single births to `Multiple(2+)`.

In [6]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_2003_augmented AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    cigarette_use,
    alcohol_use,
    hash_values
FROM
    babyweight.babyweight_2003
UNION ALL
SELECT
    weight_pounds,
    "Unknown" AS is_male,
    mother_age,
    CASE
        WHEN plurality = "Single(1)" THEN plurality
        ELSE "Multiple(2+)"
    END AS plurality,
    gestation_weeks,
    cigarette_use,
    alcohol_use,
    hash_values
FROM
    babyweight.babyweight_2003

### Split augmented dataset into train and eval sets

Using ` hash_values`, apply a modulo to get approximately a 80/15/5 train/eval/test split.

#### Split augmented dataset into train dataset

In [7]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_2003_train AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    cigarette_use,
    alcohol_use,
FROM
    babyweight.babyweight_2003_augmented
WHERE
    MOD(hash_values, 100) < 80

#### Split augmented dataset into eval dataset

In [9]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_2003_eval AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    cigarette_use,
    alcohol_use,
FROM
    babyweight.babyweight_2003_augmented
WHERE
    MOD(hash_values, 100) >= 80
    AND MOD(hash_values, 100) < 95
    

#### Split augmented dataset into test dataset

In [10]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_2003_test AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    cigarette_use,
    alcohol_use,
FROM
    babyweight.babyweight_2003_augmented
WHERE
    MOD(hash_values, 100) >= 95

## Verify table creation

Verify that you created the dataset and training data table.

In [11]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_2003_train
LIMIT 0

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,cigarette_use,alcohol_use


In [12]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_2003_eval
LIMIT 0

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks,cigarette_use,alcohol_use


## Export from BigQuery to CSVs in GCS

Use BigQuery Python API to export our train, eval, and test tables to Google Cloud Storage in the CSV format to be used later for TensorFlow/Keras training. 

We'll want to use the dataset we've been using above as well as repeat the process for both training, evaluation, and testing data.

In [14]:
# Construct a BigQuery client object.
client = bigquery.Client()

dataset_name = "babyweight"

# Create dataset reference object
dataset_ref = client.dataset(
    dataset_id=dataset_name, project=client.project)

# Export both train and eval tables
for step in ["train", "eval", "test"]:
    destination_uri = os.path.join(
        "gs://", BUCKET, dataset_name, "data", f"{step}*.csv")
    table_name = f"babyweight_2003_{step}"
    table_ref = dataset_ref.table(table_name)
    extract_job = client.extract_table(
        table_ref,
        destination_uri,
        location="US", # Location must match that of the source table.
    )  # API request
    extract_job.result()  # Waits for job to complete.

    print(f"Exported {client.project}:{dataset_name}.{table_name} \n to {destination_uri}")

Exported predict-babyweight-10142021:babyweight.babyweight_2003_train 
 to gs://predict-babyweight-10142021/babyweight/data/train*.csv
Exported predict-babyweight-10142021:babyweight.babyweight_2003_eval 
 to gs://predict-babyweight-10142021/babyweight/data/eval*.csv
Exported predict-babyweight-10142021:babyweight.babyweight_2003_test 
 to gs://predict-babyweight-10142021/babyweight/data/test*.csv


## Verify CSV creation

Verify that we correctly created the CSV files in our bucket.

In [15]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*.csv

gs://predict-babyweight-10142021/babyweight/data/eval000000000000.csv
gs://predict-babyweight-10142021/babyweight/data/test000000000000.csv
gs://predict-babyweight-10142021/babyweight/data/train000000000000.csv
gs://predict-babyweight-10142021/babyweight/data/train000000000001.csv
gs://predict-babyweight-10142021/babyweight/data/train000000000002.csv
gs://predict-babyweight-10142021/babyweight/data/train000000000003.csv
gs://predict-babyweight-10142021/babyweight/data/train000000000004.csv


In [16]:
%%bash
gsutil cat gs://predict-babyweight-10142021/babyweight/data/test000000000000.csv | head -5

weight_pounds,is_male,mother_age,plurality,gestation_weeks,cigarette_use,alcohol_use
1.43741394824,false,15,Single(1),22,false,false
2.12525620568,false,42,Single(1),30,Unknown,Unknown
2.18698563904,Unknown,42,Single(1),31,false,false
6.3382900325,Unknown,43,Multiple(2+),45,false,false


In [17]:
%%bash
gsutil cat gs://predict-babyweight-10142021/babyweight/data/eval000000000000.csv| head -5

weight_pounds,is_male,mother_age,plurality,gestation_weeks,cigarette_use,alcohol_use
5.56226287026,true,15,Single(1),31,false,false
4.629707502,true,46,Twins(2),28,false,false
2.0502990366,true,46,Twins(2),26,Unknown,Unknown
1.4991433816,true,43,Single(1),18,Unknown,Unknown


In [None]:
%%bash
gsutil cat gs://predict-babyweight-10142021/babyweight/data/test000000000000.csv| head -5

## Summary: 
In this notebook, we setup our environment, created a BigQuery dataset, preprocessed and augmented the natality dataset, created train and eval tables in BigQuery, and exported data from BigQuery to GCS in CSV format.