# Breast cancer predictor (regression) using DNN

## Step 1: Prepare training and evaluation dataset, create FastEstimator Pipeline

Pipeline can take both data in memory and data in disk. In this example, we are going to use data in memory by loading data with sklearn.datasets.load_breast_cancer.
The following can be used to get the description of the data:

In [1]:
from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()
print(breast_cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

Next, split the data into train and eval sets

In [2]:
from sklearn.model_selection import train_test_split

(X, y) = load_breast_cancer(True)
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size=0.2)

Next we need to scale the inputs to the neural network. This is done by using a StandardScaler.

In [3]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_eval = scaler.transform(x_eval)

For in-memory data in Pipeline, the data format should be a nested dictionary like: {"mode1": {"feature1": numpy_array, "feature2": numpy_array, ...}, ...}. Each mode can be either train or eval, in our case, we have both train and eval. feature is the feature name, in our case, we have x and y. The network prediction will be a rank-1 array, in order to match prediction, we will expand the groud truth dimension by 1.

In [4]:
import numpy as np

train_data = {"x": x_train, "y": np.expand_dims(y_train, -1)}
eval_data = {"x": x_eval, "y": np.expand_dims(y_eval, -1)}
data = {"train": train_data, "eval": eval_data}

In [6]:
#Parameters
epochs = 50
batch_size = 32
steps_per_epoch = None
validation_steps = None

Now we are ready to define Pipeline:

In [7]:
import fastestimator as fe

pipeline = fe.Pipeline(batch_size=batch_size, data=data)

## Step 2: Prepare model, create FastEstimator Network

First, we have to define the network architecture in tf.keras.Model or tf.keras.Sequential. After defining the architecture, users are expected to feed the architecture definition and its associated model name, optimizer and loss name (default to be 'loss') to FEModel.

In [8]:
import tensorflow as tf
from tensorflow.keras import layers

def create_dnn():
    model = tf.keras.Sequential()

    model.add(layers.Dense(32, activation="relu", input_shape=(30, )))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(16, activation="relu"))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(8, activation="relu"))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(1, activation="sigmoid"))

    return model

model = fe.build(model_def=create_dnn, model_name="dnn", optimizer="adam", loss_name="loss")

Now we are ready to define the Network: given with a batch data with key x and y, we have to work our way to loss with series of operators. ModelOp is an operator that contains a model.

In [9]:
from fastestimator.op.tensorop import ModelOp, BinaryCrossentropy

network = fe.Network(
    ops=[ModelOp(inputs="x", model=model, outputs="y_pred"), BinaryCrossentropy(inputs=("y","y_pred"),outputs="loss")])

## Step 3: Configure training, create Estimator

During the training loop, we want to: 1) measure lowest loss for data 2) save the model with lowest valdiation loss. Trace class is used for anything related to training loop, we will need to import the ModelSaver trace.

In [10]:
import tempfile
from fastestimator.trace import ModelSaver

model_dir = tempfile.mkdtemp()
traces = [ModelSaver(model_name="dnn", save_dir=model_dir, save_best=True)]

Now we can define the Estimator and specify the training configuation:

In [11]:
estimator = fe.Estimator(network=network, 
                         pipeline=pipeline, 
                         epochs=epochs, 
                         steps_per_epoch=steps_per_epoch,
                         validation_steps=validation_steps,
                         traces=traces)

## Start Training

In [12]:
estimator.fit()

    ______           __  ______     __  _                 __            
   / ____/___ ______/ /_/ ____/____/ /_(_)___ ___  ____ _/ /_____  _____
  / /_  / __ `/ ___/ __/ __/ / ___/ __/ / __ `__ \/ __ `/ __/ __ \/ ___/
 / __/ / /_/ (__  ) /_/ /___(__  ) /_/ / / / / / / /_/ / /_/ /_/ / /    
/_/    \__,_/____/\__/_____/____/\__/_/_/ /_/ /_/\__,_/\__/\____/_/     
                                                                        



W0107 16:08:06.647137 140186565609280 base_layer.py:1814] Layer dense is casting an input tensor from dtype float64 to the layer's dtype of float32, which is new behavior in TensorFlow 2.  The layer has dtype float32 because it's dtype defaults to floatx.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



FastEstimator-Start: step: 0; total_train_steps: 700; dnn_lr: 0.001; 
FastEstimator-Train: step: 0; loss: 0.7021929; 
FastEstimator-ModelSaver: Saving model to /tmp/tmppwyeiwu2/dnn_best_loss.h5
FastEstimator-Eval: step: 14; epoch: 0; loss: 0.5658682; min_loss: 0.56586826; since_best_loss: 0; 
FastEstimator-ModelSaver: Saving model to /tmp/tmppwyeiwu2/dnn_best_loss.h5
FastEstimator-Eval: step: 28; epoch: 1; loss: 0.533607; min_loss: 0.53360695; since_best_loss: 0; 
FastEstimator-ModelSaver: Saving model to /tmp/tmppwyeiwu2/dnn_best_loss.h5
FastEstimator-Eval: step: 42; epoch: 2; loss: 0.4919445; min_loss: 0.49194452; since_best_loss: 0; 
FastEstimator-ModelSaver: Saving model to /tmp/tmppwyeiwu2/dnn_best_loss.h5
FastEstimator-Eval: step: 56; epoch: 3; loss: 0.4704986; min_loss: 0.47049853; since_best_loss: 0; 
FastEstimator-ModelSaver: Saving model to /tmp/tmppwyeiwu2/dnn_best_loss.h5
FastEstimator-Eval: step: 70; epoch: 4; loss: 0.4535487; min_loss: 0.4535487; since_best_loss: 0; 
Fast

## Inferencing

After training, the model is saved to a temporary folder. we can load the model from file and do inferencing on a sample.

In [13]:
import os

model_path = os.path.join(model_dir, 'dnn_best_loss.h5')
trained_model = tf.keras.models.load_model(model_path, compile=False)

Randomly get one sample from validation set and compare the predicted value with model's prediction:

In [14]:
import numpy as np

selected_idx = np.random.randint(0, high=101)
print("test sample idx {}, ground truth: {}".format(selected_idx, y_eval[selected_idx]))

test_sample = np.expand_dims(x_eval[selected_idx], axis=0)

predicted_value = trained_model.predict(test_sample)
print("model predicted value is {}".format(predicted_value))

test sample idx 24, ground truth: 1
model predicted value is [[0.99993205]]
