# Classification
Linear Regression is used to predict numeric values, like change of survival on the titanic.
Classification means to serparate data points into classes.

We have a dataset containing four different features of three different types of flowers.

In [49]:
from __future__ import absolute_import, division, print_function, unicode_literals

from IPython.display import clear_output

import tensorflow as tf
import pandas as pd

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']

train = pd.read_csv('data/iris_training.csv', names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv('data/iris_test.csv', names=CSV_COLUMN_NAMES, header=0)

# pop the species column off and use that as our label
train_y = train.pop('Species')
test_y = test.pop('Species')


# Input function: Converts the inputs to a Dataset, shuffles and repeat if in training mode
def input_fn(features, labels, training=True, batch_size=256):
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    if training:
        dataset = dataset.shuffle(1000).repeat()

    return dataset.batch(batch_size)

# Feature columns describe how to use the input.
my_feature_columns = []
for key in train.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))

## Building the Model
Two build it models from tensor flow:
* deep neural network classifier ```DNNClassifier```
* linear classifier ```LinearClassifier```

DNN best choise here because maybe there is no linear correspondence in the data.

Changing models is easy. Most of the work is loading and preprocessing data.

We build a DNN with 2 hidden layers which have 30 and 10 hidden nodes each.
Theses numbers are given by the tutorial and seem arbitrary, a best choise is done by experimenting and tests.

Since we have 3 differnent types of flowers, the model must choose between 3 classes.

In [50]:
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[30, 10],
    n_classes=3)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/nz/48n490d10_dggw04m5r5z0d80000gp/T/tmpgz37qip_', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa913e4a850>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


## Training the model
With input function as lambda, since we do not provide a make_input_function to return an inner function as before.
Set in training mode with our training dataset and the Species as labels.

Numer of steps is similar to epoch, but just means how many times the classifier has looked at things to come to an end.

Output: tells the current step and the loss. The less loss the better.

E.g. INFO:tensorflow:Loss for final step: 0.36024907. Is pretty high, so pretty bad.

The output gets more importent to look at with bigger models containing terrabytes of data.

In [51]:
classifier.train(
    input_fn=lambda: input_fn(train, train_y),
    steps=5000)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/nz/48n490d10_dggw04m5r5z0d80000gp/T/tmpgz37qip_/model.ckpt.
INFO:tensorflow:loss = 1.2353456, step = 0
INFO:tensorflow:global_step/sec: 106.97
INFO:tensorflow:loss = 0.928645, step = 100 (0.936 sec)
INFO:tensorflow:global_step/sec: 160.85
INFO:tensorflow:loss = 0.87341493, step = 200 (0.622 sec)
INFO:tensorflow:global_step/sec: 156.679
INFO:tensorflow:loss = 0.8380406, step = 300 (0.637 sec)
INFO:ten

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifierV2 at 0x7fa913e4a350>

## Evaluate
We use the test dataset and the test labels this time. Also set training to false.
The result is an accuracy of 95% at 5000 steps. This seems pretty good for not knowing what I am doing.

More is not always better, so I tried 100.000 steps. The Loss for final step was with 0.13823777 much lower. The accuracy changed to 93%, which is actually worse.

In [52]:
eval_result = classifier.evaluate(input_fn=lambda: input_fn(test, test_y, training=False))

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-07-14T16:36:13Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/nz/48n490d10_dggw04m5r5z0d80000gp/T/tmpgz37qip_/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2020-07-14-16:36:14
INFO:tensorflow:Saving dict for global step 5000: accuracy = 0.8666667, average_loss = 0.46237597, global_step = 5000, loss = 0.46237597
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 5000: /var/folders/nz/48n490d10_dggw04m5r5z0d80000gp/T/tmpgz37qip_

## Prediction
Predictions are made by putting in a dictionary of feature value pairs.
The function below prints out the final prediction result, which is the most likely flower.

In [63]:
# Convert the inputs to a Dataset without labels.
def predict_input_fn(features, batch_size=256):
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

def print_predictions(predictions):
    for pred_dict in predictions:
        class_id = pred_dict['class_ids'][0]
        probability = pred_dict['probabilities'][class_id]

        print('Prediction is "{}" ({:.1f}%)'.format(
            SPECIES[class_id], 100 * probability))

# which are expected to be 'Setosa', 'Versicolor', 'Virginica'
predict = {
    'SepalLength': [5.1, 5.9, 6.9],
    'SepalWidth': [3.3, 3.0, 3.1],
    'PetalLength': [1.7, 4.2, 5.4],
    'PetalWidth': [0.5, 1.5, 2.1]
}

predictions = classifier.predict(input_fn=lambda: predict_input_fn(predict))
print_predictions(predictions)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/nz/48n490d10_dggw04m5r5z0d80000gp/T/tmpgz37qip_/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Prediction is "Setosa" (83.6%)
Prediction is "Versicolor" (47.9%)
Prediction is "Virginica" (63.3%)
