# SVM Kernel Approximation

1. Environment Setup: import required libraries
2. Loading Dataset and Preprocessing Data
3. Build Input Fn
4. Training Model
5. Evaluating Training Data


## 1. Environment Setup: import required library

We include the required libraries that will be used in the next parts. The **time**, **numpy**, and **tensorflow** are common libraries in machine learning. The **fio**, which is "file input output" used to load data and **config**, which is "configuration file" used to config the path of the dataset files are written by myself. Modify it when you need it.

**patch/metrics.py**: fix `tf.metrics.true_negatives` method is missing on Tensorflow r1.4.

In [1]:
import fio
import preprocessing as pc
from config import *
import utility

# python std library
import gc
import time
import logging
import functools
import collections
from multiprocessing.pool import ThreadPool

# install library
import numpy as np
import tensorflow as tf

# patch
import patch.metrics

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  return f(*args, **kwds)


## 2. Loading Dataset and Preprocessing Data

Loading the training set, validation set, and test set that was defined in **config.py** file. Sample files consisting of indices of data will be used to undersample the data.

There are two parts of the data that must be processed:

- the input data represented by the prefix "**X**" on variables
- the target data represented by the prefix "**Y**" on variables

Check out the file **preprocess.py** for more details.

In [2]:
def load(window_size):
    X_train = fio.load_file(X_train_dataset)
    Y_train = fio.load_file(Y_train_dataset)
    X_test = fio.load_file(X_test_dataset)
    Y_test = fio.load_file(Y_test_dataset)
    train_sample = fio.load_sample_file(train_sample_dataset)
    valid_sample = fio.load_sample_file(valid_sample_dataset)

    stat = pc.get_feat_stat(X_train)
    
    X_train = pc.standardize(X_train, stat)
    X_test = pc.standardize(X_test, stat)
    
    X_train = pc.expand(X_train, window_size)
    X_test = pc.expand(X_test, window_size)
    
    Y_train = pc.classify(Y_train)
    Y_test = pc.classify(Y_test)
    
    testing_sample = [np.indices((x.shape[0], x.shape[1])).reshape((2,-1)).T for x in X_test]
    
    return X_train, Y_train, X_test, Y_test, train_sample, valid_sample, testing_sample

In [3]:
if __name__ == "__main__":
    window_size = 23
    
    X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, train_sample, valid_sample, testing_sample = load(window_size)
    print("the length of training set:", len(X_train_orig))
    print("the length of testing set:", len(X_test_orig))
    print("the first row training data:", np.sum(X_train_orig[0][0][0]))
    print("the first row target value:", np.sum(Y_train_orig[0][0][0]))
    
    print(X_train_orig[0].shape)
    print(Y_train_orig[0].shape)
    print(train_sample[0])
    print(valid_sample[0])

the length of training set: 3
the length of testing set: 2
the first row training data: 0.60705996
the first row target value: 1
(5, 3, 23, 23, 6)
(5, 3, 1)
[[0 0]
 [3 0]
 [4 0]
 [0 1]
 [1 1]
 [2 1]
 [3 1]
 [4 1]
 [0 2]
 [2 2]
 [3 2]
 [4 2]]
[[1 0]
 [2 0]
 [1 2]]


## 3. Build Input Fn

- [Tensorflow Doc: dataset.from_generator](https://github.com/tensorflow/docs/blob/r1.4/site/en/api_docs/api_docs/python/tf/data/Dataset.md#from_generator)
- [Tensorflow Doc: dataset.batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch)
    - batch_size: A tf.int64 scalar tf.Tensor, representing the number of consecutive elements of this dataset to combine in a single batch.
    - drop_remainder: (Optional.) A tf.bool scalar tf.Tensor, representing whether the last batch should be dropped in the case it has fewer than batch_size elements; the default behavior is not to drop the smaller batch.
- [Tensorflow Doc: dataset.padded_batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch)
- [How to use dataset in tensorflow](https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428)
- [StackOverflow: Train Tensorflow model with estimator (from_generator)](https://stackoverflow.com/questions/49673602/train-tensorflow-model-with-estimator-from-generator?rq=1)
- [StackOverflow: Is Tensorflow Dataset API slower than Queues?](https://stackoverflow.com/questions/47403407/is-tensorflow-dataset-api-slower-than-queues)
- [Github: How can I ues Dataset to shuffle a large whole dataset?](https://github.com/tensorflow/tensorflow/issues/14857)

**Got the warning: Out of range StopIteration**

```shell
W tensorflow/core/framework/op_kernel.cc:1192] Out of range: StopIteration: Iteration finished
```

> I also meeting this problem same for you,but it is not a bug.
>
> you can see the doc in https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator about train()
> 
> steps: Number of steps for which to train model. If None, train forever or train until input_fn generates the OutOfRange error or StopIteration exception. 'steps' works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If OutOfRange or StopIteration occurs in the middle, training stops before 20 steps. If you don't want to have incremental behavior please set max_steps instead. If set, max_steps must be None.
>
> -- libulin

From [Github Comment](https://github.com/tensorflow/tensorflow/issues/12414#issuecomment-345131765)

With the fix in [301a6c4](https://github.com/tensorflow/tensorflow/commit/301a6c41cbb111fae89657a49775920aa70525fd) (and a related fix for the StopIteration logging in [c154d47](https://github.com/tensorflow/tensorflow/commit/c154d4719eea88e694f4c06bcb1249dbac0f7877), the logs should be much quieter when using tf.data.

Simple fix:

```python
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # ERROR
import tensorflow as tf
```

In [4]:
def np_input_fn(X, Y, samples, shuffle=False, window_size=1, batch=None, epoch=None):
    if batch is None:
        raise Exception('batch can not be None')
    
    if window_size & 0x1 == 0:
        raise Exception('window size can not even')
    dim_in = window_size * window_size * 6
    
    if not isinstance(X, list):
        X = [X]
    if not isinstance(Y, list):
        Y = [Y]
    if not isinstance(samples, list):
        samples = [samples]
    
    samples = [np.pad(x, ((0,0),(1,0)), 'constant', constant_values=i) for i, x in enumerate(samples)]
    samples = np.concatenate(samples)
    
    print("input_fn total size", len(samples))
    
    def generator():
        if shuffle == True:
            np.random.shuffle(samples)
        
        for s in samples:
            x = X[s[0]][s[1], s[2]].reshape((dim_in))
            y = Y[s[0]][s[1], s[2]].reshape((1))
            yield x, y
    
    def _input_fn():
        dataset = tf.data.Dataset.from_generator(generator,
                                                   output_types= (tf.float32, tf.int32), 
                                                   output_shapes=(tf.TensorShape([dim_in]), tf.TensorShape([1])))
        dataset = dataset.batch(batch_size=batch)
        dataset = dataset.repeat(epoch)
        dataset = dataset.prefetch(1)

        iterator = dataset.make_one_shot_iterator()
        features_tensors, labels = iterator.get_next()
        print(features_tensors)
        print(labels)
        features = {'data': features_tensors }
        return features, labels
    
    return _input_fn


In [5]:
if __name__ == "__main__":
    print(np_input_fn(X_train_orig, Y_train_orig, train_sample, window_size=23, batch=128)())

input_fn total size 16092
Tensor("IteratorGetNext:0", shape=(?, 3174), dtype=float32)
Tensor("IteratorGetNext:1", shape=(?, 1), dtype=int32)
({'data': <tf.Tensor 'IteratorGetNext:0' shape=(?, 3174) dtype=float32>}, <tf.Tensor 'IteratorGetNext:1' shape=(?, 1) dtype=int32>)


## 4. Training Model

The model that we used is followed by the article: [Improving Linear Models Using Explicit Kernel Methods](https://github.com/Debian/tensorflow/blob/master/tensorflow/contrib/kernel_methods/g3doc/tutorial.md).

[TensorFlow Estimators: Managing Simplicity vs. Flexibility in
High-Level Machine Learning Frameworks](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/18d86099a350df93f2bd88587c0ec6d118cc98e7.pdf)

Optimizer

- [Ftrl Optimizer](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/FtrlOptimizer)
- [Adam Optimizer](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/AdamOptimizer)

Build the following models:

1. Build Linear Classifier Model
2. Build Random Fourier Feature Mapper Model and Linear Classifier Model

### 4.1. Training Linear Classifier Model

In [6]:
def create_linear_model(learning_rate, dim_in, config=None):
    optimizer = tf.train.AdamOptimizer(learning_rate)
    image_column = tf.contrib.layers.real_valued_column('data', dimension=dim_in)
    
    estimator = tf.contrib.learn.LinearClassifier(
        feature_columns=[image_column],
        n_classes=2, 
        config=config,
        optimizer=optimizer)

    return estimator

In [7]:
if __name__ == "__main__":
    batch = 128
    epoch = 2
    train_input_fn = np_input_fn(X_train_orig, Y_train_orig, train_sample, shuffle=True, window_size=23, batch=batch, epoch=epoch)
    
    learning_rate = 0.001       # Adam Optimizer
    input_dim = 23 * 23 * 6     # Data size
    
    estimator = create_linear_model(learning_rate, input_dim)

    start = time.time()
    estimator.fit(input_fn=train_input_fn) # Train.
    end = time.time()
    print('Elapsed time: {} seconds'.format(end - start))
    

input_fn total size 16092
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x827ed7390>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/var/folders/40/1jhkx0kj6ld3fv5st6wpccsr0000gp/T/tmpks701peg'}
Tensor("IteratorGetNext:0", shape=(?, 3174), dtype=float32)
Tensor("IteratorGetNext:1", shape=(?, 1), dtype=int32)
Instructions for updating:
Please switch to tf.train.get_global_step
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving che

### 4.2. Training Random Fourier Feature Mapper Model

In [8]:
def create_rffm_model(learning_rate, dim_in, dim_out, stddev, config=None):
    
    kernel_mapper = tf.contrib.kernel_methods.RandomFourierFeatureMapper(dim_in, dim_out, stddev, name='rffm')
    
    optimizer = tf.train.AdamOptimizer(learning_rate)
    image_column = tf.contrib.layers.real_valued_column('data', dimension=dim_in)

    estimator = tf.contrib.kernel_methods.KernelLinearClassifier(
        feature_columns=[image_column], 
        n_classes=2, 
        config=config,
        optimizer=optimizer, 
        kernel_mappers={image_column: [kernel_mapper]})
    
    return estimator

In [9]:
if __name__ == "__main__":
    
    batch = 128
    epoch = 1
    train_input_fn = np_input_fn(X_train_orig, Y_train_orig, train_sample, shuffle=True, window_size=23, batch=batch, epoch=epoch)
    
    learning_rate = 0.001  # Adam Optimizer

    # RFFM
    input_dim = 23 * 23 * 6
    output_dim = 23 * 23 * 6 * 10
    stddev = 1.0

    estimator = create_rffm_model(learning_rate, input_dim, output_dim, stddev)
    
    start = time.time()
    estimator.fit(input_fn=train_input_fn) # Train.
    end = time.time()
    print('Elapsed time: {} seconds'.format(end - start))

input_fn total size 16092
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x828faa128>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/var/folders/40/1jhkx0kj6ld3fv5st6wpccsr0000gp/T/tmp15r9t5tx'}
Tensor("IteratorGetNext:0", shape=(?, 3174), dtype=float32)
Tensor("IteratorGetNext:1", shape=(?, 1), dtype=int32)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/40/1jhkx0kj6ld3fv5st6wpccsr0000gp/T/

## 5. Evaluating Training Data

1. Evaluating Training Data
2. Evaluating Validation Data
3. Evaluating Testing Data

**Confusion Matrix**

- [Classification: True vs. False and Positive vs. Negative](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative)
- [如何辨別機器學習模型的好壞？秒懂Confusion Matrix](https://www.ycc.idv.tw/confusion-matrix.html)

**estimator.evaluate**

- [Tensorflow Doc: estimator evaluate metrics](https://tensorflow.google.cn/versions/r1.15/api_docs/python/tf/keras/metrics)
- [Stack Overflow: Meaning of evaluation metrics in Tensorflow](https://ai.stackexchange.com/questions/6383/meaning-of-evaluation-metrics-in-tensorflow)

```python
x, y = {'data': X}, Y
input_fn = tf.estimator.inputs.numpy_input_fn(x, y, batch_size=batch, shuffle=False, num_epochs=1)
metric = estimator.evaluate(input_fn=input_fn)
```

**estimator.predict_classes**

```python
x, y = {'data': X_train.astype(np.float32) }, Y_train
batch = 128

input_fn = tf.estimator.inputs.numpy_input_fn(x, batch_size=batch, shuffle=False, num_epochs=1)
metric = estimator.predict_classes(input_fn=input_fn)

for i, p in enumerate(metric):
    print(p, y[i][0])
```

**Metrics**

- [Python tensorflow.contrib.learn.MetricSpec() Examples](https://www.programcreek.com/python/example/96156/tensorflow.contrib.learn.MetricSpec)
- [Tensorflow Doc: Available Metrics](https://github.com/tensorflow/docs/tree/r1.4/site/en/api_docs/api_docs/python/tf/metrics)

```python
metrics = { "accuracy": learn.MetricSpec(metric_fn=tf.metrics.accuracy, prediction_key="classes") }
metric = estimator.evaluate(input_fn=eval_input_fn, metrics=metrics)
```


In [10]:
def evaluate_model(estimator, X, Y, samples, window_size=1, batch=2048, epoch=1):

    eval_input_fn = np_input_fn(X, Y, samples, shuffle=False, window_size=window_size, batch=batch, epoch=epoch)
    
    metrics = {
        "tp": tf.contrib.learn.MetricSpec(metric_fn=tf.metrics.true_positives, prediction_key="classes"),
        "tn": tf.contrib.learn.MetricSpec(metric_fn=patch.metrics.true_negatives, prediction_key="classes"),
        "fp": tf.contrib.learn.MetricSpec(metric_fn=tf.metrics.false_positives, prediction_key="classes"),
        "fn": tf.contrib.learn.MetricSpec(metric_fn=tf.metrics.false_negatives, prediction_key="classes"),
    }
    
    start = time.time()
    metric = estimator.evaluate(input_fn=eval_input_fn, metrics=metrics)
    end = time.time()
    print('Elapsed time: {} seconds'.format(end - start))
    
    return metric


### 5.1 Evaluating Training Data

In [11]:
if __name__ == "__main__":
    start = time.time()
    metrics = evaluate_model(estimator, X_train_orig, Y_train_orig, train_sample, batch=1, window_size=23)
    end = time.time()
    print('Elapsed time: {} seconds'.format(end - start))
    print("training metrics:", metrics)


input_fn total size 16092
Tensor("IteratorGetNext:0", shape=(?, 3174), dtype=float32)
Tensor("IteratorGetNext:1", shape=(?, 1), dtype=int32)
INFO:tensorflow:Starting evaluation at 2020-03-24-20:46:54
INFO:tensorflow:Restoring parameters from /var/folders/40/1jhkx0kj6ld3fv5st6wpccsr0000gp/T/tmp15r9t5tx/model.ckpt-126
INFO:tensorflow:Finished evaluation at 2020-03-24-20:55:26
INFO:tensorflow:Saving dict for global step 126: accuracy = 0.68207806, accuracy/baseline_label_mean = 0.5036664, accuracy/threshold_0.500000_mean = 0.68207806, auc = 0.75058556, auc_precision_recall = 0.74787956, fn = 2484.0, fp = 2632.0, global_step = 126, labels/actual_label_mean = 0.5036664, labels/prediction_mean = 0.5025797, loss = 0.61646944, precision/positive_threshold_0.500000_mean = 0.68108565, recall/positive_threshold_0.500000_mean = 0.6935225, tn = 5355.0, tp = 5621.0
Elapsed time: 528.6947410106659 seconds
Elapsed time: 528.7258548736572 seconds
training metrics: {'loss': 0.61646944, 'accuracy': 0.682

### 5.2 Evaluating Validation Data

In [12]:
if __name__ == "__main__":
    start = time.time()
    metrics = evaluate_model(estimator, X_train_orig, Y_train_orig, valid_sample, batch=1, window_size=23)
    end = time.time()
    print('Elapsed time: {} seconds'.format(end - start))
    print("validation metrics:", metrics)


input_fn total size 3973
Tensor("IteratorGetNext:0", shape=(?, 3174), dtype=float32)
Tensor("IteratorGetNext:1", shape=(?, 1), dtype=int32)
INFO:tensorflow:Starting evaluation at 2020-03-24-20:55:41
INFO:tensorflow:Restoring parameters from /var/folders/40/1jhkx0kj6ld3fv5st6wpccsr0000gp/T/tmp15r9t5tx/model.ckpt-126
INFO:tensorflow:Finished evaluation at 2020-03-24-20:58:01
INFO:tensorflow:Saving dict for global step 126: accuracy = 0.49811226, accuracy/baseline_label_mean = 0.49836394, accuracy/threshold_0.500000_mean = 0.49811226, auc = 0.49631155, auc_precision_recall = 0.48840523, fn = 982.0, fp = 1012.0, global_step = 126, labels/actual_label_mean = 0.49836394, labels/prediction_mean = 0.5020263, loss = 0.72332865, precision/positive_threshold_0.500000_mean = 0.49651742, recall/positive_threshold_0.500000_mean = 0.5040404, tn = 981.0, tp = 998.0
Elapsed time: 145.8856339454651 seconds
Elapsed time: 145.89155507087708 seconds
validation metrics: {'loss': 0.72332865, 'accuracy': 0.49

### 5.3 Evaluating Testing Data

In [13]:
if __name__ == "__main__":
    print(len(X_test_orig))
    print(X_test_orig[0].shape)
    print(X_test_orig[1].shape)
    start = time.time()
    metrics = evaluate_model(estimator, X_test_orig, Y_test_orig, testing_sample, batch=1, window_size=23)
    end = time.time()
    print('Elapsed time: {} seconds'.format(end - start))
    print("testing metrics:", metrics)


2
(4, 3, 23, 23, 6)
(20, 10, 23, 23, 6)
input_fn total size 212
Tensor("IteratorGetNext:0", shape=(?, 3174), dtype=float32)
Tensor("IteratorGetNext:1", shape=(?, 1), dtype=int32)
INFO:tensorflow:Starting evaluation at 2020-03-24-20:58:07
INFO:tensorflow:Restoring parameters from /var/folders/40/1jhkx0kj6ld3fv5st6wpccsr0000gp/T/tmp15r9t5tx/model.ckpt-126
INFO:tensorflow:Finished evaluation at 2020-03-24-20:58:29
INFO:tensorflow:Saving dict for global step 126: accuracy = 0.49056605, accuracy/baseline_label_mean = 0.509434, accuracy/threshold_0.500000_mean = 0.49056605, auc = 0.5028935, auc_precision_recall = 0.5129694, fn = 51.0, fp = 57.0, global_step = 126, labels/actual_label_mean = 0.509434, labels/prediction_mean = 0.5030061, loss = 0.701244, precision/positive_threshold_0.500000_mean = 0.5, recall/positive_threshold_0.500000_mean = 0.5277778, tn = 47.0, tp = 57.0
Elapsed time: 28.010125875473022 seconds
Elapsed time: 28.011068105697632 seconds
testing metrics: {'loss': 0.701244, '