
# Reference
[Tensorflow: 사전 제작된 에스티메이터](https://www.tensorflow.org/tutorials/estimator/premade?hl=ko)

# Workflow
1. EDA
2. Dataset과 feature columns을 통한 입력함수 생성
3. Estimator 활용


# 1. EDA

데이터: Iris(붓꽃) data
Iris를 구성하는 꽃받침과 꽃잎의 크기에 따라서 Iris를 세 가지의 다른 종으로 분류되어 있는 데이터.

1개의 데이터 행은 다음과 같이 구성되어 있음
- 꽃받침 길이
- 꽃받침 넓이
- 꽃잎 길이
- 꽃잎 너비
- 종



In [None]:
import tensorflow as tf
import pandas as pd

In [None]:
train_path = tf.keras.utils.get_file(
    "iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_path = tf.keras.utils.get_file(
    "iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")

CSV_COLUMN_NAMES = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
SPECIES = ['Setosa', 'Versicolor', 'Virginica']

train = pd.read_csv(train_path, names=CSV_COLUMN_NAMES, header=0)
test = pd.read_csv(test_path, names=CSV_COLUMN_NAMES, header=0)

In [None]:
train.head()

꽃의 종에 따라서 꽃잎과 꽃받침의 길이에 차이가 있음을 확인

In [None]:
val_by_spec = train.groupby("Species").mean()

In [None]:
val_by_spec.plot(kind='bar')

데이터를 Input과 Output으로 분리

In [None]:
train_y = train.pop('Species')
test_y = test.pop('Species')

# 2. 입력함수


### tf.data.Dataset
입력함수는 `tf.data.Dataset`를 이용하여 만들 수 있음. `tf.data.Dataset`은 다양한 데이터를 다룰 수 있으며, 배치기능 등을 포함하고 있음.

`tf.data.Dataset`을 통해 모델 학습에서 입력 파이프라인을 빌드할 수 있음

In [None]:
def input_fn(features, labels, training=True, batch_size=256):
    # Pandas 타입의 데이터를 tf.data.Dataset 타입으로 변환
    # features의 경우 feature_column을 이용하여 dictionary를 통해 상세화 할 수 있음
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    if training:
        dataset = dataset.shuffle(1000).repeat()
        
    # 배치크기에 맞는 데이터를 반환한다
    return dataset.batch(batch_size)


In [None]:
input_fn(train.head(), train_y.head())

feature에 `dict` 가 없는 경우에는 다음과 같이 입력된 feature의 column을 인식하지 못함

In [None]:
def input_fn2(features, labels, training=True, batch_size=256):
    # Pandas 타입의 데이터를 tf.data.Dataset 타입으로 변환
    # features의 경우 feature_column을 이용하여 dictionary를 통해 상세화 할 수 있음
    dataset = tf.data.Dataset.from_tensor_slices((features, labels))

    if training:
        dataset = dataset.shuffle(1000).repeat()
        
    # 배치크기에 맞는 데이터를 반환한다
    return dataset.batch(batch_size)

In [None]:
input_fn2(train.head(), train_y.head())

### Feature columns 을 이용한 feature 정의. 

Iris data는 모두 Numerical data라서 `tf.feature_column.numeric_column`를 사용함. Categorical data도 feature column 으로 정의할 수 있다.

In [None]:
my_feature_columns = []
for key in train.keys():
    my_feature_columns.append(tf.feature_column.numeric_column(key=key))

In [None]:
my_feature_columns

Feature columns can be far more sophisticated than those we're showing here.  You can read more about Feature Columns in [this guide](https://www.tensorflow.org/guide/feature_columns).

Now that you have the description of how you want the model to represent the raw
features, you can build the estimator.

# 3. estimator

tensorflow에서는 직접 구성한 네트워크 외에도 시전에 정의된 네트워크를 사용할 수 있다.

scikit-learn에서 제공하는 `estimator`와 같이 간편하게 사용할 수 있다.


* `tf.estimator.DNNClassifier` for deep models that perform multi-class
  classification.
* `tf.estimator.DNNLinearCombinedClassifier` for wide & deep models.
* `tf.estimator.LinearClassifier` for classifiers based on linear models.



In [None]:
# Build a DNN with 2 hidden layers with 30 and 10 hidden nodes each.
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    # Two hidden layers of 30 and 10 nodes respectively.
    hidden_units=[30, 10],
    # The model must choose between 3 classes.
    n_classes=3)

# 4. 학습, 평가, 예측

Now that you have an Estimator object, you can call methods to do the following:

* Train the model.
* Evaluate the trained model.
* Use the trained model to make predictions.

### 모델 학습
Train the model by calling the Estimator's `train` method as follows:

In [None]:
# Train the Model.
classifier.train(
    input_fn=lambda: input_fn(train, train_y, training=True),
    steps=5000)

Note that you wrap up your `input_fn` call in a
[`lambda`](https://docs.python.org/3/tutorial/controlflow.html)
to capture the arguments while providing an input function that takes no
arguments, as expected by the Estimator. The `steps` argument tells the method
to stop training after a number of training steps.


### 모델 평가
Now that the model has been trained, you can get some statistics on its
performance. The following code block evaluates the accuracy of the trained
model on the test data:


In [None]:
eval_result = classifier.evaluate(
    input_fn=lambda: input_fn(test, test_y, training=False))

print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))

Unlike the call to the `train` method, you did not pass the `steps`
argument to evaluate. The `input_fn` for eval only yields a single
[epoch](https://developers.google.com/machine-learning/glossary/#epoch) of data.


The `eval_result` dictionary also contains the `average_loss` (mean loss per sample), the `loss` (mean loss per mini-batch) and the value of the estimator's `global_step` (the number of training iterations it underwent).


### 신규 데이터 예측
You now have a trained model that produces good evaluation results.
You can now use the trained model to predict the species of an Iris flower
based on some unlabeled measurements. As with training and evaluation, you make
predictions using a single function call:

In [None]:
# Generate predictions from the model
expected = ['Setosa', 'Versicolor', 'Virginica']
predict_x = {
    'SepalLength': [5.1, 5.9, 6.9],
    'SepalWidth': [3.3, 3.0, 3.1],
    'PetalLength': [1.7, 4.2, 5.4],
    'PetalWidth': [0.5, 1.5, 2.1],
}

def input_fn(features, batch_size=256):
    """An input function for prediction."""
    # Convert the inputs to a Dataset without labels.
    return tf.data.Dataset.from_tensor_slices(dict(features)).batch(batch_size)

predictions = classifier.predict(
    input_fn=lambda: input_fn(predict_x))

The `predict` method returns a Python iterable, yielding a dictionary of
prediction results for each example. The following code prints a few
predictions and their probabilities:

In [None]:
for pred_dict, expec in zip(predictions, expected):
    class_id = pred_dict['class_ids'][0]
    probability = pred_dict['probabilities'][class_id]

    print('Prediction is "{}" ({:.1f}%), expected "{}"'.format(
        SPECIES[class_id], 100 * probability, expec))