In [None]:
"""
We have divided the data into training set, validation set, and test set in examples before. The reason why we do not evaluate model
with the data that we used for training is because of overfitting. In other words, model's performance over data first seen does not get
better or gets worse. Since machine learning's objective is to get generalized model that works well on data first seen, overfitting is
a major obstacle. Since we have control only on things that we can observe, trustworthy measuring method about model's generalization
performance is very important. In next section we will look at strategies to nullify overfitting and maximize generalization. In here,
we will focus on how to measure generalization performance of model.

Key for evaluation of model is to divide data into training, validation, and test sets. We train model with training set and evaluates 
model with validation set. Lastly, if model is ready for practice we test the model with test set. We can use just two sets: training
set and test set. However, the reason why we do not do this is because we always tune settings for the model when we are developing it.
For example, we choose number of layers or units in layer (we call them hyperparameter to distinguish them with weight of network). We
tune them after we evaluate the model's performance with validation set. Essentially, this is also a training that finds good setting on
some parameter space. Eventually if we tune the settings of the model based on performance on validation set, model can overfit to the
validation set even if we do not directly train the model with it. Key point of this phenomenon is on information leak. Information
about validation data leaks to the model as we tune the model's hyperparameter based on performance of the model on validation set.
Only a few information leaks if there was only one tuning. However, as number of tuning based on same validation set increases, 
a lot of information leaks to the model. Eventually, since we have optimized the model on validation set, model that works well on
validation set is created. However, if we are interested in performance on new data set, we need to use completely different dataset
to evaluate the model. This is test set. Model should not get any information about the test set in any route.

Dividing data into training, validation, and test set is easy, but if there is few data, we can use some advanced method. We will look
at 3 main validation method: hold-out validation, K-fold cross-validation, and iterated K-fold cross-validation using shuffling.

Hold-out validation: we set some data into validation set and test set. Then, we train the model with remaining data, tune the model
with validation set, and test the final model with test set.
"""

"""
num_validation_samples = 10000

np.random.shuffle(data)

validation_data = data[ : num_validation_samples]
data = data[num_validation_samples : ]

training_data = data[:]

model = get_model()
model.train(training_data)
validation_score = model.evaluate(validation_data)

#We tune the model here, then we train again, tune again, so on...

#After tuning of hyperparameter, we retrain the model using all data except for test data
model = get_model()
model.train(np.concatenate([training_data, validation_data]))
test_score = model.evaluate(test_data)
"""

"""
There is one problem with this method. If the number of data is small, test set and validation set may not be able to represent all of 
the given data statistically since they are too small. We can see this by dividing the test and validation data differently and seeing
that the performance of the model differs a lot. K-fold cross-validation and iterated K-fold cross-validation can solve this problem.

K-fold cross-validation: in this method we divide data into k folds with same size. For each fold i, we train the model with remaining
k - 1 folds and we evaluate the model with fold i. The final score is mean score of i scores that we got here. This method helps when
the model's performance deviates a lot depending on data division. Like hold-out validation, we used distinct validation set in tuning
the model.
"""

"""
k = 4
num_validation_samples = len(data) // 4

np.random.shuffle(data)

validation_scores = []
for fold in range(k):
    validation_data = data[num_validation_samples * fold : num_validation_samples * (fold + 1)]
    training_data = data[ : num_validation_samples * fold] + data[num_validation_samples * (fold + 1) : ]
    
    model = get_model()
    model.train(training_data)
    validation_score = model.evaluate(validation_data)
    validation_scores.append(validation_score)
    
validation_score = np.average(validation_scores)

model.get_model()
model.train(data)
test_score = model.evaluate(test_data)
"""

"""
Iterated K-fold cross-validation using shuffling: this method is used when there are little data and we are trying to validate model as
accurate as possible. This method applies K-fold cross-validation multiple times but shuffles the data before dividing it into K folds.
Final score is mean score of K-fold cross-validations iterated. Since it trains and validates P * K (P is number of iteration) models,
it costs a lot of time.

https://tensorflow.blog/2017/12/27/%EB%B0%98%EB%B3%B5-%EA%B5%90%EC%B0%A8-%EA%B2%80%EC%A6%9D/

We need to remember these things when choosing validation method:
1. Representative data: training set and test set need to have representativeness about the given data. For example, in the problem of
classifying number image, suppose that the sample array is sorted in order of classes (11111222223333...). If we set first 80% of this
data as training set and rest of them as test set, training set would only have images of 0~7 and test set would have 8 and 9. Because
of this, we normally shuffle the data randomly before dividing it to training set and test set. If rate of some class is very small, 
instead of randomly shuffling we need to ensure that each class are evenly distributed in training and data set. This is called 
stratified sampling. Scikit-learn's train_test_split() function can conduct stratified sampling by receiving target labels via stratify
parameter.
2. Direction of time: If we are trying to predict future from past, we should not shuffle the given data before dividing it. If we
shuffle it, this will leak information about the future to the model. Eventually, model will be trained with future data. In this kind
of problem, all data in test set should be future of all data in training set.
3. Duplicate data: If there are duplicate data point in one data set, when the data is shuffled and divided into training and test set
there can exist same data point in training and test set. This will result in testing the model with some of training set. We need to
ensure that there is no duplicate data point in training and test set.
"""