In [None]:
"""
Other than model validation, there is one question that we must ask before moving on to model development. How should we prepare before
we input input data and target data to neural network? Many data preprocessing and feature engineering techniques are dependent on 
certain domain (for example, specialized on text data or image data). We will handle this on next section. Here, we will look at basic
matters that can be applied to all kinds of data.

Data preprocessing for neural network
Purpose of data preprocessing is to make original data easy to be applied on neural network. This includes vectorization, normalization,
handling missing values, and feature extraction.

Vectorization: In neural network, all input and target should be a tensor that consists of floating points data (or sometimes consists
of integers). Whatever you want to input: sound, image, or text, you must change them first into vector. This step is called data
vectorization. For example, in the example of text classification, we have converted text into integer list (which represents sequence
of words). Then, we used one-hot encoding to convert it to tensor with datatype of float32.

Normalization: In number image classification example, we have encoded image data into integer ranged from 0 to 255, which is grayscale
encoding. Before we put this data into network, we have changed their value to floating point between 0 and 1 by dividing them into 255.
When we were estimating housing price, range of features had different sizes. Some feature had small floating point value and others
had big integer value. Before we input this data into network we have normalized each features individually and made their means 0 and
standard deviations 1.
Normally it is dangerous to input big values (for example, integer that has way more digits than network weight initial value) or 
uneven data (for example, one feature is ranged from 0 to 1 and another is ranged from 100 to 200) into neural network. Then, gradient
that will be updated gets big, and interrupts network's convergence. In order to easily train network, data must follow the following:
- Be a small value. Normally all values should be within 0 and 1
- Be even. All features should have similar range
Additionally, the following strict normalization methods are not essential, but it can be often used and helpful
- Normalize each feature so that its mean would be 0
- Normalize each feature so that its standard deviation would be 1
This is easy to do on numpy array:
x -= x.mean(axis=0)
x /= x.std(axis=0)
Like before, we have to use mean and standard deviation from training data for preprocessing validation or test data. To do this,
preprocessing should not occur before cross validation, but rather inside cross validation process. We can easily include preprocessing
process inside cross validation loop if we use Pipeline class from scikit-learn.

Handling missing values: Sometimes, there are times when there is a missing value in data. For example, in housing price prediction
question, first feature of the data is crime rate. What would happen if this feature is not included in some samples? There will be
missing values in training or test data. Normally, it is okay to use 0 as a substitute for missing value if 0 is not a predefined 
meaningful value in neural network. If network learns that 0 is a expression for missing value, it will start to ignore the value. You
would have to record calculated mean or median if you decided to substitute missing values with mean or median in training set. If there
are missing values in test set in that particular feature, you need to put mean or median calculated from training set. Normally, it is
hard to know what would be best for substituting missing values: 0, mean, or median. It is good to check them all via cross validation.

Suppose that there exists a possibility that there will be missing values in test data. However, if network was trained in training data
that does not have any missing values, this network would not know how to ignore missing values. In this case, you have to intentionally
create training samples with missing values. If there are small number of data with missing values in training data, you can exclude
these samples before picking test data. Also, if we think that the feature with missing values is not important, we can just remove this
feature from all of the data.

Feature engineering
Feature engineering is a step that we use knowledge about data and machine learning algorithm (neural network in here). Before we input
data into model, we apply hardcoded conversion so that algorithm would be carried out better. In many cases we cannot expect that the
machine learning model will learn perfectly on some data. Data needs to be expressed as some way that model could work better. Suppose
that we are developing a model that takes clock image as an input and outputs the time. This will be a hard machine learning problem if
we use original pixels as an input. It will require convolutional neural network, and it will cost a lot of computing resources. If we
understand this problem in higher level, we can make better input feature for machine learning algorithm. For example, we can create a
python script that outputs coordinates of endpoint of the clock hands through the black pixels of the image. Then the model can be 
easily learned to connect this coordinates and the corresponding time. We can create even better feature. We can express the coordinates
we calculated above as polar coordinates. Then, there is actually no need for machine learning because the problem gets too easy. It is
sufficient to estimate time with simple rounding computation and dictionary reference. This is the main point of feature engineering.
We express feature in more simple way to make problem easier. Normally we have to understand the problem very well.

Before deep learning, feature engineering was very important. Traditional algorithms with shallow learning method do not have large
hypothesis space enough to learn useful features by itself. Success depended on how data was expressed in the algorithm. For example,
before convolutional neural network solved MNIST number image classification problem, typical way to solve the problem was to use 
hardcoded features such as number of concentric circles in number image, height of number in image, and histogram of pixel values.
Recent deep learning methods mostly do not require feature engineering because neural network can extract useful features from the
original data automatically. However, we have to consider feature engineering when using neural network because:
- Good feature can solve the problem better with less resources. For example, it is not fit to use convolutional neural network in
reading clock hand problem
- Good feature can solve the problem with less data. Deep learning model learning feature automatically applies only when there are
many training data. If there are small number of samples, information in the feature becomes important.

Overfitting and underfitting
On the examples before, all of the models' performance started to decrease after reaching maximum point, after repeating some epochs.
In other words, model started to overfit to the training data after some epochs. Overfitting happens at all machine learning problem, so
we have to learn how to handle overfitting to master machine learning. The main issue of machine learning is a balance between 
optimization and generalization. Optimization is a process of setting the model so that it achieves best performance for the training
data. On the other hand, generalization is how well the model works on data first seen. The objective of making the model is to get the
best generalization performance. However, there is no way to control it. We can just adjust the model based on the training data.
"""