# Data Processing and Preliminaries

Before you can begin to apply a machine learning model on data there are some pre-requisite data processing that needs to take place. In fact, you may want to first take a look at a **portion** of the data, analyze it and trully understand it. Keep in mind that once you have looked at a certain portion of the data then that portion can no longer be treated as 'unseen' and you need to be very careful as to what kinds of statements you make regarding that data or how you use it. <br><br>
Data is usually not easy to come by but for illustrative purposes lets assume that we have plenty of it. And what do we mean when we say plenty of it?<br>
- The [one in ten rule of statistics](https://en.wikipedia.org/wiki/One_in_ten_rule) suggests that you need at least 10 samples for every feature
<br>

So let's assume that you have a data set with 20 features and one label. Specifically, imagine that 15 of the features are real valued, three are binary and that two of them are categorical with three categories each. This means that you would have the following amount of features in your data (the categorical features increase your feature set due to one hot encoding): <br>
<br>
$$ |features| = 15 + 3 + 2*3 = 24$$
<br>
In this case you would need at least 240 samples to draw any type of statistical conclusion. Many practitioners prefer to have 100 times the number of samples as there are features. Let's further assume that we have 10K samples for this data. In this case this is how one would proceed with data processing and other preliminaries. 
<br>

### Data Splitting/Partitioning
<br>
To properly apply machine learning to this data you would want to partition it into four different groups as mentioned below.  
- **Pre-training/Analysis set (1,000 samples)**
  - This data will only be used for you to analyze it. Once you are done with it you will **not** use it for training or testing. It will help you, among other things, to answer the questions below and most importantly guide you in your data pre-processing
    - Is a particular feature useful in the learning exercise, i.e. can you remove it? 
    - Does a particular feature need to be one-hot encoded?
    - Does a particular feature need to be standardized?
- **Training Set (60% of what is left)**
  - Using the one in ten rule, or 100x, you want as much data here as possible
  - This is the set used for learning the data
- **Cross Validation Set (20% of what is left)**
  - The Training set will learn the data and come up with a model that has parameters. The cross validation set is used to pick the best model/parameters 
  - For example, imagine that you used the training set to come up with a linear model and a polynomial model of degree 2. At this point you have what should be the best linear model and the best 2nd order polynomial model. But which one should you pick? To answer that question you **no** longer can use the training set. You need a completely unseen/unused data set to make that assessment. This is where you use the cross validation set.
- **Testing Set (20% of what is left)**
  - Once you have trained the model with the training set and picked the best parameters using the cross validation set then you test your results using the testing set
<br>

If data is not plentyful you will see practitioners skip the pre-training data set and cross validation set. In many cases K-fold cross validation is used within the training set and thus avoid needing a completely mutually exclusive cross validation set. In this case this is how the data is partitioned:
- **Training Set**
  - Used to analyze the data
  - Used to train the model using K-fold cross validation (more on this below)
- **Testing Set**
  - Used to test the data 
<br>

### K-fold Cross Validation ###
Roughly speaking below is the procedure for K-fold cross validation. It allows you to use the training set for training and cross validating. Assume that the model you are training has a parameter, $\lambda$, for which you want to find the optimal value. By optimal we mean the value of $\lambda$ that will produce the lowest cross validation error.
<br>
* For a range of values for $\lambda$ repeat the loop below
  * Split the training data into K equally sized partitions
    * Repeat for K iterations
        * One of the partitions will be considered the validation set
        * The other K-1 partitions will be considered the training set
        * Train the model using the training data
        * Compute the cross validation error on the validation set
          * At this point you have a cross validation error for a particular value of $\lambda$ for a particular iteration
    * Average the K cross validation errors so that you have a single cross validation error for this $\lambda$
   * carry on with the next value of $\lambda$
<br>

It is important to note that you need to pick a good range of $\lambda$ values to search on. This requires understanding of the data and the model you are using. Also, in practice we use K= 5 or 10. The idea here is that the more cross validation iterations used the more that you reduce the variantion in the cross validation error and that you are reaching convergence in the cross validation error.
 

### Data pre-processing

Data pre-processing is applied to the data set after you have analyzed the data, i.e. after you looked at the pre-training set or the training set. It is necessary for the learning model to do its job and to get good results. Data pre-prosessing entails the following:
<br>
- **Data removal**
  - **Missing Data** - if you have rows of data missing a lot of features or features that are missing in many of the samples you may want to remove the data or feature. 
    - There are various methods to estimate values but for now I will not cover them
  - Irrelevant Data - say that your data had database reference ids that were not at all related to the underlying data. In this case this feature can be removed.
  - **Duplicates**
  - **High Cardinality Features**
    - In a house sale price data set it may or may not be worthwhile to keep track of the listing agent. Here you could have a list of agents in the 100s or 1000s.
  - **Redundant Data**
    - In a house sale price data set for example, it is a judgement call if you should keep the separte features, ‘State’, ‘County’, ‘City’, and 'Neighborhood'. Perhaps you could make do with only keeping 'Neighborhood'
- **One Hot Encoding**
  - Categorial data must be one hot encoded for machine learning algorithms to work. For example if you had a feature called 'vacation home' and the possible values were 'yes', 'no' and 'unknown' then you would:
    - remove the feature 'vacation home' 
    - add three new features: 'vacation home yes', 'vacation home no', 'vacation home unknown' 
    - so if the 'vacation home' feature for a particular sample was 'yes' then it would become
      - 'vacation home yes'=1, 'vacation home no'=0, 'vacation home unknown'=0
- **[Standardization/Scaling](https://en.wikipedia.org/wiki/Feature_scaling)**
  - This depends on the learning algorithm being used but basically you want to weight all of the features equally in your learning. So if one feature ranges in the 1e6 range and another ranges in the 1e-3 range then the former feature will eclipse the later. 
- **Feature Extraction**
  - This can sometimes be extremely important in shaping the result of your learning. In fact, some competitions have been won by clever feature extraction rather than the learning algorithm used. Not only does feature extration actually derive further value from the data prior to learning from the data but it may also help to reduce the number of features used. 
- **Feature Selection**
  - Your learning algorithm may do better with a subset of the data or with a much reduced feature set you may be able to achieve similar results with a fraction of the work.
    - You can select a subset of features by using a Kbest algorithm.

# Generalization Error

Once you have trained your model using the best practices mentioned above you will then test it using the test set. Now you can compare your results with what is known as the 'generalization error.' Based on the [Hoeffding Inequality](https://en.wikipedia.org/wiki/Hoeffding%27s_inequality), the generalization error equations give you an upper bound on the test error given the training or cross validation error.<br>
* $E_{test} \le E_{train} + \sqrt{\frac{1}{2N}ln\frac{2M}{\delta}}$
  * N is the number of samples in the train set and M is the number of hypothesis in the hypothesis set H
* If you used the training set to train the model, picked the best hypothesis g and then used the cross validation set to estimate $E_test$, then 
  * $E_{test}(h_{g}) \le E_{cv}(h_{g}) + \sqrt{\frac{1}{2N}ln\frac{2M}{\delta}}$ where M = 1
    * Because you did not use the cross validation set to arrive at the best model g, then M=1 for the generalization error calculation
* $E_{test}(h_{g}) \le E_{train}(h_{g}) + \sqrt{\frac{8}{N}ln\frac{4m_H(2N)}{\delta}}$
  * N is the number of samples in the train set and $m_H$ is the growth function for the hypothesis set H 

It is usually the case that the first generalization error bound is tighter than the third bound.
<br>
# Other Topics
* Bias-Variance/learning curve
* Convexity
* generative p(y,x) vs discriminative p(y|x) models

<br>
# Take Away
Data pre-processing, including feature extraction and selection is very important. As much as it is to apply the correct machine learning algorithm.
