# Tour of Data Preparation Techniques for Machine Learning

by Jason Brownlee - [Reference](https://machinelearningmastery.com/data-preparation-techniques-for-machine-learning/)- June 19, 2020 in [Data Preparation](https://machinelearningmastery.com/category/data-preparation/)


## Tutorial Overview
This tutorial is divided into six parts; they are:

- Common Data Preparation Tasks
- Data Cleaning
- Feature Selection
- Data Transforms
- Feature Engineering
- Dimensionality Reduction

### Common Data Preparation Tasks
We can define data preparation as the transformation of raw data into a form that is more suitable for modeling.

The process of applied machine learning consists of a sequence of steps.

- Step 1: Define Problem.
- Step 2: `Prepare Data`.
- Step 3: Evaluate Models.
- Step 4: Finalize Model.

Concerned with the data preparation step (step 2), you work through multiple predictive modeling projects, you see and require the same types of data preparation tasks again and again.

These tasks include:

1. `Data Cleaning`: Identifying and correcting mistakes or errors in the data.
2. `Feature Selection`: Identifying those input variables that are most relevant to the task.
3. `Data Transforms`: Changing the scale or distribution of variables.
4. `Feature Engineering`: Deriving new variables from available data.
5. `Dimensionality Reduction`: Creating compact projections of the data.

#### 1. `Data Cleaning`
Data cleaning involves fixing systematic problems or errors in “messy” data. 

There are many reasons data may have incorrect values, such as being mistyped, corrupted, duplicated, and so on

Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed. This might involve `removing a row` or a `column`. Alternately, it might involve `replacing observations` with new values.

there are general data cleaning operations that can be performed, such as:

- Using statistics to `define normal data` and identify `outliers` (or `anomalous data`).
- Identifying columns that have the `same value` or `no variance` and removing them.
- Identifying `duplicate rows` of data and removing them.
- `Marking empty values` as missing.
- `Imputing missing values` using `statistics` or a `learned model`.

Data cleaning is an operation that is typically performed first, prior to other data preparation operations.

Data cleaning tutorial:

1. [How to Perform Data Cleaning for Machine Learning with Python](https://machinelearningmastery.com/basic-data-cleaning-for-machine-learning/)
2. How to delete Duplicate Rows and Useless Features
3. [How to identify and Delete Outliers](https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/)
4. [How to impute Missing Values](https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/)


#### 2. `Feature Selection`
Feature selection refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted.

This is important as irrelevant and redundant input variables can distract or mislead learning algorithms possibly resulting in lower predictive performance. Additionally, it is desirable to develop models only using the data that is required to make a prediction, e.g. to favor the simplest possible well performing model.

Statistical methods are popular for scoring input features, such as correlation. The features can then be ranked by their scores and a subset with the largest scores used as input to a model. The choice of statistical measure depends on the data types of the input variables and a review of different statistical measures that can be used.

For an overview of how to select statistical feature selection methods based on data type, see the tutorial:

1. [How to Choose a Feature Selection Method For Machine Learning](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)

Additionally, there are different common feature selection use cases we may encounter in a predictive modeling project, such as:

- [`Categorical` inputs for a `classification` target variable](https://machinelearningmastery.com/feature-selection-with-categorical-data/).
- `Numerical` inputs for a `classification` target variable.
- `Numerical` inputs for a `regression` target variable.

Feature importance, see the tutorial:

2. [How to Calculate Feature Importance With Python](https://machinelearningmastery.com/calculate-feature-importance-with-python/).
3. [Recursive Feature Elimination (RFE) for Feature Selection in Python](https://machinelearningmastery.com/rfe-feature-selection-in-python/).
3. [How to Use Feature Selection for Regression](https://machinelearningmastery.com/feature-selection-for-regression-data/)


#### 3. `Data Transforms`
Data transforms are used to change the type or distribution of data variables.

Recall that data may have one of a few types, such as numeric or categorical, with subtypes for each, such as integer and real-valued for numeric, and nominal, ordinal, and boolean for categorical.

- `Numeric Data Type`: Number values.
    - `Integer`: Integers with no fractional part.
    - `Real`: Floating point values.
- `Categorical Data Type`: Label values.
    - `Ordinal`: Labels with a rank ordering.
    - `Nominal`: Labels with no rank ordering.
    - `Boolean`: Values True and False.

We may wish to convert a *`numeric variable`* to an *`ordinal variable`* in a process called **discretization**. Alternatively, we may *`encode a categorical variable`* as *`integers or boolean`* variables, required on most **classification tasks**.

- **Discretization Transform**: Encode a numeric variable as an ordinal variable.
- **Ordinal Transform**: Encode a categorical variable into an integer variable.
- **One-Hot Transform**: Encode a categorical variable into binary variables.

For real-valued numeric variables, the way they are represented in a computer means there is dramatically more resolution in the range 0-1 than in the broader range of the data type. As such, it may be desirable to scale variables to this range, called normalization. If the data has a Gaussian probability distribution, it may be more useful to shift the data to a standard Gaussian with a mean of zero and a standard deviation of one.

- **Normalization Transform**: Scale a variable to the range 0 and 1.
- **Standardization Transform**: Scale a variable to a standard Gaussian.

If the distribution is nearly Gaussian, but is skewed or shifted, it can be made more Gaussian using a power transform. Alternatively, quantile transforms can be used to force a probability distribution, such as a uniform or Gaussian on a variable with an unusual natural distribution.

- **Power Transform**: Change the distribution of a variable to be more Gaussian.
- **Quantile Transform**: Impose a probability distribution such as uniform or Gaussian.

An important consideration with data transforms is that the operations are generally performed separately for each variable. As such, we may want to perform different operations on different variable types.

- Data Transforms
    - Numerical Type
        - Change Scale
            - Normalize
            - Standardize
            - Robust
        - Change Distribucion
            - Power
            - Quantile
            - Descretize
        - Engineer
            - Polynomial
    - Categorical Type
        - Nominal Type
            - One Hot Encode
            - Dummy Encode
        - Ordinal Type
            - Label Encode

1. [4 Common Machine Learning Data Transforms for Time Series Forecasting](https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/).
2. [How to use Normalization and Standardization](https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/)
3. [How to use Ordinal and One Hot Encoding](https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/)
4. [How to use Power Transforms](https://machinelearningmastery.com/power-transforms-with-scikit-learn/)

#### 4. `Feature Engineering`
Feature engineering refers to the process of creating new input variables from the available data.

Engineering new features is highly specific to your data and data types. As such, it often requires the collaboration of a subject matter expert to help identify new features.

This specialization makes it a challenging topic to generalize to general methods. There are some techniques that can be reused, such as:

- Adding a boolean flag variable for some state.
- Adding a group or global summary statistic, such as a mean.
- Adding new variables for each component of a compound variable, such as a date-time.

A popular approach drawn from statistics is to create copies of numerical input variables that have been changed with a simple mathematical operation, such as raising them to a power or multiplied with other input variables, referred to as polynomial features.

- `Polynomial Transform`: Create copies of numerical input variables that are raised to a power [here](https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/).

1. [How to Use Polynomial Feature Transforms for Machine Learning](https://machinelearningmastery.com/polynomial-features-transforms-for-machine-learning/)

#### 5. `Dimensionality Reduction`
The number of input features for a dataset may be considered the dimensionality of the data.

The problem is, the more dimensions this space has (e.g. the more input variables), the more likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that space. This is referred to as the curse of dimensionality.

This motivates feature selection, although an alternative to feature selection is to create a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.

This is referred to generally as dimensionality reduction and provides an alternative to feature selection. Unlike feature selection, the variables in the projected data are not directly related to the original input variables, making the projection difficult to interpret.

The most common approach to dimensionality reduction is to use a matrix factorization technique:

- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- `Factor Analysis` together with PCA for dimension

Other approaches exist that discover a lower dimensionality reduction. We might refer to these as model-based methods such as LDA and perhaps autoencoders.

- Linear Discriminant Analysis (LDA)

Sometimes manifold learning algorithms can also be used, such as `Kohonen self-organizing maps` and `t-SNE`.

- Dimensionality Reduction
    - Manifold Learning
        - SOM
        - tSNE
    - Model Based
        - LDA
    - Matrix Factorizacion
        - PCA
        - SVD

Dimensionality reduction, see the tutorial:

1. [Introduction to Dimensionality Reduction for Machine Learning](https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/).
2. [How to use PCA for Dimensionality Reduction](https://machinelearningmastery.com/principal-components-analysis-for-dimensionality-reduction-in-python/)
3. [How to use LDA for Dimensionality Reduction](https://machinelearningmastery.com/linear-discriminant-analysis-for-dimensionality-reduction-in-python/)

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials
- [How to Prepare Data For Machine Learning](https://machinelearningmastery.com/how-to-prepare-data-for-machine-learning/)
- [Applied Machine Learning Process](https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/)
- [How to Perform Data Cleaning for Machine Learning with Python](https://machinelearningmastery.com/basic-data-cleaning-for-machine-learning/)
- [How to Choose a Feature Selection Method For Machine Learning](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)
- [Introduction to Dimensionality Reduction for Machine Learning](https://machinelearningmastery.com/dimensionality-reduction-for-machine-learning/)

Books
- Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.[Here](https://bookdown.org/max/FES/).
- Applied Predictive Modeling, 2013.[Here](https://amzn.to/2VMhnat).
- Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.[Here](https://amzn.to/2Kk6tn0).