## Data Preparation

### Basics
- The philosophy of data preparation is to discover how to best expose the unknown underlying structure of the problem to the learning algorithms.
- This often requires an iterative path of experimentation through a suite of different data preparation techniques in order to discover what works well or best.
- Most of the machine learning algorithms are well understood and they are just routines. with amazing fully featured open-source machine learning libraries like scikit-learn in Python. **The thing that is different from project to project is the data.**
- **The challenge of data preparation** is that each dataset is unique and different. Datasets differ in the number of variables (tens, hundreds, thousands, or more), the types of the variables (numeric, nominal, ordinal, boolean), the scale of the variables, the drift in the values over time, and more

#### Applied Machine Learning Process
- You may be the first person (ever!) to work on the specific predictive modeling problem. That does not mean that others have not worked on similar prediction tasks or perhaps even the same high-level task, but you may be the first to use the specific data that you have collected.
-  No one can tell you what the best results are or might be, or what algorithms to use to achieve them. 
- You must establish a baseline in performance as a point of reference to compare all of your models and you must discover what algorithm works best for your specific dataset.
- This processis sometimes referred to as the applied **machine learning process**, **data science process**, or the older name **knowledge discovery in databases** (KDD).
- The process using the few high-level steps:

    - Step 1: Define Problem
    - Step 2: Data Collection
    - Step 3: Data Preparation
    - Step 4: Model Building    
    - Step 5: Model Evaluation
    - Step 6: Finalize the model

#### Define the problem
- The step before data preparation involves defining the problem. As part of defining the problem, this may involve many sub-tasks.

    - Gather data from the problem domain.
    - Discuss the project with subject matter experts.
    - Select those variables to be used as inputs and outputs for a predictive model. 
    - Review the data that has been collected.
    - Summarize the collected data using statistical methods.
    - Visualize the collected data using plots and charts.

#### Data Preparation
- Raw data typically cannot be used directly. This is because of reasons
    - Machine learning algorithms require data to be numerics.
    - Some machine learning algorithms impose requirements on the data.
    - Statistical noise and errors in the data may need to be corrected.
    - Complex nonlinear relationships may be teased out of the data.
- There are common or standard tasks that you may use or explore during the data preparation
    - **Data Cleaning**: Identifying and correcting mistakes or errors in the data.
    - **Feature Selection**: Identifying those input variables that are most relevant to the task.
    - **Data Transforms**: Changing the scale or distribution of variables.
    - **Feature Engineering**: Deriving new variables from available data.
    - **Dimensionality Reduction**: Creating compact projections of the data.

### What Is Data in Machine Learning
- Predictive modeling projects involve learning from data. Data refers to examples or cases from the domain that characterize the problem you want to solve. 
- In supervised learning, data is composed of examples where each example has an input element that will be provided to a model and an output or target element that the model is expected to predict.
- **Classification** is an example of a supervised learning problem where the target is a label, 
- and **Regression** is an example of a supervised learning problem where the target is a number.

- Most common type of input data is typically referred to as tabular data or structured data.
    - **Row**. A single example from the domain, often called an instance, example or sample in machine learning.
    - **Column**. A single property recorded for each example, often called a variable, **predictor**, or feature in machine learning.
    - **Input Variables**: Columns in the dataset provided to a model in order to make a prediction.
    - **Output Variable**: Column in the dataset to be predicted by a model.

- When you collect your data, you may have to transform it so it forms one large table. For example, if you have your data in a relational database, it is common to represent entities in separate tables in what is referred to as a normal form so that redundancy is minimized.
- In order to create one large table with one row per subject or entity that you want to model, you may need to reverse this process and introduce redundancy in the data in a process referred to as denormalization.

- **Raw data**: Data in the form provided from the domain.
- A **feature** is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline.
- **Feature engineering** is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model.

- Data preparation can make or break a model’s predictive ability. Different models have different sensitivities to the type of predictors in the model; how the predictors enter the model is also important.
 - **For example**: 
    - Linear machine learning models that expect each numeric input variable to have a Gaussian probability distribution.
    - This means that if you have input variables that are not Gaussian or nearly Gaussian, you might need to change them so that they are Gaussian or more Gaussian
    - Some algorithms are known to perform worse if there are input variables that are irrelevant or redundant to the target variable. There are also algorithms that are negatively impacted if two or more input variables are highly correlated.
    - There are also algorithms that have very few requirements about the probability distribution of input variables or the presence of redundancies, but in turn, may require many more examples (rows) in order to learn how to make good predictions.

- The idea that there are different ways to represent predictors in a model, and that some of these representations are better than others, leads to the idea of **feature engineering**  . 
- The performance of a machine learning algorithm is only as good as the data used to train it. This is often summarized as **garbage in, garbage out**.

- A dataset may be a weak representation of the problem we are trying to solve for many reasons.
    - **Complex Data**: Raw data contains compressed complex nonlinear relationships that may need to be exposed
    - **Messy Data**: Raw data contains statistical noise, errors, missing values, and conflicting examples.

- We can think about getting the most out of our predictive modeling project in two ways:
    - **focus on the model**:
        - We could minimally prepare the raw data and begin modeling. This puts full onus on the model to tease out the relationships in the data and learn the mapping function from inputs to outputs as best it can.
        - May require a large dataset and a flexible and powerful machine learning algorithm with few expectations, such as random forest or gradient boosting.
    - **focus on the data**:
        - Alternately, we could push the onus back onto the data and the data preparation process. This requires that each row of data best expresses the information content of the data for modeling

- **Although the algorithms are well understood operationally, most don’t have satisfiable theories about why they work or how to map algorithms to problems.This is why each predictive modeling project is empirical rather than theoretical, requiring a process of systematic experimentation of algorithms on data.**

#### Data Cleaning
- Data cleaning involves fixing systematic problems or errors in messy data.
- The most useful data cleaning involves deep domain expertise and could involve identifying and addressing specific observations that may be incorrect.
- There are many reasons data may have incorrect values, such as being mistyped, corrupted, duplicated, and so on.

- **Techniques:**:
    - **Basics**
        - Redundant Samples
        - Redundant Features
    - **Outliers**
        - Extreme Values
    - **Missing**
        - Mark
        - Impute

#### Feature selection
- ![Feature selection](./images/feature.jpeg)

#### Data Transforms
- **Data Types**
    - **Numeric Data Type**
        - Integer
        - Float
    - **Categorical Data Type**
        - **Nominal**: Labels with no rank ordering. 
        - **Ordinal**: Labels with a rank ordering.
        - **Boolean**: Values True and False.

- **Discretization Transform**: Encode a numeric variable as an ordinal variable .
- **Ordinal Transform**: Encode a categorical variable into an integer variable .
- **One Hot Transform**: Encode a categorical variable into binary variables .
- **Normalization Transform**: Scale a variable to the range 0 and 1.
- **Standardization Transform**: Scale a variable to a standard Gaussian
- **Power Transform**: Change the distribution of a variable to be more Gaussian.
- **Quantile Transform**: Impose a probability distribution such as uniform or Gaussian.
- **Polynomial Transform**: Create copies of numerical input variables that are raised to a power.

- ![Data Transforms](./images/data-transform.jpeg)

##### Dimensionality Reduction
- The number of input features for a dataset may be considered the dimensionality of the data.
- For example, two input variables together can define a two-dimensional area where each row of data defines a point in that space. This idea can then be scaled to any number of input variables to create large multi-dimensional hyper-volumes.
- The problem is, the more dimensions this space has (e.g. the more input variables), the more likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that space. This is a common issue in machine learning known as the **curse of dimensionality**.
- **Dimensionality reduction** is the process of reducing the number of input variables in a dataset.

- ![Dimensionality Reduction](./images/Dimenson_reduction.jpeg)


