## Feature engineering 101 



### What is a variable?

A variable is any differentiating factor, number, or amount that can be measured or quantified. They are termed "variables" because the value they accept can and generally does change.



#### Variables types

Variables come in different shapes in form from very common and intuitive like age, sex or eye colour to complicated measures that require domain knowledge for example systolic blood pressure. 
It is useful to cluster variables into four major types:
    - Numerical: e.g. binary, integers, float or complex 
    - Categorical: e.g. nominal and ordinal 
    - Datetime: i.e. any time related unit 
    - Freetext: i.e. any observation containing unstructured textual data

```{note}
In reality there is a fifth type that is called mixed type when you combine categorical and numbers - however, this is a very domain specific use case and can be treated as Categorical with either order or no order defined.
```



### What is a Feature?

A feature is just a sequence of variables that almost always share the same type. 
The building blocks of datasets are features.
The features of a dataset that you use for machine learning greatly influence the quality of the insights you will be able to derive.
Moreover, different scientific challenges within a given discipline may not always require the same characteristics, which is why understanding the specific goals of any data science project is paramount.

The feature selection and feature engineering process, which are notoriously difficult and time-consuming, might help you improve the quality of your dataset's features. If these strategies are correctly applied, the ideal dataset will contain all the crucial attributes relevant to your unique business challenge, resulting in the best possible model outputs and the most valuable insights.

Working with features is one of the most time-consuming aspects of conventional data science.
Any project begins with identifying the data type of each feature (i.e. numerical, categorical, DateTime, or free text).
To identify potential problems, perform fundamental statistical analysis on each feature (mean, median, standard deviation, etc.).
It is also often valuable for producing exploratory visualisations such as histograms, frequent values charts, and count of occurrence tables for each feature, enabling you to comprehend your data rapidly and what insights it may yield.




### What are Feature Characteristics?
Any scientific endeavour needs to be able to describe features using some summary measures

- The most common characteristics regardless of type is missingness. Which is just the percentage of missing data in a specific feature. 
- Then each of the four types of data will have different means of describing them. 
- For example, numeric features will often be described using:
  - Statistical moments: e.g. mean, var,  skewness and kurtosis 
  - Magnitude, range and quantiles 
  - Outliers: i.e. the existence of extreme value that is significantly different from the remaining data
- In contrast, categorical features will be described using:
  - Cardinality: i.e. The number of unique labels the feature contains
  - label frequency: i.e. the distribution of labels across a feature
- Datetime features will be described using:
  - Min, max date
  - Observation frequency
  - And can also apply more complex methods such as trend, stationarity etc.
- Finally, free text features can be described using: 
  - Dictionary: i.e. the unique set of words that can be formed from the feature 
  - Min and Max sentence length in words 
  - If relevant we can also extract affective tendencies 
  - Or even richness of vocabulary


# What are quality datasets

Quality datasets have the following properties:

1. **Data-Type Constraints**: all column values assigned the correct datatype, i.e., binary, categorical, ordinal, numeric, date
1. **Mandatory Columns**: some columns cannot be empty.
1. **Range Constraints**: numeric or dates columns should fall within a certain domain range.
1. **Categorical Consistency**: Categorical columns have a set of unique values For example, a person’s sex may be male or female.
1. **Cross-Column Consistency**: There are some interactions that need to make sense (e.g. age in years should be assoicated to birth date)
1. **Uniformity Constraints**: The degree to which the data is specified using the same unit of measure.

## What is Data Preprocessing ?
The process of data preprocessing involves converting raw data into a format that can be analysed and used. Data from the real world tends to be incomplete, inconsistent, and/or lacking in certain behaviors and trends, as well as containing a lot of errors. This can be resolved through data cleaning.
Furthermore, when you have access to a large dataset with several potential interactions, you may wish to isolate a question of interest, doing so from the beginning will help you focus on what is most important at that time with the flexibility to always go back and incorporate the features or domains you excluded earlier.

## Why use Data Preprocessing?
In the real world, data often contain inaccuracies, noise, and inconsistencies. 
- When an observation is incomplete, we mean that it lacks attribute values, lacks desired attributes, or is only aggregated.
- When there are errors or outliers in the data, they are considered noisy. 
- Data that is inconsistent contains names or codes that aren't consistent with the structure of interest.

## What can I do when an observations are incomplete?
- If the dataset is large enough..., exclude all observations with missing values
- You might consider imputed values in rare datasets (even if they are large)
- Decide in advance (before you examine the dataset) some exclusion criteria 
 
## What can I do when there are errors or outliers in the data?
- First task is to identify errors or outliers cases 
  - Errors are values that were logged incorrectly 
  - Outliers are values that can't be justified conceptually (i.e. using a domain expert)
- Once you find them, you need to come up with a strategy
  - In order to eliminate errors and outliers, we can filter data
  - Outliers and errors can be transformed or imputed
  - They can be investigated independently 

## What can I do when my dataset is inconsistent?
It depends on the level of inconsistency? 
  - Sometimes it's all about simple hacks 
  - Sometimes you need some manual procedures 
    - In manuals, it is crucial to identify problems 
      - Understand them
      - Handle them 


### What is a Feature engineering?
Feature engineering is the process of **selecting**, **manipulating**, and **transforming** raw data into features that can be used in supervised and unsupervised learning.

#### What is feature manipulation? 
- Capping
- Expansion
- Imputation
- Encoding


#### What is feature transforming? 
- Scaling
- Mathematical
- Normal
- Discretisation




#### What is feature selection? 
The process of extracting the most consistent, non-redundant, and relevant features for use in model creation is known as feature selection. As the number and variety of datasets expand, it is critical to reduce their size methodically. The primary purpose of feature selection is to increase predictive model performance while lowering modelling computational costs.

#### Feature Selection Methods
- Feature selection algorithms are categorized as either supervised, which can be used for labeled data; or unsupervised, which can be used for unlabeled data. 
- Supervised techniques are :
  - Directional selection
  - Directional elimination 
  - Random selection
  - Brute force
- Unsupervised techniques are :
  - Filtering
  - Wrapping 
  - Embedding
  - Hybrid