# Data Preparation for Machine Learning

***This is my notes of 7-day Mini-Course (created by Jason Brownlee) <br>*** *May have some classmates insights*

This crash course is broken down into seven lessons.


Below is a list of the seven lessons that will get you started and productive with data <br> preparation in Python:

* Lesson 01: [Importance of Data Preparation](#01)
* Lesson 02: [Fill Missing Values With Imputation](#02)
* Lesson 03: [Select Features With Recursive Feature Elimination (RFE)](#03)
* Lesson 04: [Scale Data With Normalization](#04) 
* Lesson 05: [Transform Categories With One-Hot Encoding](#05)
* Lesson 06: [Transform Numbers to Categories With kBins](#06)
* Lesson 07: [Dimensionality Reduction with PCA](#07)

<a name='01'></a>
## Importance of Data Preparation ##
* **Predictive modeling** projects involve learning from data
* **Data refers** to examples or cases from the domain that characterize the problem you want to solve
* On a predictive modeling project, such as classification or regression, **raw data** typically cannot be used directly. 

There are four main reasons why this is the case:

* **Data Types**: ML algorithms require data to be numbers.
* **Data Requirements**: Some ML algorithms impose requirements on the data.
* **Data Errors**: Statistical noise and errors in the data may need to be corrected.
* **Data Complexity**: Complex nonlinear relationships may be teased out of the data.
* **Data Preparation**: The **raw data** must be pre-processed prior to being used to fit and evaluate a ML model. 

Some data preparation methods:


* **Standardization** that standardizes the numeric data using the mean and standard deviation of the column.
* **Normalization** most often means dividing by a norm of the vector. Scales numerical variables to the range between zero and one. *E.g., divide each pixel value of a image by 255.*
* **Filtering** the data if we are interested in phenomenon of particular time or space scale.
* **Replacing nan values** with some default values (mean, mode,...).
* **Numerical data discretization**: transform numeric data into categorical data. This might be useful when ranges could be more effective than exact values. *E.g., high/medium/low temperatures might be more interesting than the actual temperature.*
* **Outlier dectection**: outliers can be noises in terms of finding patterns in datasets. Using boxplot is possible to identify them.
* **Principal Component Analysis (PCA)** is used to reduce the dimensionality of data by creating new features. It does this to increase their chances of being interpret-able while minimising information loss.

<a name='02'></a>
## Fill Missing Values With Imputation ##

* **Data imputation**: filling missing values with data. 
<br> Normally this data is a statistical value (mean, median, frequency,...) of its column

Statistical imputation transform for the horse colic dataset and a full description of dataset can be founded [(here)](https://archive.ics.uci.edu/ml/datasets/Horse+Colic)

In [49]:
from pandas import read_csv
from numpy import isnan 
from sklearn.impute import SimpleImputer

# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv"
dataframe = read_csv(url, header=None, na_values="?")

In [50]:
# Split into input and output elements
#   A full description of HC dataset is described in link above.
#   Realizes that (in horse colic dataset url description) the range
#   from 1 to 28 is attibutes except the number 23 which is the outcome (lived, died or was euthanized) 
#   reall that: dataframe.shape => (300, 28) 
data = dataframe.values
outcome_col, cols = 23, data.shape[1]
ix_cols = [c for c in range(cols) if c != 23]
X, y = data[:, ix_cols], data[:, outcome_col]
# total missing
print("Missing before imputation: %d" %sum(isnan(X).flatten()))
# define imputer
imputer = SimpleImputer(strategy="mean")
# fit on the dataset 
imputer.fit(X)
# transform the dataset
Xtrans = imputer.transform(X)
print("Missing after imputation: %d" % sum(isnan(Xtrans).flatten()))

Missing before imputation: 1605
Missing after imputation: 0


In [51]:
# another technique to do data imputation
print("before data imputation ", sum(dataframe.isnull().sum()))
dataframe = dataframe.apply(lambda x: x.fillna(x.mean()),axis=0)
print("after data imputation ", sum(dataframe.isnull().sum()))

before data imputation  1605
after data imputation  0


<a name='03'></a>
## Select Features With Recursive Feature Elimination (RFE) ##

Feature ranking with recursive feature elimination: [RFE](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), <br>
the goal of recursive feature elimination (RFE) is to select features by recursively considering <br> 
smaller and smaller sets of features. First, the estimator is trained on the initial set of features <br> 
and the importance of each feature is obtained either through any specific attribute or callable. <br> 
**Then, the least important features are pruned from current set of features.** That procedure is recursively <br>
repeated on the pruned set until the desired number of features to select is eventually reached. <br>

* **Basically, RFE works by recursively removing attributes and building a model on those attributes that ramain**

* **Choose those features that are statistically meaningful to your model.**

* RFE is a transform. To use it, first, the class is configured with the chosen algorithm specified <br> via the “estimator” argument and the number of features to select via the “n_features_to_select” argument. <br>

* The example below defines a synthetic classification dataset with five redundant input features. <br>
RFE is then used to select five features using the decision tree algorithm.

In [52]:
# report which features were selected by RFE
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

# define dataset
xColumns = 10
xRows = 1000
X, y = make_classification(n_samples=xRows, n_features=xColumns, n_informative=5, n_redundant=5, random_state=1)
# X.shape => (1000, 10)

# define RFE
# n_features_to_select -> #features to keep/remain (not to exclude)
# .support_ -> return the True and Falses valeus. #Trues = n_features_to_select. 
# so, all falses will be removed 
# .ranking will show the order of remotion from highest to lowest  [6, 5, 4, 3, 2] rank 1 remains, not deleted
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
# fit RFE
rfe.fit(X, y)
# summarize all features

def bool_(v):
	return 'Yes' if v else 'No ' 

for i in range(X.shape[1]):
	print('Column: %d, Remains? %s => Rank: %d' % (i, bool_(rfe.support_[i]), rfe.ranking_[i]))

Column: 0, Remains? No  => Rank: 5
Column: 1, Remains? No  => Rank: 4
Column: 2, Remains? Yes => Rank: 1
Column: 3, Remains? Yes => Rank: 1
Column: 4, Remains? Yes => Rank: 1
Column: 5, Remains? No  => Rank: 6
Column: 6, Remains? Yes => Rank: 1
Column: 7, Remains? No  => Rank: 2
Column: 8, Remains? Yes => Rank: 1
Column: 9, Remains? No  => Rank: 3


<a name='04'></a>
## Scale Data With Normalization ##



**Normalization is required when the value of feature ranges among all features are (too) disproportionate, <br>
otherwise the feature with the largest range of values (lets say X_l) would overlaps others in terms of its parameters.** <br>

Therefore, if parameter of X_l stands out from others for this reason, we have a biased model that might disregard other weights that <br> could be potentially good. <br>

*E.g., the range of age feature is suppouse to be (in years) from 0 to 120, but salary can be from 1k to 1.000.000,00k (and beyond) ...* <br> Of course this features (or dataset) need to be normalized.

* **Normalization returns the data with all values withing 0 and 1**

Given the following data: <br> 

$X =[1, 2, 3, 4, 5]$ <br>

Calculating $X$ normalized: <br>
$X_{std} =  \frac{X - X_{min}}{X_{max} - X_{min}} \rightarrow X_{std} = \frac{ [1, 2, 3, 4, 5] - 1}{5 - 1} = \frac{[0,1,2,3,4]}{4}= \begin{bmatrix}
0 \\
0.25 \\
0.5 \\ 
0.75 \\
1 \\
\end{bmatrix}$



In [53]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# define dataset
X      = np.array([1,2,3,4,5])
# define scaler
scaler = MinMaxScaler()
# summarize data after the transform
print(scaler.fit_transform(X.reshape(-1,1)))

[[0.  ]
 [0.25]
 [0.5 ]
 [0.75]
 [1.  ]]


In [54]:
# another example of normalizing input data
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5, n_redundant=0, random_state=1)
# summarize data before the transform
print(X.shape)
print(X[:3, :])
# define the scaler
trans = MinMaxScaler()
# transform the data
X_norm = trans.fit_transform(X)
# summarize data after the transform
print("\nnormalized version of X:\n",X_norm[:3, :])


(1000, 5)
[[ 2.39324489 -5.77732048 -0.59062319 -2.08095322  1.04707034]
 [-0.45820294  1.94683482 -2.46471441  2.36590955 -0.73666725]
 [ 2.35162422 -1.00061698 -0.5946091   1.12531096 -0.65267587]]

normalized version of X:
 [[0.77608466 0.0239289  0.48251588 0.18352101 0.59830036]
 [0.40400165 0.79590304 0.27369632 0.6331332  0.42104156]
 [0.77065362 0.50132629 0.48207176 0.5076991  0.4293882 ]]


<a name='05'></a>
## Transform Categories With One-Hot Encoding ##

Before, (remember your statistics class) a data can have the following types:
* Numerical (quantitative) data
    * Numerical values: **discrete** (countable) or **continuous** (measurable: temp., weight, length...)
* Categorical (qualitative) data
    * Categorical values: **nominal** (gender, colour...) or **ordinal** (if you can order or rank it: always, usually, rarely...)


One Hot Encoding is a technique to enconde categorical input variables (**qualitative variable**) as numbers (**continuous/discrete variables**). <br>
Remember that machine learning models require all input and output variables to be numeric. This means that if your data <br> 
contains categorical data, **you must encode it to numbers before you can fit and evaluate a model.** <br>

* **Ordinal Encoding:** each label (class) for a categorical variable can be mapped to a unique integer
* **One-hot encoding** is the most popular approach to do this transformation

For example, imagine we have a “color” variable with three categories (‘red‘, ‘green‘, and ‘blue‘):

| Red  | Green  | Blue |   
|------|--------|-------|
|  1   |   0    |  0    |   
|  0   |   1    |  0    |   
|  0   |   0    |  1    |  

In [55]:
# one-hot encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import OneHotEncoder as OH

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
# It is a good idea to transform inputs and outputs separately,
# so you can invert the transform later separately for predictions.
X = data[:, :-1].astype(str)
y = data[:,  -1].astype(str)
# summarize the raw data
print("\n", X[:1, :])
# define the one hot encoding transform
encoder = OH(sparse=False)
# fit and apply the transform to the input data
X_oe = encoder.fit_transform(X)
# summarize the transformed data
print("\nWe had 9 column that its converts in 43 columns after the one-hot encoding process:\n", X_oe[:1, :])


 [["'40-49'" "'premeno'" "'15-19'" "'0-2'" "'yes'" "'3'" "'right'"
  "'left_up'" "'no'"]]

We had 9 column that its converts in 43 columns after the one-hot encoding process:
 [[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]]


In [56]:
# here, instead of one-hot encondig, ordinal enconder is doing the work
# you can see that OE doesnt just use 1s and 0s to represent the encoded such as one-hot encoding
# e.g., "70-79" is represented as digit 4, whenever there is 4 in that position (COLUMN), it will be "70-79"
from sklearn.preprocessing import OrdinalEncoder as OE
enc = OE()
X = [['40-49', 'premeno', '15-19', '0-2', 'yes', '3', 'right', 'left_up', 'no'], 
    ['50-59', 'ge40', '15-19', '0-2', 'no', '1', 'right', 'centra', 'yes'], 
    ['30-39', 'premeno', '30-34', '6-8', 'yes', '2', 'right', 'right_up', 'no'], 
    ['60-69', 'lt40', '10-14', '0-2', 'no', '1', 'left', 'right_up', 'no'], 
    ['70-79', 'ge40', '15-19', '9-11', 'nan', '1', 'left', 'left_low', 'yes']]
enc.fit(X)

print(enc.transform([["40-49", "premeno", '15-19', '0-2', 'yes', '3', 'right', 'left_up', 'yes']]))
print(enc.transform([["50-59", "ge40", '15-19', '0-2', 'yes', '3', 'right', 'left_up', 'no']]))
print(enc.transform([["30-39", "ge40", '15-19', '0-2', 'yes', '3', 'right', 'left_up', 'no']]))
print(enc.transform([["60-69", "lt40", '15-19', '0-2', 'yes', '3', 'right', 'left_up', 'no']]))
print(enc.transform([["70-79", "lt40", '15-19', '0-2', 'yes', '3', 'right', 'left_up', 'no']]))

[[1. 2. 1. 0. 2. 2. 1. 2. 1.]]
[[2. 0. 1. 0. 2. 2. 1. 2. 0.]]
[[0. 0. 1. 0. 2. 2. 1. 2. 0.]]
[[3. 1. 1. 0. 2. 2. 1. 2. 0.]]
[[4. 1. 1. 0. 2. 2. 1. 2. 0.]]


Therefore, the length of encoded data by one-hot encoding may be longer than the raw ones due to the limitation of binary representation <br>
However, in ordinals, if there are 9 columns in raw data, the transformed ones will also have 9 columns as does not use only <br>
1 and 0, but 1,2,3... as many numbers as needed 

<a name='06'></a>
## Transform Numbers to Categories With kBins ##

In this lesson, you will discover how to **transform numerical variables into categorical variables**.
* Some machine learning algorithms may prefer or require categorical or ordinal input variables, <br> such as some decision tree and rule-based algorithms.
* One approach is to use the transform of the numerical variable to have a discrete probability <br> distribution where each numerical value is assigned a label and the labels have an ordered (ordinal) relationship.
* This is called a **discretization transform** and can improve the performance of some ML models <br> for datasets by making the probability distribution of numerical input variables discrete.

The discretization transform is available in the scikit-learn Python machine learning library via the [KBinsDiscretizer class](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html)

In [57]:
# discretize numeric input variables
from sklearn.datasets import make_classification
from sklearn.preprocessing import KBinsDiscretizer
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5, n_redundant=0, random_state=1)
# summarize data before the transform
print(X[:3, :])
# define the transform
trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
# transform the data
X_discrete = trans.fit_transform(X)
# summarize data after the transform
print(X_discrete[:3, :])

# The raw data is spread in between lowest value -5.77 to maximum 2.39. 
# Now after transform, it is like binning the data in classes 0-7.
# So we see that lowest value is put as class 0 and highest value as class 7.

[[ 2.39324489 -5.77732048 -0.59062319 -2.08095322  1.04707034]
 [-0.45820294  1.94683482 -2.46471441  2.36590955 -0.73666725]
 [ 2.35162422 -1.00061698 -0.5946091   1.12531096 -0.65267587]]
[[7. 0. 4. 1. 5.]
 [4. 7. 2. 6. 4.]
 [7. 5. 4. 5. 4.]]


<a name='07'></a>
# Dimensionality Reduction with PCA ## 

The number of input variables or features for a dataset is referred to as its dimensionality 

* **#features** = dimensionality
* **Dimensionality reduction techniques** = reduce #features in a dataset
* **More input features** = more challenging for model to predict 
* **Curse of dimensionality** = more generally 
* **Principal Component Analysis (PCA)** is the most popular technique for dimensionality reduction

The example below creates a synthetic binary classification dataset with 10 input variables then uses PCA to reduce the dimensionality of the dataset to the three most important components.

In [58]:
# example of PCA for dimensionality reduction
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA 
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=7, random_state=1)
# summarize data before the transform
print("( the original size = ", len(X[:3, :][0])*len(X[:3, :]), ")\n", X[:3, :])
# define the transform
transf = PCA(n_components=3)
# transform the data
X_dim = transf.fit_transform(X)
# summarize data after the transform 
print("\n( the new size = ", len(X_dim[:3, :][0])*len(X_dim[:3, :]), ")\n", X_dim[:3, :])

( the original size =  30 )
 [[-0.53448246  0.93837451  0.38969914  0.0926655   1.70876508  1.14351305
  -1.47034214  0.11857673 -2.72241741  0.2953565 ]
 [-2.42280473 -1.02658758 -2.34792156 -0.82422408  0.59933419 -2.44832253
   0.39750207  2.0265065   1.83374105  0.72430365]
 [-1.83391794 -1.1946668  -0.73806871  1.50947233  1.78047734  0.58779205
  -2.78506977 -0.04163788 -1.25227833  0.99373587]]

( the new size =  9 )
 [[-1.64710578 -2.11683302  1.98256096]
 [ 0.92840209  4.8294997   0.22727043]
 [-3.83677757  0.32300714  0.11512801]]
