# Data Preparation for Machine Learning
***This 7-day Mini-Course was created by Jason Brownlee <br>*** 

This crash course is broken down into seven lessons.


Below is a list of the seven lessons that will get you started and productive with data <br> preparation in Python:

* Lesson 01: [Importance of Data Preparation](#01)
* Lesson 02: [Fill Missing Values With Imputation](#02)
* Lesson 03: [Select Features With Recursive Feature Elimination (RFE)](#03)
* Lesson 04: [Scale Data With Normalization](#04) 
* Lesson 05: [Transform Categories With One-Hot Encoding](#05)
* Lesson 06: [Transform Numbers to Categories With kBins](#06)
* Lesson 07: [Dimensionality Reduction with PCA](#07)

<a name='01'></a>
## Importance of Data Preparation ##
* **Predictive modeling** projects involve learning from data
* **Data refers** to examples or cases from the domain that characterize the problem you want to solve
* On a predictive modeling project, such as classification or regression, **raw data** typically cannot be used directly. 

There are four main reasons why this is the case:

* **Data Types**: ML algorithms require data to be numbers.
* **Data Requirements**: Some ML algorithms impose requirements on the data.
* **Data Errors**: Statistical noise and errors in the data may need to be corrected.
* **Data Complexity**: Complex nonlinear relationships may be teased out of the data.
* **Data Preparation**: The **raw data** must be pre-processed prior to being used to fit and evaluate a ML model. 

Some data preparation methods:


* **Standardization** that standardizes the numeric data using the mean and standard deviation of the column.
* **Normalization** most often means dividing by a norm of the vector. Scales numerical variables to the range between zero and one. *E.g., divide each pixel value of a image by 255.*
* **Filtering** the data if we are interested in phenomenon of particular time or space scale.
* **Replacing nan values** with some default values (mean, mode,...).
* **Numerical data discretization**: transform numeric data into categorical data. This might be useful when ranges could be more effective than exact values. *E.g., high/medium/low temperatures might be more interesting than the actual temperature.*
* **Outlier dectection**: outliers can be noises in terms of finding patterns in datasets. Using boxplot is possible to identify them.
* **Principal Component Analysis (PCA)** is used to reduce the dimensionality of data by creating new features. It does this to increase their chances of being interpret-able while minimising information loss.

<a name='02'></a>
## Fill Missing Values With Imputation ##

* **Data imputation**: filling missing values with data. 
<br> Normally this data is a statistical value (mean, median, frequency,...) of its column

Statistical imputation transform for the horse colic dataset and a full description of dataset can be founded [(here)](https://archive.ics.uci.edu/ml/datasets/Horse+Colic)

In [63]:
from pandas import read_csv
from numpy import isnan 
from sklearn.impute import SimpleImputer

# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv"
dataframe = read_csv(url, header=None, na_values="?")

In [64]:
# Split into input and output elements
#   A full description of HC dataset is described in link above.
#   Realizes that (in horse colic dataset url description) the range
#   from 1 to 28 is attibutes except the number 23 which is the outcome (lived, died or was euthanized) 
#   reall that: dataframe.shape => (300, 28) 
data = dataframe.values
outcome_col, cols = 23, data.shape[1]
ix_cols = [c for c in range(cols) if c != 23]
X, y = data[:, ix_cols], data[:, outcome_col]
# total missing
print("Missing before imputation: %d" %sum(isnan(X).flatten()))
# define imputer
imputer = SimpleImputer(strategy="mean")
# fit on the dataset 
imputer.fit(X)
# transform the dataset
Xtrans = imputer.transform(X)
print("Missing after imputation: %d" % sum(isnan(Xtrans).flatten()))

Missing before imputation: 1605
Missing after imputation: 0


In [65]:
# another technique to do data imputation
print("before data imputation ", sum(dataframe.isnull().sum()))
dataframe = dataframe.apply(lambda x: x.fillna(x.mean()),axis=0)
print("after data imputation ", sum(dataframe.isnull().sum()))

before data imputation  1605
after data imputation  0


<a name='03'></a>
## Select Features With Recursive Feature Elimination (RFE) ##

Feature ranking with recursive feature elimination: [RFE](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), <br>
the goal of recursive feature elimination (RFE) is to select features by recursively considering <br> 
smaller and smaller sets of features. First, the estimator is trained on the initial set of features <br> 
and the importance of each feature is obtained either through any specific attribute or callable. <br> 
**Then, the least important features are pruned from current set of features.** That procedure is recursively <br>
repeated on the pruned set until the desired number of features to select is eventually reached. <br>

* **Basically, RFE works by recursively removing attributes and building a model on those attributes that ramain**

* RFE is a transform. To use it, first, the class is configured with the chosen algorithm specified <br> via the “estimator” argument and the number of features to select via the “n_features_to_select” argument. <br>

* The example below defines a synthetic classification dataset with five redundant input features. <br>
RFE is then used to select five features using the decision tree algorithm.

In [74]:
# report which features were selected by RFE
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

# define dataset
xColumns = 10
xRows = 1000
X, y = make_classification(n_samples=xRows, n_features=xColumns, n_informative=5, n_redundant=5, random_state=1)
# X.shape => (1000, 10)

# define RFE
# 	n_features_to_select -> #features to keep/remain (not to exclude)
#	.support_ -> return the True and Falses valeus. #Trues = n_features_to_select. 
#   so, all falses will be removed 
#   .ranking will show the order of remotion from highest to lowest  [6, 5, 4, 3, 2] rank 1 remains, not deleted
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
# fit RFE
rfe.fit(X, y)
# summarize all features

def bool_(v):
	return 'Yes' if v else 'No ' 

for i in range(X.shape[1]):
	print('Column: %d, Remains? %s => Rank: %d' % (i, bool_(rfe.support_[i]), rfe.ranking_[i]))

Column: 0, Remains? No  => Rank: 5
Column: 1, Remains? No  => Rank: 4
Column: 2, Remains? Yes => Rank: 1
Column: 3, Remains? Yes => Rank: 1
Column: 4, Remains? Yes => Rank: 1
Column: 5, Remains? No  => Rank: 6
Column: 6, Remains? Yes => Rank: 1
Column: 7, Remains? No  => Rank: 3
Column: 8, Remains? Yes => Rank: 1
Column: 9, Remains? No  => Rank: 2
