<a href="https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Feature_Engineering_Change_Numerical_Data_Distributions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Change Numerical Data Distributions**



Numerical input variables may have a highly skewed or non-standard distribution. This could be
caused by outliers in the data, multi-modal distributions, highly exponential distributions, and
more. Many machine learning algorithms prefer or perform better when numerical input variables
and even output variables in the case of regression have a standard probability distribution,
such as a Gaussian (normal) or a Uniform distribution.

The quantile transform provides an automatic way to transform a numeric input variable to
have a different data distribution, which in turn, can be used as input to a predictive model.

In this tutorial, you will learn:

* Many machine learning algorithms prefer or perform better when numerical variables have
a Gaussian or standard probability distribution.
* Quantile transforms are techniques for transforming numerical input or output variables
to have a Gaussian or Uniform probability distribution.
* How to use the QuantileTransformer to change the probability distribution of numeric
variables to improve the performance of predictive models.

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

This activity was prepared for the Practical AI/ML for Computational Biology and Chemistry Workshop (June 13-17, 2022, UD) https://github.com/udel-cbcb/al_ml_workshop

Code explanation has been enriched using [CS50.ai](https://cs50.ai/)

##Quantile Transforms
A quantile transform is a method that transforms the data to follow a uniform or a normal distribution. This is done by mapping the original data to its quantile values. In lay terms, it's like adjusting the data so that it fits into a specific shape or pattern, either uniform (where all outcomes are equally likely) or normal (where outcomes are most likely to be in the middle, less likely at the extremes).

In machine learning, this can be useful for making the data more suitable for certain algorithms. Some algorithms assume that the input data follows a normal distribution, and they may not perform well if this assumption is violated. By applying a quantile transform, we can make the data more compatible with these algorithms.

The transformation
can be applied to each numeric input variable in the training dataset and then provided as
input to a machine learning model to learn a predictive modeling task. This quantile transform
is available in the scikit-learn Python machine learning library via the **QuantileTransformer**
class.

To show this, we first create a sample of 1,000 random Gaussian values and add a
skew to the dataset. A histogram is created from the skewed dataset clearly showing the
distribution pushed to the far left.



In [None]:
# demonstration of the quantile transform
from numpy import exp
from numpy.random import randn
from sklearn.preprocessing import QuantileTransformer
from matplotlib import pyplot

# generate gaussian data sample
data = randn(1000)
# add a skew to the data distribution by transforming the data from a normal distribution to a log-normal distribution
data = exp(data)
# histogram of the raw data with a skew
pyplot.hist(data, bins=25)
pyplot.show()


Then a QuantileTransformer is used to map the data to a Gaussian distribution and standardize the result, centering the values on the mean value of 0 and a standard deviation of
1.0. A histogram of the transform data is created showing a Gaussian shaped data distribution.

In [None]:
# this reshapes the data into a 2D array with as many rows as there are elements in data and 1 column.
# This is necessary because many functions in sklearn expect data in this format.
data = data.reshape((len(data),1))
print(data.shape)
# quantile transform the raw data
# perform a normal quantile transform of the dataset
# 'output_distribution" is the orginal distribution for the transformed data. The choices are
# 'uniform' (default) or 'normal'.
quantile = QuantileTransformer(output_distribution='normal')
data_trans = quantile.fit_transform(data)
# histogram of the transformed data
pyplot.hist(data_trans, bins=25)
pyplot.show()

#Now a real example
##Diabetes Dataset
This dataset we used in the data preparation module. This dataset classifies patient data as
either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.

You can learn more about the dataset here:

* Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
* Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

###Download Diabetes data files

In [None]:
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv" -O pima-indians-diabetes.csv
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names" -O pima-indians-diabetes.names
!head pima-indians-diabetes.csv

###Summarizing the variables from the pima-indians-diabetes dataset

In [None]:
# load and summarize the dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# summarize the shape of the dataset
print(dataset.shape)
# summarize each variable
print(dataset.describe())


This confirms the 8
input variables, one output variable, and 768 rows of data.

Finally a histogram is created for each input variable. If we ignore the clutter of the plots and
focus on the histograms themselves, we can see that many variables have a skewed distribution.
The dataset provides a good candidate for using a quantile transform to make the variables
more-Gaussian.

In [None]:
# histograms of the variables
fig = dataset.hist(xlabelsize=4, ylabelsize=4)
[x.title.set_size(4) for x in fig.ravel()]
# show the plot
pyplot.show()

Next, let's first and evaluate a machine learning model on the raw dataset. We will use
a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated
stratified k-fold cross-validation.

In [None]:
# evaluate KNN classifier on the raw dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
# KFold
#   is a cross-validator that divides the dataset into k folds.
# Stratified
#   is to ensure that each fold of dataset has the same proportion of observations with a given label.
# Repeated
#   provides a way to improve the estimated performance of a machine learning model.
# This involves simply repeating the cross-validation procedure multiple times and reporting the mean
# result across all folds from all runs. This mean result is expected to be a more accurate estimate
# of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
# load dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define and configure the model using
# Classifier implementing the k-nearest neighbors vote.
model = KNeighborsClassifier()
# evaluate the model using RepeatedStratifiedKFold cross validator,
# that repeats Stratified K-Fold n times with different randomization in each
# repetition.
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report model performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

In this case we can see that the model achieved a mean classification accuracy of about 71.7
percent.

##Normal Quantile Transform
It is often desirable to transform an input variable to have a normal probability distribution to improve the modeling performance. We can apply the Quantile transform using the
QuantileTransformer class and set the output distribution argument to `normal'. We
must also set the n quantiles argument to a value less than the number of observations in the
training dataset, in this case, 100. Once defined, we can call the fit transform() function and
pass it to our dataset to create a quantile transformed version of our dataset.

In [None]:
# visualize a normal quantile transform of the dataset
from pandas import read_csv
from pandas import DataFrame
from sklearn.preprocessing import QuantileTransformer
from matplotlib import pyplot
# load dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# retrieve just the numeric input values
data = dataset.values[:, :-1]
# perform a normal quantile transform of the dataset
# 'n_quantiles" is the number of quantiles to be computed. It corresponds to the number
# of landmarks used to discretize the cumulative distribution function.
# 'output_distribution" is the arginal distribution for the transformed data. The choices are
# 'uniform' (default) or 'normal'.
trans = QuantileTransformer(n_quantiles=100, output_distribution='normal')
data = trans.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# histograms of the variables
fig = dataset.hist(xlabelsize=4, ylabelsize=4)
[x.title.set_size(4) for x in fig.ravel()]
# show the plot
pyplot.show()

We can see that the shape of the histograms for each variable looks very Gaussian as compared
to the raw data.

Next, let's evaluate the same KNN model as the previous section, but in this case on a
normal quantile transform of the dataset

In [None]:
# Evaluate KNN with normal quantile transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import QuantileTransformer
from sklearn.pipeline import Pipeline
# load dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the pipeline
trans = QuantileTransformer(n_quantiles=100, output_distribution='normal')
# Classifier implementing the k-nearest neighbors vote.
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[('t', trans), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report pipeline performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

we can see that the normal quantile transform results in a lift in
performance from 71.7 percent accuracy without the transform to about 73.4 percent with the
transform.

##Uniform Quantile Transform
Sometimes it can be beneficial to transform a highly exponential or multi-modal distribution to
have a uniform distribution. This is especially useful for data with a large and sparse range of
values, e.g. outliers that are common rather than rare. We can apply the transform by defining
a QuantileTransformer class and setting the output distribution argument to `uniform'
(the default).

In [None]:
# visualize a uniform quantile transform of the dataset
from pandas import read_csv
from pandas import DataFrame
from sklearn.preprocessing import QuantileTransformer
from matplotlib import pyplot
# load dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# retrieve just the numeric input values
data = dataset.values[:, :-1]
# perform a uniform quantile transform of the dataset
trans = QuantileTransformer(n_quantiles=100, output_distribution='uniform')
data = trans.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# histograms of the variables
fig = dataset.hist(xlabelsize=4, ylabelsize=4)
[x.title.set_size(4) for x in fig.ravel()]
# show the plot
pyplot.show()

We can see that the shape of the histograms for each variable looks very uniform compared to
the raw data.

Next, let's evaluate the same KNN model as the previous section, but in this case on a
uniform quantile transform of the raw dataset.

In [None]:
# evaluate KNN classifer on the dataset with uniform quantile transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import QuantileTransformer
from sklearn.pipeline import Pipeline
# load dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the pipeline
trans = QuantileTransformer(n_quantiles=100, output_distribution='uniform')
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[('t', trans), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report pipeline performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

we can see that the uniform transform results in a lift in performance
from 71.7 percent accuracy without the transform to about 73.4 percent with the normal transform, and achieved a score of 74.1 percent.