Discretization, or binning, is the process of transforming continuous variables into discrete
variables by creating a set of contiguous intervals, also called bins, that span the range of
the variable values. Discretization is used to change the distribution of skewed variables
and to minimize the influence of outliers, and hence improve the performance of some
machine learning models

### 1.Dividing the variable into intervals of equal width
- creating bins like 1-10 , 10-20 , ......

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
from feature_engine.discretisation import EqualWidthDiscretiser

## Sckit-learn 
disc = KBinsDiscretizer(n_bins=10, encode='ordinal',strategy='uniform')
disc.fit_transform(df[['LSTAT', 'DIS', 'RM']]) --> returns numpy array 
## feature engine 
disc = EqualWidthDiscretiser(bins=10, variables = ['LSTAT', 'DIS','RM'])
disc.fit_transform(df[['LSTAT', 'DIS', 'RM']]) --> returns dataframe 



### 2.Sorting the variable values in intervals of equal frequency
- Equal-frequency discretization divides the values of the variable into intervals that carry the same proportion of observations. The interval width is determined by quantiles, and therefore different intervals may have different widths.


In [None]:
from sklearn.preprocessing import KBinsDiscretizer
from feature_engine.discretisation import EqualFrequencyDiscretiser

## By using sckit-learn 
disc = KBinsDiscretizer(n_bins=10, encode='ordinal',strategy='quantile')
disc.fit_transform(df[['LSTAT', 'DIS', 'RM']]) --> returns numpy array 

## By using feature engine 
disc = EqualFrequencyDiscretiser(q=10, variables = ['LSTAT', 'DIS','RM'])
disc.fit_transform(df[['LSTAT', 'DIS', 'RM']]) --> returns dataframe 

 

### 3. Performing discretization with k-means clustering


In [None]:
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(n_bins=10, encode='ordinal',strategy='kmeans')
disc.fit_transform(X_train[['LSTAT', 'DIS', 'RM']])




### 4. Using decision trees for discretization

In [None]:
## This makes bins but may have decimal values 
from sklearn.tree import DecisionTreeRegressor
from feature_engine.discretisation import DecisionTreeDiscretiser

## By using Sckit-learn 
tree_model = DecisionTreeRegressor(max_depth=3, random_state=0)
tree_model.fit(X_train['LSTAT'].to_frame(), y_train)
X_train['lstat_tree'] = tree_model.predict(X_train['LSTAT'].to_frame())

## By using feature engine 

treeDisc = DecisionTreeDiscretiser(cv=10, scoring='neg_mean_squared_error',
                                   variables=['LSTAT', 'RM', 'DIS'],
                                   regression=True, param_grid={'max_depth': [1,2,3,4]})

train_t = treeDisc.fit_transform(X_train)
