# Part 1.5: Binning (Discretization)

Binning, or discretization, is the process of converting continuous numerical variables into discrete 'bins' or categories. This can sometimes help models learn non-linear relationships and can make them more robust to noise.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer

data = {'age': [22, 25, 31, 45, 52, 60, 70, 85, 28, 35, 41, 58]}
df = pd.DataFrame(data)
print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,age
0,22
1,25
2,31
3,45
4,52
5,60
6,70
7,85
8,28
9,35


### Using Scikit-learn's KBinsDiscretizer
This is a powerful tool for various binning strategies.

#### Equal-Width Binning (`strategy='uniform'`)
Each bin has the same width or range.

In [2]:
binner_uniform = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='uniform')
df_binned = df.copy()
df_binned['age_uniform_bins'] = binner_uniform.fit_transform(df[['age']])

print("Bin Edges (Uniform):")
print(binner_uniform.bin_edges_)

print("\nDataFrame with Uniform Bins:")
df_binned

Bin Edges (Uniform):
[array([22.  , 37.75, 53.5 , 69.25, 85.  ])]

DataFrame with Uniform Bins:


Unnamed: 0,age,age_uniform_bins
0,22,0.0
1,25,0.0
2,31,0.0
3,45,1.0
4,52,1.0
5,60,2.0
6,70,3.0
7,85,3.0
8,28,0.0
9,35,0.0


#### Equal-Frequency Binning (`strategy='quantile'`)
Each bin has approximately the same number of data points.

In [3]:
binner_quantile = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
df_binned['age_quantile_bins'] = binner_quantile.fit_transform(df[['age']])

print("Bin Edges (Quantile):")
print(binner_quantile.bin_edges_)

print("\nDataFrame with Quantile Bins:")
df_binned

Bin Edges (Quantile):
[array([22.  , 30.25, 43.  , 58.5 , 85.  ])]

DataFrame with Quantile Bins:




Unnamed: 0,age,age_uniform_bins,age_quantile_bins
0,22,0.0,0.0
1,25,0.0,0.0
2,31,0.0,1.0
3,45,1.0,2.0
4,52,1.0,2.0
5,60,2.0,3.0
6,70,3.0,3.0
7,85,3.0,3.0
8,28,0.0,0.0
9,35,0.0,1.0


#### K-Means Binning (`strategy='kmeans'`)
Bin edges are determined by a 1D K-Means clustering algorithm.

In [4]:
binner_kmeans = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='kmeans')
df_binned['age_kmeans_bins'] = binner_kmeans.fit_transform(df[['age']])

print("Bin Edges (K-Means):")
print(binner_kmeans.bin_edges_)

print("\nDataFrame with K-Means Bins:")
df_binned

Bin Edges (K-Means):
[array([22.  , 37.1 , 52.5 , 68.25, 85.  ])]

DataFrame with K-Means Bins:


Unnamed: 0,age,age_uniform_bins,age_quantile_bins,age_kmeans_bins
0,22,0.0,0.0,0.0
1,25,0.0,0.0,0.0
2,31,0.0,1.0,0.0
3,45,1.0,2.0,1.0
4,52,1.0,2.0,1.0
5,60,2.0,3.0,2.0
6,70,3.0,3.0,3.0
7,85,3.0,3.0,3.0
8,28,0.0,0.0,0.0
9,35,0.0,1.0,0.0
