# Binning

When the true relationship between a predictor and the target is non‑monotonic / piece‑wise, a strictly linear model can fail badly. Converting the feature into categorical “bins” lets the same linear algorithm approximate the step‑function without switching to a more complex model.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, classification_report

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Synthetic data: a STEP‑WISE relationship
# x in [0,100]
# P(y=1) follows a "bucket" profile

n = 12000

x = np.random.uniform(0, 100, n)

# latent step‑wise probabilities
p = np.select(
        [x < 20, x < 40, x < 60, x < 80, x <= 100],
        [0.10,   0.30,   0.70,   0.30,   0.10])
y = np.random.binomial(1, p)

df = pd.DataFrame({'x': x, 'y': y})

In [None]:
df

In [None]:
plt.scatter(x, 
            y + np.random.normal(0, .01, n), 
            alpha=0.01);

In [None]:
# Example of explicitly doing the Train/Test split

idx = np.random.permutation(n)

train, test = idx[:int(.8*n)], idx[int(.8*n):]

x_train, y_train = df.loc[train, ['x']], df.loc[train, 'y']
x_test,  y_test  = df.loc[test,  ['x']], df.loc[test,  'y']

In [None]:
# Baseline: plain LogisticRegression on raw numeric x

pipe_raw = Pipeline([
    ('clf', LogisticRegression(max_iter=1000))
]).fit(x_train, y_train)

auc_raw = roc_auc_score(y_test, pipe_raw.predict_proba(x_test)[:,1])

print(f"Plain linear model, ROC AUC = {auc_raw:.3f}")

print('Classification report: ')
print(classification_report(y_test, pipe_raw.predict(x_test)))

In [None]:
# Binned: cut x into 5 pre‑defined buckets, then one‑hot encode + LR
# Use same cut‑points as in the data‑generating process, but in practice determine these via EDA

bins = [0, 20, 40, 60, 80, 100]

x_train['x_bin'] = pd.cut(x_train['x'], bins=bins, include_lowest=True)
x_test['x_bin'] = pd.cut(x_test ['x'], bins=bins, include_lowest=True)

ohe = ColumnTransformer(
        [('cat', OneHotEncoder(drop='first'), ['x_bin'])],
        remainder='drop')

pipe_bin = Pipeline([
    ('ohe', ohe),
    ('clf', LogisticRegression(max_iter=1000))
]).fit(x_train, y_train)

auc_bin = roc_auc_score(y_test, pipe_bin.predict_proba(x_test)[:,1])
print(f"Binned feature model, ROC AUC = {auc_bin:.3f}")

print('Classification report: ')
print(classification_report(y_test, pipe_bin.predict(x_test)))

In [None]:
# Visualize the results

# Predictions across a range of x
xx  = np.linspace(0, 100, 500)
pp_raw = pipe_raw.predict_proba(xx.reshape(-1,1))[:,1]

# Build a tiny DF to pass through the binning & LogReg pipeline
tmp = pd.DataFrame({'x': xx})
tmp['x_bin'] = pd.cut(tmp['x'], bins=bins, include_lowest=True)#, right=False)
pp_bin = pipe_bin.predict_proba(tmp[['x_bin']])[:,1]

# Plot
plt.figure(figsize=(6,3))

plt.plot(xx, pp_raw, label='Raw LR (linear)')
plt.plot(xx, pp_bin, label='Binned LR')
plt.scatter(x, y + np.random.normal(0, .01, n), alpha=.03)

plt.ylabel('Predicted P(y=1)')
plt.xlabel('x')
plt.ylim(-.05,1.05)
plt.title('How binning lets a linear model mimic a step function')
plt.show()

## Results:

| Model         | Feature representation                           | Can mimic step‑wise target?                | Test ROC‑AUC |
| ------------- | ------------------------------------------------ | ------------------------------------------ | ------------ |
| Plain LR  | 1 numeric coefficient                            | No — forces a single sigmoid trend         | 0.50         |
| Binned LR | 4 dummy coefficients (5 buckets minus reference) | Yes — assigns its own log‑odds to each bin | 0.76     |


The linear algorithm itself did not change.

By discretising x into bins that align with real shifts, the model has enough flexibility to capture the non‑linear structure while keeping its interpretability and training speed.

* Binning:
  * Helps with piece‑wise or plateau effects (marketing spend thresholds, credit‑score bands, age brackets).
  * Robust to outliers (extreme values map to a common "tail" bin).
  * Monotonic‑constraint preparation for scorecards / WOE encoding in credit‑risk models.
  * Explainability: "Applicants earning 40,000‑60,000 have 3x default risk' is easier to tell a stakeholder than “slope = 0.08 log‑odds per 1000".
  * Conversely, tree‑based models already split features internally; careless binning there can harm resolution. Always profile the feature and the algorithm before deciding.