# The problem with continous variables
Countinous variables (like height, weight, salary) can take an infinite amount of possible values, so it does not
make sense to count the frequency of each one of them.

The solution is to group them into classes (intervals):

| Height(cm) | f |
|---|---|
|150-159| 4|
|160-169| 3|
|170-179| 10|
|180-189| 5|
...

## How to choose the number of classes (bins) ?
When we transform a countinous variable into classes(or bins), we are **discretizing** the data.

If we create too few classes, we may lose important details

If we create too many classes, the result becomes noisy and hard to interpret

So, we need **balance**.

The goal is to choose enough classes to represent the **spread, symmetry, and shape** of the data.

## Classical methods for choosing number of classes (k)
In **descriptive statistics**, there are a few empirical rules that help estimate the “ideal” number of classes, denoted by **k**.

### Sturges' Rule
The simplest and most traditional rule.
It’s suitable when the sample size n is not very large.
$$
k = 1 + 3.322 \log_{10}(n)
$$

### Freedman–Diaconis Rule
A more robust rule that takes data dispersion into account (using the Interquartile Range, or IQR).
It’s ideal for skewed data or data with outliers.
This rule adjusts the number of classes **based on the actual variability in your data** — so it’s adaptive and robust.

h = bin width (the size of each class)

IQR=Q3−Q1 (difference between the 75th and 25th percentiles)

$$
h = 2 \times \frac{IQR}{n^{1/3}}, \quad
k = \frac{\max(x) - \min(x)}{h}
$$

### Scott’s Rule
Very similar to Freedman–Diaconis, but it uses the **standard deviation** instead of the IQR:
$$
h = 3.5 \times \frac{\text{standard deviation}}{n^{1/3}}, \quad
k = \frac{\max(x) - \min(x)}{h}
$$




In [1]:
import pandas as pd
import numpy as np

Monthly income data example:

In [2]:
n = 256
df = pd.DataFrame({'income': np.random.normal(5000, 1500, n)})
df['income'] = df['income'].clip(lower=0) # replace negatives with 0
x_min, x_max = df['income'].min(), df['income'].max()
IQR = df['income'].quantile(0.75) - df['income'].quantile(0.25)
std = df['income'].std()

In [3]:
k_sturges = int(1 + 3.322 * np.log10(n))
k_sturges

9

In [4]:
h_fd = 2 * (IQR / (n ** (1/3)))
k_fd = int( (x_max - x_min) / h_fd )
k_fd

11

In [5]:
h_sct = 3.5 * (std / (n ** (1/3)))
k_sct = int( (x_max - x_min) / h_sct )
k_sct

9

**Now we use pandas to compute the bin edges based on the number of classes (k)**

pd.cut()

In [6]:
df_sturges = df.copy(deep=True)
df_sturges['Class'] = pd.cut(df['income'], bins=k_sturges)
abs_freq = df_sturges['Class'].value_counts().sort_index()
rel_freq = df_sturges['Class'].value_counts(normalize=True).sort_index()
print(abs_freq)
print(rel_freq)

Class
(738.132, 1607.265]      3
(1607.265, 2468.646]    10
(2468.646, 3330.026]    25
(3330.026, 4191.407]    46
(4191.407, 5052.787]    50
(5052.787, 5914.168]    57
(5914.168, 6775.548]    33
(6775.548, 7636.929]    23
(7636.929, 8498.309]     9
Name: count, dtype: int64
Class
(738.132, 1607.265]     0.011719
(1607.265, 2468.646]    0.039062
(2468.646, 3330.026]    0.097656
(3330.026, 4191.407]    0.179688
(4191.407, 5052.787]    0.195312
(5052.787, 5914.168]    0.222656
(5914.168, 6775.548]    0.128906
(6775.548, 7636.929]    0.089844
(7636.929, 8498.309]    0.035156
Name: proportion, dtype: float64


In [7]:
sturges = pd.DataFrame({
    'Absolute frequency': abs_freq,
    'Relative frequency': rel_freq
})
sturges

Unnamed: 0_level_0,Absolute frequency,Relative frequency
Class,Unnamed: 1_level_1,Unnamed: 2_level_1
"(738.132, 1607.265]",3,0.011719
"(1607.265, 2468.646]",10,0.039062
"(2468.646, 3330.026]",25,0.097656
"(3330.026, 4191.407]",46,0.179688
"(4191.407, 5052.787]",50,0.195312
"(5052.787, 5914.168]",57,0.222656
"(5914.168, 6775.548]",33,0.128906
"(6775.548, 7636.929]",23,0.089844
"(7636.929, 8498.309]",9,0.035156


In [8]:
sturges.reset_index(inplace=True)
sturges

Unnamed: 0,Class,Absolute frequency,Relative frequency
0,"(738.132, 1607.265]",3,0.011719
1,"(1607.265, 2468.646]",10,0.039062
2,"(2468.646, 3330.026]",25,0.097656
3,"(3330.026, 4191.407]",46,0.179688
4,"(4191.407, 5052.787]",50,0.195312
5,"(5052.787, 5914.168]",57,0.222656
6,"(5914.168, 6775.548]",33,0.128906
7,"(6775.548, 7636.929]",23,0.089844
8,"(7636.929, 8498.309]",9,0.035156


In [9]:
df_fd = df.copy(deep=True)
df_fd['Class'] = pd.cut(df['income'], bins=k_fd)
abs_freq = df_fd['Class'].value_counts().sort_index()
rel_freq = df_fd['Class'].value_counts(normalize=True).sort_index()

In [10]:
fd = pd.DataFrame({
    'Absolute frequency': abs_freq,
    'Relative frequency': rel_freq
})
fd.reset_index(inplace=True)
fd

Unnamed: 0,Class,Absolute frequency,Relative frequency
0,"(738.132, 1450.65]",3,0.011719
1,"(1450.65, 2155.416]",6,0.023438
2,"(2155.416, 2860.182]",15,0.058594
3,"(2860.182, 3564.948]",25,0.097656
4,"(3564.948, 4269.714]",38,0.148438
5,"(4269.714, 4974.48]",42,0.164062
6,"(4974.48, 5679.246]",45,0.175781
7,"(5679.246, 6384.012]",31,0.121094
8,"(6384.012, 7088.777]",31,0.121094
9,"(7088.777, 7793.543]",11,0.042969


In [11]:
df_scott = df.copy(deep=True)
df_scott['Class'] = pd.cut(df['income'], bins=k_sct)
abs_freq = df_scott['Class'].value_counts().sort_index()
rel_freq = df_scott['Class'].value_counts(normalize=True).sort_index()

In [13]:
scott = pd.DataFrame({
    'Absolute frequency': abs_freq,
    'Relative frequency': rel_freq
})
scott.reset_index(inplace=True)
scott

Unnamed: 0,Class,Absolute frequency,Relative frequency
0,"(738.132, 1607.265]",3,0.011719
1,"(1607.265, 2468.646]",10,0.039062
2,"(2468.646, 3330.026]",25,0.097656
3,"(3330.026, 4191.407]",46,0.179688
4,"(4191.407, 5052.787]",50,0.195312
5,"(5052.787, 5914.168]",57,0.222656
6,"(5914.168, 6775.548]",33,0.128906
7,"(6775.548, 7636.929]",23,0.089844
8,"(7636.929, 8498.309]",9,0.035156
