In [49]:
import numpy as np
import pandas as pd
from random import sample, choice

In [35]:
df = pd.DataFrame(
    data={
        'ages': sample(range(0, 100), 25), 
        'sex': [choice(['M', 'F']) for _ in range(25)],
    }
)
df

Unnamed: 0,ages,sex
0,66,F
1,19,F
2,71,F
3,36,F
4,24,M
5,92,F
6,45,M
7,23,F
8,25,M
9,67,F


In [36]:
df.ages.describe()

count    25.000000
mean     55.760000
std      28.172504
min      10.000000
25%      30.000000
50%      62.000000
75%      77.000000
max      99.000000
Name: ages, dtype: float64

In [41]:
# equal-width bins
pd.cut(df.ages, bins=3, precision=0, include_lowest=False)

0     (40.0, 69.0]
1     (10.0, 40.0]
2     (69.0, 99.0]
3     (10.0, 40.0]
4     (10.0, 40.0]
5     (69.0, 99.0]
6     (40.0, 69.0]
7     (10.0, 40.0]
8     (10.0, 40.0]
9     (40.0, 69.0]
10    (40.0, 69.0]
11    (69.0, 99.0]
12    (40.0, 69.0]
13    (69.0, 99.0]
14    (10.0, 40.0]
15    (69.0, 99.0]
16    (40.0, 69.0]
17    (69.0, 99.0]
18    (10.0, 40.0]
19    (10.0, 40.0]
20    (69.0, 99.0]
21    (69.0, 99.0]
22    (40.0, 69.0]
23    (10.0, 40.0]
24    (69.0, 99.0]
Name: ages, dtype: category
Categories (3, interval[float64, right]): [(10.0, 40.0] < (40.0, 69.0] < (69.0, 99.0]]

In [42]:
# equal-frequency bins i.e., bins with approximately equal counts in each
pd.qcut(df.ages, q=4, precision=2, retbins=False)

0     (62.0, 77.0]
1     (9.99, 30.0]
2     (62.0, 77.0]
3     (30.0, 62.0]
4     (9.99, 30.0]
5     (77.0, 99.0]
6     (30.0, 62.0]
7     (9.99, 30.0]
8     (9.99, 30.0]
9     (62.0, 77.0]
10    (62.0, 77.0]
11    (77.0, 99.0]
12    (30.0, 62.0]
13    (77.0, 99.0]
14    (9.99, 30.0]
15    (77.0, 99.0]
16    (30.0, 62.0]
17    (62.0, 77.0]
18    (9.99, 30.0]
19    (30.0, 62.0]
20    (77.0, 99.0]
21    (62.0, 77.0]
22    (30.0, 62.0]
23    (9.99, 30.0]
24    (77.0, 99.0]
Name: ages, dtype: category
Categories (4, interval[float64, right]): [(9.99, 30.0] < (30.0, 62.0] < (62.0, 77.0] < (77.0, 99.0]]

Freedman–Diaconis’ Choice specifies the number of bins for a sample x, and it is based on the InterQuartile Range (IQR) and the number of observations n, as shown in the following formula:

k = 2 * IQR(x)/[cube root of n]

In [47]:
# calculating IQR
iqr = df.ages.quantile(0.75) - df.ages.quantile(0.25)
2*iqr/(df.ages.count() ** (1/3))

np.float64(32.1475477975219)

Sturge’s Rule to determine the number of bins k for Gaussian-based data is based on the number of observations n, and it is expressed as follows:

k = 1 + 3.322 * log n

In [50]:
1 + (3.322*np.log(df.ages.count()))

np.float64(11.693105490212162)

Partitioning the values of people’s ages as described in the preceding section can be problematic. In particular, suppose that person A, person B, and person C are 29, 30, and 39 years old, respectively. Then, person A and person B are probably much more similar to each other than person B and person C, but because of the way in which the ages are partitioned, B is classified as closer to C than to A. In fact, binning can increase Type I errors (false positive) and Type II errors (false negative)