# Data discretization

Data discretization is the process of converting continuous numerical data into categorical or discrete data. It is a data preprocessing technique that is used to reduce data complexity, remove noise, and improve accuracy. It is commonly used in machine learning, data mining, and other data analysis applications.

There are two main types of data discretization:

## Equal width binning

Equal width binning is a data discretization method that divides a continuous variable into equal-width intervals, or bins. The width of each bin is determined by dividing the range of the variable by the number of desired bins.

For example, suppose we have a continuous variable with a range of 0 to 100 and we want to divide it into 5 bins. Each bin would have a width of (100-0)/5 = 20. The first bin would then include values from 0 to 20, the second bin would include values from 20 to 40, and so on.

Equal width binning is a simple and easy-to-understand method of discretization, but it may not always be the most effective. In cases where the distribution of the data is not uniform, some bins may have very few observations, while others may have a large number. This can result in loss of information and reduced accuracy in analysis. Other methods of discretization, such as equal frequency binning and entropy-based binning, may be more appropriate in such cases.

In [1]:
import pandas as pd
import numpy as np

# Create a sample dataset
df = pd.DataFrame({'A': np.random.randint(0, 100, size=100)})

# Define the number of bins and bin width
num_bins = 5
bin_width = (df['A'].max() - df['A'].min()) / num_bins

# Create the bins
bins = []
for i in range(num_bins):
    bins.append(df['A'].min() + i * bin_width)
bins.append(df['A'].max())

# Assign each value to a bin
df['bin'] = pd.cut(df['A'], bins=bins)

# Print the resulting dataframe
print(df.head())

    A           bin
0  60  (59.4, 79.2]
1  64  (59.4, 79.2]
2  54  (39.6, 59.4]
3  81  (79.2, 99.0]
4  59  (39.6, 59.4]


This code generates a random dataset with a single column `'A'`. It then calculates the bin width based on the range of values in `'A'` and the desired number of bins. It creates a list of bin edges, and uses the `cut` function from `pandas` to assign each value in `'A'` to a bin. The resulting dataframe includes a new column `'bin'` with the assigned bin for each value.

## Equal frequency binning

Equal frequency binning, also known as quantile-based binning or histogram-based binning, is a method of discretization that divides the data into equal frequency intervals. The goal of this method is to ensure that each interval contains the same number of observations. This can be particularly useful when the data distribution is skewed, and we want to ensure that each bin contains a representative sample of the data.

The steps involved in equal frequency binning are as follows:

- Sort the data in ascending order.
- Divide the data into n equal parts, where n is the desired number of bins.
- Assign each observation to the corresponding bin.

In Python, we can perform equal frequency binning using the `qcut()` function from the `pandas` library. This function divides the data into quantiles, or equal frequency intervals, and assigns a label to each observation based on the bin it falls into.

In [2]:
import pandas as pd

# create a dataset
data = pd.DataFrame({'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                     'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})

# divide the data into 4 equal frequency bins
data['A_binned'] = pd.qcut(data['A'], q=4, labels=False)
data['B_binned'] = pd.qcut(data['B'], q=4, labels=False)

# print the binned data
print(data)

    A    B  A_binned  B_binned
0   1   10         0         0
1   2   20         0         0
2   3   30         0         0
3   4   40         1         1
4   5   50         1         1
5   6   60         2         2
6   7   70         2         2
7   8   80         3         3
8   9   90         3         3
9  10  100         3         3


## Entropy-based Discretization

Entropy-based discretization is a method used to discretize continuous data by dividing it into intervals that maximize the information gain or minimize the entropy of the resulting classes. The goal of this method is to find the optimal threshold values for dividing continuous data into discrete intervals that maximize the information gain or minimize the entropy.

The entropy of a class is a measure of the degree of impurity in the class. In other words, it measures how much uncertainty there is in predicting the class of a data point. The entropy is calculated as:

$Entropy = -\sum_{i=1}^{k} p_i\log_2(p_i)$

where $n$ is the number of classes and $p_i$ is the proportion of data points that belong to class $i$.

The information gain is the difference between the entropy of the original data and the weighted average of the entropies of the resulting classes after the data has been split based on a particular threshold. The information gain is used as a criterion to evaluate the quality of a split.

There are several algorithms for entropy-based discretization, such as Fayyad and Irani's algorithm and ChiMerge algorithm. These algorithms differ in the way they determine the optimal threshold values.

In [3]:
import pandas as pd
import numpy as np

def entropy_based_discretization(series, bins):
    # Calculate entropy for each possible split
    entropy = []
    for i in range(1, len(bins)):
        split = np.where(series < bins[i], 0, 1)
        p0 = np.count_nonzero(split == 0) / len(split)
        p1 = 1 - p0
        e0 = 0 if p0 == 0 else -p0 * np.log2(p0)
        e1 = 0 if p1 == 0 else -p1 * np.log2(p1)
        entropy.append(e0 + e1)

    # Find the split with maximum entropy
    idx = np.argmax(entropy)
    split = np.where(series < bins[idx + 1], 0, 1)

    # Return the discretized series
    return pd.Series(split, index=series.index)


# Create a dummy dataset
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [20, 30, 40, 50, 60, 70, 80, 90, 100, 200]
})

# Discretize column 'B' using entropy-based method
bins = [0, 50, 100, 200]
df['B_discretized'] = entropy_based_discretization(df['B'], bins)

print(df)

    A    B  B_discretized
0   1   20              0
1   2   30              0
2   3   40              0
3   4   50              1
4   5   60              1
5   6   70              1
6   7   80              1
7   8   90              1
8   9  100              1
9  10  200              1


In this example, we are creating a dummy dataset and discretizing the `'B'` column using the entropy-based method by calling the `entropy_based_discretization()` function. The resulting discretized column is added to the original dataframe as `'B_discretized'`.