## Discretization: equal-width vs equal-frequency

If in a hurry, should **equal-width** be the first choice?

> So it seems according to this test on four popular scikit-learn datasets.

The test uses as judging criteria the accuracy reported by a special classifier. In two of the datasets (iris and digits) the equal-width method markedly outperforms equal-frequency. In the other two datasets evaluated, the differences are much narrower and could be considered as a tie result. The observations remain consistent when varying the number of bins used to discretize the attribute values.

This seems counter-intuitive; equal-frequency should have an advantage by providing better immunity in the presence of outliers.

The used classifier, "deodel", discretizes continuous attributes using one of the two methods. After discretization, it behaves like a Hamming distance nearest neighbor classifier.

The equal-width and equal-frequency are methods that are referred to as unsupervised methods. Supervised methods take into account the training output in order to establish the thresholds for the bins. In the selected output, the decision tree classifier is used as a proxy for such methods. Although the deodel classifier can be seen as a collapsed decision tree, the algorithms differ and comparison is not straightforward.

You can modify the code and test it with other datasets. If you do, please share your findings. The code is available at:

https://github.com/c4pub/deodel


In [1]:
"""
    Compare discretization/binning methods
"""

# >- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print("*** Get remote files")

import urllib
import shutil

remote_list = [
                {'file': 'deodel.py', 'url': "https://raw.githubusercontent.com/c4pub/deodel/main/deodel.py"},
                {'file': 'usap_common.py', 'url': "https://raw.githubusercontent.com/c4pub/deodel/main/usap_common.py"},
                {'file': 'usap_cmp_binning.py', 'url': "https://raw.githubusercontent.com/c4pub/deodel/main/usap_cmp_binning.py"},
            ]

for remote_entry in remote_list :
    file_name = remote_entry['file']
    url = remote_entry['url']
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

# >- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print("*** Run locally")

import deodel
import usap_common
import usap_cmp_binning

# >- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
print("*** End")


*** Get remote files
*** Run locally

- - - - - - - - - 

- - - - - - - - - 
- - - - - - - - - 
- - - - average accuracy test

- - - - iter_no: 50
- - - - random_seed: 42


- - - - dataset: .. _iris_d

avg accuracy        classifier 
--------------------------------------------------------------------------------
0.9436000000000001  DecisionTreeClassifier() 
0.5444000000000001  DeodataDelangaClassifier({'split_no': 2, 'split_mode': 'eq_freq'}) 
0.7199999999999999  DeodataDelangaClassifier({'split_no': 2, 'split_mode': 'eq_width'}) 
0.7880000000000001  DeodataDelangaClassifier({'split_no': 3, 'split_mode': 'eq_freq'}) 
0.9440000000000002  DeodataDelangaClassifier({'split_no': 3, 'split_mode': 'eq_width'}) 
0.7799999999999998  DeodataDelangaClassifier({'split_no': 5, 'split_mode': 'eq_freq'}) 
0.9308000000000001  DeodataDelangaClassifier({'split_no': 5, 'split_mode': 'eq_width'}) 
0.8556              DeodataDelangaClassifier({'split_no': 10, 'split_mode': 'eq_freq'}) 
0.936              