# Discretization of pre-processed data using Decision Tree discretization
## Dataset: Blood/ tranfusion
By: Sam
Update: 23/02/2023
Replicate using Malina script

### About Dataset
Attribute Information:
Given is the variable name, variable type, the measurement unit and a brief  description. The "Blood Transfusion Service Center" is a classification problem. 
The order of this listing corresponds to the order of numerals along the rows of  the database.
- R (Recency - months since last donation),
- F (Frequency - total number of donation),
- M (Monetary - total blood donated in c.c.),
- T (Time - months since first donation),

LABEL: a binary variable representing whether he/she donated blood in March 2007 
- 1 stand for donating blood
- 0 stands for not donating blood

# 1. Preparing data

In [1]:
# Import library
import pandas as pd
import numpy as np
from collections import Counter #for Chi Merge

In [2]:
# Read clean dataset for discretization
data0 = pd.read_csv('clean_tranfusion.csv')
#tranfusion dataset
tranfusion = data0

In [3]:
tranfusion

Unnamed: 0,recency,frequency,monetary,time,label
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


In [4]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
tranfusion['label']= label_encoder.fit_transform(tranfusion['label'])
  
tranfusion['label'].unique()

array([1, 0])

In [5]:
# List of continuous feature to discretize
num_list = tranfusion.columns.to_list()


In [6]:
num_list.remove('label')

In [7]:
y_list = pd.DataFrame(tranfusion['label'])

In [8]:
num_list

['recency', 'frequency', 'monetary', 'time']

In [9]:
num_list
y_list

Unnamed: 0,label
0,1
1,1
2,1
3,1
4,0
...,...
743,0
744,0
745,0
746,0


# 3. Decision Tree discretization

In [12]:
# !pip install feature_engine

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import DecisionTreeDiscretiser

In [14]:
# Load dataset
data = tranfusion
data

Unnamed: 0,recency,frequency,monetary,time,label
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


In [15]:
# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data,
            data['label'], test_size=0.3, random_state=0)

# DT scripts

In [16]:
#load data
data = tranfusion
# let's separate into training and testing set
# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data,
            data['label'], test_size=0.3, random_state=0)

print("X_train :", X_train.shape)
print("X_test :", X_test.shape)

X_train : (523, 5)
X_test : (225, 5)


## 2.1 DT with small max_depth

In [17]:
#make DT discreizer
# 'max_depth': [2] => 2^2 = 4 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [2]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

      recency  frequency  monetary      time  label
242  0.363636   0.148000  0.148000  0.231608      1
533  0.363636   0.224490  0.224490  0.231608      0
315  0.363636   0.148000  0.148000  0.231608      0
12   0.363636   0.301802  0.301802  0.231608      1
161  0.363636   0.148000  0.148000  0.231608      0
..        ...        ...       ...       ...    ...
267  0.104348   0.301802  0.301802  0.405405      0
362  0.363636   0.148000  0.148000  0.125000      0
501  0.104348   0.301802  0.301802  0.231608      1
310  0.104348   0.301802  0.301802  0.231608      0
200  0.363636   0.301802  0.301802  0.125000      0

[748 rows x 5 columns]
DT discreizer binner dict:
{'recency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [2]}, scoring='accuracy'), 'frequency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [2]}, scoring='accuracy'), 'monetary': GridSearchCV(cv=3,

In [18]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: recency
4
Entries per interval for recency
Counter({0.36363636363636365: 357, 0.10434782608695652: 327, 0.02702702702702703: 59, 0.0: 5})
 
No of bins: frequency
4
Entries per interval for frequency
Counter({0.148: 357, 0.30180180180180183: 327, 0.22448979591836735: 62, 1.0: 2})
 
No of bins: monetary
4
Entries per interval for monetary
Counter({0.148: 357, 0.30180180180180183: 327, 0.22448979591836735: 62, 1.0: 2})
 
No of bins: time
4
Entries per interval for time
Counter({0.23160762942779292: 516, 0.125: 166, 0.40540540540540543: 56, 0.42857142857142855: 10})
 
No of bins: label
2
Entries per interval for label
Counter({0: 570, 1: 178})
 


In [19]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = tranfusion.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_small_discretized_tranfusion.csv',index=False)

      recency  frequency  monetary      time  label
242  0.363636   0.148000  0.148000  0.231608      1
533  0.363636   0.224490  0.224490  0.231608      0
315  0.363636   0.148000  0.148000  0.231608      0
12   0.363636   0.301802  0.301802  0.231608      1
161  0.363636   0.148000  0.148000  0.231608      0
..        ...        ...       ...       ...    ...
267  0.104348   0.301802  0.301802  0.405405      0
362  0.363636   0.148000  0.148000  0.125000      0
501  0.104348   0.301802  0.301802  0.231608      1
310  0.104348   0.301802  0.301802  0.231608      0
200  0.363636   0.301802  0.301802  0.125000      0

[748 rows x 5 columns]
     recency  frequency  monetary  time  label
0          3          0         0     1      1
1          3          1         1     1      0
2          3          0         0     1      0
3          3          2         2     1      1
4          3          0         0     1      0
..       ...        ...       ...   ...    ...
743        2          2

In [20]:
disc_ord.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   recency    748 non-null    int64
 1   frequency  748 non-null    int64
 2   monetary   748 non-null    int64
 3   time       748 non-null    int64
 4   label      748 non-null    int64
dtypes: int64(5)
memory usage: 29.3 KB


## 2.2 DT with medium max_depth

In [21]:
#make DT discreizer
# 'max_depth': [3] => 2^3 = 8 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [3]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

# put side by side the original variable and the transformed variable
print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

      recency  frequency  monetary      time  label
242  0.367347   0.163121  0.163121  0.241279      1
533  0.367347   0.224490  0.224490  0.241279      0
315  0.367347   0.163121  0.163121  0.241279      0
12   0.367347   0.289340  0.289340  0.241279      1
161  0.367347   0.128440  0.128440  0.241279      0
..        ...        ...       ...       ...    ...
267  0.100437   0.289340  0.289340  0.388889      0
362  0.367347   0.163121  0.163121  0.132075      0
501  0.100437   0.400000  0.400000  0.241279      1
310  0.100437   0.289340  0.289340  0.241279      0
200  0.367347   0.289340  0.289340  0.132075      0

[748 rows x 5 columns]
DT discreizer binner dict:
{'recency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [3]}, scoring='accuracy'), 'frequency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [3]}, scoring='accuracy'), 'monetary': GridSearchCV(cv=3,

In [22]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: recency
6
Entries per interval for recency
Counter({0.3673469387755102: 348, 0.10043668122270742: 326, 0.0: 55, 0.2: 9, 0.25: 9, 1.0: 1})
 
No of bins: frequency
6
Entries per interval for frequency
Counter({0.2893401015228426: 290, 0.16312056737588654: 199, 0.12844036697247707: 158, 0.22448979591836735: 62, 0.4: 37, 1.0: 2})
 
No of bins: monetary
6
Entries per interval for monetary
Counter({0.2893401015228426: 290, 0.16312056737588654: 199, 0.12844036697247707: 158, 0.22448979591836735: 62, 0.4: 37, 1.0: 2})
 
No of bins: time
7
Entries per interval for time
Counter({0.24127906976744187: 483, 0.1320754716981132: 160, 0.3888888888888889: 54, 0.08695652173913043: 33, 0.42857142857142855: 10, 0.0: 6, 1.0: 2})
 
No of bins: label
2
Entries per interval for label
Counter({0: 570, 1: 178})
 


In [23]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = tranfusion.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_medium_discretized_tranfusion.csv',index=False)

      recency  frequency  monetary      time  label
242  0.367347   0.163121  0.163121  0.241279      1
533  0.367347   0.224490  0.224490  0.241279      0
315  0.367347   0.163121  0.163121  0.241279      0
12   0.367347   0.289340  0.289340  0.241279      1
161  0.367347   0.128440  0.128440  0.241279      0
..        ...        ...       ...       ...    ...
267  0.100437   0.289340  0.289340  0.388889      0
362  0.367347   0.163121  0.163121  0.132075      0
501  0.100437   0.400000  0.400000  0.241279      1
310  0.100437   0.289340  0.289340  0.241279      0
200  0.367347   0.289340  0.289340  0.132075      0

[748 rows x 5 columns]
     recency  frequency  monetary  time  label
0          4          1         1     3      1
1          4          2         2     3      0
2          4          1         1     3      0
3          4          3         3     3      1
4          4          0         0     3      0
..       ...        ...       ...   ...    ...
743        1          3

## 2.3 DT with large max_depth

In [24]:
#make DT discreizer
# 'max_depth': [4] => 2^4 = 16 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [4]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

# put side by side the original variable and the transformed variable
print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

      recency  frequency  monetary      time  label
242  0.368201   0.156250  0.156250  0.229391      1
533  0.368201   0.224490  0.224490  0.229391      0
315  0.368201   0.168831  0.168831  0.229391      0
12   0.368201   0.312102  0.312102  0.229391      1
161  0.368201   0.128440  0.128440  0.292308      0
..        ...        ...       ...       ...    ...
267  0.107692   0.312102  0.312102  0.363636      0
362  0.368201   0.156250  0.156250  0.126214      0
501  0.107692   0.142857  0.142857  0.229391      1
310  0.107692   0.312102  0.312102  0.229391      0
200  0.368201   0.312102  0.312102  0.126214      0

[748 rows x 5 columns]
DT discreizer binner dict:
{'recency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [4]}, scoring='accuracy'), 'frequency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [4]}, scoring='accuracy'), 'monetary': GridSearchCV(cv=3,

In [25]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: recency
7
Entries per interval for recency
Counter({0.3682008368200837: 342, 0.1076923076923077: 282, 0.0: 62, 0.058823529411764705: 44, 0.25: 9, 0.3333333333333333: 6, 1.0: 3})
 
No of bins: frequency
9
Entries per interval for frequency
Counter({0.31210191082802546: 226, 0.12844036697247707: 158, 0.16883116883116883: 112, 0.15625: 87, 0.2: 64, 0.22448979591836735: 62, 0.5: 24, 0.14285714285714285: 13, 1.0: 2})
 
No of bins: monetary
9
Entries per interval for monetary
Counter({0.31210191082802546: 226, 0.12844036697247707: 158, 0.16883116883116883: 112, 0.15625: 87, 0.2: 64, 0.22448979591836735: 62, 0.5: 24, 0.14285714285714285: 13, 1.0: 2})
 
No of bins: time
10
Entries per interval for time
Counter({0.22939068100358423: 389, 0.1262135922330097: 155, 0.2923076923076923: 94, 0.36363636363636365: 50, 0.11764705882352941: 26, 0.0: 13, 0.42857142857142855: 10, 0.3333333333333333: 5, 0.6666666666666666: 4, 1.0: 2})
 
No of bins: label
2
Entries per interval for label
Counter(

In [26]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = tranfusion.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_large_discretized_tranfusion.csv',index=False)

      recency  frequency  monetary      time  label
242  0.368201   0.156250  0.156250  0.229391      1
533  0.368201   0.224490  0.224490  0.229391      0
315  0.368201   0.168831  0.168831  0.229391      0
12   0.368201   0.312102  0.312102  0.229391      1
161  0.368201   0.128440  0.128440  0.292308      0
..        ...        ...       ...       ...    ...
267  0.107692   0.312102  0.312102  0.363636      0
362  0.368201   0.156250  0.156250  0.126214      0
501  0.107692   0.142857  0.142857  0.229391      1
310  0.107692   0.312102  0.312102  0.229391      0
200  0.368201   0.312102  0.312102  0.126214      0

[748 rows x 5 columns]
     recency  frequency  monetary  time  label
0          5          2         2     3      1
1          5          5         5     3      0
2          5          3         3     3      0
3          5          6         6     3      1
4          5          0         0     4      0
..       ...        ...       ...   ...    ...
743        2          6

## 2.4 DT with extra large max_depth

In [27]:
#make DT discreizer
# 'max_depth': [5] => 2^5 = 32 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [5]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

# put side by side the original variable and the transformed variable
print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

      recency  frequency  monetary      time  label
242  0.371901   0.156250  0.156250  0.266667      1
533  0.364407   0.224490  0.224490  0.201258      0
315  0.371901   0.168831  0.168831  0.266667      0
12   0.364407   0.833333  0.833333  0.266667      1
161  0.364407   0.128440  0.128440  0.227273      0
..        ...        ...       ...       ...    ...
267  0.098958   0.291391  0.291391  0.392857      0
362  0.371901   0.156250  0.156250  0.137931      0
501  0.098958   0.000000  0.000000  0.266667      1
310  0.098958   0.291391  0.291391  0.266667      0
200  0.371901   0.291391  0.291391  0.062500      0

[748 rows x 5 columns]
DT discreizer binner dict:
{'recency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [5]}, scoring='accuracy'), 'frequency': GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=29),
             param_grid={'max_depth': [5]}, scoring='accuracy'), 'monetary': GridSearchCV(cv=3,

In [28]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: recency
9
Entries per interval for recency
Counter({0.09895833333333333: 278, 0.3644067796610169: 173, 0.371900826446281: 169, 0.0: 62, 0.058823529411764705: 44, 0.25: 9, 0.3333333333333333: 6, 0.6666666666666666: 4, 1.0: 3})
 
No of bins: frequency
12
Entries per interval for frequency
Counter({0.2913907284768212: 212, 0.12844036697247707: 158, 0.16883116883116883: 112, 0.15625: 87, 0.22448979591836735: 62, 0.16: 36, 0.26666666666666666: 28, 0.35714285714285715: 20, 0.8333333333333334: 14, 0.0: 7, 1.0: 6, 0.3333333333333333: 6})
 
No of bins: monetary
12
Entries per interval for monetary
Counter({0.2913907284768212: 212, 0.12844036697247707: 158, 0.16883116883116883: 112, 0.15625: 87, 0.22448979591836735: 62, 0.16: 36, 0.26666666666666666: 28, 0.35714285714285715: 20, 0.8333333333333334: 14, 0.0: 7, 1.0: 6, 0.3333333333333333: 6})
 
No of bins: time
15
Entries per interval for time
Counter({0.20125786163522014: 222, 0.26666666666666666: 167, 0.13793103448275862: 131, 0.325

In [29]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = tranfusion.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_verylarge_discretized_tranfusion.csv',index=False)

      recency  frequency  monetary      time  label
242  0.371901   0.156250  0.156250  0.266667      1
533  0.364407   0.224490  0.224490  0.201258      0
315  0.371901   0.168831  0.168831  0.266667      0
12   0.364407   0.833333  0.833333  0.266667      1
161  0.364407   0.128440  0.128440  0.227273      0
..        ...        ...       ...       ...    ...
267  0.098958   0.291391  0.291391  0.392857      0
362  0.371901   0.156250  0.156250  0.137931      0
501  0.098958   0.000000  0.000000  0.266667      1
310  0.098958   0.291391  0.291391  0.266667      0
200  0.371901   0.291391  0.291391  0.062500      0

[748 rows x 5 columns]
     recency  frequency  monetary  time  label
0          6          2         2     8      1
1          5          5         5     6      0
2          6          4         4     8      0
3          5         10        10     8      1
4          5          1         1     7      0
..       ...        ...       ...   ...    ...
743        2          7