# Discretization of pre-processed data using Decision Tree discretization
## Dataset: adult
By: Sam
Update: 15/03/2023

### About Dataset
Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
- 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
- 45222 if instances with unknown values are removed (train=30162, test=15060)
- Duplicate or conflicting instances : 6
- Class probabilities for adult.all file
- Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
- Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)

ATTRIBUTE:
*Continuous attributes*
- age: continuous.
- fnlwgt: continuous.
- education-num: continuous.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.

*Categorical attributes*
- workclass
- education
- marital-status
- occupation
- relationship
- race
- sex
- native-country

# 1. Preparing data

In [1]:
# Import library
import pandas as pd
import numpy as np
from collections import Counter #for Chi Merge

In [2]:
# Read clean dataset for discretization
data0 = pd.read_csv('clean_adult.csv')
#adult dataset
adult = data0

In [3]:
adult

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [4]:
adult['class'].unique()


array([' <=50K', ' >50K'], dtype=object)

In [5]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
adult['class']= label_encoder.fit_transform(adult['class'])
  
adult['class'].unique()

array([0, 1])

In [6]:
# List of continuous feature to discretize
num_list = ['age', 'fnlwgt', 'education-num', 'capital-gain', 
            'capital-loss', 'hours-per-week']

In [7]:
y_list = pd.DataFrame(adult['class'])

In [8]:
num_list

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [9]:
num_list
y_list

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0
...,...
48837,0
48838,0
48839,0
48840,0


# 3. Decision Tree discretization

In [10]:
# !pip install feature_engine

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import DecisionTreeDiscretiser

In [13]:
# Load dataset
data = adult
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,0
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,0
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,0
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,0


In [14]:
# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data,
            data['class'], test_size=0.3, random_state=0)

# DT scripts

In [15]:
#load data
data = adult
# let's separate into training and testing set
# Separate into train and test sets
X_train, X_test, y_train, y_test =  train_test_split(
            data,
            data['class'], test_size=0.3, random_state=0)

print("X_train :", X_train.shape)
print("X_test :", X_test.shape)

X_train : (34189, 15)
X_test : (14653, 15)


## 2.1 DT with small max_depth

In [16]:
#make DT discreizer
# 'max_depth': [2] => 2^2 = 4 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [2]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

            age          workclass    fnlwgt      education  education-num  \
3833   0.345230            Private  0.232358   Some-college       0.182177   
34743  0.345230        Federal-gov  0.215873        HS-grad       0.182177   
2022   0.236174            Private  0.256482   Some-college       0.182177   
1580   0.024036            Private  0.232358        HS-grad       0.182177   
4612   0.345230            Private  0.256482    Prof-school       0.621669   
...         ...                ...       ...            ...            ...   
48826  0.345230          Local-gov  0.256482        Masters       0.621669   
44230  0.024036   Self-emp-not-inc  0.215873           11th       0.056861   
27824  0.345230            Private  0.232358        HS-grad       0.182177   
13582  0.024036            Private  0.256482        HS-grad       0.182177   
14557  0.024036            Private  0.232358   Some-college       0.182177   

            marital-status        occupation     relationship  

In [17]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: age
4
Entries per interval for age
Counter({0.3452304048234281: 26496, 0.02403592181722134: 10780, 0.23617448439496258: 7831, 0.13466042154566746: 3735})
 
No of bins: workclass
9
Entries per interval for workclass
Counter({' Private': 33906, ' Self-emp-not-inc': 3862, ' Local-gov': 3136, ' ?': 2799, ' State-gov': 1981, ' Self-emp-inc': 1695, ' Federal-gov': 1432, ' Without-pay': 21, ' Never-worked': 10})
 
No of bins: fnlwgt
4
Entries per interval for fnlwgt
Counter({0.2564816071670965: 20943, 0.23235812477969686: 20361, 0.21587301587301588: 7209, 0.30869565217391304: 329})
 
No of bins: education
16
Entries per interval for education
Counter({' HS-grad': 15784, ' Some-college': 10878, ' Bachelors': 8025, ' Masters': 2657, ' Assoc-voc': 2061, ' 11th': 1812, ' Assoc-acdm': 1601, ' 10th': 1389, ' 7th-8th': 955, ' Prof-school': 834, ' 9th': 756, ' 12th': 657, ' Doctorate': 594, ' 5th-6th': 509, ' 1st-4th': 247, ' Preschool': 83})
 
No of bins: education-num
4
Entries per inte

In [18]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = adult.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_small_discretized_adult.csv',index=False)

            age          workclass    fnlwgt      education  education-num  \
3833   0.345230            Private  0.232358   Some-college       0.182177   
34743  0.345230        Federal-gov  0.215873        HS-grad       0.182177   
2022   0.236174            Private  0.256482   Some-college       0.182177   
1580   0.024036            Private  0.232358        HS-grad       0.182177   
4612   0.345230            Private  0.256482    Prof-school       0.621669   
...         ...                ...       ...            ...            ...   
48826  0.345230          Local-gov  0.256482        Masters       0.621669   
44230  0.024036   Self-emp-not-inc  0.215873           11th       0.056861   
27824  0.345230            Private  0.232358        HS-grad       0.182177   
13582  0.024036            Private  0.256482        HS-grad       0.182177   
14557  0.024036            Private  0.232358   Some-college       0.182177   

            marital-status        occupation     relationship  

In [19]:
disc_ord.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             48842 non-null  int32
 1   workclass       48842 non-null  int32
 2   fnlwgt          48842 non-null  int32
 3   education       48842 non-null  int32
 4   education-num   48842 non-null  int32
 5   marital-status  48842 non-null  int32
 6   occupation      48842 non-null  int32
 7   relationship    48842 non-null  int32
 8   race            48842 non-null  int32
 9   sex             48842 non-null  int32
 10  capital-gain    48842 non-null  int32
 11  capital-loss    48842 non-null  int32
 12  hours-per-week  48842 non-null  int32
 13  native-country  48842 non-null  int32
 14  class           48842 non-null  int32
dtypes: int32(15)
memory usage: 2.8 MB


## 2.2 DT with medium max_depth

In [20]:
#make DT discreizer
# 'max_depth': [3] => 2^3 = 8 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [3]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

# put side by side the original variable and the transformed variable
print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

            age          workclass    fnlwgt      education  education-num  \
3833   0.361314            Private  0.234367   Some-college       0.170719   
34743  0.361314        Federal-gov  0.217610        HS-grad       0.170719   
2022   0.249727            Private  0.256330   Some-college       0.170719   
1580   0.057177            Private  0.234367        HS-grad       0.170719   
4612   0.361314            Private  0.256330    Prof-school       0.736630   
...         ...                ...       ...            ...            ...   
48826  0.361314          Local-gov  0.256330        Masters       0.560451   
44230  0.007691   Self-emp-not-inc  0.217610           11th       0.058477   
27824  0.361314            Private  0.234367        HS-grad       0.170719   
13582  0.057177            Private  0.256330        HS-grad       0.170719   
14557  0.007691            Private  0.234367   Some-college       0.170719   

            marital-status        occupation     relationship  

In [21]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: age
8
Entries per interval for age
Counter({0.3613143135820532: 23340, 0.007690790771051075: 7226, 0.24972677595628415: 5228, 0.057177129148340666: 3554, 0.22375690607734808: 3156, 0.20890599230346343: 2603, 0.1537122969837587: 2503, 0.0954653937947494: 1232})
 
No of bins: workclass
9
Entries per interval for workclass
Counter({' Private': 33906, ' Self-emp-not-inc': 3862, ' Local-gov': 3136, ' ?': 2799, ' State-gov': 1981, ' Self-emp-inc': 1695, ' Federal-gov': 1432, ' Without-pay': 21, ' Never-worked': 10})
 
No of bins: fnlwgt
8
Entries per interval for fnlwgt
Counter({0.2563301880388297: 20940, 0.234366944096634: 19484, 0.21761031634092282: 7098, 0.1875: 877, 0.29464285714285715: 322, 0.1038961038961039: 111, 0.8333333333333334: 7, 1.0: 3})
 
No of bins: education
16
Entries per interval for education
Counter({' HS-grad': 15784, ' Some-college': 10878, ' Bachelors': 8025, ' Masters': 2657, ' Assoc-voc': 2061, ' 11th': 1812, ' Assoc-acdm': 1601, ' 10th': 1389, ' 7th-8th

In [22]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = adult.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_medium_discretized_adult.csv',index=False)

            age          workclass    fnlwgt      education  education-num  \
3833   0.361314            Private  0.234367   Some-college       0.170719   
34743  0.361314        Federal-gov  0.217610        HS-grad       0.170719   
2022   0.249727            Private  0.256330   Some-college       0.170719   
1580   0.057177            Private  0.234367        HS-grad       0.170719   
4612   0.361314            Private  0.256330    Prof-school       0.736630   
...         ...                ...       ...            ...            ...   
48826  0.361314          Local-gov  0.256330        Masters       0.560451   
44230  0.007691   Self-emp-not-inc  0.217610           11th       0.058477   
27824  0.361314            Private  0.234367        HS-grad       0.170719   
13582  0.057177            Private  0.256330        HS-grad       0.170719   
14557  0.007691            Private  0.234367   Some-college       0.170719   

            marital-status        occupation     relationship  

## 2.3 DT with large max_depth

In [23]:
#make DT discreizer
# 'max_depth': [4] => 2^4 = 16 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [4]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

# put side by side the original variable and the transformed variable
print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

            age          workclass    fnlwgt      education  education-num  \
3833   0.339460            Private  0.234041   Some-college       0.190202   
34743  0.339460        Federal-gov  0.216926        HS-grad       0.157246   
2022   0.236872            Private  0.255982   Some-college       0.190202   
1580   0.065947            Private  0.234041        HS-grad       0.157246   
4612   0.339460            Private  0.255982    Prof-school       0.740103   
...         ...                ...       ...            ...            ...   
48826  0.377003          Local-gov  0.255982        Masters       0.560451   
44230  0.002726   Self-emp-not-inc  0.216926           11th       0.052725   
27824  0.339460            Private  0.234041        HS-grad       0.157246   
13582  0.065947            Private  0.255982        HS-grad       0.157246   
14557  0.016959            Private  0.234041   Some-college       0.190202   

            marital-status        occupation     relationship  

In [24]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: age
15
Entries per interval for age
Counter({0.3770028275212064: 13551, 0.33946024799416485: 9789, 0.0027256208358570565: 4719, 0.2620320855614973: 2640, 0.23687150837988827: 2588, 0.016958733747880157: 2507, 0.24053724053724054: 2406, 0.06594724220623502: 2348, 0.21202185792349726: 1325, 0.13485714285714287: 1280, 0.20575221238938052: 1278, 0.0954653937947494: 1232, 0.17314487632508835: 1223, 0.03961584633853541: 1206, 0.17228464419475656: 750})
 
No of bins: workclass
9
Entries per interval for workclass
Counter({' Private': 33906, ' Self-emp-not-inc': 3862, ' Local-gov': 3136, ' ?': 2799, ' State-gov': 1981, ' Self-emp-inc': 1695, ' Federal-gov': 1432, ' Without-pay': 21, ' Never-worked': 10})
 
No of bins: fnlwgt
14
Entries per interval for fnlwgt
Counter({0.2559815116911365: 20915, 0.23404098481497862: 19465, 0.2169258735608968: 7080, 0.2273972602739726: 532, 0.12757201646090535: 345, 0.3160621761658031: 277, 0.16666666666666666: 62, 0.02857142857142857: 49, 0.16129032

In [25]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = adult.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_large_discretized_adult.csv',index=False)

            age          workclass    fnlwgt      education  education-num  \
3833   0.339460            Private  0.234041   Some-college       0.190202   
34743  0.339460        Federal-gov  0.216926        HS-grad       0.157246   
2022   0.236872            Private  0.255982   Some-college       0.190202   
1580   0.065947            Private  0.234041        HS-grad       0.157246   
4612   0.339460            Private  0.255982    Prof-school       0.740103   
...         ...                ...       ...            ...            ...   
48826  0.377003          Local-gov  0.255982        Masters       0.560451   
44230  0.002726   Self-emp-not-inc  0.216926           11th       0.052725   
27824  0.339460            Private  0.234041        HS-grad       0.157246   
13582  0.065947            Private  0.255982        HS-grad       0.157246   
14557  0.016959            Private  0.234041   Some-college       0.190202   

            marital-status        occupation     relationship  

## 2.4 DT with extra large max_depth

In [26]:
#make DT discreizer
# 'max_depth': [5] => 2^5 = 32 intervals max. 
import time
start = time.time() # For measuring time execution
treeDisc = DecisionTreeDiscretiser(cv=3,
                                   scoring='accuracy',
                                   variables=num_list,
                                   regression=False,
                                   param_grid={'max_depth': [5]},
                                   random_state=29,
                                   )

treeDisc.fit(X_train, y_train)

# transform the data
train_t= treeDisc.transform(X_train)
test_t= treeDisc.transform(X_test)

#add on to categorical dataset again
disc = pd.concat([train_t, test_t], axis=0)
print(disc)
#categorical = categorical.drop('label', axis=1)

# put side by side the original variable and the transformed variable
print('DT discreizer binner dict:')
print(treeDisc.binner_dict_)
print(' ')
print('Computation time: ')
end = time.time()
print(end - start) # Total time execution for this sample

            age          workclass    fnlwgt      education  education-num  \
3833   0.345307            Private  0.228316   Some-college       0.190202   
34743  0.345307        Federal-gov  0.217629        HS-grad       0.157246   
2022   0.245154            Private  0.253143   Some-college       0.190202   
1580   0.068427            Private  0.228316        HS-grad       0.157246   
4612   0.345307            Private  0.253143    Prof-school       0.740103   
...         ...                ...       ...            ...            ...   
48826  0.392074          Local-gov  0.253143        Masters       0.560451   
44230  0.001190   Self-emp-not-inc  0.217629           11th       0.046191   
27824  0.345307            Private  0.228316        HS-grad       0.157246   
13582  0.068427            Private  0.253143        HS-grad       0.157246   
14557  0.014303            Private  0.252932   Some-college       0.190202   

            marital-status        occupation     relationship  

In [27]:
#Show number of bins for each variable
#no of bins
for i in disc:
    print('No of bins: ' + i)
    print(disc[i].nunique())
    #show start of intervals of each bin
    print('Entries per interval for ' + i)
    print(Counter(disc[i]))
    print(' ')

No of bins: age
24
Entries per interval for age
Counter({0.39207432138191317: 9838, 0.3453066757031515: 8441, 0.3379699248120301: 3713, 0.0011895321173671688: 3623, 0.25268817204301075: 1353, 0.30325288562434416: 1348, 0.2531779661016949: 1337, 0.2289156626506024: 1335, 0.01935483870967742: 1329, 0.21202185792349726: 1325, 0.2710583153347732: 1303, 0.13485714285714287: 1280, 0.20575221238938052: 1278, 0.2451539338654504: 1253, 0.0954653937947494: 1232, 0.17314487632508835: 1223, 0.03961584633853541: 1206, 0.06347305389221557: 1195, 0.014302741358760428: 1178, 0.06842737094837935: 1153, 0.007692307692307693: 1096, 0.2245762711864407: 1053, 0.1646586345381526: 695, 0.2777777777777778: 55})
 
No of bins: workclass
9
Entries per interval for workclass
Counter({' Private': 33906, ' Self-emp-not-inc': 3862, ' Local-gov': 3136, ' ?': 2799, ' State-gov': 1981, ' Self-emp-inc': 1695, ' Federal-gov': 1432, ' Without-pay': 21, ' Never-worked': 10})
 
No of bins: fnlwgt
19
Entries per interval for

In [28]:
#ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data1 = asarray(disc)
print(disc)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = pd.DataFrame(encoder.fit_transform(disc))
#print(result)
disc_ord = pd.DataFrame(result).astype(int)
tmp_col = adult.columns
disc_ord.columns = tmp_col # change column name
#print(disc_ord)
#disc_ord = pd.concat([categorical, disc_ord], axis=1)
print(disc_ord)
disc_ord.isna().sum()
# Export this dataset for discretization
disc_ord.to_csv('DT_verylarge_discretized_adult.csv',index=False)

            age          workclass    fnlwgt      education  education-num  \
3833   0.345307            Private  0.228316   Some-college       0.190202   
34743  0.345307        Federal-gov  0.217629        HS-grad       0.157246   
2022   0.245154            Private  0.253143   Some-college       0.190202   
1580   0.068427            Private  0.228316        HS-grad       0.157246   
4612   0.345307            Private  0.253143    Prof-school       0.740103   
...         ...                ...       ...            ...            ...   
48826  0.392074          Local-gov  0.253143        Masters       0.560451   
44230  0.001190   Self-emp-not-inc  0.217629           11th       0.046191   
27824  0.345307            Private  0.228316        HS-grad       0.157246   
13582  0.068427            Private  0.253143        HS-grad       0.157246   
14557  0.014303            Private  0.252932   Some-college       0.190202   

            marital-status        occupation     relationship  