# Discretization of pre-processed data
## Dataset: adult

By: Sam <br>
Update: 15/03/2023 <br>
Implementing ChiMerge discretization using the library https://pypi.org/project/scorecardbundle/


### About Dataset
Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
- 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
- 45222 if instances with unknown values are removed (train=30162, test=15060)
- Duplicate or conflicting instances : 6
- Class probabilities for adult.all file
- Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
- Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)

ATTRIBUTE:
*Continuous attributes*
- age: continuous.
- fnlwgt: continuous.
- education-num: continuous.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.

*Categorical attributes*
- workclass
- education
- marital-status
- occupation
- relationship
- race
- sex
- native-country

# 1. Preparing data

In [1]:
# Import library
import pandas as pd
import numpy as np
from collections import Counter #for Chi Merge

In [2]:
# Read clean dataset for discretization
data0 = pd.read_csv('clean_adult.csv')
#adulttralia dataset
adult = data0

In [3]:
adult

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [4]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
adult['class']= label_encoder.fit_transform(adult['class'])
  
adult['class'].unique()

array([0, 1])

In [5]:
# List of continuous feature to discretize
num_list = ['age', 'fnlwgt', 'education-num', 'capital-gain', 
            'capital-loss', 'hours-per-week']

In [6]:
y_list = pd.DataFrame(adult['class'])

In [7]:
y_list

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0
...,...
48837,0
48838,0
48839,0
48840,0


In [8]:
num_list

['age',
 'fnlwgt',
 'education-num',
 'capital-gain',
 'capital-loss',
 'hours-per-week']

In [9]:
adult[num_list]

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,39,77516,13,2174,0,40
1,50,83311,13,0,0,13
2,38,215646,9,0,0,40
3,53,234721,7,0,0,40
4,28,338409,13,0,0,40
...,...,...,...,...,...,...
48837,39,215419,13,0,0,36
48838,64,321403,9,0,0,40
48839,38,374983,13,0,0,50
48840,44,83891,13,5455,0,40


# 2. Chi Merge  discretization implementation 
Referece:
- https://scorecard-bundle.bubu.blue/English/2.usage.html#feature-discretization-chimerge
- Parameter
    - m: integer, optional(default=2)
    The number of adjacent intervals to compare during chi-squared test.

    - confidence_level: float, optional(default=0.9)
    The confidence level to determine the threshold for intervals to be considered as different during the chi-square test.

    - max_intervals: int, optional(default=None)
    Specify the maximum number of intervals the discretized array will have. Sometimes (like when training a scorecard model) fewer intervals are prefered. If do not need this option just set it to None.

    - min_intervals: int, optional(default=2)
    Specify the mininum number of intervals the discretized array will have. If do not need this option just set it to 2.

    - initial_intervals: int, optional(default=100)
    The original Chimerge algorithm starts by putting each unique value in an interval and merging through a loop. This can be time-consumming when sample size is large.  Set the initial_intervals option to values other than None (like 10 or 100) will make the algorithm start at the number of intervals specified (the initial intervals are generated using quantiles). This can greatly shorten the run time. If do not need this option just set it to None.
 
    - delimiter: string, optional(default='\~')
    The returned array will be an array of intervals. Each interval is representated by string (i.e. '1~2'), which takes the form lower+delimiter+upper. This parameter control the symbol that connects the lower and upper boundaries.

    - decimal: int,  optional(default=None) 
    Control the number of decimals of boundaries. Default is None.
    - output_dataframe: boolean, optional(default=False) 
    Whether to output np.array or pd.DataFrame

In [12]:
# !pip install --upgrade scorecardbundle

In [13]:
from scorecardbundle.feature_discretization import ChiMerge as cm
from scorecardbundle.feature_discretization import FeatureIntervalAdjustment as fia
from scorecardbundle.feature_encoding import WOE as woe
from scorecardbundle.feature_selection import FeatureSelection as fs
from scorecardbundle.model_training import LogisticRegressionScoreCard as lrsc
from scorecardbundle.model_evaluation import ModelEvaluation as me
from scorecardbundle.model_interpretation import ScorecardExplainer as mise

## 2.1 SC_ChiMerge with 6 intervals
Scorecard ChiMerge discretization with 6 intervals

In [14]:
# ScorecardBundle - ChiMerge with max 6 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = adult['class']
trans_cm_6 = cm.ChiMerge(max_intervals=6, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_6 = trans_cm_6.fit_transform(adult[num_list], y) 
trans_cm_6.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_6 = pd.DataFrame({
    'feature':num_list
})
feature_doc_6['num_intervals'] = feature_doc_6['feature'].map(result_cm_6.nunique().to_dict())
feature_doc_6['min_interval_size'] = [fia.feature_stat(result_cm_6[col].values,y.values)['sample_size'].min() for col in feature_doc_6['feature']]
print('Summary discretization result')
print(feature_doc_6)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_6.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_6 = result_cm_6.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_6.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
          feature  num_intervals  min_interval_size
0             age              6               3156
1          fnlwgt              5               1953
2   education-num              6               1428
3    capital-gain              6                617
4    capital-loss              6                 31
5  hours-per-week              6               1672
Before encoding
         age                      fnlwgt education-num capital-gain  \
0  36.0~61.0               -inf~95894.52     12.0~13.0   0.0~2829.0   
1  36.0~61.0               -inf~95894.52     12.0~13.0     -inf~0.0   
2  36.0~61.0  214242.0~319612.0099999999      8.0~10.0     -inf~0.0   
3  36.0~61.0  214242.0~319612.0099999999      -inf~8.0     -inf~0.0   
4  27.0~33.0  319612.0099999999~397877.0     12.0~13.0     -inf~0.0   

  capital-loss hours-per-week  
0  -inf~1876.0      39.0~41.0  
1  -inf~1876.0      -inf~34.0  
2  -inf~1876.0      39.0~41.0  
3  -inf~1876.0      39.0~41.0  
4  

In [15]:
# Export data
tmp = pd.concat([result_cm_6,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_adult_6int.csv',index=False)

## 2.2 Chi merge with 8 intervals

In [16]:
# ScorecardBundle - ChiMerge with max 8 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = adult['class']
trans_cm_8 = cm.ChiMerge(max_intervals=8, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_8 = trans_cm_8.fit_transform(adult[num_list], y) 
trans_cm_8.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_8 = pd.DataFrame({
    'feature':num_list
})
feature_doc_8['num_intervals'] = feature_doc_8['feature'].map(result_cm_8.nunique().to_dict())
feature_doc_8['min_interval_size'] = [fia.feature_stat(result_cm_8[col].values,y.values)['sample_size'].min() for col in feature_doc_8['feature']]
print('Summary discretization result')
print(feature_doc_8)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_8.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_8 = result_cm_8.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_8.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
          feature  num_intervals  min_interval_size
0             age              8               2503
1          fnlwgt              8                489
2   education-num              8                330
3    capital-gain              7                391
4    capital-loss              8                 31
5  hours-per-week              8                326
Before encoding
         age                      fnlwgt education-num capital-gain  \
0  36.0~61.0               -inf~95894.52     12.0~13.0   0.0~2829.0   
1  36.0~61.0               -inf~95894.52     12.0~13.0     -inf~0.0   
2  36.0~61.0  214242.0~319612.0099999999       8.0~9.0     -inf~0.0   
3  36.0~61.0  214242.0~319612.0099999999       2.0~8.0     -inf~0.0   
4  27.0~29.0  319612.0099999999~397877.0     12.0~13.0     -inf~0.0   

  capital-loss hours-per-week  
0  -inf~1564.0      39.0~41.0  
1  -inf~1564.0      -inf~34.0  
2  -inf~1564.0      39.0~41.0  
3  -inf~1564.0      39.0~41.0  
4  

In [17]:
# Export data
tmp = pd.concat([result_cm_8,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_adult_8int.csv',index=False)

## 2.3 Chi merge with 10 intervals

In [18]:
# ScorecardBundle - ChiMerge with max 10 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = adult['class']
trans_cm_10 = cm.ChiMerge(max_intervals=10, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_10 = trans_cm_10.fit_transform(adult[num_list], y) 
trans_cm_10.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_10 = pd.DataFrame({
    'feature':num_list
})
feature_doc_10['num_intervals'] = feature_doc_10['feature'].map(result_cm_10.nunique().to_dict())
feature_doc_10['min_interval_size'] = [fia.feature_stat(result_cm_10[col].values,y.values)['sample_size'].min() for col in feature_doc_10['feature']]
print('Summary discretization result')
print(feature_doc_10)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_10.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_10 = result_cm_10.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_10.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
          feature  num_intervals  min_interval_size
0             age             10               2503
1          fnlwgt             10                489
2   education-num              8                330
3    capital-gain              7                391
4    capital-loss             10                 31
5  hours-per-week             10                326
Before encoding
         age                      fnlwgt education-num capital-gain  \
0  36.0~41.0               -inf~95894.52     12.0~13.0   0.0~2829.0   
1  41.0~54.0               -inf~95894.52     12.0~13.0     -inf~0.0   
2  36.0~41.0  214242.0~319612.0099999999       8.0~9.0     -inf~0.0   
3  41.0~54.0  214242.0~319612.0099999999       2.0~8.0     -inf~0.0   
4  27.0~29.0  319612.0099999999~397877.0     12.0~13.0     -inf~0.0   

  capital-loss hours-per-week  
0  -inf~1539.0      39.0~41.0  
1  -inf~1539.0      10.0~34.0  
2  -inf~1539.0      39.0~41.0  
3  -inf~1539.0      39.0~41.0  
4  

In [19]:
# Export data
tmp = pd.concat([result_cm_10,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_adult_10int.csv',index=False)

## 2.3 Chi merge with 15 intervals

In [20]:
# ScorecardBundle - ChiMerge with max 15 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = adult['class']
trans_cm_15 = cm.ChiMerge(max_intervals=15, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_15 = trans_cm_15.fit_transform(adult[num_list], y) 
trans_cm_15.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_15 = pd.DataFrame({
    'feature':num_list
})
feature_doc_15['num_intervals'] = feature_doc_15['feature'].map(result_cm_15.nunique().to_dict())
feature_doc_15['min_interval_size'] = [fia.feature_stat(result_cm_15[col].values,y.values)['sample_size'].min() for col in feature_doc_15['feature']]
print('Summary discretization result')
print(feature_doc_15)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_15.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_15 = result_cm_15.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_15.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
          feature  num_intervals  min_interval_size
0             age             15                899
1          fnlwgt             14                488
2   education-num              8                330
3    capital-gain              7                391
4    capital-loss             15                 30
5  hours-per-week             15                 80
Before encoding
         age                      fnlwgt education-num capital-gain  \
0  36.0~41.0            65738.2~95894.52     12.0~13.0   0.0~2829.0   
1  41.0~54.0            65738.2~95894.52     12.0~13.0     -inf~0.0   
2  36.0~41.0  214242.0~319612.0099999999       8.0~9.0     -inf~0.0   
3  41.0~54.0  214242.0~319612.0099999999       2.0~8.0     -inf~0.0   
4  27.0~29.0  319612.0099999999~397877.0     12.0~13.0     -inf~0.0   

  capital-loss hours-per-week  
0   -inf~155.0      39.0~41.0  
1   -inf~155.0      10.0~23.0  
2   -inf~155.0      39.0~41.0  
3   -inf~155.0      39.0~41.0  
4  

In [21]:
# Export data
tmp = pd.concat([result_cm_15,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_adult_15int.csv',index=False)