# Discretization of pre-processed data
## Dataset: pageblock

By: Sam <br>
Update: 23/02/2023 <br>
Implementing ChiMerge discretization using the library https://pypi.org/project/scorecardbundle/


### About Dataset
Number of Instances: 5473.

Number of Attributes: 10 numeric attributes
   - height: Height of the block.
   - lenght: Length of the block. 
   - area: Area of the block (height * lenght);
   - eccen: Eccentricity of the block (lenght / height);
   - p_black:Percentage of black pixels within the block (blackpix / area);
   - p_and: Percentage of black pixels after the application of the Run Length Smoothing Algorithm (RLSA) (blackand / area);
   - mean_tr: Mean number of white-black transitions (blackpix / wb_trans);
   - blackpix: Total number of black pixels in the original bitmap of the block.
   - blackand: Total number of black pixels in the bitmap of the block after the RLSA.
   - wb_trans: Number of white-black transitions in the original bitmap of the block.

Missing Attribute Values:  No missing value.

# 1. Preparing data

In [1]:
# Import library
import pandas as pd
import numpy as np
from collections import Counter #for Chi Merge

In [2]:
# Read clean dataset for discretization
data0 = pd.read_csv('clean_pageblock.csv')
#pageblocktralia dataset
pageblock = data0

In [3]:
pageblock

Unnamed: 0,height,length,area,eccen,p_black,p_and,mean_tr,blacpix,blackand,wb_trans,class
0,5,7,35,1.400,0.400,0.657,2.33,14,23,6,1
1,6,7,42,1.167,0.429,0.881,3.60,18,37,5,1
2,6,18,108,3.000,0.287,0.741,4.43,31,80,7,1
3,5,7,35,1.400,0.371,0.743,4.33,13,26,3,1
4,6,3,18,0.500,0.500,0.944,2.25,9,17,4,1
...,...,...,...,...,...,...,...,...,...,...,...
5468,4,524,2096,131.000,0.542,0.603,40.57,1136,1264,28,2
5469,7,4,28,0.571,0.714,0.929,10.00,20,26,2,1
5470,6,95,570,15.833,0.300,0.911,1.64,171,519,104,1
5471,7,41,287,5.857,0.213,0.801,1.36,61,230,45,1


In [4]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
pageblock['class']= label_encoder.fit_transform(pageblock['class'])
  
pageblock['class'].unique()

array([0, 1, 3, 4, 2])

In [5]:
# List of continuous feature to discretize
num_list = pageblock.columns.to_list()
num_list.remove('class')

In [6]:
y_list = pd.DataFrame(pageblock['class'])

In [7]:
num_list
y_list

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0
...,...
5468,1
5469,0
5470,0
5471,0


In [8]:
num_list

['height',
 'length',
 'area',
 'eccen',
 'p_black',
 'p_and',
 'mean_tr',
 'blacpix',
 'blackand',
 'wb_trans']

In [9]:
pageblock[num_list]

Unnamed: 0,height,length,area,eccen,p_black,p_and,mean_tr,blacpix,blackand,wb_trans
0,5,7,35,1.400,0.400,0.657,2.33,14,23,6
1,6,7,42,1.167,0.429,0.881,3.60,18,37,5
2,6,18,108,3.000,0.287,0.741,4.43,31,80,7
3,5,7,35,1.400,0.371,0.743,4.33,13,26,3
4,6,3,18,0.500,0.500,0.944,2.25,9,17,4
...,...,...,...,...,...,...,...,...,...,...
5468,4,524,2096,131.000,0.542,0.603,40.57,1136,1264,28
5469,7,4,28,0.571,0.714,0.929,10.00,20,26,2
5470,6,95,570,15.833,0.300,0.911,1.64,171,519,104
5471,7,41,287,5.857,0.213,0.801,1.36,61,230,45


# 2. Chi Merge  discretization implementation 
Referece:
- https://scorecard-bundle.bubu.blue/English/2.usage.html#feature-discretization-chimerge
- Parameter
    - m: integer, optional(default=2)
    The number of adjacent intervals to compare during chi-squared test.

    - confidence_level: float, optional(default=0.9)
    The confidence level to determine the threshold for intervals to be considered as different during the chi-square test.

    - max_intervals: int, optional(default=None)
    Specify the maximum number of intervals the discretized array will have. Sometimes (like when training a scorecard model) fewer intervals are prefered. If do not need this option just set it to None.

    - min_intervals: int, optional(default=2)
    Specify the mininum number of intervals the discretized array will have. If do not need this option just set it to 2.

    - initial_intervals: int, optional(default=100)
    The original Chimerge algorithm starts by putting each unique value in an interval and merging through a loop. This can be time-consumming when sample size is large.  Set the initial_intervals option to values other than None (like 10 or 100) will make the algorithm start at the number of intervals specified (the initial intervals are generated using quantiles). This can greatly shorten the run time. If do not need this option just set it to None.
 
    - delimiter: string, optional(default='\~')
    The returned array will be an array of intervals. Each interval is representated by string (i.e. '1~2'), which takes the form lower+delimiter+upper. This parameter control the symbol that connects the lower and upper boundaries.

    - decimal: int,  optional(default=None) 
    Control the number of decimals of boundaries. Default is None.
    - output_dataframe: boolean, optional(default=False) 
    Whether to output np.array or pd.DataFrame

In [10]:
# !pip install --upgrade scorecardbundle

In [11]:
from scorecardbundle.feature_discretization import ChiMerge as cm
from scorecardbundle.feature_discretization import FeatureIntervalAdjustment as fia
from scorecardbundle.feature_encoding import WOE as woe
from scorecardbundle.feature_selection import FeatureSelection as fs
from scorecardbundle.model_training import LogisticRegressionScoreCard as lrsc
from scorecardbundle.model_evaluation import ModelEvaluation as me
from scorecardbundle.model_interpretation import ScorecardExplainer as mise

## 2.1 SC_ChiMerge with 6 intervals
Scorecard ChiMerge discretization with 6 intervals

In [12]:
# ScorecardBundle - ChiMerge with max 6 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = pageblock['class']
trans_cm_6 = cm.ChiMerge(max_intervals=6, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_6 = trans_cm_6.fit_transform(pageblock[num_list], y) 
trans_cm_6.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_6 = pd.DataFrame({
    'feature':num_list
})
feature_doc_6['num_intervals'] = feature_doc_6['feature'].map(result_cm_6.nunique().to_dict())
feature_doc_6['min_interval_size'] = [fia.feature_stat(result_cm_6[col].values,y.values)['sample_size'].min() for col in feature_doc_6['feature']]
print('Summary discretization result')
print(feature_doc_6)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_6.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_6 = result_cm_6.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_6.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
    feature  num_intervals  min_interval_size
0    height              6                 58
1    length              6                 36
2      area              6                 55
3     eccen              6                117
4   p_black              6                 57
5     p_and              6                 55
6   mean_tr              6                219
7   blacpix              6                 55
8  blackand              6                110
9  wb_trans              6                 55
Before encoding
    height    length         area      eccen     p_black         p_and  \
0  3.0~6.0  5.0~30.0    18.0~45.0  1.182~6.0  0.163~0.45  0.5622~0.695   
1  3.0~6.0  5.0~30.0    18.0~45.0  0.4~1.182  0.163~0.45   0.695~0.996   
2  3.0~6.0  5.0~30.0  45.0~1020.0  1.182~6.0  0.163~0.45   0.695~0.996   
3  3.0~6.0  5.0~30.0    18.0~45.0  1.182~6.0  0.163~0.45   0.695~0.996   
4  3.0~6.0   1.0~5.0    -inf~18.0  0.4~1.182  0.45~0.604   0.695~0.996   

   

In [13]:
# Export data
tmp = pd.concat([result_cm_6,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_pageblock_6int.csv',index=False)

## 2.2 Chi merge with 8 intervals

In [14]:
# ScorecardBundle - ChiMerge with max 8 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = pageblock['class']
trans_cm_8 = cm.ChiMerge(max_intervals=8, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_8 = trans_cm_8.fit_transform(pageblock[num_list], y) 
trans_cm_8.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_8 = pd.DataFrame({
    'feature':num_list
})
feature_doc_8['num_intervals'] = feature_doc_8['feature'].map(result_cm_8.nunique().to_dict())
feature_doc_8['min_interval_size'] = [fia.feature_stat(result_cm_8[col].values,y.values)['sample_size'].min() for col in feature_doc_8['feature']]
print('Summary discretization result')
print(feature_doc_8)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_8.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_8 = result_cm_8.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_8.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
    feature  num_intervals  min_interval_size
0    height              8                 55
1    length              8                 36
2      area              8                 55
3     eccen              8                 55
4   p_black              8                 57
5     p_and              8                 55
6   mean_tr              8                 59
7   blacpix              8                 55
8  blackand              8                 55
9  wb_trans              8                 55
Before encoding
    height    length        area        eccen                   p_black  \
0  4.0~6.0  5.0~30.0   18.0~45.0  1.182~3.818                0.237~0.45   
1  4.0~6.0  5.0~30.0   18.0~45.0    0.4~1.182                0.237~0.45   
2  4.0~6.0  5.0~30.0  45.0~297.0  1.182~3.818                0.237~0.45   
3  4.0~6.0  5.0~30.0   18.0~45.0  1.182~3.818                0.237~0.45   
4  4.0~6.0   1.0~5.0   -inf~18.0    0.4~1.182  0.4931999999999998~0.604  

In [15]:
# Export data
tmp = pd.concat([result_cm_8,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_pageblock_8int.csv',index=False)

## 2.3 Chi merge with 10 intervals

In [16]:
# ScorecardBundle - ChiMerge with max 10 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = pageblock['class']
trans_cm_10 = cm.ChiMerge(max_intervals=10, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_10 = trans_cm_10.fit_transform(pageblock[num_list], y) 
trans_cm_10.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_10 = pd.DataFrame({
    'feature':num_list
})
feature_doc_10['num_intervals'] = feature_doc_10['feature'].map(result_cm_10.nunique().to_dict())
feature_doc_10['min_interval_size'] = [fia.feature_stat(result_cm_10[col].values,y.values)['sample_size'].min() for col in feature_doc_10['feature']]
print('Summary discretization result')
print(feature_doc_10)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_10.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_10 = result_cm_10.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_10.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
    feature  num_intervals  min_interval_size
0    height              9                 55
1    length             10                 36
2      area             10                 55
3     eccen             10                 55
4   p_black             10                 57
5     p_and              9                 55
6   mean_tr             10                 59
7   blacpix             10                 55
8  blackand             10                 55
9  wb_trans              9                 55
Before encoding
    height    length        area        eccen                   p_black  \
0  4.0~6.0  5.0~30.0   18.0~45.0  1.182~3.818                0.337~0.45   
1  4.0~6.0  5.0~30.0   18.0~45.0    1.0~1.182                0.337~0.45   
2  4.0~6.0  5.0~30.0  45.0~297.0  1.182~3.818               0.237~0.337   
3  4.0~6.0  5.0~30.0   18.0~45.0  1.182~3.818                0.337~0.45   
4  4.0~6.0   1.0~3.0   -inf~18.0      0.4~1.0  0.4931999999999998~0.604  

In [17]:
# Export data
tmp = pd.concat([result_cm_10,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_pageblock_10int.csv',index=False)

## 2.3 Chi merge with 15 intervals

In [18]:
# ScorecardBundle - ChiMerge with max 15 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = pageblock['class']
trans_cm_15 = cm.ChiMerge(max_intervals=15, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_15 = trans_cm_15.fit_transform(pageblock[num_list], y) 
trans_cm_15.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_15 = pd.DataFrame({
    'feature':num_list
})
feature_doc_15['num_intervals'] = feature_doc_15['feature'].map(result_cm_15.nunique().to_dict())
feature_doc_15['min_interval_size'] = [fia.feature_stat(result_cm_15[col].values,y.values)['sample_size'].min() for col in feature_doc_15['feature']]
print('Summary discretization result')
print(feature_doc_15)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_15.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_15 = result_cm_15.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_15.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
    feature  num_intervals  min_interval_size
0    height              9                 55
1    length             15                 36
2      area             14                 55
3     eccen             15                 52
4   p_black             11                 57
5     p_and              9                 55
6   mean_tr             12                 54
7   blacpix             11                 55
8  blackand             15                 54
9  wb_trans             11                 54
Before encoding
    height    length        area                     eccen  \
0  4.0~6.0  5.0~30.0   18.0~45.0  1.182~2.2944000000000004   
1  4.0~6.0  5.0~30.0   18.0~45.0                 1.0~1.182   
2  4.0~6.0  5.0~30.0  70.0~297.0  2.2944000000000004~3.818   
3  4.0~6.0  5.0~30.0   18.0~45.0  1.182~2.2944000000000004   
4  4.0~6.0   1.0~3.0   10.0~18.0                   0.4~1.0   

                    p_black        p_and                 mean_tr    blacpix

In [19]:
# Export data
tmp = pd.concat([result_cm_15,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_pageblock_15int.csv',index=False)