# Discretization of pre-processed data
## Dataset: tranfusion

By: Sam
Update: 23/02/2023
Implementing ChiMerge discretization using the library https://pypi.org/project/scorecardbundle/


### About Dataset
Attribute Information:
Given is the variable name, variable type, the measurement unit and a brief  description. The "Blood Transfusion Service Center" is a classification problem. 
The order of this listing corresponds to the order of numerals along the rows of  the database.
- R (Recency - months since last donation),
- F (Frequency - total number of donation),
- M (Monetary - total blood donated in c.c.),
- T (Time - months since first donation),

LABEL: a binary variable representing whether he/she donated blood in March 2007 
- 1 stand for donating blood
- 0 stands for not donating blood

# 1. Preparing data

In [1]:
# Import library
import pandas as pd
import numpy as np
from collections import Counter #for Chi Merge

In [2]:
# Read clean dataset for discretization
data0 = pd.read_csv('clean_tranfusion.csv')
#tranfusiontralia dataset
tranfusion = data0

In [3]:
tranfusion

Unnamed: 0,recency,frequency,monetary,time,label
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0
...,...,...,...,...,...
743,23,2,500,38,0
744,21,2,500,52,0
745,23,3,750,62,0
746,39,1,250,39,0


In [4]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
tranfusion['label']= label_encoder.fit_transform(tranfusion['label'])
  
tranfusion['label'].unique()

array([1, 0])

In [5]:
# List of continuous feature to discretize
num_list = tranfusion.columns.to_list()
num_list.remove('label')

In [6]:
y_list = pd.DataFrame(tranfusion['label'])

In [7]:
num_list
y_list

Unnamed: 0,label
0,1
1,1
2,1
3,1
4,0
...,...
743,0
744,0
745,0
746,0


In [8]:
num_list

['recency', 'frequency', 'monetary', 'time']

In [9]:
tranfusion[num_list]

Unnamed: 0,recency,frequency,monetary,time
0,2,50,12500,98
1,0,13,3250,28
2,1,16,4000,35
3,2,20,5000,45
4,1,24,6000,77
...,...,...,...,...
743,23,2,500,38
744,21,2,500,52
745,23,3,750,62
746,39,1,250,39


# 2. Chi Merge  discretization implementation 
Referece:
- https://scorecard-bundle.bubu.blue/English/2.usage.html#feature-discretization-chimerge
- Parameter
    - m: integer, optional(default=2)
    The number of adjacent intervals to compare during chi-squared test.

    - confidence_level: float, optional(default=0.9)
    The confidence level to determine the threshold for intervals to be considered as different during the chi-square test.

    - max_intervals: int, optional(default=None)
    Specify the maximum number of intervals the discretized array will have. Sometimes (like when training a scorecard model) fewer intervals are prefered. If do not need this option just set it to None.

    - min_intervals: int, optional(default=2)
    Specify the mininum number of intervals the discretized array will have. If do not need this option just set it to 2.

    - initial_intervals: int, optional(default=100)
    The original Chimerge algorithm starts by putting each unique value in an interval and merging through a loop. This can be time-consumming when sample size is large.  Set the initial_intervals option to values other than None (like 10 or 100) will make the algorithm start at the number of intervals specified (the initial intervals are generated using quantiles). This can greatly shorten the run time. If do not need this option just set it to None.
 
    - delimiter: string, optional(default='\~')
    The returned array will be an array of intervals. Each interval is representated by string (i.e. '1~2'), which takes the form lower+delimiter+upper. This parameter control the symbol that connects the lower and upper boundaries.

    - decimal: int,  optional(default=None) 
    Control the number of decimals of boundaries. Default is None.
    - output_dataframe: boolean, optional(default=False) 
    Whether to output np.array or pd.DataFrame

In [10]:
# !pip install --upgrade scorecardbundle

In [11]:
from scorecardbundle.feature_discretization import ChiMerge as cm
from scorecardbundle.feature_discretization import FeatureIntervalAdjustment as fia
from scorecardbundle.feature_encoding import WOE as woe
from scorecardbundle.feature_selection import FeatureSelection as fs
from scorecardbundle.model_training import LogisticRegressionScoreCard as lrsc
from scorecardbundle.model_evaluation import ModelEvaluation as me
from scorecardbundle.model_interpretation import ScorecardExplainer as mise

## 2.1 SC_ChiMerge with 6 intervals
Scorecard ChiMerge discretization with 6 intervals

In [12]:
# ScorecardBundle - ChiMerge with max 6 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = tranfusion['label']
trans_cm_6 = cm.ChiMerge(max_intervals=6, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_6 = trans_cm_6.fit_transform(tranfusion[num_list], y) 
trans_cm_6.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_6 = pd.DataFrame({
    'feature':num_list
})
feature_doc_6['num_intervals'] = feature_doc_6['feature'].map(result_cm_6.nunique().to_dict())
feature_doc_6['min_interval_size'] = [fia.feature_stat(result_cm_6[col].values,y.values)['sample_size'].min() for col in feature_doc_6['feature']]
print('Summary discretization result')
print(feature_doc_6)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_6.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_6 = result_cm_6.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_6.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
     feature  num_intervals  min_interval_size
0    recency              3                 59
1  frequency              5                  4
2   monetary              5                  4
3       time              5                  4
Before encoding
    recency  frequency       monetary       time
0  -inf~6.0   24.0~inf     6000.0~inf   95.0~inf
1  -inf~6.0   4.0~19.0  1000.0~4750.0  -inf~49.0
2  -inf~6.0   4.0~19.0  1000.0~4750.0  -inf~49.0
3  -inf~6.0  19.0~22.0  4750.0~5500.0  -inf~49.0
4  -inf~6.0  22.0~24.0  5500.0~6000.0  69.0~95.0
After encoding
   recency  frequency  monetary  time
0        0          3         4     4
1        0          4         1     0
2        0          4         1     0
3        0          1         2     0
4        0          2         3     3
Execution time:  1.0499858856201172


In [13]:
# Export data
tmp = pd.concat([result_cm_6,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_tranfusion_6int.csv',index=False)

## 2.2 Chi merge with 8 intervals

In [14]:
# ScorecardBundle - ChiMerge with max 8 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = tranfusion['label']
trans_cm_8 = cm.ChiMerge(max_intervals=8, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_8 = trans_cm_8.fit_transform(tranfusion[num_list], y) 
trans_cm_8.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_8 = pd.DataFrame({
    'feature':num_list
})
feature_doc_8['num_intervals'] = feature_doc_8['feature'].map(result_cm_8.nunique().to_dict())
feature_doc_8['min_interval_size'] = [fia.feature_stat(result_cm_8[col].values,y.values)['sample_size'].min() for col in feature_doc_8['feature']]
print('Summary discretization result')
print(feature_doc_8)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_8.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_8 = result_cm_8.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_8.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
     feature  num_intervals  min_interval_size
0    recency              3                 59
1  frequency              5                  4
2   monetary              5                  4
3       time              7                  3
Before encoding
    recency  frequency       monetary       time
0  -inf~6.0   24.0~inf     6000.0~inf   95.0~inf
1  -inf~6.0   4.0~19.0  1000.0~4750.0  -inf~31.0
2  -inf~6.0   4.0~19.0  1000.0~4750.0  32.0~49.0
3  -inf~6.0  19.0~22.0  4750.0~5500.0  32.0~49.0
4  -inf~6.0  22.0~24.0  5500.0~6000.0  69.0~95.0
After encoding
   recency  frequency  monetary  time
0        0          3         4     6
1        0          4         1     0
2        0          4         1     2
3        0          1         2     2
4        0          2         3     5
Execution time:  0.701528787612915


In [15]:
# Export data
tmp = pd.concat([result_cm_8,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_tranfusion_8int.csv',index=False)

## 2.3 Chi merge with 10 intervals

In [16]:
# ScorecardBundle - ChiMerge with max 10 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = tranfusion['label']
trans_cm_10 = cm.ChiMerge(max_intervals=10, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_10 = trans_cm_10.fit_transform(tranfusion[num_list], y) 
trans_cm_10.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_10 = pd.DataFrame({
    'feature':num_list
})
feature_doc_10['num_intervals'] = feature_doc_10['feature'].map(result_cm_10.nunique().to_dict())
feature_doc_10['min_interval_size'] = [fia.feature_stat(result_cm_10[col].values,y.values)['sample_size'].min() for col in feature_doc_10['feature']]
print('Summary discretization result')
print(feature_doc_10)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_10.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_10 = result_cm_10.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_10.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
     feature  num_intervals  min_interval_size
0    recency              3                 59
1  frequency              5                  4
2   monetary              5                  4
3       time              7                  3
Before encoding
    recency  frequency       monetary       time
0  -inf~6.0   24.0~inf     6000.0~inf   95.0~inf
1  -inf~6.0   4.0~19.0  1000.0~4750.0  -inf~31.0
2  -inf~6.0   4.0~19.0  1000.0~4750.0  32.0~49.0
3  -inf~6.0  19.0~22.0  4750.0~5500.0  32.0~49.0
4  -inf~6.0  22.0~24.0  5500.0~6000.0  69.0~95.0
After encoding
   recency  frequency  monetary  time
0        0          3         4     6
1        0          4         1     0
2        0          4         1     2
3        0          1         2     2
4        0          2         3     5
Execution time:  0.7155258655548096


In [17]:
# Export data
tmp = pd.concat([result_cm_10,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_tranfusion_10int.csv',index=False)

## 2.3 Chi merge with 15 intervals

In [18]:
# ScorecardBundle - ChiMerge with max 15 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = tranfusion['label']
trans_cm_15 = cm.ChiMerge(max_intervals=15, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_15 = trans_cm_15.fit_transform(tranfusion[num_list], y) 
trans_cm_15.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_15 = pd.DataFrame({
    'feature':num_list
})
feature_doc_15['num_intervals'] = feature_doc_15['feature'].map(result_cm_15.nunique().to_dict())
feature_doc_15['min_interval_size'] = [fia.feature_stat(result_cm_15[col].values,y.values)['sample_size'].min() for col in feature_doc_15['feature']]
print('Summary discretization result')
print(feature_doc_15)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_15.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_15 = result_cm_15.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_15.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
     feature  num_intervals  min_interval_size
0    recency              3                 59
1  frequency              5                  4
2   monetary              5                  4
3       time              7                  3
Before encoding
    recency  frequency       monetary       time
0  -inf~6.0   24.0~inf     6000.0~inf   95.0~inf
1  -inf~6.0   4.0~19.0  1000.0~4750.0  -inf~31.0
2  -inf~6.0   4.0~19.0  1000.0~4750.0  32.0~49.0
3  -inf~6.0  19.0~22.0  4750.0~5500.0  32.0~49.0
4  -inf~6.0  22.0~24.0  5500.0~6000.0  69.0~95.0
After encoding
   recency  frequency  monetary  time
0        0          3         4     6
1        0          4         1     0
2        0          4         1     2
3        0          1         2     2
4        0          2         3     5
Execution time:  0.7028007507324219


In [19]:
# Export data
tmp = pd.concat([result_cm_15,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_tranfusion_15int.csv',index=False)