# Discretization of pre-processed data
## Dataset: phoneme

By: Sam
Update: 23/02/2023
Implementing ChiMerge discretization using the library https://pypi.org/project/scorecardbundle/


### About Dataset
Raw dataset is in format arff, must convert to csv (using tool: https://pulipulichen.github.io/jieba-js/weka/arff2csv/)

Five different attributes were chosen to characterize each vowel: they are the amplitudes of the five first harmonics AHi, normalised by the total energy Ene (integrated on all the frequencies): AHi/Ene. The phonemes are transcribed as follows: sh as in she, dcl as in dark, iy as the vowel in she, aa as the vowel in dark, and ao as the first vowel in water.
=> All attributes are numeric.

The aim of the present database is to distinguish between nasal and oral vowels. There are thus two different classes:
- Class 0 : Nasals
- Class 1 : Orals

# 1. Preparing data

In [1]:
# Import library
import pandas as pd
import numpy as np
from collections import Counter #for Chi Merge

In [2]:
# Read clean dataset for discretization
data0 = pd.read_csv('clean_phoneme.csv')
#phonemetralia dataset
phoneme = data0

In [3]:
phoneme.drop(['id'], axis=1, inplace = True)

In [4]:
phoneme

Unnamed: 0,V1,V2,V3,V4,V5,Class
0,0.489927,-0.451528,-1.047990,-0.598693,-0.020418,1
1,-0.641265,0.109245,0.292130,-0.916804,0.240223,1
2,0.870593,-0.459862,0.578159,0.806634,0.835248,1
3,-0.628439,-0.316284,1.934295,-1.427099,-0.136583,1
4,-0.596399,0.015938,2.043206,-1.688448,-0.948127,1
...,...,...,...,...,...,...
5399,-0.658318,1.331760,-0.081621,1.794253,-1.082181,1
5400,-0.044375,-0.010512,0.030989,-0.019379,1.281061,2
5401,0.246882,-0.793228,1.190101,1.423194,-1.303036,2
5402,-0.778907,-0.383111,1.727029,-1.432389,-1.208085,1


In [5]:
phoneme.rename(columns={'Class':'class'}, inplace=True)

In [6]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
phoneme['class']= label_encoder.fit_transform(phoneme['class'])
  
phoneme['class'].unique()

array([0, 1])

In [7]:
# List of continuous feature to discretize
num_list = phoneme.columns.to_list()
num_list.remove('class')

In [8]:
y_list = pd.DataFrame(phoneme['class'])

In [9]:
num_list
y_list

Unnamed: 0,class
0,0
1,0
2,0
3,0
4,0
...,...
5399,0
5400,1
5401,1
5402,0


In [10]:
num_list

['V1', 'V2', 'V3', 'V4', 'V5']

In [11]:
phoneme[num_list]

Unnamed: 0,V1,V2,V3,V4,V5
0,0.489927,-0.451528,-1.047990,-0.598693,-0.020418
1,-0.641265,0.109245,0.292130,-0.916804,0.240223
2,0.870593,-0.459862,0.578159,0.806634,0.835248
3,-0.628439,-0.316284,1.934295,-1.427099,-0.136583
4,-0.596399,0.015938,2.043206,-1.688448,-0.948127
...,...,...,...,...,...
5399,-0.658318,1.331760,-0.081621,1.794253,-1.082181
5400,-0.044375,-0.010512,0.030989,-0.019379,1.281061
5401,0.246882,-0.793228,1.190101,1.423194,-1.303036
5402,-0.778907,-0.383111,1.727029,-1.432389,-1.208085


# 2. Chi Merge  discretization implementation 
Referece:
- https://scorecard-bundle.bubu.blue/English/2.usage.html#feature-discretization-chimerge
- Parameter
    - m: integer, optional(default=2)
    The number of adjacent intervals to compare during chi-squared test.

    - confidence_level: float, optional(default=0.9)
    The confidence level to determine the threshold for intervals to be considered as different during the chi-square test.

    - max_intervals: int, optional(default=None)
    Specify the maximum number of intervals the discretized array will have. Sometimes (like when training a scorecard model) fewer intervals are prefered. If do not need this option just set it to None.

    - min_intervals: int, optional(default=2)
    Specify the mininum number of intervals the discretized array will have. If do not need this option just set it to 2.

    - initial_intervals: int, optional(default=100)
    The original Chimerge algorithm starts by putting each unique value in an interval and merging through a loop. This can be time-consumming when sample size is large.  Set the initial_intervals option to values other than None (like 10 or 100) will make the algorithm start at the number of intervals specified (the initial intervals are generated using quantiles). This can greatly shorten the run time. If do not need this option just set it to None.
 
    - delimiter: string, optional(default='\~')
    The returned array will be an array of intervals. Each interval is representated by string (i.e. '1~2'), which takes the form lower+delimiter+upper. This parameter control the symbol that connects the lower and upper boundaries.

    - decimal: int,  optional(default=None) 
    Control the number of decimals of boundaries. Default is None.
    - output_dataframe: boolean, optional(default=False) 
    Whether to output np.array or pd.DataFrame

In [12]:
# !pip install --upgrade scorecardbundle

In [13]:
from scorecardbundle.feature_discretization import ChiMerge as cm
from scorecardbundle.feature_discretization import FeatureIntervalAdjustment as fia
from scorecardbundle.feature_encoding import WOE as woe
from scorecardbundle.feature_selection import FeatureSelection as fs
from scorecardbundle.model_training import LogisticRegressionScoreCard as lrsc
from scorecardbundle.model_evaluation import ModelEvaluation as me
from scorecardbundle.model_interpretation import ScorecardExplainer as mise

In [14]:
y = phoneme['class']

In [15]:
trans_cm = cm.ChiMerge(max_intervals=6, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm = trans_cm.fit_transform(phoneme[num_list], y) 
trans_cm.boundaries_ # see the interval boundaries for each feature

{'V1': array([-0.8481224 , -0.690833  ,  0.23532028,  0.74940586,  1.50849149,
                inf]),
 'V2': array([-1.04744376, -0.90718692, -0.4762088 ,  0.03819899,  0.52780789,
                inf]),
 'V3': array([-1.6331841 , -1.35842668, -0.599138  , -0.27078184,  1.56899126,
                inf]),
 'V4': array([-0.881187  , -0.40999874, -0.0740622 ,  0.21153328,  0.40498152,
                inf]),
 'V5': array([-0.73667892, -0.37457236, -0.2127475 , -0.136583  ,  0.4312365 ,
                inf])}

In [16]:
result_cm

Unnamed: 0,V1,V2,V3,V4,V5
0,0.23532028~0.74940586,-0.47620879999999993~0.03819898999999998,-1.35842668~-0.599138,-0.8811869999999999~-0.40999874000000003,-0.136583~0.4312365000000001
1,-0.690833~0.23532028,0.03819898999999998~0.5278078899999998,-0.27078184~1.56899126,-inf~-0.8811869999999999,-0.136583~0.4312365000000001
2,0.74940586~1.5084914900000004,-0.47620879999999993~0.03819898999999998,-0.27078184~1.56899126,0.40498152000000065~inf,0.4312365000000001~inf
3,-0.690833~0.23532028,-0.47620879999999993~0.03819898999999998,1.56899126~inf,-inf~-0.8811869999999999,-0.21274749999999998~-0.136583
4,-0.690833~0.23532028,-0.47620879999999993~0.03819898999999998,1.56899126~inf,-inf~-0.8811869999999999,-inf~-0.7366789199999997
...,...,...,...,...,...
5399,-0.690833~0.23532028,0.5278078899999998~inf,-0.27078184~1.56899126,0.40498152000000065~inf,-inf~-0.7366789199999997
5400,-0.690833~0.23532028,-0.47620879999999993~0.03819898999999998,-0.27078184~1.56899126,-0.0740622~0.21153328000000007,0.4312365000000001~inf
5401,0.23532028~0.74940586,-0.90718692~-0.47620879999999993,-0.27078184~1.56899126,0.40498152000000065~inf,-inf~-0.7366789199999997
5402,-0.8481223999999999~-0.690833,-0.47620879999999993~0.03819898999999998,1.56899126~inf,-inf~-0.8811869999999999,-inf~-0.7366789199999997


In [17]:
# Examine result
feature_doc = pd.DataFrame({
    'feature':num_list
})
feature_doc['num_intervals'] = feature_doc['feature'].map(result_cm.nunique().to_dict())
feature_doc['min_interval_size'] = [fia.feature_stat(result_cm[col].values,y.values)['sample_size'].min() for col in feature_doc['feature']]
feature_doc

Unnamed: 0,feature,num_intervals,min_interval_size
0,V1,6,271
1,V2,6,378
2,V3,6,55
3,V4,6,325
4,V5,6,634


In [18]:
result_cm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5404 entries, 0 to 5403
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   V1      5404 non-null   object
 1   V2      5404 non-null   object
 2   V3      5404 non-null   object
 3   V4      5404 non-null   object
 4   V5      5404 non-null   object
dtypes: object(5)
memory usage: 211.2+ KB


In [19]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
#result_cm = label_encoder.fit_transform(result_cm)
result_cm = result_cm.apply(label_encoder.fit_transform)

In [20]:
result_cm

Unnamed: 0,V1,V2,V3,V4,V5
0,3,0,2,2,0
1,0,4,0,3,0
2,4,0,0,5,5
3,0,0,5,3,1
4,0,0,5,3,4
...,...,...,...,...,...
5399,0,5,0,5,4
5400,0,0,0,0,5
5401,3,1,0,5,4
5402,1,0,5,3,4


In [21]:
tmp = pd.concat([result_cm,y_list], axis=1)
tmp

Unnamed: 0,V1,V2,V3,V4,V5,class
0,3,0,2,2,0,0
1,0,4,0,3,0,0
2,4,0,0,5,5,0
3,0,0,5,3,1,0
4,0,0,5,3,4,0
...,...,...,...,...,...,...
5399,0,5,0,5,4,0
5400,0,0,0,0,5,1
5401,3,1,0,5,4,1
5402,1,0,5,3,4,0


## 2.1 SC_ChiMerge with 6 intervals
Scorecard ChiMerge discretization with 6 intervals

In [22]:
# ScorecardBundle - ChiMerge with max 6 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = phoneme['class']
trans_cm_6 = cm.ChiMerge(max_intervals=6, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_6 = trans_cm_6.fit_transform(phoneme[num_list], y) 
trans_cm_6.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_6 = pd.DataFrame({
    'feature':num_list
})
feature_doc_6['num_intervals'] = feature_doc_6['feature'].map(result_cm_6.nunique().to_dict())
feature_doc_6['min_interval_size'] = [fia.feature_stat(result_cm_6[col].values,y.values)['sample_size'].min() for col in feature_doc_6['feature']]
print('Summary discretization result')
print(feature_doc_6)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_6.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_6 = result_cm_6.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_6.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
  feature  num_intervals  min_interval_size
0      V1              6                271
1      V2              6                378
2      V3              6                 55
3      V4              6                325
4      V5              6                634
Before encoding
                              V1                                        V2  \
0          0.23532028~0.74940586  -0.47620879999999993~0.03819898999999998   
1           -0.690833~0.23532028    0.03819898999999998~0.5278078899999998   
2  0.74940586~1.5084914900000004  -0.47620879999999993~0.03819898999999998   
3           -0.690833~0.23532028  -0.47620879999999993~0.03819898999999998   
4           -0.690833~0.23532028  -0.47620879999999993~0.03819898999999998   

                       V3                                        V4  \
0   -1.35842668~-0.599138  -0.8811869999999999~-0.40999874000000003   
1  -0.27078184~1.56899126                  -inf~-0.8811869999999999   
2  -0.27

In [23]:
# Export data
tmp = pd.concat([result_cm_6,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_phoneme_6int.csv',index=False)

## 2.2 Chi merge with 8 intervals

In [24]:
# ScorecardBundle - ChiMerge with max 8 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = phoneme['class']
trans_cm_8 = cm.ChiMerge(max_intervals=8, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_8 = trans_cm_8.fit_transform(phoneme[num_list], y) 
trans_cm_8.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_8 = pd.DataFrame({
    'feature':num_list
})
feature_doc_8['num_intervals'] = feature_doc_8['feature'].map(result_cm_8.nunique().to_dict())
feature_doc_8['min_interval_size'] = [fia.feature_stat(result_cm_8[col].values,y.values)['sample_size'].min() for col in feature_doc_8['feature']]
print('Summary discretization result')
print(feature_doc_8)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_8.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_8 = result_cm_8.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_8.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
  feature  num_intervals  min_interval_size
0      V1              8                271
1      V2              8                217
2      V3              8                 55
3      V4              8                 54
4      V5              8                324
Before encoding
                                 V1  \
0             0.23532028~0.74940586   
1             -0.690833~-0.63812504   
2     0.74940586~1.5084914900000004   
3  -0.63812504~-0.12732183000000002   
4  -0.63812504~-0.12732183000000002   

                                          V2  \
0  -0.47620879999999993~-0.31337043999999964   
1     0.03819898999999998~0.5278078899999998   
2  -0.47620879999999993~-0.31337043999999964   
3  -0.47620879999999993~-0.31337043999999964   
4   -0.31337043999999964~0.03819898999999998   

                                         V3                        V4  \
0                     -1.35842668~-0.599138    -0.7583255~-0.58313932   
1  -0.02052678999999

In [25]:
# Export data
tmp = pd.concat([result_cm_8,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_phoneme_8int.csv',index=False)

## 2.3 Chi merge with 10 intervals

In [26]:
# ScorecardBundle - ChiMerge with max 10 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = phoneme['class']
trans_cm_10 = cm.ChiMerge(max_intervals=10, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_10 = trans_cm_10.fit_transform(phoneme[num_list], y) 
trans_cm_10.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_10 = pd.DataFrame({
    'feature':num_list
})
feature_doc_10['num_intervals'] = feature_doc_10['feature'].map(result_cm_10.nunique().to_dict())
feature_doc_10['min_interval_size'] = [fia.feature_stat(result_cm_10[col].values,y.values)['sample_size'].min() for col in feature_doc_10['feature']]
print('Summary discretization result')
print(feature_doc_10)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_10.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_10 = result_cm_10.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_10.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
  feature  num_intervals  min_interval_size
0      V1             10                108
1      V2             10                108
2      V3             10                 55
3      V4             10                 54
4      V5             10                216
Before encoding
                                 V1  \
0             0.23532028~0.74940586   
1             -0.690833~-0.63812504   
2     0.74940586~1.5084914900000004   
3  -0.63812504~-0.12732183000000002   
4  -0.63812504~-0.12732183000000002   

                                          V2  \
0  -0.47620879999999993~-0.31337043999999964   
1     0.03819898999999998~0.5278078899999998   
2  -0.47620879999999993~-0.31337043999999964   
3  -0.47620879999999993~-0.31337043999999964   
4   -0.31337043999999964~0.03819898999999998   

                                      V3  \
0                  -1.22064355~-0.599138   
1  0.2007423999999995~0.7380863399999998   
2  0.2007423999999995~0.7380863399

In [27]:
# Export data
tmp = pd.concat([result_cm_10,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_phoneme_10int.csv',index=False)

## 2.3 Chi merge with 15 intervals

In [28]:
# ScorecardBundle - ChiMerge with max 15 intervals
# Complete Pipeline

import time
start = time.time() # For measuring time execution

# Perform discretization
y = phoneme['class']
trans_cm_15 = cm.ChiMerge(max_intervals=15, 
                       min_intervals=2,
                       confidence_level = 0.95, # alpha = 0.05
                       decimal=None, 
                       output_dataframe=True)
result_cm_15 = trans_cm_15.fit_transform(phoneme[num_list], y) 
trans_cm_15.boundaries_ # see the interval boundaries for each feature

# Summarise result
feature_doc_15 = pd.DataFrame({
    'feature':num_list
})
feature_doc_15['num_intervals'] = feature_doc_15['feature'].map(result_cm_15.nunique().to_dict())
feature_doc_15['min_interval_size'] = [fia.feature_stat(result_cm_15[col].values,y.values)['sample_size'].min() for col in feature_doc_15['feature']]
print('Summary discretization result')
print(feature_doc_15)
print('='*20)

# Encoding data after discretization (mapping interval)
print('Before encoding')
print(result_cm_15.head())

## Import label encoder
from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
  
## Encode labels for all feature columns after discretization.
result_cm_15 = result_cm_15.apply(label_encoder.fit_transform)
print('='*20)
print('After encoding')
print(result_cm_15.head())

end = time.time()
print('='*20)
print('Execution time: ', end - start) # Total time execution for this sample

Summary discretization result
  feature  num_intervals  min_interval_size
0      V1             15                 54
1      V2             15                 54
2      V3             15                 54
3      V4             13                 54
4      V5             15                 54
Before encoding
                                V1                                         V2  \
0            0.23532028~0.74940586  -0.47620879999999993~-0.31337043999999964   
1          -0.64486956~-0.63812504     0.03819898999999998~0.5278078899999998   
2    0.74940586~1.5084914900000004  -0.47620879999999993~-0.31337043999999964   
3  -0.63812504~-0.5630595000000002  -0.47620879999999993~-0.31337043999999964   
4  -0.63812504~-0.5630595000000002   -0.31337043999999964~0.03819898999999998   

                                      V3  \
0                  -1.22064355~-0.599138   
1  0.2007423999999995~0.7380863399999998   
2  0.2007423999999995~0.7380863399999998   
3                         1

In [29]:
# Export data
tmp = pd.concat([result_cm_15,y_list], axis=1)
tmp
# Export this dataset for discretization
#convert to csv file
tmp.to_csv('sc_cm_phoneme_15int.csv',index=False)