# Classification models - supervised discretization

Dataset: pendigits (REDO) <br>
By: Sam <br>
Update at: 27/04/2023 <br>

====

Summary:<br>
- Import unsupervised discretised datasets (already encoded categorical attributes)
- Split dataset: no random split
train= pendigits6.head(4435)
test= pendigits6.tail(2000)

- Perform 3 classification models: ChiMerge (4 settings) and Decision Tree (4 settings)
**For categorical Naive Bayes: passing number of categories of features in the parameter min_categories to avoid index out of bound error**
- Evaluation on testing data: Classification report (accuracy, precision, recall, f1-score) + G-mean
- Export models after training: CNB models - joblib; ID3 & Knn-Hamming: skops
- Write models performance to file: 'transfusion_models.txt'.

### About Dataset
NUMBER OF ATTRIBUTES: 36 (= 4 spectral bands x 9 pixels in neighbourhood) the pixels read out in sequence left-to-right and top-to-bottom. 

    - A1-A4: 4 top-left
    - A5-A8: 4 top middle
    - A9-A12: 4 top-right
    => central pixel are given by attributes 17,18,19 and 20

NUMBER OF EXAMPLES:

	- training set     4435
	- test set         2000
    
ATTRIBUTES: The attributes are numerical, in the range 0 to 255.
CLASS: 
	There are 6 decision classes: 1,2,3,4,5 and 7.

!!! NB. There are no examples with class 6 in this dataset-they have all been removed because of doubts about the 
	validity of this class.
    
!!! NB. DO NOT USE CROSS-VALIDATION WITH THIS DATASET !!!
- Just train and test only once with the above training and test sets.
- The data is given in random order and certain lines of data have been removed so you cannot reconstruct the original image from this dataset.

In [1]:
import pandas as pd
from pandas import read_csv
from pandas import set_option
import numpy as np
from numpy import arange
## EDA
from collections import Counter

In [2]:
# Pre-processing
from sklearn.preprocessing import OrdinalEncoder
# Cross validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score # 1 metric
from sklearn.model_selection import cross_validate # more than 1 metric
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [3]:
# For Naive Bayes
from sklearn.naive_bayes import CategoricalNB # Categorical Naive Bayes
from sklearn.naive_bayes import MultinomialNB # Multinominal Naive Bayes (suitable for NLP)
from mixed_naive_bayes import MixedNB # Mixed Naive Bayes for combination of both discrete & continuous feature

In [4]:
# For decision tree ID3 
# https://stackoverflow.com/questions/61867945/python-import-error-cannot-import-name-six-from-sklearn-externals
import six
import sys
sys.modules['sklearn.externals.six'] = six
import mlrose
from id3 import Id3Estimator # ID3 Decision Tree (https://pypi.org/project/decision-tree-id3/)
from id3 import export_graphviz

In [5]:
# Knn-VDM 3
from vdm3 import ValueDifferenceMetric
from sklearn.neighbors import KNeighborsClassifier

In [6]:
# For model evaluation
from sklearn.metrics import classification_report
from sklearn import metrics
import sklearn.metrics as metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import confusion_matrix

In [7]:
import seaborn as sns
import matplotlib.pyplot as plt

# 1. ChiMerge data

## 1. Max intervals = 6

In [8]:
# Complete code for data preperation
# Read data
df_cm1 = pd.read_csv('cm_pendigits_6int.csv')
df_cm1.rename(columns={'class':'label'}, inplace=True)
disc = 'CM'
k = 6

df_cm1.info()
data = df_cm1.values
data.shape

features = df_cm1.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]
#X = df_cm1[features]
#Y = df_cm1['label']

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_cm1[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 1: 1143, 0: 1143, 7: 1142, 6: 1056, 8: 1055, 5: 1055, 9: 1055, 3: 1055})
Class representation - training 

In [9]:
# from imblearn.combine import SMOTETomek
# smt_tomek = SMOTETomek(random_state=42)
# x_resample, y_resample = smt_tomek.fit_resample(x_train, y_train)
# # Check labels in traning dataset after SMOTE
# pd.Series(y_resample) \
# .value_counts() \
# .plot(kind='bar', title='Class distribution after applying SMOTE Tomek', xlabel='Vowels')

### Models - CM, max intervals = 6

In [10]:
# Knn-Hammingcomplete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'CM'
disc_param = 'k = 6'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, k = {k} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{k}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_CM_6.skops


## 1.2 ChiMerge, max interval = 8

### Data prep

In [11]:
# Complete code for data preperation
# Read data
df_cm2 = pd.read_csv('cm_pendigits_8int.csv')
df_cm2.rename(columns={'class':'label'}, inplace=True)
disc = 'CM'
k = 8

df_cm2.info()
data = df_cm2.values
data.shape

features = df_cm2.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_cm2[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 1: 1143, 0: 1143, 7: 1142, 6: 1056, 8: 1055, 5: 1055, 9: 1055, 3: 1055})
Class representation - training 

In [12]:
# from imblearn.combine import SMOTETomek
# smt_tomek = SMOTETomek(random_state=42)
# x_resample, y_resample = smt_tomek.fit_resample(x_train, y_train)
# # Check labels in traning dataset after SMOTE
# pd.Series(y_resample) \
# .value_counts() \
# .plot(kind='bar', title='Class distribution after applying SMOTE Tomek', xlabel='Vowels')

### Models - ChiMerge, max intervals = 8

In [13]:
# Knn-Hammingcomplete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'CM'
disc_param = 'k = 8'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, k = {k} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{k}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_CM_8.skops


## 1.3 ChiMerge, max intervals = 10

### Data prep

In [14]:
# Complete code for data preperation
# Read data
df_cm3 = pd.read_csv('cm_pendigits_10int.csv')
df_cm3.rename(columns={'class':'label'}, inplace=True)
disc = 'cm'
k = 10

df_cm3.info()
data = df_cm3.values
data.shape

features = df_cm3.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_cm3[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 1: 1143, 0: 1143, 7: 1142, 6: 1056, 8: 1055, 5: 1055, 9: 1055, 3: 1055})
Class representation - training 

### Models - ChiMerge, max intervals =10

In [15]:
# Knn-Hammingcomplete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'CM'
disc_param = 'k = 10'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, k = {k} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{k}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_CM_10.skops


## 1.4 ChiMerge, max intervals = 15

### Data prep

In [16]:
# Complete code for data preperation
# Read data
df_cm4 = pd.read_csv('cm_pendigits_15int.csv')
df_cm4.rename(columns={'class':'label'}, inplace=True)
disc = 'cm'
k = 15

df_cm4.info()
data = df_cm4.values
data.shape

features = df_cm4.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_cm4[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 1: 1143, 0: 1143, 7: 1142, 6: 1056, 8: 1055, 5: 1055, 9: 1055, 3: 1055})
Class representation - training 

### Models, ChiMerge, max intervals = 15

In [17]:
# Knn-Hammingcomplete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'CM'
disc_param = 'k = 15'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, k = {k} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{k}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_CM_15.skops


# 2. Decision Tree Discretizer

## 2.1 DT, max_depth = 2

### Data prep

In [18]:
# Complete code for data preperation
# Read data
df_dt1 = pd.read_csv('DT_small_discretized_pendigits.csv')
df_dt1.rename(columns={'class':'label'}, inplace=True)
disc = 'DT'
max_depth = 2

df_dt1.info()
data = df_dt1.values
data.shape

features = df_dt1.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_dt1[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 0: 1143, 1: 1143, 7: 1142, 6: 1056, 9: 1055, 5: 1055, 3: 1055, 8: 1055})
Class representation - training 

### Models - DT, max_depth = 2

In [19]:
# Knn-Hamming complete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'DT'
disc_param = 'max_depth = 2'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, max_depth = {max_depth} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{max_depth}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_DT_2.skops


## 2.2 Decision Tree, max_depth = 3

### Data prep

In [20]:
# Complete code for data preperation
# Read data
df_dt2 = pd.read_csv('DT_medium_discretized_pendigits.csv')
df_dt2.rename(columns={'class':'label'}, inplace=True)
disc = 'DT'
max_depth = 3

df_dt2.info()
data = df_dt2.values
data.shape

features = df_dt2.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_dt2[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 0: 1143, 1: 1143, 7: 1142, 6: 1056, 9: 1055, 5: 1055, 3: 1055, 8: 1055})
Class representation - training 

### Models, DT, max_depth = 3

In [21]:
# Knn-Hamming complete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'DT'
disc_param = 'max_depth = 3'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, max_depth = {max_depth} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{max_depth}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_DT_3.skops


## 2.3 Decision Tree, max_depth = 4

### Dataprep

In [22]:
# Complete code for data preperation
# Read data
df_dt3 = pd.read_csv('DT_large_discretized_pendigits.csv')
df_dt3.rename(columns={'class':'label'}, inplace=True)
disc = 'DT'
max_depth = 4

df_dt3.info()
data = df_dt3.values
data.shape

features = df_dt3.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_dt3[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 0: 1143, 1: 1143, 7: 1142, 6: 1056, 9: 1055, 5: 1055, 3: 1055, 8: 1055})
Class representation - training 

### Models, DT, max_depth = 4

In [23]:
# Knn-Hamming complete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'DT'
disc_param = 'max_depth = 4'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, max_depth = {max_depth} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{max_depth}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_DT_4.skops


## 2.3 Decision Tree, max_depth = 5

### Data prep

In [24]:
# Complete code for data preperation
# Read data
df_dt4 = pd.read_csv('DT_verylarge_discretized_pendigits.csv')
df_dt4.rename(columns={'class':'label'}, inplace=True)
disc = 'DT'
max_depth = 5

df_dt4.info()
data = df_dt4.values
data.shape

features = df_dt4.drop('label', axis = 1).columns

# separate the data into X and y
X = data[:, : len(features)]
Y = data[:,-1]

print(X.shape, Y.shape)

# Split train test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state = 30, stratify=Y)

# Check representation of class
print('Class representation - original: ', Counter(Y)) 
print('Class representation - training data: ', Counter(y_train)) 
print('Class representation - testing data: ', Counter(y_test)) 

# Check number of categories for features
n_categories = df_dt4[features].nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10992 entries, 0 to 10991
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A1      10992 non-null  int64
 1   A2      10992 non-null  int64
 2   A3      10992 non-null  int64
 3   A4      10992 non-null  int64
 4   A5      10992 non-null  int64
 5   A6      10992 non-null  int64
 6   A7      10992 non-null  int64
 7   A8      10992 non-null  int64
 8   A9      10992 non-null  int64
 9   A10     10992 non-null  int64
 10  A11     10992 non-null  int64
 11  A12     10992 non-null  int64
 12  A13     10992 non-null  int64
 13  A14     10992 non-null  int64
 14  A15     10992 non-null  int64
 15  A16     10992 non-null  int64
 16  label   10992 non-null  int64
dtypes: int64(17)
memory usage: 1.4 MB
(10992, 16) (10992,)
Class representation - original:  Counter({2: 1144, 4: 1144, 0: 1143, 1: 1143, 7: 1142, 6: 1056, 9: 1055, 5: 1055, 3: 1055, 8: 1055})
Class representation - training 

### Models, DT, max_depth = 5

In [25]:
# Knn-Hamming complete code

model = 'KNN-Hamming'
dataset = 'pendigits'
discretizer = 'DT'
disc_param = 'max_depth = 5'

f = open("pendigits_supervised_disc_models.txt", "a")
import time
start = time.time() # For measuring time execution

# Knn-Hamming complete code
knn_hamming = KNeighborsClassifier(n_neighbors=3, metric='hamming', algorithm='auto')
knn_hamming.fit(x_train, y_train)

# Testing
y_pred_knn = knn_hamming.predict(x_test)
knn_hamming.classes_
print(f'Models results: model {model}, dataset {dataset}, discretization {discretizer} with parameter {disc_param}', 
      file = f)
print('Classification report', file = f)
print(classification_report(y_test, y_pred_knn), file = f)

from imblearn.metrics import geometric_mean_score as gmean
print('G-mean:', gmean(y_test, y_pred_knn),file = f)

end = time.time()
print(f'Time for training model {model}- default, {disc}, max_depth = {max_depth} is: {end - start}.', file = f) # Total time execution
print('=='*20, file = f)
f.close()

# Save models
import skops.io as sio
model_name = f"{dataset}_{model}_{discretizer}_{max_depth}.skops"
print(model_name)
obj = sio.dump(knn_hamming, model_name)

pendigits_KNN-Hamming_DT_5.skops
