<h1> Tabular Data Classification with Differentiable Decision Tree.ipynb </h1>
<h1> Name : Patrick Sutanto </h1>

# Description

In this notebook, we will conduct classification tasks using four distinct datasets: the Iris dataset, the Pima Indians Diabetes Database, the Breast Cancer Wisconsin (Diagnostic) dataset, and the Adult Dataset. Each dataset represents unique challenges in classification, and our goal is to apply deep learning-based methods to explore and solve these tasks.

- Iris Dataset: This classic dataset consists of measurements from three different Iris species. We'll classify samples into one of these three species based on four features: sepal length, sepal width, petal length, and petal width.

- Pima Indians Diabetes Database: This dataset contains diagnostic data collected from Pima Indian women, with the objective of predicting the onset of diabetes based on various health measurements, such as blood glucose level, body mass index, age, and more.

- Breast Cancer Wisconsin (Diagnostic) Dataset: Here, the task is to classify tumor samples as benign or malignant based on features computed from cell nuclei present in breast mass images. The dataset provides insights into texture, area, smoothness, and other characteristics critical for accurate diagnostics.

- Adult Dataset: This dataset, commonly used in income prediction tasks, includes features such as age, work class, education, occupation, and more. Our goal is to predict whether a person's income exceeds $50,000 per year, based on these socioeconomic factors.

For each dataset, we will use a primary classification method called the Differentiable Decision Tree. This deep learning technique combines the interpretability of traditional decision trees with the flexibility of gradient-based learning, making it well-suited for diverse data types and complex patterns.

In addition to preprocessing each dataset for model compatibility, we will try to use the data augmentation techniques of Mixup and also various regularization method such as and L2 normalization, to enhance model robustness and prevent overfitting. These augmentations aim to create more diverse training data representations, allowing the model to generalize better across unseen examples.

We will preprocess each dataset to ensure compatibility with the Differentiable Decision Tree model, explore underlying data distributions, and adjust hyperparameters to maximize performance. The evaluation process will involve accuracy, macro-F1 score, and a confusion matrix to thoroughly assess model performance across classes. Additionally, we will apply k-fold cross-validation to ensure the model generalizes well across different subsets of the data, promoting robust, reliable results.






# Model

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix, f1_score
from copy import deepcopy

In [None]:
class Node(nn.Module) :
  def __init__(self, num_feature, num_output, num_depth) :
    super(Node, self).__init__()
    self.num_depth = num_depth
    init_std = 0.1

    self.beta = nn.Parameter(torch.randn(1, num_feature)*init_std)
    self.theta = nn.Parameter(torch.randn(1, num_feature)*init_std)
    self.weight_a = nn.Parameter(torch.randn(1)*init_std)


    self.child = []
    if num_depth > 0 :
      self.child_left = Node(num_feature, num_output, num_depth - 1)
      self.child_right = Node(num_feature, num_output, num_depth - 1)
    else :
      self.label_weight = nn.Parameter(torch.randn(1, num_output)*init_std)



  def forward(self, x) :
    # assume x is B x D
    if self.num_depth == 0 :
      return self.label_weight
    else :
      gate_value = torch.sum(x*self.beta - self.theta,dim=-1, keepdim = True)*self.weight_a
      gate_value = torch.sigmoid(gate_value)
      # print(gate_value.shape)

      left_value = self.child_left(x)
      right_value = self.child_right(x)
      value = gate_value*left_value + (1 - gate_value)*right_value
      return value

  def forward_discrete_estimate(self,x) :
    if self.num_depth == 0 :
      return self.label_weight
    else :
      max_index = self.beta.argmax(-1).squeeze()
      to_compare = self.theta.squeeze()[max_index]/self.beta.squeeze()[max_index]
      is_left =  (x[:, max_index:max_index+1]>= to_compare).float()

      value = is_left*self.child_left.forward_discrete_estimate(x) + (1 - is_left)*self.child_right.forward_discrete_estimate(x)


      return value

  def forward_discrete(self,x) :
    if self.num_depth == 0 :
      return self.label_weight.squeeze().argmax()
    else :
      max_index = self.beta.argmax(-1).squeeze()
      to_compare = self.theta.squeeze()[max_index]/self.beta.squeeze()[max_index]
      is_left =  (x[:, max_index] >= to_compare).float()

      gate_value = torch.sum(x*self.beta - self.theta,dim=-1, keepdim = True)*self.weight_a
      gate_value = torch.sigmoid(gate_value)
      value = gate_value*self.child_left(x) + (1 - gate_value)*self.child_right(x)

      value = is_left*self.child_left.forward_discrete(x) + (1 - is_left)*self.child_right.forward_discrete(x)


      return value

  def create_rules(self) :
    self.create_rules_reccur(self.num_depth, True)

  def create_rules_reccur(self, max_depth, is_from_left) :
    if self.num_depth == 0 :
      label_now = self.label_weight.squeeze().argmax()
      label_now = label_now.detach().cpu().numpy()
      num_space = " ".join([""]*4*(max_depth - self.num_depth))
      if self.num_depth != max_depth :
        if is_from_left :
          print(num_space, "LEFT:")
        else :
          print(num_space, "RIGHT:")
      print(num_space, "RETURN", label_now)
    else :
      max_index = self.beta.argmax(-1).squeeze()
      to_compare = self.theta.squeeze()[max_index]/self.beta.squeeze()[max_index]
      # is_left =  (x[:, max_index] >= to_compare).float()

      max_index = max_index.detach().cpu().numpy()
      to_compare =  to_compare.detach().cpu().numpy()
      num_space = " ".join([""]*4*(max_depth - self.num_depth))

      if self.num_depth != max_depth :
        if is_from_left :
          print(num_space, "LEFT:")
        else :
          print(num_space, "RIGHT:")
      print(num_space, "WHEN feature", max_index, "is bigger than", to_compare, "then go to left, else go to right")
      self.child_left.create_rules_reccur(max_depth, True)
      self.child_right.create_rules_reccur(max_depth, False)


In [None]:
class LinearTree(Node) :
  def __init__(self, num_input, num_feature, num_output, num_depth) :
    super(LinearTree, self).__init__(num_feature, num_output, num_depth)
    self.linear1 = nn.Linear(num_input, num_feature*4)
    self.linear2 = nn.Linear(num_feature*4, num_feature*4)
    self.linear3 = nn.Linear(num_feature*4, num_feature*4)
    self.linear4 = nn.Linear(num_feature*4, num_feature)

  def basic_forward(self, x) :
    yhat = F.relu(self.linear1(x))
    yhat = F.relu(self.linear2(yhat))
    yhat = F.relu(self.linear3(yhat))
    yhat = F.relu(self.linear4(yhat))

    return yhat

  def forward(self, x) :
    # assume x is B x D
    yhat = self.basic_forward(x)
    yhat = super().forward(yhat)

    return yhat

  def forward_discrete_estimate(self,x) :
    yhat = self.basic_forward(x)
    yhat = super().forward_discrete_estimate(yhat)

    return yhat

  def forward_discrete(self,x) :
    yhat = self.basic_forward(x)
    yhat = super().forward_discrete(yhat)

    return yhat



In [None]:
def train_model(model, x_train, y_train,
                num_iteration = 2000,
                batch_size = 512,
                lr =  6e-4,
                l2_weight = 0,
                ent_weight = 0,
                focus_on_cont = False,
                apply_label_weight = True,
                label_weight_power = 1,
                apply_mixup_aug = True) :
  """
  Input :
  model = Tree model that will be trained
  x_train = all training dataset containing all the feature
  y_train = all training label
  batch_size = use gradient descent if possible, else uses mini batch gradient descent
  lr = learning rate for the optimizers
  l2_weight = weight of the L2 normalization for the feature weight of the tree only
  ent_weight = weight of the entropy normalization for the bias of the tree only
  focus_on_cont = if true, the output is aggregation of all leaf, if not output only 1 leaf
  apply_label_weight = weight the label when the class is imbalanced
  label_weight_power = power of the label_weight, if the label weight is not strong enough
  apply_mixup_aug = use mixup data augmentation or not
  """


  optim = torch.optim.Adam(model.parameters(), lr)
  num_label = int(y_train.max()) + 1
  all_class_label_weight = []
  for label_now_idx in range(num_label) :
    all_class_label_weight.append(np.sum(y_train == label_now_idx))
  all_class_label_weight = np.stack(all_class_label_weight)
  print("Num Label:", all_class_label_weight)
  all_class_label_weight = 1/all_class_label_weight
  all_class_label_weight = all_class_label_weight/all_class_label_weight.min()
  all_class_label_weight = all_class_label_weight**(label_weight_power)

  # all_class_label_weight = np.array([1, 10])
  print("Label Weight:", all_class_label_weight)
  all_class_label_weight = torch.as_tensor(all_class_label_weight).float()

  if apply_mixup_aug :
    batch_size = batch_size*2

  for itr in range(num_iteration) :
    if batch_size >= len(x_train) :
      x_now = torch.as_tensor(x_train.astype(float)).float()
      y_now = torch.LongTensor(y_train)
    else :
      rand_idx = np.random.randint(0, len(x_train), batch_size)
      x_now = torch.as_tensor(x_train[rand_idx].astype(float)).float()
      y_now = torch.LongTensor(y_train[rand_idx])


    loss_weight = all_class_label_weight[y_now].reshape(len(y_now), 1)
    y_now = F.one_hot(y_now, num_label)
    if apply_mixup_aug :
      rand_unif = np.random.rand(len(x_now)//2).reshape(len(x_now)//2, 1)
      rand_unif = torch.as_tensor(rand_unif).float()

      x_now = rand_unif*x_now[:len(x_now)//2] + (1-rand_unif)*x_now[len(x_now)//2:]
      y_now = rand_unif*y_now[:len(y_now)//2] + (1-rand_unif)*y_now[len(y_now)//2:]
      loss_weight = rand_unif*loss_weight[:len(loss_weight)//2] + (1-rand_unif)*loss_weight[len(loss_weight)//2:]

    yhat = model(x_now)
    discrete_yhat = model.forward_discrete_estimate(x_now)

    if not focus_on_cont :
      yhat = torch.softmax(discrete_yhat + yhat - yhat.detach(), -1)
    else :
      yhat = torch.softmax(yhat + discrete_yhat - discrete_yhat.detach(), -1)
    # yhat = torch.softmax(discrete_yhat , -1)


    l2_total = 0
    ent_total = 0
    counter = 0
    for module_now in model.modules() :
      if type(module_now) == Node :
        counter = counter + 1

        beta_prob = torch.softmax(module_now.beta,-1) # *x_now - module_now.theta
        ent_now = -torch.sum(beta_prob*torch.log(beta_prob),-1)
        ent_now = torch.mean(ent_now)

        ent_total = ent_total + ent_now



        l2_now = torch.mean(module_now.theta**2) # torch.mean(module_now.beta**2) +

        l2_total = l2_total + l2_now
    l2_total = l2_total/counter
    ent_total = ent_total/counter

    if apply_label_weight :
      loss = loss_weight* torch.sum(y_now*torch.log(yhat),-1)
    else :
      loss = torch.sum(y_now*torch.log(yhat),-1)
    loss = -torch.mean(loss) + l2_weight*l2_total  + ent_weight*ent_total

    optim.zero_grad()
    loss.backward()
    optim.step()

    if itr % 10 == 0 :
      print(itr, loss)
  return model


In [None]:
def k_fold(x_all, y_all, num_k = 10, use_mlp = False, normalize = True, **train_config) :
  """
  Input :
  x_all = all dataset containing all the feature
  y_all = all label
  num_k = number of the k in the k-fold
  use_mlp = wether to use MLP to extract feature first
  normalize = wether to normalize the data using z-norm or not
  train_config = the model training configuration
  """

  all_model = []
  all_results = {
      "confusion_matrix":[],
      "accuracy":[],
      "f1_score":[],
  }

  num_incr = len(x_all)//num_k
  for k in range(num_k) :
    x_train = np.concatenate((x_all[:num_incr*k], x_all[num_incr*(k+1):] ), 0)
    x_test = x_all[num_incr*k:num_incr*(k+1)]
    y_train = np.concatenate((y_all[:num_incr*k], y_all[num_incr*(k+1):] ), 0)
    y_test = y_all[num_incr*k:num_incr*(k+1)]

    if normalize :
      x_mu = x_train.mean(0, keepdims = True)
      x_std = np.var(x_train,0, keepdims=True)**(1/2)

      x_train = (x_train - x_mu)/x_std
      x_test = (x_test - x_mu)/x_std

    if use_mlp :
      model = LinearTree(x_train.shape[1], 16 , int(y_all.max() + 1), 3)
    else :
      model = Node(x_train.shape[1], int(y_all.max() + 1), 3)

    print('=============================================================')
    print(x_train.shape, y_train.shape)
    print(x_test.shape, y_test.shape)
    print(model)
    print('=============================================================')
    print(train_config)
    model = train_model(model, x_train, y_train, **train_config)
    all_model.append(deepcopy(model))

    x_now = torch.as_tensor(x_test.astype(float)).float()
    y_now = torch.LongTensor(y_test)
    yhat = model.forward_discrete_estimate(x_now).detach().cpu().numpy()
    y_now = y_now.detach().cpu().numpy()
    yhat = (yhat.argmax(-1))

    all_results['accuracy'].append(np.mean((yhat == y_now)))
    all_results['f1_score'].append(f1_score(y_now, yhat, average='macro'))
    all_results['confusion_matrix'].append(confusion_matrix(y_now, yhat))

    print("RULES:")
    model.create_rules()

    print("accuracy:\n", all_results['accuracy'][-1])
    print("f1_score:\n", all_results['f1_score'][-1])
    print("confusion_matrix:\n", all_results['confusion_matrix'][-1])
  return all_results, all_model


# Iris Datasets

In [None]:
!wget https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv

--2024-10-26 12:03:48--  https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3975 (3.9K) [text/plain]
Saving to: ‘iris.csv.7’


2024-10-26 12:03:48 (43.0 MB/s) - ‘iris.csv.7’ saved [3975/3975]



In [None]:
iris_dataset = pd.read_csv("/content/iris.csv")
iris_dataset = iris_dataset.to_numpy()
np.random.shuffle(iris_dataset)

x_all = iris_dataset[:,0:4]
y_all = iris_dataset[:,4:]
y_all = (0*(y_all == "Setosa") + 1*(y_all == "Virginica") + 2*(y_all == "Versicolor")).squeeze()

In [None]:
all_results, all_model = k_fold(x_all, y_all, 10, False, True, batch_size = 64, l2_weight=0.1)

(135, 4) (135,)
(15, 4) (15,)
Node(
  (child_left): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
  (child_right): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
)
{'batch_size': 64, 'l2_weight': 0.1}
Num Label: [45 46 44]
Label Weight: [1.02222222 1.         1.04545455]
0 tensor(1.1090, grad_fn=<AddBackward0>)
10 tensor(1.1030, grad_fn=<AddBackward0>)
20 tensor(1.0929, grad_fn=<AddBackward0>)
30 tensor(1.0934, grad_fn=<AddBackward0>)
40 tensor(1.0854, grad_fn=<AddBackward0>)
50 tensor(1.0998, grad_fn=<AddBackward0>)
60 tensor(1.0795, grad_fn=<AddBackward0>)
70 tensor(1.0901, grad_fn=<AddBackward0>)
80 tensor(1.0837, grad_fn=<AddBackward0>)
90 tensor(1.0844, grad_fn=<AddBackward0>)
100 tensor(1.0698, grad_fn=<AddBackward0

In [None]:
print("accuracy:\n", np.array(all_results['accuracy']).mean())
print("f1_score:\n", np.array(all_results['f1_score']).mean())
print("confusion_matrix:\n", np.array(all_results['confusion_matrix']).mean(0))

accuracy:
 0.6
f1_score:
 0.5746547896547897
confusion_matrix:
 [[3.9 0.6 0.5]
 [0.1 2.2 2.7]
 [0.3 1.8 2.9]]


# Pima Indians Diabetes Database

In [None]:
!wget https://raw.githubusercontent.com/jbrownlee/Datasets/refs/heads/master/pima-indians-diabetes.csv

--2024-10-26 12:08:43--  https://raw.githubusercontent.com/jbrownlee/Datasets/refs/heads/master/pima-indians-diabetes.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23278 (23K) [text/plain]
Saving to: ‘pima-indians-diabetes.csv.3’


2024-10-26 12:08:44 (3.49 MB/s) - ‘pima-indians-diabetes.csv.3’ saved [23278/23278]



In [None]:
def load_pima_diabetes_data(filepath):
  data = pd.read_csv(filepath)
  data.columns = [
      "Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
      "BMI", "DiabetesPedigreeFunction", "Age", "Outcome"
  ]

  return data

In [None]:
df = load_pima_diabetes_data("/content/pima-indians-diabetes.csv").to_numpy()

In [None]:
pima_dataset = load_pima_diabetes_data("/content/pima-indians-diabetes.csv")
pima_dataset = pima_dataset.to_numpy()
np.random.shuffle(pima_dataset)

x_all = pima_dataset[:,0:-1]
y_all = pima_dataset[:,-1:]

In [None]:
all_results, all_model = k_fold(x_all, y_all, num_k = 10, use_mlp = False, normalize = True, batch_size = 256,
                                l2_weight=0, label_weight_power = 1.5, focus_on_cont=True)

(691, 8) (691, 1)
(76, 8) (76, 1)
Node(
  (child_left): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
  (child_right): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
)
{'batch_size': 256, 'l2_weight': 0, 'label_weight_power': 1.5, 'focus_on_cont': True}
Num Label: [450 241]
Label Weight: [1.        2.5514861]
0 tensor(1.0698, grad_fn=<AddBackward0>)
10 tensor(1.0499, grad_fn=<AddBackward0>)
20 tensor(0.9922, grad_fn=<AddBackward0>)
30 tensor(1.0673, grad_fn=<AddBackward0>)
40 tensor(1.0447, grad_fn=<AddBackward0>)
50 tensor(1.0306, grad_fn=<AddBackward0>)
60 tensor(1.1007, grad_fn=<AddBackward0>)
70 tensor(1.0619, grad_fn=<AddBackward0>)
80 tensor(1.0759, grad_fn=<AddBackward0>)
90 tensor(1.0599, grad_fn=<AddBackward0>)
1

In [None]:
print("accuracy:\n", np.array(all_results['accuracy']).mean())
print("f1_score:\n", np.array(all_results['f1_score']).mean())
print("confusion_matrix:\n", np.array(all_results['confusion_matrix']).mean(0))

accuracy:
 0.6151662049861496
f1_score:
 0.44562724205801035
confusion_matrix:
 [[44.4  5.1]
 [23.2  3.3]]


# The Breast Cancer Wisconsin (Diagnostic) dataset

In [None]:
!wget https://raw.githubusercontent.com/jbrownlee/Datasets/refs/heads/master/breast-cancer-wisconsin.csv

--2024-10-26 12:15:17--  https://raw.githubusercontent.com/jbrownlee/Datasets/refs/heads/master/breast-cancer-wisconsin.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14486 (14K) [text/plain]
Saving to: ‘breast-cancer-wisconsin.csv.2’


2024-10-26 12:15:17 (4.56 MB/s) - ‘breast-cancer-wisconsin.csv.2’ saved [14486/14486]



In [None]:
def load_breast_cancer_data(filepath):

  data = pd.read_csv(filepath)

  data.columns = [
      "ClumpThickness", "UniformityOfCellSize",
      "UniformityOfCellShape", "MarginalAdhesion", "SingleEpithelialCellSize",
      "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "Class"
  ]

  data['Class'] = data['Class'].replace({2: 'Benign', 4: 'Malignant'})

  return data

In [None]:
breast_cancer_dataset_begin = load_breast_cancer_data("/content/breast-cancer-wisconsin.csv")
breast_cancer_dataset_begin = breast_cancer_dataset_begin.to_numpy()

breast_cancer_dataset = []
for i in range(len(breast_cancer_dataset_begin)) :
  can_add = True
  for j in range(len(breast_cancer_dataset_begin[i])) :
    if breast_cancer_dataset_begin[i][j] == '?' :
      can_add = False
  if can_add :
    breast_cancer_dataset.append(breast_cancer_dataset_begin[i])
breast_cancer_dataset = np.stack(breast_cancer_dataset)

np.random.shuffle(breast_cancer_dataset)

x_all = breast_cancer_dataset[:,0:-1].astype(float)
y_all = (breast_cancer_dataset[:,-1:] == "Malignant").astype(float)

In [None]:
all_results, all_model = k_fold(x_all, y_all, num_k = 10, use_mlp = False, normalize = True, batch_size = 256,
                                focus_on_cont=True)

(614, 9) (614, 1)
(68, 9) (68, 1)
Node(
  (child_left): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
  (child_right): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
)
{'batch_size': 256, 'focus_on_cont': True}
Num Label: [403 211]
Label Weight: [1.         1.90995261]
0 tensor(0.8787, grad_fn=<AddBackward0>)
10 tensor(0.9006, grad_fn=<AddBackward0>)
20 tensor(0.9255, grad_fn=<AddBackward0>)
30 tensor(0.8817, grad_fn=<AddBackward0>)
40 tensor(0.9029, grad_fn=<AddBackward0>)
50 tensor(0.9063, grad_fn=<AddBackward0>)
60 tensor(0.8936, grad_fn=<AddBackward0>)
70 tensor(0.9115, grad_fn=<AddBackward0>)
80 tensor(0.9162, grad_fn=<AddBackward0>)
90 tensor(0.9043, grad_fn=<AddBackward0>)
100 tensor(0.9285, grad_fn=<AddBackward0>)

In [None]:
print("accuracy:\n", np.array(all_results['accuracy']).mean())
print("f1_score:\n", np.array(all_results['f1_score']).mean())
print("confusion_matrix:\n", np.array(all_results['confusion_matrix']).mean(0))

accuracy:
 0.6381487889273356
f1_score:
 0.46970545890051263
confusion_matrix:
 [[44.2  0. ]
 [21.6  2.2]]


# Adult Dataset

In [None]:
!wget https://raw.githubusercontent.com/jbrownlee/Datasets/refs/heads/master/adult-all.csv


--2024-10-26 12:21:46--  https://raw.githubusercontent.com/jbrownlee/Datasets/refs/heads/master/adult-all.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5277365 (5.0M) [text/plain]
Saving to: ‘adult-all.csv.2’


2024-10-26 12:21:47 (19.7 MB/s) - ‘adult-all.csv.2’ saved [5277365/5277365]



In [None]:
def preproc_col(str_column) :
  all_str = []
  for str_now in str_column :
    if not (str_now in all_str) :
      all_str.append(str_now)
  all_str = np.array(all_str)

  new_feature = []
  for str_now in str_column :
    now_feat = (all_str == str_now).astype(float)
    new_feature.append(now_feat)
  new_feature = np.stack(new_feature)

  return new_feature

In [None]:
def load_adult_data(filepath):

    data = pd.read_csv(filepath,
                       names=[
                           "age", "workclass", "fnlwgt", "education", "education-num",
                           "marital-status", "occupation", "relationship", "race", "sex",
                           "capital-gain", "capital-loss", "hours-per-week", "native-country",
                           "income"
                       ],
                       na_values=["?"])  # Treat "?" as missing values

    return data

In [None]:
filepath = '/content/adult-all.csv'  # Update with your actual file path
adult_data = load_adult_data(filepath)
# adult_data = adult_data.dropna()
adult_data = adult_data.to_numpy()
np.random.shuffle(adult_data)

In [None]:
all_adult_feat = adult_data[:,0:1]
for col_idx in range(1, adult_data.shape[1] - 1) :
  if type(adult_data[0][col_idx]) == str :
    feat_now = preproc_col(adult_data[:,col_idx])
    all_adult_feat = np.concatenate((all_adult_feat, feat_now),-1)
    print(all_adult_feat.shape)

print("----")

for col_idx in range(1, adult_data.shape[1] - 1) :
  if not (type(adult_data[0][col_idx]) == str) :
    all_adult_feat = np.concatenate((all_adult_feat, adult_data[:,col_idx:col_idx+1 ]),-1)
print(all_adult_feat.shape)

(48842, 10)
(48842, 26)
(48842, 33)
(48842, 48)
(48842, 54)
(48842, 59)
(48842, 61)
(48842, 103)
----
(48842, 108)


In [None]:
x_all = all_adult_feat
y_all = (adult_data[:,-1:] == ">50K").astype(float)

In [None]:
all_results, all_model = k_fold(x_all, y_all, num_k = 10, use_mlp = True, normalize = False, batch_size = 512,
                                focus_on_cont=True)

(43958, 108) (43958, 1)
(4884, 108) (4884, 1)
LinearTree(
  (child_left): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
  (child_right): Node(
    (child_left): Node(
      (child_left): Node()
      (child_right): Node()
    )
    (child_right): Node(
      (child_left): Node()
      (child_right): Node()
    )
  )
  (linear1): Linear(in_features=108, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=64, bias=True)
  (linear3): Linear(in_features=64, out_features=64, bias=True)
  (linear4): Linear(in_features=64, out_features=16, bias=True)
)
{'batch_size': 512, 'focus_on_cont': True}
Num Label: [33437 10521]
Label Weight: [1.         3.17811995]
0 tensor(1.0622, grad_fn=<AddBackward0>)
10 tensor(1.0000, grad_fn=<AddBackward0>)
20 tensor(1.0033, grad_fn=<AddBackward0>)
30 tensor(0.9674, grad_fn=<AddBackward0>)
40 tensor(1.026

In [None]:
print("accuracy:\n", np.array(all_results['accuracy']).mean())
print("f1_score:\n", np.array(all_results['f1_score']).mean())
print("confusion_matrix:\n", np.array(all_results['confusion_matrix']).mean(0))

accuracy:
 0.7607084357084357
f1_score:
 0.43203432772852823
confusion_matrix:
 [[3715.3    0. ]
 [1168.7    0. ]]


# Reference
- Differentiable Decision Tree, https://proceedings.mlr.press/v108/silva20a/silva20a.pdf
- Mixup Augmentation, https://arxiv.org/pdf/1710.09412
- Straight-Throught Estimator, https://arxiv.org/abs/1308.3432
- Adam Optimization, https://arxiv.org/abs/1412.6980