# Classification Tree Optimization Project

## Course Information

This project is part of the Master's Degree in Computer Engineering course, specifically for the "Optimization" class taught by Professor Francesca Maggioni.

## Assignment

The task for this course is to **implement one of the classification trees** presented in the paper by Bertsimas and Dunn and apply it to a real-life dataset discussed in the article.

### Paper Reference

- **Title**: Optimal Classification Trees
- **Authors**: Dimitris Bertsimas and Julia Dunn
- **Journal**: Machine Learning
- **Volume**: 106
- **Pages**: 1039–1082
- **Year**: 2017
- **DOI**: [10.1007/s10994-017-5633-9](https://doi.org/10.1007/s10994-017-5633-9)

### Objective

The objective of this project is to:

1. Implement a classification tree based on the methodology described in the paper.
2. Apply the implemented model to a real-world dataset to evaluate its performance and effectiveness.

### Overview of the Implementation

In this notebook, we will:

1. **Load and Prepare Data**: Import the dataset and preprocess it for use in the classification tree model.
2. **Standardize the Data**: Ensure that the data is scaled appropriately for the model.
3. **Split the Dataset**: Divide the data into training and test sets for model evaluation.
4. **Configure the Model**: Set up the parameters and constants for the classification tree model.
5. **Train and Tune the Model**: Train the model using the training data and fine-tune it for optimal performance.
6. **Compare Different Solvers**: Evaluate the performance of various solvers in solving the classification tree model.
7. **Evaluate and Report Results**: Assess the model’s performance on both training and test data and present the results.

By following these steps, we aim to understand the application of optimization techniques in classification problems and gain insights into the performance of classification trees on real-world data.



## Setup Instructions

To ensure a smooth development process, it is recommended to use a virtual environment. This helps manage dependencies and avoid conflicts. You can choose between `venv` or `conda` for creating the virtual environment.

### Using `venv`

1. **Create a virtual environment**:
    ```bash
    python -m venv myenv
    ```

2. **Activate the virtual environment**:
    - On Windows:
      ```bash
      myenv\Scripts\activate
      ```
    - On macOS and Linux:
      ```bash
      source myenv/bin/activate
      ```

3. **Install dependencies**:
    ```bash
    pip install -r requirements.txt
    ```

### Using `conda`

1. **Create a new conda environment** with Python 3.12:
    ```bash
    conda create -n myenv python=3.12
    ```

2. **Activate the conda environment**:
    ```bash
    conda activate myenv
    ```

3. **Install dependencies**:
    ```bash
    pip install -r requirements.txt
    ```

### Python Version

Ensure you are using **Python 3.12** for compatibility with the dependencies and the project code.

### Setting constants and model parameters

In this section, we define key constants and parameters that will be used throughout the model training and evaluation process.
These include the random seed for reproducibility, dataset split ratios, and model-specific parameters.

In [1]:
# For reproducibility
SEED = 26

# Split parameters
TRAIN_SIZE = 0.6
TEST_SIZE = 1 - TRAIN_SIZE

# Model parameters
ALPHA = 0.01 # complexity
MIN_SAMPLES_PER_LEAF = 0 # minimum number of samples per leaf
MAX_DEPTH = 3 # max depth of the tree


### Installing required packages and importing libraries

In this section, we install the necessary packages and import the required libraries for our project, this includes specific versions for reproducibility and compatibility.

In [2]:
# When pyomo will support numpy 2.0, we will update the version
%pip install numpy==1.26.4 \
    scipy \
    matplotlib \
    scikit-learn \
    ucimlrepo \
    pandas \
    pyomo==6.7.3 --quiet

import numpy as np
import pandas as pd
import time
import pyomo.environ as pyo
import importlib
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from src import MIOTree
from sklearn import tree

Note: you may need to restart the kernel to use updated packages.


### Utility functions

This section defines two utility functions: one for standardizing data and another for printing the confusion matrix.

In [3]:
def print_confusion_matrix(classes, confusion_matrix):
    """Print the confusion matrix."""
    print('Confusion Matrix:')
    print ('Act/Pred\t' + '\t'.join(np.char.mod('(%d)', classes)))
    for index in range(len(confusion_matrix)):
        print(f'({classes[index]})\t\t', end='')
        for val in confusion_matrix[index]:
            print(val, end='\t')
        print()

def extract_leaf_predictions(model):
    classes = np.unique(model.y_train)

    num_leaf_nodes = len(model.pyomo_model.leaf_nodes)
    leaf_predictions = [None] * num_leaf_nodes

    class_index_to_label = {i: classes[i] for i in model.pyomo_model.classes_indices}


    for i in model.pyomo_model.classes_indices:
        for j in model.pyomo_model.leaf_nodes:
            if model.pyomo_model.c[i, j].value == 1:
                leaf_index = j - num_leaf_nodes
                if 0 <= leaf_index < num_leaf_nodes:
                    leaf_predictions[leaf_index] = class_index_to_label[i]

    return leaf_predictions


### Fetching and processing the dataset

In this section, we fetch a dataset from the UCI Machine Learning Repository, process it into NumPy arrays, and print out metadata and variable information.

In [4]:
# fetch dataset 
iris = fetch_ucirepo(id=53) 
  
# data (as pandas dataframes) 
X = iris.data.features
y = iris.data.targets

# convert to numpy
X = X.to_numpy()
y = y.to_numpy()
  
# metadata 
print(iris.metadata) 
  
# variable information 
print(iris.variables) 

{'uci_id': 53, 'name': 'Iris', 'repository_url': 'https://archive.ics.uci.edu/dataset/53/iris', 'data_url': 'https://archive.ics.uci.edu/static/public/53/data.csv', 'abstract': 'A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.\n', 'area': 'Biology', 'tasks': ['Classification'], 'characteristics': ['Tabular'], 'num_instances': 150, 'num_features': 4, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1936, 'last_updated': 'Tue Sep 12 2023', 'dataset_doi': '10.24432/C56C76', 'creators': ['R. A. Fisher'], 'intro_paper': {'title': 'The Iris data set: In search of the source of virginica', 'authors': 'A. Unwin, K. Kleinman', 'published_in': 'Significance, 2021', 'year': 2021, 'url': 'https://www.semanticscholar.org/paper/4599862ea877863669a6a8e63a3c707a787d5d7e', 'doi': '1740-9713.01589'}, 'add

### Data standardization and train-test split

In this section, we standardize the dataset and split it into training and test sets.

In [5]:
from sklearn.preprocessing import MinMaxScaler

# Normalize the data
scaler = MinMaxScaler(feature_range=(0, 1))
X_std = scaler.fit_transform(X)

# Convert the target to integers
le = LabelEncoder()
y_converted = le.fit_transform(y)

# Print mapping
print('Class to integer mapping:')
for i, item in enumerate(le.classes_):
    print(item, '-->', i)


X_train, X_test, y_train, y_test = train_test_split(
    X_std, 
    y_converted, 
    train_size=TRAIN_SIZE,
    test_size=TEST_SIZE,
    random_state=SEED)

Class to integer mapping:
Iris-setosa --> 0
Iris-versicolor --> 1
Iris-virginica --> 2


  y = column_or_1d(y, warn=True)


### Model Comparison and Evaluation

In this section, we will compare the performance of the `MIOTree` model with different hyperparameter settings. We will vary the `max_depth` and `alpha` parameters to evaluate their impact on model performance.

In [6]:
# Comparisons
for max_depth in [2, 3]:
    for alpha in [1, 0.1]:
        model = MIOTree.MIOTree(
            alpha=alpha, 
            min_samples_per_leaf=MIN_SAMPLES_PER_LEAF, 
            max_depth=max_depth,
            X_train=X_train, 
            y_train=y_train)
        
        init_time = time.time()
        model.solve('gurobi')
        end_time = time.time()

        accuracy_train = model.calculate_accuracy()

        confusion_matrix = model.calculate_confusion_matrix()

        accuracy_test = model.calculate_accuracy(X_test, y_test)
        model.print_log(duration=end_time - init_time, accuracy_train = model.calculate_accuracy(), accuracy_test = accuracy_test)
        print_confusion_matrix(np.unique(model.y_train), confusion_matrix)

        leaf_predictions = extract_leaf_predictions(model)
        model.tree.print_tree(leaf_predictions, a=model.pyomo_model.a, b=model.pyomo_model.b)
        print()

MIO	Depth: 2	Alpha: 1	Duration: 4.17625880241394
	Accuracy train: 0.9111111111111111	Accuracy test: 0.9166666666666666
Confusion Matrix:
Act/Pred	(0)	(1)	(2)
(0)		29	0	0	
(1)		4	25	1	
(2)		0	3	28	
Root: (1) val: None b: 0.37500000000000444
    L--- (2) val: 0 b: 0.0
        L--- (4) val: 2 b: None
        R--- (5) val: None b: None
    R--- (3) val: 1 b: 0.7083333333333334
        L--- (6) val: None b: None
        R--- (7) val: None b: None

MIO	Depth: 2	Alpha: 0.1	Duration: 2.2820680141448975
	Accuracy train: 0.9555555555555556	Accuracy test: 0.9666666666666667
Confusion Matrix:
Act/Pred	(0)	(1)	(2)
(0)		29	0	0	
(1)		0	29	1	
(2)		0	3	28	
Root: (1) val: None b: 0.1694915254237288
    L--- (2) val: 0 b: 0.0
        L--- (4) val: 2 b: None
        R--- (5) val: None b: None
    R--- (3) val: 1 b: 0.7083333333333334
        L--- (6) val: None b: None
        R--- (7) val: None b: None

MIO	Depth: 3	Alpha: 1	Duration: 157.6781919002533
	Accuracy train: 0.9888888888888889	Accuracy test: 0.

Same models but with warm start enabled.

In [7]:
# Comparisons
for max_depth in [2, 3]:
    for alpha in [1, 0.1]:
        model = MIOTree.MIOTree(
            alpha=alpha, 
            min_samples_per_leaf=MIN_SAMPLES_PER_LEAF, 
            max_depth=max_depth,
            X_train=X_train, 
            y_train=y_train)
        
        cart_model, duration = model.cart_model(X_train, y_train, max_depth, min_samples_per_leaf=MIN_SAMPLES_PER_LEAF, alpha=alpha)
        model.apply_cart_to_mio(cart_model.tree_)
        init_time = time.time()
        model.solve('gurobi', warmstart=True)
        end_time = time.time()

        accuracy_train = model.calculate_accuracy()

        confusion_matrix = model.calculate_confusion_matrix()

        accuracy_test = model.calculate_accuracy(X_test, y_test)
        model.print_log(duration=end_time - init_time, accuracy_train = model.calculate_accuracy(), accuracy_test = accuracy_test)
        print_confusion_matrix(np.unique(model.y_train), confusion_matrix)

        leaf_predictions = extract_leaf_predictions(model)
        model.tree.print_tree(leaf_predictions, a=model.pyomo_model.a, b=model.pyomo_model.b)
        print()

MIO	Depth: 2	Alpha: 1	Duration: 6.048969984054565
	Accuracy train: 0.8555555555555555	Accuracy test: 0.8333333333333334
Confusion Matrix:
Act/Pred	(0)	(1)	(2)
(0)		29	0	0	
(1)		4	26	0	
(2)		0	9	22	
Root: (1) val: None b: 0.3750000000000029
    L--- (2) val: 0 b: 0.0
        L--- (4) val: 2 b: None
        R--- (5) val: None b: None
    R--- (3) val: 1 b: 0.708333333333347
        L--- (6) val: None b: None
        R--- (7) val: None b: None

MIO	Depth: 2	Alpha: 0.1	Duration: 1.9832262992858887
	Accuracy train: 0.9555555555555556	Accuracy test: 0.95
Confusion Matrix:
Act/Pred	(0)	(1)	(2)
(0)		29	0	0	
(1)		0	29	1	
(2)		0	3	28	
Root: (1) val: 0 b: 0.2083333333333332
    L--- (2) val: 0 b: 0.0
        L--- (4) val: 2 b: None
        R--- (5) val: None b: None
    R--- (3) val: 1 b: 0.7083333333333333
        L--- (6) val: None b: None
        R--- (7) val: None b: None

MIO	Depth: 3	Alpha: 1	Duration: 162.17447304725647
	Accuracy train: 0.9888888888888889	Accuracy test: 0.9833333333333333


### Creating and tuning the model

In this section, we create an instance of the `MIOTree` model using the defined parameters and then tune the model using the test data.


In [8]:
model = MIOTree.MIOTree(
    alpha=ALPHA, 
    min_samples_per_leaf=MIN_SAMPLES_PER_LEAF, 
    max_depth=MAX_DEPTH,
    X_train=X_train, 
    y_train=y_train)

tuned_model = model.tune(X_test, y_test)

accuracy = tuned_model.calculate_accuracy()
print(f'Train Accuracy: {accuracy * 100}%')

confusion_matrix = tuned_model.calculate_confusion_matrix()
print_confusion_matrix(np.unique(model.y_train), confusion_matrix)

accuracy = tuned_model.calculate_accuracy(X_test, y_test)
print(f'Test Accuracy: {accuracy * 100}%')

leaf_predictions = extract_leaf_predictions(tuned_model)
tuned_model.tree.print_tree(leaf_predictions, a=tuned_model.pyomo_model.a, b=tuned_model.pyomo_model.b)

CART	Depth: 2	Alpha: 1	Duration: 0.0005700588226318359
	Accuracy train: 0.6666666666666666	Accuracy test: -
Best warm-start: {'model': <src.MIOTree.MIOTree object at 0x17191bef0>, 'accuracy': 0.6666666666666666, 'duration': 0.0005700588226318359, 'type': 'CART'}
MIO	Depth: 2	Alpha: 1	Duration: 0.13965487480163574
	Accuracy train: 0.6666666666666666	Accuracy test: -
Root: (1) val: None b: 0.20833333333333326
    L--- (2) val: 0 b: 0.0
        L--- (4) val: 2 b: None
            L--- (8) val: None b: None
            R--- (9) val: None b: None
        R--- (5) val: None b: None
            L--- (10) val: None b: None
            R--- (11) val: None b: None
    R--- (3) val: None b: 0.0
        L--- (6) val: None b: None
            L--- (12) val: None b: None
            R--- (13) val: None b: None
        R--- (7) val: None b: None
            L--- (14) val: None b: None
            R--- (15) val: None b: None

CART	Depth: 2	Alpha: 2	Duration: 0.0007998943328857422
	Accuracy train: 0.95