# OCT Alpha Tuning Problem

## Guidance on Using the OCT Package

The **Optimal Classification Tree (OCT)** is an interpretable decision tree model formulated as a mixed-integer programming (MIP) problem. It determines the optimal splits by solving an optimization problem over the tree’s structure, using warm-start values from a Classification and Regression Tree (CART) to initialize the split decision variables (`a`) and thresholds (`b`).

The `optimal_classification_tree` package implements key functions for training the OCT and data preprocessing tools specifically for importing data from UCI Machine Learning Repository (such that there are no missing values and each column of data is normalized). This notebook demonstrates how to implement the OCT model and benchmark it against standard classifiers such as **CART**, **XGBoost**, and **Random Forest**.

---

### 1. Data Preparation

Start by importing the relevant functions from the `optimal_classification_tree` package and other classification models. The OCT class function allows us to plot the tree structure when the package `graphvi` is installed and added to our system path. Specify the following:
- `data_id`: the dataset ID from UCIMLREPO,
- `k_folds`: the number of folds for cross-validation,
- `D`: the depth of the OCT tree.
- `n`: randomly extract n observations if the original dataset is too large. 

and Use the `oct_tts` class to import a dataset from the UCI Machine Learning Repository and create cross-validation folds. 

Example usage:

```python
dt = oct_tts(data_id, k_folds,n)
df_train = dt.df_train #train set: 50% of each k_folds splits. 
df_cal = dt.df_cal #calibration set: 25% of each split. 
df_test = dt.df_test #test set: 25% of each split. 
```

### 2.Fitting the OCT Model

The balanced OCT model (where each node is forced to split) can be fit on a training set as follows:

```python
oct_model = bal_OCT(df_train[i], df_test[i], D)
```
Arguments:
- `df_train[i]`: the i-th training set,
- `df_test[i]`: the i-th test set,
- `D`: the tree depth.

If no warm-start is provided, the function internally fits a CART tree and uses its splits to warm-start the optimization.
To provide custom warm-starts:
- `warmstart_a`: a list of length \( $2^D$ - 1 \), where each entry indicates the feature index used to split at a branch node.
- `warmstart_b`: a list of threshold values for the chosen feature at each branch node.

The OCT model can be fit on a training set as follows:

```python
oct_model = OCT_Main(df_train[i], df_test[i], D, alpha)
```

### 3. Model Evaluation
After training, evaluate the in-sample and out-of-sample classification accuracy

```python
oct_model.oct_accuracy_train  # OCT in-sample accuracy
oct_model.cart_accuracy_train # CART in-sample accuracy
```
To visualize the tree structure, use the following (requires Graphviz installed and configured on your system path):
```python
oct_model.plot_tree_structure()
```

More conveniently, one may choose to calibrate $\alpha$ by cross validation through setting `tuning_problem = 0` or calibrate $C$ by cross validation through setting `tuning_problem = 1` or simply training the balanced OCT model (where in this case, tuning_problem = None) and view the prediction accuracies by calling the following function: 
```python
OCT(data_id,D,k_folds,n=None,tuning_problem=None,alpha=None)
```

### 4. Benchmarking with Other Classifiers
This notebook also benchmarks the OCT against XGBoost and Random Forest classifiers using the same cross-validation splits.
Refer to the corresponding sections of the notebook for details on training and evaluating these models.

### Import Packages

In [None]:
import pandas as pd
import numpy as np

from optimal_classification_tree import oct_tts, bal_OCT,OCT_Main,OCT_AT_Sub,OCT

from graphviz import Digraph
import xgboost as xgb

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

data_id = 244
k_folds = 5
D=2

dt = oct_tts(data_id, k_folds)
df_train = dt.df_train
df_cal = dt.df_cal
df_test = dt.df_test

### OCT vs. CART

In [18]:
OCT(data_id, D,k_folds,tuning_problem=1)

Gurobi Optimizer version 12.0.1 build v12.0.1rc0 (win64 - Windows 11.0 (26100.2))

CPU model: Intel(R) Core(TM) Ultra 9 185H, instruction set [SSE2|AVX|AVX2]
Thread count: 16 physical cores, 22 logical processors, using up to 22 threads

Optimize a model with 699 rows, 261 columns and 4867 nonzeros
Model fingerprint: 0x331ceb6f
Variable types: 7 continuous, 254 integer (242 binary)
Coefficient statistics:
  Matrix range     [6e-02, 5e+01]
  Objective range  [1e+00, 1e+00]
  Bounds range     [1e+00, 1e+00]
  RHS range        [1e+00, 5e+01]

User MIP start did not produce a new incumbent solution

Presolve removed 484 rows and 191 columns
Presolve time: 0.01s
Presolved: 215 rows, 70 columns, 1188 nonzeros
Variable types: 0 continuous, 70 integer (65 binary)
Found heuristic solution: objective 3.1914894

Root relaxation: objective 0.000000e+00, 39 iterations, 0.00 seconds (0.00 work units)
Another try with MIP start

    Nodes    |    Current Node    |     Objective Bounds      |     Work

{'C_opt': 1,
 'OCT_train_accuracy': np.float64(0.8933333333333333),
 'OCT_test_accuracy': np.float64(0.8400000000000001),
 'CART_train_accuracy': 0.8986666666666666,
 'CART_test_accuracy': 0.8240000000000001}

### XGBoost and Random Forest

In [19]:
xgboost_in = []
xgboost_out = []
rf_in = []
rf_out = []
for i in range(k_folds):  
    train = pd.concat([df_train[i], df_cal[i]], ignore_index=True).reset_index(drop=True)
    test = df_test[i].reset_index(drop=True)
    x_train = train.drop(['target'], axis=1)
    y_train = train['target'].astype('int')
    x_test = test.drop(['target'], axis=1)
    y_test = test['target'].astype('int')
    
    k = len(y_train.unique())
    if k >= 3:
        model_xgb = xgb.XGBClassifier(objective='multi:softmax',num_class=k,use_label_encoder=False, max_depth=D)
    else:
        model_xgb = xgb.XGBClassifier(objective='binary:logistic',use_label_encoder=False, max_depth=D)
    model_xgb.fit(x_train, y_train)

    y_fit_xgb = model_xgb.predict(x_train)
    y_pred_xgb = model_xgb.predict(x_test)
    xgboost_in.append(accuracy_score(y_train, y_fit_xgb))
    xgboost_out.append(accuracy_score(y_test, y_pred_xgb))
    
    model_rf = RandomForestClassifier(n_estimators=100, random_state=42,max_depth=D) #build 100 decision trees
    model_rf.fit(x_train, y_train)
    y_fit_rf = model_rf.predict(x_train)
    y_pred_rf = model_rf.predict(x_test)
    rf_in.append(accuracy_score(y_train, y_fit_rf))
    rf_out.append(accuracy_score(y_test, y_pred_rf))

results = {
    'xgboost_in': float(np.mean(xgboost_in)),
    'xgboost_out': float(np.mean(xgboost_out)),
    'rf_in': float(np.mean(rf_in)),
    'rf_out': float(np.mean(rf_out))
}

results

{'xgboost_in': 0.9706666666666667,
 'xgboost_out': 0.8160000000000001,
 'rf_in': 0.8933333333333333,
 'rf_out': 0.8400000000000001}