### Code demo for CART
Classification and decision trees are very powerful, but they do have one major drawback: they are highly
unstable. We show this with the following example on the servo data set that offers a regression task and
looks like that:

In [1]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [2]:
data = pd.read_csv('servo.csv', index_col='Unnamed: 0',dtype={'Motor':'category','Screw':'category'})

In [3]:
data.head()

Unnamed: 0,Motor,Screw,Pgain,Vgain,Class
1,E,E,5,4,4
2,B,D,6,5,11
3,D,D,4,3,6
4,B,A,3,2,48
5,D,B,6,5,6


We’ll fit two CART’s on our data, which we split in train and test with two different seeds, resulting in slightly different train and test data sets.

Check the differences in the CART architecture that was induced by those differing seeds:

In [4]:
np.random.seed(1333)
# split: 75 % for training, 25 % for test:
train_size = 0.75
train_indices = np.random.rand(len(data)) < train_size
train_1 = data[train_indices]
test_1 = data[~train_indices]

In [5]:
np.random.seed(42)
# split: 75 % for training, 25 % for test:
train_size = 0.75
train_indices = np.random.rand(len(data)) < train_size
train_2 = data[train_indices]
test_2 = data[~train_indices]

In [6]:
def classifier(data):
    X_cat = ['Motor','Screw']
    X_num = ['Pgain','Vgain']

    y_train = data.Class
    X_train = data.drop('Class',axis=1)

    categorical_pipe = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    numerical_pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='mean'))
    ])

    preprocessing = ColumnTransformer(
        [('cat', categorical_pipe, X_cat),
         ('num', numerical_pipe, X_num)])

    clf = Pipeline([
        ('preprocess', preprocessing),
        ('classifier', DecisionTreeClassifier(random_state=42))
    ])

    clf.fit(X_train, y_train)
    
    return clf

In [7]:
clf = classifier(train_1)

In [8]:
clf.score(X_test, y_test)

NameError: name 'X_test' is not defined