# Skin of the Orange
**PAGE 431.** We generated 100 observations in each of two classes. The first class has four standard normal independent features $X_1, X_2, X_3, X_4$. The second class also has four standard normal independent features, but conditioned on $9 \le \sum X_j^2 \le 16$. This is relatively easy problem. As a second harder problem, we augmented the features with an additional six standard Gaussian noise features. Hence the second class almost completely surrounds the first, like the skin surrounding the orange, in a four-dimensional subspace.

**DATA INFO.** There are two datasets. The first is non-noise features situation. The second is six-noise features situation. There are 50 simulations for each case. One simulation has 100 training records and 1000 test records. 

|           |            |
|-----------|------------|
|column 0    |simulation id (from 0 to 49)|
|column 1    |train/test flag (0 for training)|
|column 2    |class id -1/1|
|columns 3-6 |features|
|*columns 7-12 |noise (for the secod case)|

In [1]:
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

np.warnings.filterwarnings('ignore')

## Load and Prepare Data

In [2]:
loaded = np.load('../data/orange.npz')
no_noise_ds, six_noise_ds = loaded['no_noise'], loaded['six_noise']

## Bayes Error Rate
We know the underlying distribution of the data, so let's compute the error rate on the whole test data using optimal Bayes.

In [3]:
ds = np.vstack((no_noise_ds, six_noise_ds[:, :7]))
# find the test records
records = ds[(ds[:, 1] == 1), :]
# use only 3-7 features, the other features contain noise
features = records[:, 3:7]
# score actually is the squared distance from zero
score = np.sum(features**2, axis=1)
# classifiy to -1 if it is the second class
y_hat = 1 - 2*((score <= 16) & (score >= 9))
bayes_error_rate = 1 - accuracy_score(records[:, 2], y_hat)
# PAGE 431. The Bayes error rate for this problem is 0.029
#           (irrespective of dimension).
print(f'{bayes_error_rate:.3f}')

0.029


## Support Vector Machine Classification

In [4]:
def estimate_svc_on_data(ds, **kwargs):
    """Estimates mean error rate and its standard error of the SVC algorithm
       on a provided dataset.

    Parameters
    ----------
    ds : a dataset with training and test data for 50 simulations
    kwargs : parameters for the Support Vector Classifier
    Returns
    -------
    mean_error : mean test error rate based on 50 simulations
    mean_error_std : standard error of the mean test error rate
    """
    error_rates = np.zeros(50)
    # PAGE 431. The average test errors over 50 simulations, with and without
    #           noise features, are shown in Table 12.2.
    for sim_id in range(50):
        train_ds = ds[(ds[:, 0] == sim_id) & (ds[:, 1] == 0), :]
        test_ds = ds[(ds[:, 0] == sim_id) & (ds[:, 1] == 1), :]
        best_err, best_C = 1, 0
        # PAGE 431. For all support vector procedures, we chose the cost
        #           parameter C to minimize the test error, to be as fair
        #           as possible to the method.
        for C in np.linspace(0.01, 5, 20):
            err = 1 - SVC(C=C, max_iter=100000, **kwargs).fit(
                train_ds[:, 3:], train_ds[:, 2]
            ).score(test_ds[:, 3:], test_ds[:, 2])
            if err < best_err:
                best_err, best_C = err, C
        error_rates[sim_id] = best_err
    return np.mean(error_rates), np.std(error_rates)/np.sqrt(50)


def estimate_svc_without_and_with_noise(**kwargs):
    """Estimates mean error rate and its standard error of the SVC algorithm
       on 1) the dataset with 4 no-noise features; 2) on the dataset with 6
       additional noise features. So, it shows how noise features affect
       SVC performance.

    Parameters
    ----------
    kwargs : parameters for the Support Vector Classifier
    Returns
    -------
    no_noise_mean_error : mean test error rate based on 50 simulations without
                          noise features
    no_noise_mean_error_std : standard error of the mean test error rate
                              without noise features
    noise_mean_error : mean test error rate based on 50 simulations with noise
                       features
    noise_mean_error_std : standard error of the mean test error rate with
                           noise features
    """
    no_noise_err, no_noise_std = estimate_svc_on_data(no_noise_ds, **kwargs)
    six_noise_err, six_noise_std = estimate_svc_on_data(six_noise_ds, **kwargs)
    return no_noise_err, no_noise_std, six_noise_err, six_noise_std

In [5]:
linear_svc_results = estimate_svc_without_and_with_noise(kernel='linear')
poly2_svc_results = estimate_svc_without_and_with_noise(
    gamma='scale', kernel='poly', coef0=1, degree=2)
poly5_svc_results = estimate_svc_without_and_with_noise(
    gamma='scale', kernel='poly', coef0=1, degree=5)
poly10_svc_results = estimate_svc_without_and_with_noise(
    gamma='scale', kernel='poly', coef0=1, degree=10)

## MARS

In [6]:
from pyearth import Earth

In [7]:
def estimate_mars_on_data(ds):
    """Estimates mean error rate and its standard error of the MARS algorithm
       on a provided dataset.

    Parameters
    ----------
    ds : a dataset with training and test data for 50 simulations
    Returns
    -------
    mean_error : mean test error rate based on 50 simulations
    mean_error_std : standard error of the mean test error rate
    """
    error_rates = np.zeros(50)
    for sim_id in range(50):
        train_ds = ds[(ds[:, 0] == sim_id) & (ds[:, 1] == 0), :]
        test_ds = ds[(ds[:, 0] == sim_id) & (ds[:, 1] == 1), :]
        mars = Earth(
            max_terms=20, max_degree=4, enable_pruning=True
        ).fit(train_ds[:, 3:], train_ds[:, 2])
        y_hat = np.sign(mars.predict(test_ds[:, 3:]))
        error_rates[sim_id] = 1 - accuracy_score(test_ds[:, 2], y_hat)
    return np.mean(error_rates), np.std(error_rates)/np.sqrt(50)

In [8]:
mars_results = (*estimate_mars_on_data(no_noise_ds),
                *estimate_mars_on_data(six_noise_ds))

In [9]:
# PAGE 431. TABLE 12.2. Skin of the orange: Shown are mean (standard error of
#           the mean) of the test error over 50 simulations. BRUTO fits an
#           additive spline model adaptively, while MARS fits a low-order
#           interaction model adaptively.
names = ['1  SV Classifier', '2  SVM/poly 2', '3  SVM/poly 5',
         '4  SVM/poly 10', '6  MARS']
results = [linear_svc_results, poly2_svc_results, poly5_svc_results,
           poly10_svc_results, mars_results]

print('                             Test Error (SE)')
print('Methods           No Noise Features   Six Noise Features')
print('----------------------------------------------------------')
for name, result in zip(names, results):
    no_noise_err, no_noise_std, noise_err, noise_std = result
    print(f'{name:<20}{no_noise_err:>.3f} ({no_noise_std:.3f})'
          f'       {noise_err:.3f} ({noise_std:.3f})')
print(f'Bayes {bayes_error_rate:>23.3f} {bayes_error_rate:>19.3f}')

                             Test Error (SE)
Methods           No Noise Features   Six Noise Features
----------------------------------------------------------
1  SV Classifier    0.449 (0.002)       0.469 (0.003)
2  SVM/poly 2       0.075 (0.003)       0.165 (0.004)
3  SVM/poly 5       0.130 (0.004)       0.213 (0.004)
4  SVM/poly 10      0.179 (0.004)       0.346 (0.003)
6  MARS             0.144 (0.005)       0.161 (0.005)
Bayes                   0.029               0.029
