# Chp 3 Introduction to ML

Decision tree construction relies on recursion, where a function calls itself on smaller versions of the problem until reaching a stopping condition. For example, the factorial function can be defined recursively:

```python
ùëõ! = ùëõ √ó (ùëõ‚àí1)!
```

In the same way, decision trees build subtrees by applying the same process recursively on subsets of the training data, until reaching a leaf node.


In [None]:
# Factorial
def fact(n):
    if (n <= 1):
        return 1
    else:
        return n*fact(n-1)

# Chp 4 Experiments with Classical Models

## Iris

The iris dataset contains four continuous features‚Äîsepal length, sepal width, petal length, and petal width‚Äîand three classes corresponding to iris species. It has 150 samples, 50 per class. Using PCA augmentation, we expand the dataset to 1,200 training samples while keeping the same test set.

In [None]:
# Iris Experiments
import numpy as np
from sklearn.neighbors import NearestCentroid, KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC 

def run(x_train, y_train, x_test, y_test, clf):
    clf.fit(x_train, y_train)
    print("    predictions  :", clf.predict(x_test))
    print("    actual labels:", y_test)
    print("    score = %0.4f" % clf.score(x_test, y_test))
    print()

def main():
    x = np.load("../data/iris/iris_features.npy")
    y = np.load("../data/iris/iris_labels.npy")
    N = 120 
    x_train = x[:N]; x_test = x[N:]
    y_train = y[:N]; y_test = y[N:]
    xa_train=np.load("../data/iris/iris_train_features_augmented.npy")
    ya_train=np.load("../data/iris/iris_train_labels_augmented.npy")
    xa_test =np.load("../data/iris/iris_test_features_augmented.npy")
    ya_test =np.load("../data/iris/iris_test_labels_augmented.npy")

    print("Nearest centroid:")
    run(x_train, y_train, x_test, y_test, NearestCentroid())
    print("k-NN classifier (k=3):")
    run(x_train, y_train, x_test, y_test, KNeighborsClassifier(n_neighbors=3))
    print("Naive Bayes classifier (Gaussian):")
    run(x_train, y_train, x_test, y_test, GaussianNB())
    print("Naive Bayes classifier (Multinomial):")
    run(x_train, y_train, x_test, y_test, MultinomialNB())
    print("Decision tree classifier:")
    run(x_train, y_train, x_test, y_test, DecisionTreeClassifier())
    print("Random forest classifier (estimators=5):")
    run(xa_train, ya_train, xa_test, ya_test, RandomForestClassifier(n_estimators=5))

    print("SVM (linear, C=1.0):")
    run(xa_train, ya_train, xa_test, ya_test, SVC(kernel="linear", C=1.0))
    print("SVM (RBF, C=1.0, gamma=0.25):")
    run(xa_train, ya_train, xa_test, ya_test, SVC(kernel="rbf", C=1.0, gamma=0.25))
    print("SVM (RBF, C=1.0, gamma=0.001, augmented)")
    run(xa_train, ya_train, xa_test, ya_test, SVC(kernel="rbf", C=1.0, gamma=0.001))
    print("SVM (RBF, C=1.0, gamma=0.001, original)")
    run(x_train, y_train, x_test, y_test, SVC(kernel="rbf", C=1.0, gamma=0.001))

main()

### Implementing a Nearest-Centroid Classifier

Even without sklearn, we can quickly implement a nearest-centroid classifier for the iris dataset. The process involves calculating the per-feature means (centroids) of each class from the training samples. This is all that is needed to "train" the model. Predictions are made by computing the Euclidean distance from each test sample to the three centroids, assigning the sample to the class with the nearest centroid. 

In [None]:
import numpy as np

def centroids(x,y):
    c0 = x[np.where(y==0)].mean(axis=0)
    c1 = x[np.where(y==1)].mean(axis=0)
    c2 = x[np.where(y==2)].mean(axis=0)
    return [c0,c1,c2]

def predict(c0,c1,c2,x):
    p = np.zeros(x.shape[0], dtype="uint8")
    for i in range(x.shape[0]):
        d = [((c0-x[i])**2).sum(),
             ((c1-x[i])**2).sum(),
             ((c2-x[i])**2).sum()]
        p[i] = np.argmin(d)
    return p

def main():
    x = np.load("../data/iris/iris_features.npy")
    y = np.load("../data/iris/iris_labels.npy")
    N = 120
    x_train = x[:N]; x_test = x[N:]
    y_train = y[:N]; y_test = y[N:]
    c0, c1, c2 = centroids(x_train, y_train)
    p = predict(c0,c1,c2, x_test)
    nc = len(np.where(p == y_test)[0])
    nw = len(np.where(p != y_test)[0])
    acc = float(nc) / (float(nc)+float(nw))
    print("predicted:", p)
    print("actual   :", y_test)
    print("test accuracy = %0.4f" % acc)

main()

## Breast Cancer

The breast cancer dataset contains 569 samples, each with 30 continuous features, including 212 malignant and 357 benign cases. Before training, we normalize the dataset by subtracting the mean and dividing by the standard deviation for each feature. Normalization ensures all features are on a similar scale, improving performance for many models.

Using an 80/20 train-test split (455 training samples and 114 test samples), we train nine classifiers: nearest centroid, k-NN, naive Bayes, decision tree, random forest (two variants), linear SVM, and RBF SVM. For the SVMs, we set the margin constant C to the default 1.0, and Œ≥ for the RBF kernel to 0.0333 (1/30).


In [None]:
# BC experiements
import numpy as np
from sklearn.neighbors import NearestCentroid, KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC 

def run(x_train, y_train, x_test, y_test, clf):
    clf.fit(x_train, y_train)
    print("    score = %0.4f" % clf.score(x_test, y_test))
    print()

def main():
    x = np.load("../data/breast/bc_features_standard.npy")
    y = np.load("../data/breast/bc_labels.npy")
    N = 455 
    x_train = x[:N];  x_test = x[N:]
    y_train = y[:N];  y_test = y[N:]

    print("Nearest centroid:")
    run(x_train, y_train, x_test, y_test, NearestCentroid())
    print("k-NN classifier (k=3):")
    run(x_train, y_train, x_test, y_test, KNeighborsClassifier(n_neighbors=3))
    print("k-NN classifier (k=7):")
    run(x_train, y_train, x_test, y_test, KNeighborsClassifier(n_neighbors=7))
    print("Naive Bayes classifier (Gaussian):")
    run(x_train, y_train, x_test, y_test, GaussianNB())
    print("Decision tree classifier:")
    run(x_train, y_train, x_test, y_test, DecisionTreeClassifier())
    print("Random forest classifier (estimators=5):")
    run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=5))
    print("Random forest classifier (estimators=50):")
    run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=50))
    print("SVM (linear, C=1.0):")
    run(x_train, y_train, x_test, y_test, SVC(kernel="linear", C=1.0))
    print("SVM (RBF, C=1.0, gamma=0.03333):")
    run(x_train, y_train, x_test, y_test, SVC(kernel="rbf", C=1.0, gamma=0.03333))

main()

### Adding k-Fold Validation

To implement k-fold validation, we first select a value for k. For the breast cancer dataset with 569 samples, a balance is needed: smaller k ensures each fold has enough samples to represent the data reasonably, while larger k helps average out the effects of a ‚Äúbad‚Äù split. A common choice is k = 5, giving roughly 113 samples per fold, with 80% for training and 20% for testing. The code is designed to allow easy adjustment of k.

In [None]:
# BC K-Fold
import numpy as np
from sklearn.neighbors import NearestCentroid
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import sys

def run(x_train, y_train, x_test, y_test, clf):
    clf.fit(x_train, y_train)
    return clf.score(x_test, y_test)

def split(x,y,k,m):
    ns = int(y.shape[0]/m)
    s = []
    for i in range(m):
    	s.append([x[(ns*i):(ns*i+ns)],
                  y[(ns*i):(ns*i+ns)]])
    x_test, y_test = s[k]
    x_train = []
    y_train = []
    for i in range(m):
        if (i==k):
            continue
        else:
            a,b = s[i]
            x_train.append(a)
            y_train.append(b)
    x_train = np.array(x_train).reshape(((m-1)*ns,30))
    y_train = np.array(y_train).reshape((m-1)*ns)
    return [x_train, y_train, x_test, y_test]

def pp(z,k,s):
    m = z.shape[1]
    print("%-19s: %0.4f +/- %0.4f | " % (s, z[k].mean(), z[k].std()/np.sqrt(m)), end='')
    for i in range(m):
        print("%0.4f " % z[k,i], end='')
    print()

def main():
    x = np.load("../data/breast/bc_features_standard.npy")
    y = np.load("../data/breast/bc_labels.npy")
    idx = np.argsort(np.random.random(y.shape[0]))
    x = x[idx]
    y = y[idx]
    m = int(sys.argv[1])
    z = np.zeros((8,m))

    for k in range(m):
        x_train, y_train, x_test, y_test = split(x,y,k,m)
        z[0,k] = run(x_train, y_train, x_test, y_test, NearestCentroid())
        z[1,k] = run(x_train, y_train, x_test, y_test, KNeighborsClassifier(n_neighbors=3))
        z[2,k] = run(x_train, y_train, x_test, y_test, KNeighborsClassifier(n_neighbors=7))
        z[3,k] = run(x_train, y_train, x_test, y_test, GaussianNB())
        z[4,k] = run(x_train, y_train, x_test, y_test, DecisionTreeClassifier())
        z[5,k] = run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=5))
        z[6,k] = run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=50))
        z[7,k] = run(x_train, y_train, x_test, y_test, SVC(kernel="linear", C=1.0))

    pp(z,0,"Nearest"); pp(z,1,"3-NN")
    pp(z,2,"7-NN");    pp(z,3,"Naive Bayes")
    pp(z,4,"Decision tree");    pp(z,5,"Random forest (5)")
    pp(z,6,"Random forest (50)");    pp(z,7,"SVM (linear)")

main()

### Fine-Tuning the RBF Kernel SVM

For the RBF (Gaussian) kernel SVM, both C and Œ≥ must be optimized. A 2D grid search is performed:

- C uses the same range as the linear SVM.
- Œ≥ is selected from powers of two times the default 1/30, for p ‚àà [‚Äì4, 3].

For each pair (C, Œ≥), five-fold validation is performed, and the pair with the highest mean accuracy is selected. Repeated runs produce slightly different results due to randomization in the dataset ordering.

One promising combination is (C, Œ≥) = (10, 0.00417), which achieves a grand mean accuracy of 97.70%, the highest among all models tested on the breast cancer dataset.


In [None]:
# BC RBF SVM Search
import numpy as np
from sklearn.svm import SVC 

def run(x_train, y_train, x_test, y_test, clf):
    clf.fit(x_train, y_train)
    return clf.score(x_test, y_test)

def split(x,y,k,m):
    ns = int(y.shape[0]/m)
    s = []
    for i in range(m):
        s.append([x[(ns*i):(ns*i+ns)], y[(ns*i):(ns*i+ns)]])
    x_test, y_test = s[k]
    x_train = []
    y_train = []
    for i in range(m):
        if (i==k):
            continue
        else:
            a,b = s[i]
            x_train.append(a)
            y_train.append(b)
    x_train = np.array(x_train).reshape(((m-1)*ns,30))
    y_train = np.array(y_train).reshape((m-1)*ns)
    return [x_train, y_train, x_test, y_test]

def main():
    m = 5 
    x = np.load("../data/breast/bc_features_standard.npy")
    y = np.load("../data/breast/bc_labels.npy")
    idx = np.argsort(np.random.random(y.shape[0]))
    x = x[idx]
    y = y[idx]

    Cs = np.array([0.01,0.1,1.0,2.0,10.0,50.0,100.0])
    gs = (1./30)*2.0**np.array([-4,-3,-2,-1,0,1,2,3])
    zmax = 0.0 
    for C in Cs: 
        for g in gs: 
            z = np.zeros(m)
            for k in range(m):
                x_train, y_train, x_test, y_test = split(x,y,k,m)
                z[k] = run(x_train, y_train, x_test, y_test, SVC(C=C,gamma=g,kernel="rbf"))
            if (z.mean() > zmax):
                zmax = z.mean()
                bestC = C 
                bestg = g 
    print("best C     = %0.5f" % bestC)
    print("     gamma = %0.5f" % bestg)
    print("   accuracy= %0.5f" % zmax)

main()



## MNIST

The final dataset examined in this chapter is the vector version of MNIST, which contains 28√ó28 grayscale images of handwritten digits (0‚Äì9), one per image. MNIST is a foundational dataset in machine learning and deep learning and will be used throughout the book.

MNIST has 60,000 training images and 10,000 test images, roughly balanced across the 10 digits. Because the dataset is large, classical models are trained directly on the training set and tested on the test set, without using k-fold validation.

The images are converted into vectors of 784 elements (28 √ó 28 pixels), with values from 0 to 255. Three versions of the dataset are considered:

1. Raw byte values (0‚Äì255)
2. Scaled data to [0, 1) by dividing by 256
3. Normalized data, where each pixel has its mean subtracted and is divided by its standard deviation


In [None]:
# MNIST experiments
import time
import numpy as np
from sklearn.neighbors import NearestCentroid
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn import decomposition

def run(x_train, y_train, x_test, y_test, clf):
    s = time.time()
    clf.fit(x_train, y_train)
    e_train = time.time() - s 
    s = time.time()
    score = clf.score(x_test, y_test)
    e_test = time.time() - s 
    print("score = %0.4f (time, train=%8.3f, test=%8.3f)" % (score, e_train, e_test))

def train(x_train, y_train, x_test, y_test):
    print("    Nearest centroid          : ", end='')
    run(x_train, y_train, x_test, y_test, NearestCentroid())
    print("    k-NN classifier (k=3)     : ", end='')
    run(x_train, y_train, x_test, y_test, KNeighborsClassifier(n_neighbors=3))
    print("    k-NN classifier (k=7)     : ", end='')
    run(x_train, y_train, x_test, y_test, KNeighborsClassifier(n_neighbors=7))
    print("    Naive Bayes (Gaussian)    : ", end='')
    run(x_train, y_train, x_test, y_test, GaussianNB())
    print("    Decision tree             : ", end='')
    run(x_train, y_train, x_test, y_test, DecisionTreeClassifier())
    print("    Random forest (trees=  5) : ", end='')
    run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=5))
    print("    Random forest (trees= 50) : ", end='')
    run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=50))
    print("    Random forest (trees=500) : ", end='')
    run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=500))
    print("    Random forest (trees=1000): ", end='')
    run(x_train, y_train, x_test, y_test, RandomForestClassifier(n_estimators=1000))
    print("    LinearSVM (C=0.01)        : ", end='')
    run(x_train, y_train, x_test, y_test, LinearSVC(C=0.01))
    print("    LinearSVM (C=0.1)         : ", end='')
    run(x_train, y_train, x_test, y_test, LinearSVC(C=0.1))
    print("    LinearSVM (C=1.0)         : ", end='')
    run(x_train, y_train, x_test, y_test, LinearSVC(C=1.0))
    print("    LinearSVM (C=10.0)        : ", end='')
    run(x_train, y_train, x_test, y_test, LinearSVC(C=10.0))

The code uses LinearSVC instead of SVC for runtime efficiency and multiclass handling. Helper functions track both model accuracy and training/testing time, important due to the dataset‚Äôs larger size. Training is repeated for the raw, scaled, and normalized versions of the dataset.

Normalization uses the training set‚Äôs mean and standard deviation, which are also applied to test data, as these better represent the true distribution. PCA is also applied, reducing the 784 features to 15 principal components, capturing just over 33% of the variance.


In [None]:
def main():
    x_train = np.load("../data/mnist/mnist_train_vectors.npy").astype("float64")
    y_train = np.load("../data/mnist/mnist_train_labels.npy")
    x_test = np.load("../data/mnist/mnist_test_vectors.npy").astype("float64")
    y_test = np.load("../data/mnist/mnist_test_labels.npy")

    print("Models trained on raw [0,255] images:")
    train(x_train, y_train, x_test, y_test)
    print("Models trained on raw [0,1) images:")
    train(x_train/256.0, y_train, x_test/256.0, y_test)

    m = x_train.mean(axis=0)
    s = x_train.std(axis=0) + 1e-8
    x_ntrain = (x_train - m) / s
    x_ntest  = (x_test - m) / s

    print("Models trained on normalized images:")
    train(x_ntrain, y_train, x_ntest, y_test)

    pca = decomposition.PCA(n_components=15)
    pca.fit(x_ntrain)
    x_ptrain = pca.transform(x_ntrain)
    x_ptest = pca.transform(x_ntest)
    
    print("Models trained on first 15 PCA components of normalized images:")
    train(x_ptrain, y_train, x_ptest, y_test)

main()

### Experimenting with PCA Components

Previously, 15 PCA components were used, representing about 33% of the dataset‚Äôs variance. To explore the effect of PCA further, the number of components is varied from 10 to 780, and three models are trained for each setting: Gaussian naive Bayes, random forest (50 trees), and linear SVM (C = 1.0). This process is computationally intensive and took over 10 hours on a low-end machine.


In [None]:
import time
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn import decomposition

def run(x_train, y_train, x_test, y_test, clf):
    s = time.time()
    clf.fit(x_train, y_train)
    e_train = time.time() - s 
    s = time.time()
    score = clf.score(x_test, y_test)
    e_test = time.time() - s 
    return [score, e_train, e_test]

def main():
    x_train = np.load("../data/mnist/mnist_train_vectors.npy").astype("float64")
    y_train = np.load("../data/mnist/mnist_train_labels.npy")
    x_test = np.load("../data/mnist/mnist_test_vectors.npy").astype("float64")
    y_test = np.load("../data/mnist/mnist_test_labels.npy")
    m = x_train.mean(axis=0)
    s = x_train.std(axis=0) + 1e-8
    x_ntrain = (x_train - m) / s 
    x_ntest  = (x_test - m) / s 

    n = 78
    pcomp = np.linspace(10,780,n, dtype="int16")
    nb=np.zeros((n,4))
    rf=np.zeros((n,4))
    sv=np.zeros((n,4))
    tv=np.zeros((n,2))

    for i,p in enumerate(pcomp):
        pca = decomposition.PCA(n_components=p)
        pca.fit(x_ntrain)
        xtrain = pca.transform(x_ntrain)
        xtest = pca.transform(x_ntest)
        tv[i,:] = [p, pca.explained_variance_ratio_.sum()]
        sc,etrn,etst =run(xtrain, y_train, xtest, y_test, GaussianNB())
        nb[i,:] = [p,sc,etrn,etst]
        sc,etrn,etst =run(xtrain, y_train, xtest, y_test, RandomForestClassifier(n_estimators=50))
        rf[i,:] = [p,sc,etrn,etst]
        sc,etrn,etst =run(xtrain, y_train, xtest, y_test, LinearSVC(C=1.0))
        sv[i,:] = [p,sc,etrn,etst]

    np.save("../data/mnist/mnist_pca_tv.npy", tv) 
    np.save("../data/mnist/mnist_pca_nb.npy", nb)
    np.save("../data/mnist/mnist_pca_rf.npy", rf)
    np.save("../data/mnist/mnist_pca_sv.npy", sv)

main()



## Classical Model Summary

The chapter concludes with a summary of pros and cons for the six classical models discussed:

### Nearest Centroid

- **Pros:** Simple implementation, fast training, low memory use, supports multiclass classification, fast inference.
- **Cons:** Assumes each class forms a tight cluster in feature space, often too simplistic for complex data. Variants with multiple centroids per class can improve performance.

### k-Nearest Neighbors (k-NN)

- **Pros:** No explicit training required, works well with large datasets, supports multiclass classification naturally.
- **Cons:** Slow inference because distances must be computed for every training sample, even with optimized algorithms.

### Naive Bayes

- **Pros:** Fast to train and classify, supports multiclass problems, works for both discrete and continuous features.
- **Cons:** Assumes feature independence, which is rarely true in practice. Continuous features often require additional distributional assumptions (e.g., Gaussian).

### Decision Trees

- **Pros:** Fast training and inference, interpretable, supports multiclass and mixed feature types, can justify decisions with a clear path from root to leaf.
- **Cons:** Prone to overfitting, interpretability decreases with tree size, requires balancing tree depth against accuracy.

### Random Forests

- **Pros:** Robust to overfitting, supports multiclass problems, reasonably fast to train and infer, less sensitive to feature scaling, accuracy improves with more trees.
- **Cons:** Harder to interpret than single decision trees, inference time scales linearly with the number of trees, stochastic performance can vary slightly between trainings.

### Support Vector Machines (SVMs)

- **Pros:** Can achieve excellent performance, fast inference after training.
- **Cons:** Multiclass requires multiple models, only supports continuous features, sensitive to feature scaling, difficult to train on large datasets with non-linear kernels, requires careful hyperparameter tuning.

## When to Use Classical Models

Classical models remain appropriate under certain conditions:

1. **Small datasets:** They perform well when there are only tens or hundreds of examples, unlike deep learning models that require larger datasets.
2. **Limited computational resources:** Simple models (nearest centroid, naive Bayes, decision trees, SVMs) are feasible on low-power devices; k-NN may be too slow unless the dataset is small.
3. **Explainability:** Models like decision trees, k-NN, nearest centroid, and naive Bayes can explain their predictions, unlike deep neural networks.
4. **Vector inputs without structure:** When features are independent and unstructured (not spatially correlated as in images), classical models are suitable.

These are rules of thumb, not hard rules. Deep learning could be used even when these conditions apply, but classic models may provide sufficient performance with less complexity.


