<a href="https://colab.research.google.com/github/avionerman/computational_and_statistical/blob/main/svm_part_a.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports, Data Loading & Data subsets

In [1]:
# import numpy as np
# from tensorflow.keras.datasets import cifar10
# from sklearn.model_selection import train_test_split, GridSearchCV
!pip install cupy-cuda12x




In [2]:
!pip install tensorflow
import numpy as np
from tensorflow.keras.datasets import cifar10
from sklearn.model_selection import train_test_split, GridSearchCV

start_bold = "\u001b[1m"
end_bold = "\033[0m"

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
# x_train.shape, y_train.shape, x_test.shape, y_test.shape

class_names = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]

# # 10% train subset
#x_train, _, y_train, _ = train_test_split(x_train, y_train, test_size=0.75, stratify=y_train, random_state=42)

# # 10% test subset
# x_test, _, y_test, _ = train_test_split(x_test, y_test, test_size=0.90, stratify=y_test, random_state=42)

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
[1m170498071/170498071[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 0us/step


# Preprocessing phase

## Flatten enablement

In [4]:
##### Flatten step #####

print(start_bold + "Flattening explanation:" + end_bold)
print("Flattening the data, is highly needed since both PCA and SVM cannot accept 3D data. \n"
"PCA calculates the covariance matrix among features. \n"
"That said, it needs an array of [samples x features], where each faeture is a column.\n")

x_train = x_train.reshape(len(x_train), -1)
x_test  = x_test.reshape(len(x_test), -1)
# x_train.shape, x_test.shape

[1mFlattening explanation:[0m
Flattening the data, is highly needed since both PCA and SVM cannot accept 3D data. 
PCA calculates the covariance matrix among features. 
That said, it needs an array of [samples x features], where each faeture is a column.



## Normalization enablement

In [5]:
##### Normalization step #####

print(start_bold + "Normalization explanation:" + end_bold)
print("I want to normalize my data mainly because I want to: \n"
"[1] to prevent my upcoming models from being dominated by large features.\n"
"[2] to feed a better scale for calculating distances for my models.\n"
"[3] and to help my PCA step with meaningful directions instead of large ones.\n")

x_train = x_train.astype("float32") / 255.0
x_test  = x_test.astype("float32") / 255.0
# print(x_train.min(), x_train.max(), x_test.min(), x_test.max())

[1mNormalization explanation:[0m
I want to normalize my data mainly because I want to: 
[1] to prevent my upcoming models from being dominated by large features.
[2] to feed a better scale for calculating distances for my models.
[3] and to help my PCA step with meaningful directions instead of large ones.



## Standarization enablement

In [6]:
##### Standarization step #####

print(start_bold + "Standarization explanation:" + end_bold)
print("I want to standarize my data mainly because I want to: \n"
"[1] make my PCA work better since features with high deviation will not dominate.\n"
"[2] help my SVM to use all the features in a common scale.\n"
"*will use fit only for the training set, to prevent data leakage.\n"
"**fit learns info from the data, while transform applies the learned info to new data .\n"
"***mean should be 0, and std. dev should be 1.\n")

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

print("The train mean (μ) is:", x_train_scaled.mean(),
      "and the std. dev (σ) is:", x_train_scaled.std())

[1mStandarization explanation:[0m
I want to standarize my data mainly because I want to: 
[1] make my PCA work better since features with high deviation will not dominate.
[2] help my SVM to use all the features in a common scale.
*will use fit only for the training set, to prevent data leakage.
**fit learns info from the data, while transform applies the learned info to new data .
***mean should be 0, and std. dev should be 1.

The train mean (μ) is: -1.4948844e-09 and the std. dev (σ) is: 0.99999964


## PCA enablement

In [7]:
##### PCA (Principal component analysis) step #####

print(start_bold + "PCA explanation:" + end_bold)
print("To be updated: \n"
"*full: accurate, slow, memory heavy. \n"
"**auto: data shape (n_samples, n_features) -- recommended. \n")

from sklearn.decomposition import PCA
import time

start_time = time.time()
pca = PCA(n_components=0.90, svd_solver="auto", random_state=42)
x_train_pca = pca.fit_transform(x_train_scaled)
x_test_pca = pca.transform(x_test_scaled)
end_time = time.time()

print(f">>> The total PCA time was: {(end_time - start_time):.2f} seconds ({(end_time - start_time)/60:.2f} minutes)")
print(">>>",[float((pca.explained_variance_ratio_.sum())), (x_train_scaled.shape[1]), (x_train_pca.shape[1])] )

print("\n" + start_bold + "PCA results:" + end_bold)
print("Apparently, only 103 components are needed to explain almost 90% of the variance.\n"
"That looks good, because we has a huge dimensionality reduction from 3072 to 103, and at the same time"
"we didn't loose more than 10% of the total components.")

y_train_flat = y_train.ravel()
y_test_flat  = y_test.ravel()

[1mPCA explanation:[0m
To be updated: 
*full: accurate, slow, memory heavy. 
**auto: data shape (n_samples, n_features) -- recommended. 

>>> The total PCA time was: 18.79 seconds (0.31 minutes)
>>> [0.9006754159927368, 3072, 103]

[1mPCA results:[0m
Apparently, only 103 components are needed to explain almost 90% of the variance.
That looks good, because we has a huge dimensionality reduction from 3072 to 103, and at the same timewe didn't loose more than 10% of the total components.


# Test phase

The scope of this phase is only about checking if the pipeline works, if I get a logical accuracy and if the PCA works as expected on the dataset.

The optimization actions will come once the current step produces a logical baseline so we can start building on top of it in the following phase.

## Train an SVM with LinearSVC (once)

In [56]:
print(1)
# from sklearn.svm import LinearSVC
# from sklearn.metrics import accuracy_score
# import time
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     "C": [1]
# }

# grid = GridSearchCV(
#     LinearSVC(max_iter=5000),
#     param_grid=param_grid,
#     n_jobs=1,
#     verbose=3
# )

# start_time = time.time()
# grid.fit(x_train_pca, y_train_flat)
# train_time = time.time() - start_time

# print(f">>> The total LinearSVC time was: {train_time:.2f} seconds")

# y_train_pred_linearsvc = grid.predict(x_train_pca)
# y_test_pred_linearsvc  = grid.predict(x_test_pca)

# train_acc_w_linearsvc = accuracy_score(y_train_flat, y_train_pred_linearsvc)
# test_acc_w_linearsvc  = accuracy_score(y_test_flat, y_test_pred_linearsvc)

# print(f">>> The train accuracy was {train_acc_w_linearsvc:.4f} and the test accuracy was {test_acc_w_linearsvc:.4f}")

# print("\n" + start_bold + "LinearSVC results:" + end_bold)
# print("The train accuracy was close to 40%, same as the test accuracy.\n"
# "That looks good, firstly because we don't have to worry about overfitting, and secondly the pipeline is working as expected.\n")

1


## Train an SVM with SVC(kernel=linear) (once)

In [57]:
print(1)
# from sklearn.svm import SVC
# from sklearn.metrics import accuracy_score
# import time

# model = SVC(kernel="linear", C=1.0, verbose=True)

# start_time = time.time()
# model.fit(x_train_pca, y_train_flat)
# train_time = time.time() - start_time

# print(f">>> The total SVC (linear) time was: {train_time:.2f} seconds")

# y_train_pred = model.predict(x_train_pca)
# y_test_pred  = model.predict(x_test_pca)

# train_acc = accuracy_score(y_train_flat, y_train_pred)
# test_acc  = accuracy_score(y_test_flat, y_test_pred)

# print(f">>> The train accuracy was {train_acc:.2f} and the test accuracy was {test_acc:.2f}")

1


# Model Selection - LinearSVC

In [9]:
import time
import pandas as pd
import cupy as cp

from cuml.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Move data to GPU
X_train_gpu = cp.asarray(x_train_pca)
X_test_gpu  = cp.asarray(x_test_pca)
y_train_gpu = cp.asarray(y_train_flat)

param_grid = {
    "C": [1, 3, 5, 7, 10]
}

grid = GridSearchCV(
    estimator=LinearSVC(max_iter=5000),
    param_grid=param_grid,
    cv=5,
    verbose=3,
    n_jobs=1
)

start_time = time.time()
grid.fit(X_train_gpu, y_train_flat)
grid_time = time.time() - start_time

print(f"\n >>> LinearSVC(GPU) total time: {grid_time:.2f} seconds")

best_model = grid.best_estimator_
best_params = grid.best_params_

print("\n >>> LinearSVC(GPU) Summary")
print("--------------------------")
print("Best mean CV accuracy:", grid.best_score_)
print("Best params:", best_params)

# ---- Predictions on GPU ----
y_train_pred = cp.asnumpy(best_model.predict(X_train_gpu))
y_test_pred  = cp.asnumpy(best_model.predict(X_test_gpu))

train_acc = accuracy_score(y_train_flat, y_train_pred)
test_acc  = accuracy_score(y_test_flat,  y_test_pred)

print("\nFinal Evaluation:")
print("Train acc:", train_acc)
print("Test acc:", test_acc)

results = pd.DataFrame(grid.cv_results_)
cv_table = results[["param_C", "mean_test_score", "std_test_score", "rank_test_score"]]

print("\n >>> Cross-Validation outcome table")
print(cv_table.sort_values("rank_test_score"))


Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5] END ...............................C=1;, score=0.393 total time=   4.8s
[CV 2/5] END ...............................C=1;, score=0.383 total time=   0.2s
[CV 3/5] END ...............................C=1;, score=0.397 total time=   0.2s
[CV 4/5] END ...............................C=1;, score=0.390 total time=   0.2s
[CV 5/5] END ...............................C=1;, score=0.393 total time=   0.2s
[CV 1/5] END ...............................C=3;, score=0.393 total time=   0.2s
[CV 2/5] END ...............................C=3;, score=0.383 total time=   0.2s
[CV 3/5] END ...............................C=3;, score=0.397 total time=   0.2s
[CV 4/5] END ...............................C=3;, score=0.390 total time=   0.2s
[CV 5/5] END ...............................C=3;, score=0.393 total time=   0.2s
[CV 1/5] END ...............................C=5;, score=0.393 total time=   0.2s
[CV 2/5] END ...............................C=5;,

# Model Selection - SVC(kernel=rbf)

In [11]:
import time
import pandas as pd
import cupy as cp

from cuml.svm import SVC                     # GPU SVM (RBF)
from cuml.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# --- Move data to GPU ---
X_train_gpu = cp.asarray(x_train_pca)
X_test_gpu  = cp.asarray(x_test_pca)
y_train_gpu = cp.asarray(y_train_flat)

param_grid_rbf = {
    "C": [1, 3, 5, 7, 10],
    "gamma": ["scale"]
}

rbf_grid = GridSearchCV(
    estimator=SVC(kernel="rbf", verbose=True),
    param_grid=param_grid_rbf,
    cv=5,
    verbose=3,
    n_jobs=1
)

start_time = time.time()
rbf_grid.fit(X_train_gpu, y_train_flat)
rbf_time = time.time() - start_time

print(f"\n >>> SVM(rbf, GPU) with GridSearchCV total execution time: {rbf_time:.2f} seconds")

best_rbf_score  = float(rbf_grid.best_score_)
best_rbf_params = rbf_grid.best_params_
best_rbf_model  = rbf_grid.best_estimator_

print("\n >>> SVM(rbf, GPU) Execution Summary")
print("----------------------------")
print(f"Best mean CV accuracy: {best_rbf_score:.4f}")
print(f"Best hyperparameters: {best_rbf_params}")
print(f"Best model: {best_rbf_model}")

# --- Predict on GPU, then bring back to CPU for metrics ---
y_train_pred_gpu = best_rbf_model.predict(X_train_gpu)
y_test_pred_gpu  = best_rbf_model.predict(X_test_gpu)

y_train_pred_rbf = cp.asnumpy(y_train_pred_gpu)
y_test_pred_rbf  = cp.asnumpy(y_test_pred_gpu)

train_acc_rbf = accuracy_score(y_train_flat, y_train_pred_rbf)
test_acc_rbf  = accuracy_score(y_test_flat,  y_test_pred_rbf)

print("\n >>> Final Evaluation of the best SVM(rbf, GPU)")
print(f"Train accuracy: {train_acc_rbf:.4f}")
print(f"Test accuracy:  {test_acc_rbf:.4f}")

rbf_results = pd.DataFrame(rbf_grid.cv_results_)

rbf_cv_table = rbf_results[
    ["param_C", "param_gamma", "mean_test_score", "std_test_score", "rank_test_score"]
]
print("\n >>> RBF Cross-Validation outcome table")
print(rbf_cv_table.sort_values("rank_test_score"))


Fitting 5 folds for each of 5 candidates, totalling 25 fits
[2025-12-07 12:20:53.797] [CUML] [debug] Creating working set with 1024 elements
[2025-12-07 12:20:53.848] [CUML] [debug] SMO solver finished after 15 outer iterations, total inner 6237 iterations, and diff 0.000993
[2025-12-07 12:20:53.968] [CUML] [debug] Creating working set with 1024 elements
[2025-12-07 12:20:54.003] [CUML] [debug] SMO solver finished after 17 outer iterations, total inner 7269 iterations, and diff 0.000994
[2025-12-07 12:20:54.023] [CUML] [debug] Creating working set with 1024 elements
[2025-12-07 12:20:54.055] [CUML] [debug] SMO solver finished after 15 outer iterations, total inner 6033 iterations, and diff 0.000957
[2025-12-07 12:20:54.076] [CUML] [debug] Creating working set with 1024 elements
[2025-12-07 12:20:54.110] [CUML] [debug] SMO solver finished after 16 outer iterations, total inner 6476 iterations, and diff 0.000993
[2025-12-07 12:20:54.130] [CUML] [debug] Creating working set with 1024 elem

# Model Selection - SVC(kernel=linear)

In [None]:
import time
import pandas as pd
import cupy as cp

from cuml.svm import SVC
from cuml.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# ---- Move data to GPU ----
X_train_gpu = cp.asarray(x_train_pca)
X_test_gpu  = cp.asarray(x_test_pca)

param_grid_linear = {
    "C": [1, 3, 5, 7, 10],
}

linear_grid = GridSearchCV(
    estimator=SVC(
        kernel="linear",
        verbose=False,
        tol=1e-3,
        max_iter=100000
    ),
    param_grid=param_grid_linear,
    cv=5,
    verbose=2,
    n_jobs=1
)

start_time = time.time()
linear_grid.fit(X_train_gpu, y_train_flat)
linear_time = time.time() - start_time

print(f"\n >>> SVM(linear, GPU) with GridSearchCV total execution time: {linear_time:.2f} seconds")

best_linear_score  = float(linear_grid.best_score_)
best_linear_params = linear_grid.best_params_
best_linear_model  = linear_grid.best_estimator_

print("\n >>> SVM(linear, GPU) Execution Summary")
print("----------------------------")
print(f"Best mean CV accuracy: {best_linear_score:.4f}")
print(f"Best hyperparameters: {best_linear_params}")
print(f"Best model: {best_linear_model}")

# ---- Predict on GPU, then bring results back to CPU ----
y_train_pred_gpu = best_linear_model.predict(X_train_gpu)
y_test_pred_gpu  = best_linear_model.predict(X_test_gpu)

y_train_pred_linear = cp.asnumpy(y_train_pred_gpu)
y_test_pred_linear  = cp.asnumpy(y_test_pred_gpu)

train_acc_linear = accuracy_score(y_train_flat, y_train_pred_linear)
test_acc_linear  = accuracy_score(y_test_flat,  y_test_pred_linear)

print("\n >>> Final Evaluation of the best SVM(linear, GPU)")
print(f"Train accuracy: {train_acc_linear:.4f}")
print(f"Test accuracy:  {test_acc_linear:.4f}")

linear_results = pd.DataFrame(linear_grid.cv_results_)

linear_cv_table = linear_results[
    ["param_C", "mean_test_score", "std_test_score", "rank_test_score"]
]
print("\n >>> Linear Cross-Validation outcome table")
print(linear_cv_table.sort_values("rank_test_score"))


Fitting 5 folds for each of 5 candidates, totalling 25 fits
[2025-12-07 12:27:27.826] [CUML] [debug] Creating working set with 1024 elements
[2025-12-07 12:27:40.707] [CUML] [debug] SMO iteration 500, diff 0.213985
[2025-12-07 12:27:53.321] [CUML] [debug] SMO iteration 1000, diff 0.028822
[2025-12-07 12:28:05.921] [CUML] [debug] SMO iteration 1500, diff 0.007737
[2025-12-07 12:28:11.109] [CUML] [debug] Solver is not converging monotonically. This might be caused by insufficient normalization of the feature columns. In that case MinMaxScaler((0,1)) could help. Alternatively, for nonlinear kernels, you can try to increase the gamma parameter. To limit execution time, you can also adjust the number of iterations using the max_iter parameter.
[2025-12-07 12:28:17.082] [CUML] [debug] SMO iteration 2000, diff 0.001528
[2025-12-07 12:28:22.659] [CUML] [debug] SMO iteration 2500, diff 0.003674
[2025-12-07 12:28:26.109] [CUML] [debug] SMO solver finished after 2982 outer iterations, total inner

# kNN & CNN models

In [14]:
import time
import numpy as np
import cupy as cp

from cuml.neighbors import KNeighborsClassifier as cuKNN  # GPU KNN
from sklearn.neighbors import NearestCentroid             # CPU NCC
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

from scipy import sparse

# Ensure dense arrays for CuPy
if sparse.issparse(x_train_pca):
    x_train_dense = x_train_pca.toarray()
    x_test_dense  = x_test_pca.toarray()
else:
    x_train_dense = x_train_pca
    x_test_dense  = x_test_pca

X_train_gpu = cp.asarray(x_train_dense)
X_test_gpu  = cp.asarray(x_test_dense)
y_train_gpu = cp.asarray(y_train_flat)

# Range of k values to search
k_values = list(range(1, 100))  # k = 1..99

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

best_k = None
best_cv_score = -1.0
k_scores = []

start = time.time()

print("\n>>> Starting KNN (GPU) cross-validation over k...\n")
for k in k_values:
    fold_scores = []

    for train_idx, val_idx in cv.split(x_train_pca, y_train_flat):
        # Index GPU arrays with CPU indices
        X_tr_gpu = X_train_gpu[train_idx]
        y_tr_gpu = y_train_gpu[train_idx]
        X_val_gpu = X_train_gpu[val_idx]
        y_val = y_train_flat[val_idx]  # keep validation labels on CPU

        knn_gpu = cuKNN(n_neighbors=k)
        knn_gpu.fit(X_tr_gpu, y_tr_gpu)

        # Predict on GPU, bring back to CPU for accuracy
        y_val_pred_gpu = knn_gpu.predict(X_val_gpu)
        y_val_pred = cp.asnumpy(y_val_pred_gpu)

        fold_scores.append(accuracy_score(y_val, y_val_pred))

    mean_score = float(np.mean(fold_scores))
    k_scores.append((k, mean_score))
    print(f"k={k:3d} | mean CV accuracy={mean_score:.4f}")

    if mean_score > best_cv_score:
        best_cv_score = mean_score
        best_k = k

knn_cv_time = time.time() - start

print(f"\n>>> Best k from CV: {best_k} with mean accuracy {best_cv_score:.4f}")
print(f">>> KNN CV search time (GPU): {knn_cv_time:.2f} sec\n")

# Train final KNN with best k on full training set (GPU)
best_knn_gpu = cuKNN(n_neighbors=best_k)

start = time.time()
best_knn_gpu.fit(X_train_gpu, y_train_gpu)
knn_train_time = time.time() - start

start = time.time()
y_test_pred_gpu = best_knn_gpu.predict(X_test_gpu)
knn_test_time = time.time() - start

# For accuracy, bring preds (and train preds) back to CPU
y_test_pred = cp.asnumpy(y_test_pred_gpu)
y_train_pred = cp.asnumpy(best_knn_gpu.predict(X_train_gpu))

knn_train_acc = accuracy_score(y_train_flat, y_train_pred)
knn_test_acc  = accuracy_score(y_test_flat,  y_test_pred)

print("\n KNN (k-Nearest Neighbors, GPU)\n")
print(f"Best k from CV: ", best_k)
print(f"CV search time (GPU):       {knn_cv_time:.2f} sec")
print(f"Train time (best k only):   {knn_train_time:.2f} sec")
print(f"Test time:                  {knn_test_time:.2f} sec\n")
print(f"Train accuracy (best k):    {knn_train_acc:.4f}")
print(f"Test accuracy  (best k):    {knn_test_acc:.4f}\n")



ncc = NearestCentroid()

start = time.time()
ncc.fit(x_train_pca, y_train_flat)
ncc_train_time = time.time() - start

start = time.time()
y_test_pred_ncc = ncc.predict(x_test_pca)
ncc_test_time = time.time() - start

ncc_train_acc = ncc.score(x_train_pca, y_train_flat)
ncc_test_acc  = ncc.score(x_test_pca,  y_test_flat)

print("\n NCC (Nearest Class Centroid)\n")
print(f"Train time:      {ncc_train_time:.4f} sec")
print(f"Test time:       {ncc_test_time:.4f} sec\n")
print(f"Train accuracy:  {ncc_train_acc:.4f}")
print(f"Test accuracy:   {ncc_test_acc:.4f}\n")



>>> Starting KNN (GPU) cross-validation over k...

k=  1 | mean CV accuracy=0.3757
k=  2 | mean CV accuracy=0.3367
k=  3 | mean CV accuracy=0.3620
k=  4 | mean CV accuracy=0.3691
k=  5 | mean CV accuracy=0.3739
k=  6 | mean CV accuracy=0.3718
k=  7 | mean CV accuracy=0.3742
k=  8 | mean CV accuracy=0.3760
k=  9 | mean CV accuracy=0.3752
k= 10 | mean CV accuracy=0.3751
k= 11 | mean CV accuracy=0.3741
k= 12 | mean CV accuracy=0.3740
k= 13 | mean CV accuracy=0.3726
k= 14 | mean CV accuracy=0.3715
k= 15 | mean CV accuracy=0.3707
k= 16 | mean CV accuracy=0.3696
k= 17 | mean CV accuracy=0.3687
k= 18 | mean CV accuracy=0.3696
k= 19 | mean CV accuracy=0.3699
k= 20 | mean CV accuracy=0.3692
k= 21 | mean CV accuracy=0.3676
k= 22 | mean CV accuracy=0.3673
k= 23 | mean CV accuracy=0.3668
k= 24 | mean CV accuracy=0.3662
k= 25 | mean CV accuracy=0.3654
k= 26 | mean CV accuracy=0.3651
k= 27 | mean CV accuracy=0.3645
k= 28 | mean CV accuracy=0.3627
k= 29 | mean CV accuracy=0.3627
k= 30 | mean CV accu

# Summary results

In [16]:
import pandas as pd

summary = pd.DataFrame([
    ["LinearSVC",  train_acc, test_acc, grid_time],
    ["RBF SVM",    train_acc_rbf,    test_acc_rbf,    rbf_time],
    ["kNN (k=8)",  knn_train_acc,    knn_test_acc,    knn_train_time],
    ["NCC",        ncc_train_acc,    ncc_test_acc,    ncc_train_time],
], columns=["Model", "Train Accuracy", "Test Accuracy", "Train Time (s)"])

print("\n Summary results")
print(summary)



 Summary results
       Model  Train Accuracy  Test Accuracy  Train Time (s)
0  LinearSVC         0.39822         0.3967       10.003416
1    RBF SVM         0.86664         0.5665      160.979877
2  kNN (k=8)         0.50710         0.3814        0.002134
3        NCC         0.27120         0.2807        0.071079
