# CS3920 Assignment 2

1. Install the needed libraries into the current environment

In [1]:
%pip install scikit-learn matplotlib numpy

Collecting scikit-learn
  Downloading scikit_learn-1.6.0-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting numpy
  Downloading numpy-2.2.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.14.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.1-cp311-cp311-win_amd64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.55.3-cp311-cp311-win_amd64.whl.metadata (168 kB)
Collecting kiwisolver>=1.

2. Load the data set into Python using, e.g., load_wine or genfromtxt, as appropriate. In the case of the USPS dataset, merge the original training and test sets into one dataset.

In [1]:
from sklearn.datasets import load_wine
import numpy as np

uspsZip = {}
wine = load_wine()

# Load data from both files
test_data = np.genfromtxt("zip.test", delimiter=" ", usecols=np.arange(1, 257))
train_data = np.genfromtxt("zip.train", delimiter=" ", usecols=np.arange(1, 257))

# Load targets from both files
test_target = np.genfromtxt("zip.test", delimiter=" ", usecols=0, dtype='int')
train_target = np.genfromtxt("zip.train", delimiter=" ", usecols=0, dtype='float').astype(int)

# Combine the two files
uspsZip['data'] = np.vstack((test_data, train_data))
uspsZip['target'] = np.concatenate((test_target, train_target))

3. Divide the dataset into a training set and a test set. You may use the
function train_test_split. Use your birthday in the format DDMM as
random_state (omit leading zeros if any).

In [2]:
from sklearn.model_selection import train_test_split

X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(wine.data, wine.target, random_state=79)

In [3]:
X_zip_train, X_zip_test, y_zip_train, y_zip_test = train_test_split(uspsZip["data"], uspsZip["target"], random_state=79)

4. Using cross-validation and the training set only, estimate the generaliza-
tion accuracy of the SVM with the default values of the parameters. You
may use the function cross_val_score.

In [4]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svc = SVC()

zip_score = np.mean(cross_val_score(svc, X_zip_train, y_zip_train))
wine_score = np.mean(cross_val_score(svc, X_wine_train, y_wine_train))

print(f"Accuracy on training set for ZIP Codes: {zip_score}")
print(f"Accuracy on training set for Wine dataset: {wine_score}")

Accuracy on training set for ZIP Codes: 0.9708874181720943
Accuracy on training set for Wine dataset: 0.6541310541310541


5. Find the test error rate of the SVM with the default values of parameters,
compare it with the estimate obtained in the previous task (task 3), and
write your observations in a markdown cell of your Jupyter notebook.

In [5]:
svc.fit(X_zip_train, y_zip_train)
zip_acc = svc.score(X_zip_test, y_zip_test) * 100

svc.fit(X_wine_train, y_wine_train)
wine_acc = svc.score(X_wine_test, y_wine_test) * 100

print(f"Error-rate for ZIP Code dataset: {100 - zip_acc}%")
print(f"Error-rate for Wine dataset: {100 - wine_acc}%")

print(f"Accuracy for ZIP Code dataset: {zip_acc}%")
print(f"Accuracy for Wine dataset: {wine_acc}%")

Error-rate for ZIP Code dataset: 2.8817204301075208%
Error-rate for Wine dataset: 31.111111111111114%
Accuracy for ZIP Code dataset: 97.11827956989248%
Accuracy for Wine dataset: 68.88888888888889%


6. Create a pipeline for SVM involving data normalization and SVC, and
use grid search and cross-validation to tune parameters C and gamma for
the pipeline, avoiding data snooping and data leakage. You may use
the scikit-learn class GridSearchCV (along with other scikit-learn
classes). Experiment with different ways of doing normalization (such
as StandardScaler, MinMaxScaler, RobustScaler, and Normalizer).
Which ways are appropriate for either dataset? (The answer, which should
be written in your Jupyter notebook, may depend on the results that you
obtain for the next task.)

In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import Normalizer, MinMaxScaler, RobustScaler, StandardScaler

normalisers = [Normalizer(), MinMaxScaler(), RobustScaler(), StandardScaler()]
grid_values = [0.01, 0.1, 1, 10, 100]

def normalise(meth, gridVals, X_test, y_test, X_train, y_train):
    pipeline = make_pipeline(meth, SVC())
    pipe_param = {"svc__C": gridVals, "svc__gamma": gridVals}
    g_search = GridSearchCV(pipeline, param_grid=pipe_param, cv=len(gridVals), n_jobs=-1)
    g_search.fit(X_train, y_train)

    return (g_search.score(X_test, y_test), g_search.best_score_, g_search.best_params_), g_search

In [7]:
wine_grids = []
wine_saved_norm = []

for i in normalisers:
    grid, norm = normalise(i, grid_values, X_wine_test, y_wine_test, X_wine_train, y_wine_train)
    wine_grids.append(grid)
    wine_saved_norm.append(norm)

print(wine_grids)
print(wine_saved_norm)

[(0.9555555555555556, np.float64(0.8880341880341881), {'svc__C': 100, 'svc__gamma': 100}), (1.0, np.float64(0.9772079772079773), {'svc__C': 0.1, 'svc__gamma': 1}), (1.0, np.float64(0.9703703703703702), {'svc__C': 0.1, 'svc__gamma': 0.1}), (1.0, np.float64(0.9772079772079773), {'svc__C': 1, 'svc__gamma': 0.01})]
[GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('normalizer', Normalizer()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]}), GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]}), GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('robustscaler', RobustScaler()),


In [8]:
zip_grids = []
zip_saved_norm = []

for i in normalisers:
    grid, norm = normalise(i, grid_values, X_zip_test, y_zip_test, X_zip_train, y_zip_train)
    zip_grids.append(grid)
    zip_saved_norm.append(norm)

print(zip_grids)
print(zip_saved_norm)

[(0.9750537634408603, np.float64(0.9728947923255324), {'svc__C': 10, 'svc__gamma': 1}), (0.9720430107526882, np.float64(0.9698837310953754), {'svc__C': 10, 'svc__gamma': 0.01}), (0.7359139784946237, np.float64(0.7953542833341047), {'svc__C': 10, 'svc__gamma': 0.01}), (0.9333333333333333, np.float64(0.9327402127911222), {'svc__C': 10, 'svc__gamma': 0.01})]
[GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('normalizer', Normalizer()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]}), GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]}), GridSearchCV(cv=5,
             estimator=Pipel

7. Fit the GridSearchCV object of task 5 to the training set and use it to
predict the test labels. Write the resulting test error rate in your Jupyter
notebook.

In [18]:
from sklearn.metrics import accuracy_score

wine_acc = []

for w_norm in wine_saved_norm:
    w_norm: GridSearchCV
    w_norm.fit(X_wine_train, y_wine_train)
    w_predict = w_norm.predict(X_wine_test)
    wine_acc.append((w_norm.estimator.steps[0][0], 100 - (accuracy_score(y_wine_test, w_predict) * 100)))

In [19]:
zip_acc = []

for z_norm in zip_saved_norm:
    z_norm: GridSearchCV
    z_norm.fit(X_zip_train, y_zip_train)
    z_predict = z_norm.predict(X_zip_test)
    zip_acc.append((z_norm.estimator.steps[0][0], 100 - (accuracy_score(y_zip_test, z_predict) * 100)))

8. Implement a cross-conformal predictor. You may use the KFold class for
splitting into folds (start from 5 or 10 folds). For computing the conformity
scores for each fold, you may use one of the GridSearchCV objects that
you created in task 5 in combination with the decision_function method
(see Section 3 of Lab Worksheet 9 for examples). Run your cross-conformal
predictor on the two datasets, training it on the training set and testing
on the test set.
 - To check its validity, produce a calibration curve, plotting the per-
centage of errors made on the test set1 vs the significance level
ϵ ∈ [0, 1].
 - Compute the average false p-value on the test set.

In [25]:
def calculate_avg_false(y_test, p_values):
    in_range = []
    for i in range(0, len(y_test)):
        for j in np.unique(y_test):
            if y_test[i] != j:
                in_range.append(p_values[i][j])
    
    return np.mean(in_range)

In [31]:
from sklearn.model_selection import KFold


def conform_p_value(grids: list[GridSearchCV], folds, X_train: np.ndarray, y_train: np.array, X_test: np.ndarray, y_test: np.array) -> np.float64:
    conform_score = []
    p_false_val = []
    for gi in range(0, len(grids)):
        n_grid = grids[gi]
        p_ranks = np.zeros((X_test.shape[0], folds))
        p_values = np.zeros_like(p_ranks)
        print(f"Using average false P-Value for: {n_grid}")
        for i, j in KFold(shuffle=True, random_state=0, n_splits=folds).split(X_train):
            X_ext = X_train[i]
            y_ext = y_train[i]
            X_fold = X_train[j]
            y_fold = y_train[j]

            n_grid.fit(X_ext, y_ext)

            fold = n_grid.decision_function(X_fold)
            test = n_grid.decision_function(X_ext)
            alpha = np.zeros(X_fold.shape[0])
            
            for k_fold in range(0, X_fold.shape[0], 1):
                alpha[k_fold] = fold[k_fold, y_fold[k_fold]]

            for k_fold in range(0, X_test.shape[0], 1):
                for f in range(0, folds, 1):
                    p_ranks[k_fold, f] = p_ranks[k_fold, f] + np.sum(alpha <= test[k_fold, f])
            
            p_values = (p_ranks + 1) / (X_train.shape[0] + 1)
            conform_score.append(p_values)
            p_false_val.append([(calculate_avg_false(y_test, p_values))])
    return conform_score, p_false_val

        


In [33]:
w_p_vals, w_p_false_vals = conform_p_value(wine_saved_norm, 3,X_wine_train, y_wine_train, X_wine_test, y_wine_test)

print(f"Wine average false P-Values using;")
for i in range(len(normalisers)):
    print(f"with {normalisers[i]}: {w_p_false_vals[i]}")

Using average false P-Value for: GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('normalizer', Normalizer()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]})
Using average false P-Value for: GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]})
Using average false P-Value for: GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('robustscaler', RobustScaler()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]})
Using avera

In [36]:
z_p_vals, z_p_false_vals = conform_p_value(zip_saved_norm, 3,X_zip_train, y_zip_train, X_zip_test, y_zip_test)

print(f"Zip code average false P-Values using;")
for i in range(len(normalisers)):
    print(f"with {normalisers[i]}: {z_p_false_vals[i]}")

Using average false P-Value for: GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('normalizer', Normalizer()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]})


9. An alternative to implementing a cross-conformal predictor is to experiment with a neural network. Perform tasks 3–6 for the scikit-learn

In [37]:
from sklearn.neural_network import MLPClassifier

mlp_c = MLPClassifier(max_iter=6000)
print(f"Cross-validation score average:")
print(f"Wine Dataset => {np.mean(cross_val_score(mlp_c, X_wine_train, y_wine_train))}")
print(f"Zip Code Dataset => {np.mean(cross_val_score(mlp_c, X_zip_train, y_zip_train))}")

Cross-validation score average:
Wine Dataset => 0.572934472934473
Zip Code Dataset => 0.9609912425499967


In [39]:
mlp_c.fit(X_wine_train, y_wine_train)
w_acc_mlp = mlp_c.score(X_wine_test, y_wine_test)

mlp_c.fit(X_zip_train, y_zip_train)
z_acc_mlp = mlp_c.score(X_zip_test, y_zip_test)

print("Accuracy of MLP on;")
print(f"Wine => {w_acc_mlp}")
print(f"Zip Code => {z_acc_mlp}")

print()

print("Error rate on MLP;")
print(f"Wine => {(1 - w_acc_mlp) * 100}%")
print(f"Zip Code => {(1 - z_acc_mlp) * 100}%")

Accuracy of MLP on;
Wine => 0.13333333333333333
Zip Code => 0.9660215053763441

Error rate on MLP;
Wine => 86.66666666666667%
Zip Code => 3.3978494623655875%


In [None]:
def normalise_m(mlp, normalisers, X_test, y_test, X_train, y_train):
    pipeline_param = {"mlpclassifier__activation": ["identity", "logistic", "tanh", "relu"],
                      "mlpclassifier__solver": ["lbfgs", "sgd", "adam"]}
    pipeline = make_pipeline(normalisers, mlp)
    g_search = GridSearchCV(pipeline, param_grid=pipeline_param, cv=21, n_jobs=-1)

    return (g_search.score(X_test, y_test), g_search.best_score_, g_search.best_params_), g_search