# CS3920 Assignment 2

1. Install the needed libraries into the current environment

In [1]:
%pip install scikit-learn matplotlib numpy

Collecting scikit-learn
  Downloading scikit_learn-1.6.0-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting matplotlib
  Downloading matplotlib-3.10.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting numpy
  Downloading numpy-2.2.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.14.1-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.1-cp311-cp311-win_amd64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.55.3-cp311-cp311-win_amd64.whl.metadata (168 kB)
Collecting kiwisolver>=1.

2. Load the data set into Python using, e.g., load_wine or genfromtxt, as appropriate. In the case of the USPS dataset, merge the original training and test sets into one dataset.

In [4]:
from sklearn.datasets import load_wine
import numpy as np

uspsZip = {}
wine = load_wine()

# Load data from both files
test_data = np.genfromtxt("zip.test", delimiter=" ", usecols=np.arange(1, 257))
train_data = np.genfromtxt("zip.train", delimiter=" ", usecols=np.arange(1, 257))

# Load targets from both files
test_target = np.genfromtxt("zip.test", delimiter=" ", usecols=0, dtype='int')
train_target = np.genfromtxt("zip.train", delimiter=" ", usecols=0, dtype='float').astype(int)

# Combine the two files
uspsZip['data'] = np.vstack((test_data, train_data))
uspsZip['target'] = np.concatenate((test_target, train_target))

3. Divide the dataset into a training set and a test set. You may use the
function train_test_split. Use your birthday in the format DDMM as
random_state (omit leading zeros if any).

In [5]:
from sklearn.model_selection import train_test_split

X_wine_train, X_wine_test, y_wine_train, y_wine_test = train_test_split(wine.data, wine.target, random_state=79)

In [6]:
X_zip_train, X_zip_test, y_zip_train, y_zip_test = train_test_split(uspsZip["data"], uspsZip["target"], random_state=79)

In [7]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svc = SVC()

zip_score = np.mean(cross_val_score(svc, X_zip_train, y_zip_train))
wine_score = np.mean(cross_val_score(svc, X_wine_train, y_wine_train))

print(f"Accuracy on training set for ZIP Codes: {zip_score}")
print(f"Accuracy on training set for Wine dataset: {wine_score}")

Accuracy on training set for ZIP Codes: 0.9708874181720943
Accuracy on training set for Wine dataset: 0.6541310541310541


4. Find the test error rate of the SVM with the default values of parameters,
compare it with the estimate obtained in the previous task (task 3), and
write your observations in a markdown cell of your Jupyter notebook.

In [9]:
svc.fit(X_zip_train, y_zip_train)
zip_acc = svc.score(X_zip_test, y_zip_test) * 100

svc.fit(X_wine_train, y_wine_train)
wine_acc = svc.score(X_wine_test, y_wine_test) * 100

print(f"Error-rate for ZIP Code dataset: {100 - zip_acc}%")
print(f"Error-rate for Wine dataset: {100 - wine_acc}%")

print(f"Accuracy for ZIP Code dataset: {zip_acc}%")
print(f"Accuracy for Wine dataset: {wine_acc}%")

Error-rate for ZIP Code dataset: 2.8817204301075208%
Error-rate for Wine dataset: 31.111111111111114%
Accuracy for ZIP Code dataset: 97.11827956989248%
Accuracy for Wine dataset: 68.88888888888889%


5. Create a pipeline for SVM involving data normalization and SVC, and
use grid search and cross-validation to tune parameters C and gamma for
the pipeline, avoiding data snooping and data leakage. You may use
the scikit-learn class GridSearchCV (along with other scikit-learn
classes). Experiment with different ways of doing normalization (such
as StandardScaler, MinMaxScaler, RobustScaler, and Normalizer).
Which ways are appropriate for either dataset? (The answer, which should
be written in your Jupyter notebook, may depend on the results that you
obtain for the next task.)

In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import Normalizer, MinMaxScaler, RobustScaler, StandardScaler

normalisers = [Normalizer(), MinMaxScaler(), RobustScaler(), StandardScaler()]
grid_values = [0.01, 0.1, 1, 10, 100]

def normalise(meth, gridVals, X_test, y_test, X_train, y_train):
    pipeline = make_pipeline(meth, SVC())
    pipe_param = {"svc__C": gridVals, "svc__gamma": gridVals}
    g_search = GridSearchCV(pipeline, param_grid=pipe_param, cv=len(gridVals), n_jobs=-1)
    g_search.fit(X_train, y_train)

    return (g_search.score(X_test, y_test), g_search.best_score_, g_search.best_params_), g_search

In [13]:
wine_grids = []
wine_saved_norm = []

for i in normalisers:
    grid, norm = normalise(i, grid_values, X_wine_test, y_wine_test, X_wine_train, y_wine_train)
    wine_grids.append(grid)
    wine_saved_norm.append(norm)

print(wine_grids)
print(wine_saved_norm)

[(0.9555555555555556, np.float64(0.8880341880341881), {'svc__C': 100, 'svc__gamma': 100}), (1.0, np.float64(0.9772079772079773), {'svc__C': 0.1, 'svc__gamma': 1}), (1.0, np.float64(0.9703703703703702), {'svc__C': 0.1, 'svc__gamma': 0.1}), (1.0, np.float64(0.9772079772079773), {'svc__C': 1, 'svc__gamma': 0.01})]
[GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('normalizer', Normalizer()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]}), GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('minmaxscaler', MinMaxScaler()),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.01, 0.1, 1, 10, 100],
                         'svc__gamma': [0.01, 0.1, 1, 10, 100]}), GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('robustscaler', RobustScaler()),


In [14]:
zip_grids = []
zip_saved_norm = []

for i in normalisers:
    grid, norm = normalise(i, grid_values, X_zip_test, y_zip_test, X_zip_train, y_zip_train)
    zip_grids.append(grid)
    zip_saved_norm.append(norm)

print(zip_grids)
print(zip_saved_norm)

KeyboardInterrupt: 