***4. Multiclass SVM.***

*In this problem, we’ll use support vector machines to classify the MNIST data set of
handwritten digits.*

*(a) Load in the MNIST data: a training set of 60,000 points and a separate test set of 10,000 points.*

In [2]:
from struct import unpack
import numpy as np
import matplotlib.pylab as plt 

def loadmnist(imagefile, labelfile):
    # Open the images with gzip in read binary mode
    images = open(imagefile, 'rb')
    labels = open(labelfile, 'rb')

    # Get metadata for images
    images.read(4)  # skip the magic_number
    number_of_images = images.read(4)
    number_of_images = unpack('>I', number_of_images)[0]
    rows = images.read(4)
    rows = unpack('>I', rows)[0]
    cols = images.read(4)
    cols = unpack('>I', cols)[0]

    # Get metadata for labels
    labels.read(4)
    N = labels.read(4)
    N = unpack('>I', N)[0]

    # Get data
    x = np.zeros((N, rows*cols), dtype=np.uint8)  # Initialize numpy array
    y = np.zeros(N, dtype=np.uint8)  # Initialize numpy array
    for i in range(N):
        for j in range(rows*cols):
            tmp_pixel = images.read(1)  # Just a single byte
            tmp_pixel = unpack('>B', tmp_pixel)[0]
            x[i][j] = tmp_pixel
        tmp_label = labels.read(1)
        y[i] = unpack('>B', tmp_label)[0]

    images.close()
    labels.close()
    return (x, y)

def displaychar(image):
    plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
    plt.axis('off')
    plt.show()

In [3]:
x,y = loadmnist('../MNIST/train-images-idx3-ubyte', '../MNIST/train-labels-idx1-ubyte')
print("Length of x :"+str(len(x))+ " Length of y :"+str(len(y)))

Length of x :60000 Length of y :60000


In [4]:

#Scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(x)
x = scaler.transform(x)

In [5]:
x_test_images,y_test_labels = loadmnist('../MNIST/t10k-images-idx3-ubyte', '../MNIST/t10k-labels-idx1-ubyte')
print("Length of x_test_images :"+str(len(x_test_images))+ " Length of y_test_labels :"+str(len(y_test_labels)))

#scale test data
x_test_images = scaler.transform(x_test_images)


Length of x_test_images :10000 Length of y_test_labels :10000


*(b) Learn a linear SVM classifier using sklearn.svm.LinearSVC. You will need to see loss=’hinge’.
How can you choose a suitable value of C? Explain your methodology.*

In [6]:
from sklearn.model_selection import train_test_split
X_train_images, X_validation_images, y_train_labels, y_validation_labels = train_test_split(x, y, train_size=50000,random_state=0)

In [8]:
from sklearn.svm import LinearSVC
C_list=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 1, 10]
header = "{:<10} {:<15}".format('C', 'validation_error')
print(header)
print('-' * len(header))
for C in C_list:
    #C = 0.1
    lsvc = LinearSVC(loss='hinge', C=C, random_state=0, tol=1 , max_iter=15000)
    lsvc.fit(X_train_images,y_train_labels)
    validation_predictions = lsvc.predict(X_validation_images)
    err_predictions = np.not_equal(y_validation_labels,validation_predictions)
    validation_error = float(np.sum(err_predictions))/len(y_validation_labels)
    print ("{:<10} {:<15}".format(C, validation_error))
    #C *= 5
print('-' * len(header))

C          validation_error
---------------------------
0.0001     0.2073         
0.0005     0.1196         
0.001      0.1056         
0.005      0.0903         
0.01       0.0873         
0.05       0.0831         
1          0.0828         




10         0.0852         
---------------------------


Methodology:
    I have created a hold out set or validation set of 10k data points to check the validation error for each value of C.
    The best value of C appears to be 1 since that has got the least validation error.

In [7]:
from sklearn.svm import LinearSVC
C = 1
lsvc = LinearSVC(loss='hinge', C=C, random_state=0, tol=1 , max_iter=15000)
lsvc.fit(X_train_images,y_train_labels)
test_predictions = lsvc.predict(x_test_images)
err_predictions = np.not_equal(y_test_labels,test_predictions)
test_error = float(np.sum(err_predictions))/len(y_test_labels)
print ("Final test error for C = {0} is {1}".format(C, test_error))

Final test error for C = 1 is 0.0845


The data is not linearly Separable since we are using slack value here.