# SVM classification using active learning
Diane Lingrand (diane.lingrand@univ-cotedazur)

**DSAI evaluation, March 2022**

Write your name here:

Documentation SVM: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [19]:
#necessary imports
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, f1_score, accuracy_score

## dataset: digits

In [20]:
# loading the dataset
from sklearn import datasets
from sklearn.datasets import load_digits
digits = load_digits()
(X,y) = datasets.load_digits(return_X_y=True)


In [21]:
X.shape

(1797, 64)

In [22]:
X.ndim

2

**Question 1** 
- What is the dimension of the data space ? 
- How many data in the digits dataset ?

Compute these values (even if they are available on the net). Print the results in the form (10 and 100 are examples, not the correct values):

    Data are of dimension: 10.
    There are 100 data in the digits dataset.

In [23]:
#write your solution here
print("There are %.2f data in the digits dataset" %X.shape[0])
print("data are of dimension %.2f" %X.ndim) # dimension of the data space

There are 1797.00 data in the digits dataset
data are of dimension 2.00


In [24]:
# you will consider only 2 classes: the 3's and the 7's
c1 = 3
c2 = 7

In [25]:
# #write your solution here
# mesClasses_X = (y==c1)|(y==c2)
# mesClasses_X.shape
# XClasses = np.array(X[mesClasses_X,:])
# YClasses = np.array((y[mesClasses_X]-c1)/(c2-c1))

**Question 2:**

Set X to contain the part of the data from the original variable X that contains only data with labels 3 or 7. Set y to the corresponding labels: 0 value for class '3' and 1 value for class '7'.

In [38]:
X1 = X[X==3]
X2 = X[X==7]
# # OR
# XX = (X[X==3]) | (X[X==7])
X = np.concatenate([X1, X2])
y = []
for i in X:
    if(i == 3):
        y.append(0)
    else:
        y.append(1)
y = np.array(y)

**Question 3:**

How many samples for class '3' and for class '7'? Print the values this way:
    
    There are ... data in class 3 and ... data in class 7.

In [27]:
X.shape, y.shape

((5571,), (5571,))

In [28]:
# split into train and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
X_train.shape,y_train.shape, X_test.shape, y_test.shape

((3342,), (3342,), (2229,), (2229,))

In [29]:
X_train_reshape= X_train.reshape(-1,1)
X_test_reshape= X_test.reshape(-1,1)
y_train_reshape= y_train.reshape(-1,1)
y_test_reshape= y_test.reshape(-1,1)
X_train_reshape.shape, y_train_reshape.shape, X_test_reshape.shape, y_test_reshape.shape

((3342, 1), (3342, 1), (2229, 1), (2229, 1))

## Baseline: train a linear SVM on the whole train dataset

**Question 4:**

Using a linear kernel and a default C value to 1, learn the SVM classification of 3's versus 7's with the whole train dataset.


In [30]:
#write your solution here
#write your solution here
mysvm = svm.SVC(kernel='linear')
mysvm.fit(X_train_reshape,y_train_reshape)
#write your solution here
ypred= mysvm.predict(y_test_reshape) # predicting the value using mysvm of the testing dataset X_test

  y = column_or_1d(y, warn=True)


**Question 5:**

Compute the different metrics (F1 score, accuracy and confusion matrix) on the test dataset.

In [34]:
#write your solution here
f1_score(y_test_reshape, ypred, average='macro')

0.34633431085043986

In [35]:
accuracy_score(y_test_reshape,ypred) #accuracy score 

0.5298340062808434

In [36]:
confusion_matrix(y_test_reshape, ypred)

array([[1181,    0],
       [1048,    0]], dtype=int64)

## Active learning with SVM

Start with few annoted data and iterate by asking new labelled data and re-learn SVM separation. Try different selection of new labelled data.

In [None]:
# short reminder for random integers:
import random
a = random.randint(2, 15)
# a is random integer such that 2 <= a <= 15

In [None]:
#In order to avoid any modification in (xTrain, yTrain), we will work on a copy in the next cells:
xTrainP = np.copy(X_train)
yTrainP = np.copy(y_train)

**Question 6: Initialisation of the active training dataset**

Construct a new training dataset named (xActif,yActif). For it's initialisation, take randomly nb0 data from the copy of the original training dataset (xTrainP, yTrainP). You are allowed to use informations from yTrainP in order to get half of nb0 data for each class. These nb0 data are also removed from (xTrainP,yTrainP). Removing data can be done using [np.delete](https://numpy.org/doc/stable/reference/generated/numpy.delete.html).

In [None]:
# we assume that nb0 is an even number
nb0 = 4 # number of data in the active training dataset at initialisation
xActif = []
yActif = []

In [None]:
#write your solution here

**Question 7: Iterations of the active learning** 

1. Learn a linear SVM classifier on the active training dataset
2. Compute the accuracy on the test dataset (not modified)
3. add randomly nb new data to the active training dataset and remove them from (xTrainP, yTrainP)
4. Go back to step 1 (20 times)

In [None]:
nb = 4
#write your solution here

**Question 8: plot the evolution of the accuracy**

Plot the accuracy with respect to the iterations from the previous question.

In [None]:
#write your solution here

**Question 9: strategy for choosing new data**
    
Same question as question 7 but, instead of choosing the new points randomly, at each iteration, choose the nb points that are the closest to the separation. The [decision_function](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.decision_function) from scikit-learn will help you.

In [None]:
#write your solution here

**Question 10: plot the evolution of the accuracy**

Plot the accuracy with respect to the iterations from the previous question.
Compare with question 8. Also compare with the baseline.

In [None]:
#write your solution here

**Question 11: many random starts**
    
Since the initialisation is random, running previous codes can lead to different curves for questions 8 and 10. Write here the code necessary for plotting several (e.g. 10) curves corresponding to questions 8 and 10 and display these new plots. Which one is the best strategy.
    

In [None]:
#write your solution here

**Question 12: hyperparameters**
So far, you have used the linear kernel with default parameter. Using the strategy of question 9, how could you choose the kernel and the hyperparameters ? Try different experiments such as:
- choose the kernel and hyperparams using nb0 at starting
- update kernel and hyperparams after few itertions
- compare different trials    
    

In [None]:
#write your solution here