# NTOU Machine learning HW1
## by 00857005 周固廷

In this exercise, we try to compare the accuracy of the two classifiers: k-NN and linear SVM on  the Breast Cancer Wisconsin (Diagnostic) dataset by nested cross-validation.  The following are the settings for this exercise.

1. Hyperparameters:
    The hyperparameter for k-NN is n_neighbors and the candidates for this hyperparameter are [1,3,5,7].
    The hyperparameter for linear SVM is C and the candidates for C are [0.01,0.1,1,10].
2. Nested cross-validation
   outer-loop: 10-fold stratified cross-validation
   inner-loop: GridSearchCV with cv=5
3. dataset
    import sklearn.datasets as ds
    data, target = ds.load_breast_cancer(True)

Useful links:
    + linear SVM (sklearn.svm.SVC): https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

    + k-nearest neighbor classifier (sklearn.neighbors.KNeighborsClassifier): https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

    + dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

In [7]:
import sklearn.datasets as ds
from sklearn.model_selection import StratifiedKFold,cross_val_score,GridSearchCV
import numpy as np

data, target = ds.load_breast_cancer(return_X_y = True)

# 1. SVM

In [None]:
#SVM
from sklearn import svm
svmClf = svm.SVC(kernel='linear')
svmClf.fit(data, target)
svmClf.score(data, target)

In [None]:
#cv代表cross validation
outer_kFold = StratifiedKFold(n_splits=10 , shuffle=True ,random_state=2)  #外層的分層抽樣 (拆成10fold)
inner_cv_svm = GridSearchCV(svmClf, {'C':[0.01,0.1,1,10]}, cv=5)  # 內層的validation tuning (拆成5fold)
svm_scores = cross_val_score(inner_cv_svm, data, target, cv=outer_kFold, scoring='accuracy')

In [4]:
print("svm accuracy:",svm_scores)
print("svm accuracy mean:",svm_scores.mean())
print("svm accuracy std:",svm_scores.std())

svm accuracy: [0.87719298 0.92982456 0.89473684 0.92982456 0.87719298 0.9122807
 0.96491228 0.98245614 0.92982456 0.92857143]
svm accuracy mean: 0.9226817042606517
svm accuracy std: 0.032514148720571616


In [5]:
#計算AUC
svm_auc_scores = cross_val_score(inner_cv_svm, data, target, cv=outer_kFold, scoring='roc_auc')
print(svm_auc_scores)
print("svm auc mean: {}".format(svm_auc_scores.mean()))
print("svm auc std: {}".format(svm_auc_scores.std()))

[0.98181818 0.94545455 0.97883598 0.98677249 0.93518519 1.
 0.99074074 0.99470899 0.98677249 0.99183673]
svm auc mean: 0.979212533498248
svm auc std: 0.020405330753281024


# 2. k-NN

In [6]:
#KNN
from sklearn.neighbors import KNeighborsClassifier
knnClf = KNeighborsClassifier()
knnClf.fit(data, target)
knnClf.score(data, target)

0.9472759226713533

In [7]:
#cv代表cross validation
# outer_kFold_knn = StratifiedKFold(n_splits=10 , shuffle=True ,random_state=2)  #外層的分層抽樣 (拆成10fold)
inner_cv_knn = GridSearchCV(knnClf, {'n_neighbors':[1,3,5,7]}, cv=5)  # 內層的validation tuning (拆成5fold)
knn_scores = cross_val_score(inner_cv_knn, data, target, cv=outer_kFold, scoring='accuracy')

In [8]:
print("knn accuracy:",knn_scores)
print("knn accuracy mean:",knn_scores.mean())
print("knn accuracy std:",knn_scores.std())

knn accuracy: [0.94736842 0.94736842 0.9122807  0.92982456 0.87719298 0.94736842
 0.94736842 0.98245614 0.92982456 0.92857143]
knn accuracy mean: 0.9349624060150376
knn accuracy std: 0.02610880527963588


In [9]:
#計算AUC
knn_auc_scores = cross_val_score(inner_cv_knn, data, target, cv=outer_kFold, scoring='roc_auc')
print(knn_auc_scores)
print("knn auc mean: {}".format(knn_auc_scores.mean()))
print("knn auc std: {}".format(knn_auc_scores.std()))



[0.96818182 0.99415584 0.91600529 0.96626984 0.88624339 0.99140212
 0.98082011 0.97222222 0.98941799 0.93741497]
knn auc mean: 0.960213358070501
knn auc std: 0.03404768174744221


# 計算p-value

In [10]:
from scipy import stats
t , pValue = stats.ttest_1samp(svm_scores- knn_scores, 0)

In [11]:
print("t: {}".format(t))
print("pValue: {}".format(pValue))

t: -1.560917707119047
pValue: 0.15297630918246946


1. 210
2.
    (a)

    svm accuracy: [0.87719298 0.92982456 0.89473684 0.92982456 0.87719298 0.9122807
     0.96491228 0.98245614 0.92982456 0.92857143]
    svm accuracy mean: 0.9226817042606517
    svm accuracy std: 0.032514148720571616

    knn accuracy: [0.94736842 0.94736842 0.9122807  0.92982456 0.87719298 0.94736842
     0.94736842 0.98245614 0.92982456 0.92857143]
    knn accuracy mean: 0.9349624060150376
    knn accuracy std: 0.02610880527963588

    (b)
    p-value : 0.15297630918246946

    (c)
    不行 因為p-value > 0.05(一個常見的門檻值)
    代表我們不夠有信心能推翻null hypothesis
    也代表兩個分類器準確度沒有顯著差異
3.
    svm average AUC: 0.979212533498248
    knn average AUC: 0.960213358070501
4.
    N=10 e=2
    N=100 e=10
    N=1000 e=100

