<hr style="height:1px;border:none;color:#333;background-color:#333;" />

### Q3. 在scikit-learn裡，定義自己的分類器類別myGaussianClassifier
在這個作業，我們假設$P(X|C=i),\forall i$呈現高斯分佈$\mathcal{N}(\boldsymbol{\mu}_{i},\boldsymbol{\Sigma}_{i})$。你必須完成下面函式，才能將myGaussianClassifier放在scikit-learn框架下。前三個函式($\text{__init__,fit,predict}$)是必要的，後面是視需要再增加。那三個函式需要傳回的值請參照<br>
https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html <br>
的說明撰寫。


+ $\text{__init__(self,alpha=1e-5)}$函式參數必須包含所有需要設定的參數及其內定值，其中$alpha$為正則化參數用於正則化sample covariance matrix。
+ fit(self,train,target): 你必須對每一個類別完成估計$\boldsymbol{\mu}_{i},\boldsymbol{\Sigma}_{i},P(C_{i})$,其中$\boldsymbol{\Sigma}_{i}=\text{the sample covariance matrix of class $i$}+\alpha\mathbf{I}$。若是target裡，類別標籤為0,1,2,...,c-1，你可以用np.mean,np.cov得到你要的sample mean, sample covariance matrix(參照Multivariate Methods那章，估計sample mean, sample covariance matrix的公式):
   + **sample mean**: $m_{i}$=np.mean(X[np.nonzero(target.ravel()==i)],axis=0)
   + **sample covariance matrix**: $S_{i}$=np.cov(X[np.nonzero(target.ravel()==i)],rowvar=False)+alpha*np.eye(X.shape[1])
   + **prior**: $P(C=i)$= np.sum(target.ravel()==i)/target.size
   
   
+ predict(self,X,y=None)裡必須計算$X$裡，每一列資料的事後機率$P(C=i|x),\forall i$，選擇事後機率高的那個類別為x的類別。
+ predict_proba(self,X,y=None)計算每一列資料的事後機率$P(C|x)$。
+ score(self,X,y)計算這個模型資料為X，答案為y的得到的評分(越高分代表越好喔，例如準確度)

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

### Q3.1 請完成myGaussianClassifier程式。
<pre>
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin

class myGaussianClassifier(BaseEstimator, ClassifierMixin): #必須繼承 BaseEstimator, ClassifierMixin
    def __init__(self,alpha=1.e-5):        # initializer函式參數必須包含所有需要設定的參數及其內定值
        if isinstance(self,myGaussianClassifier):
            super(myGaussianClassifier,self).__init__()  
        self.alpha = alpha
        
    def fit(self,train,target): # 不能缺
        #    
        N,d = train.shape
        label = np.sort(np.unique(target.ravel()))
        self.c_     = label.size
        self.d_     = d
        self.prior_ = np.zeros((self.c_,))
        self.mean_  = np.zeros((self.c_,self.d_))
        self.cov_   = np.zeros((self.c_,self.d_,self.d_))
        # 計算 mean, covariance
        for cid,y in enumerate(label):
            idx = np.nonzero(target.ravel()==y)
            self.cov_[cid] =np.cov(train[idx],rowvar=False)+self.alpha*np.eye(d)
            # 完成mean及prior
            
        return self #最後要傳回self這個物件
 
    def predict(self,X, y=None): # 不能缺
        pass
    
    def predict_proba(self,X, y=None): # 視需要
        pass
    
    def score(self,X,y): # 可有可無
        pass
</pre>

<hr style="height:1px;border:none;color:#333;background-color:#333;" />

### Q3.2,Q3.3 使用GridSearchCV與cross_val_score評估其效能 (如同作業一)，並比較其準確度相較於GaussianNB、kNN與SVC有沒有顯著差異
在這個作業，你要比較的是下面分類器，並使用GridSearchCV選取恰當超參數，cv設定為5:
+ SVC: {'kernel':['linear'],'C':[0.01, 0.1, 1, 10]}
+ KNeighborsClassifier: {'n_neighbors':[1,3,5,7]}
+ myGaussian: {'alpha':[0.001,0.01,0.1,1,10,100]}
+ GaussianNB: {'var_smoothing':[1e-5,1e-4,1e-3,1e-2,1e-1,1]}

你要回答的問題:
### Q3.2 myGaussianClassifier, SVC with linear kernel, kNN, GaussianNB平均準確分別為多少?
### Q3.3 以paired t-test顯著程度0.05前提下，平均準確率最高那個分類器與另外3個分類器，在平均準確率上有顯著差異嗎? 請寫出p-value


In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from sklearn import neighbors, svm, naive_bayes 
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
import sklearn.datasets as ds
import numpy as np

# 載入資料集
data,target    = ds.load_breast_cancer(True)
 
# 宣告分類器
gauss_clf      = myGaussianClassifier()
knn_clf        = neighbors.KNeighborsClassifier(n_neighbors=3,weights='uniform',algorithm='kd_tree',leaf_size=30)
svm_clf        = svm.SVC(kernel='linear', C=1, probability=True)
gaussnb_clf    = naive_bayes.GaussianNB()

# 定義超參數及其候選值
knn_clf_param = {'n_neighbors':[1,3,5,7]}
svm_clf_param = {'C':[0.01, 0.1, 1, 10]}
gauss_clf_param={'alpha':[0.001,0.01,0.1,1,10,100]}
gaussnb_clf_param={'var_smoothing':np.logspace(-5,2,6)}

# inner cross-validation for hyper-parameter tuning
# 當n_jobs=-1時，在Windows可能有Bug，那麼就改為n_jobs = 1
gauss_gs     = GridSearchCV(estimator=gauss_clf,param_grid = gauss_clf_param, scoring = 'accuracy', cv=5, n_jobs=-1, verbose=1)
knn_gs       = GridSearchCV(estimator=knn_clf,param_grid = knn_clf_param, scoring = 'accuracy', cv=5, n_jobs=-1, verbose=1)
svm_gs       = GridSearchCV(estimator=svm_clf,param_grid = svm_clf_param, scoring = 'accuracy', cv=5,  n_jobs=-1, verbose=1)
svm_pipeline = Pipeline([('scaler',MinMaxScaler()),('svm_gs',svm_gs)])
gaussnb_gs   = GridSearchCV(estimator=gaussnb_clf,param_grid = gaussnb_clf_param, scoring = 'accuracy', cv=5, n_jobs=-1, verbose=1)

# outer cross-validation for estimating the accuracy of the classifier
# the classifiers to be compared must be evaluated by the same k-fold CV
kfold        = StratifiedKFold(n_splits=10, shuffle=True, random_state=3)

#當n_jobs=-1時，在Windows可能有Bug，那麼就改為n_jobs = 1
gauss_scores   = cross_val_score(gauss_gs, data, target, scoring='accuracy',cv = kfold, verbose=10) 
knn_scores     = cross_val_score(knn_gs, data, target, scoring='accuracy',cv = kfold, verbose=10)
svm_scores     = cross_val_score(svm_pipeline, data, target, scoring='accuracy',cv = kfold, verbose=10)
gaussnb_scores = cross_val_score(gaussnb_gs, data, target, scoring='accuracy',cv = kfold, verbose=10)

#請同學接續寫完評比
# apply the paired t-test (Refer to ppt for Chapter 20 Design and Analysis of Machine Learning Experiments)

#### 效能評估