# 機器學習理論與實作 練習四

## Pipeline (機器學習流程) 實作

<img src="https://img-blog.csdn.net/20160115095855517" width="50%">

## [機器學習之 sklearn中的pipeline](https://hk.saowen.com/a/673021f529c2dcc9ff6802c40290787d2d55032049ccc6b6606b1a52dbb3f1d8)

### Load data

In [1]:
from sklearn import datasets

In [2]:
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

In [3]:
digits

{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 'images': array([[[ 0.,  0.,  5., ...,  1.,  0.,  0.],
         [ 0.,  0., 13., ..., 15.,  5.,  0.],
         [ 0.,  3., 15., ..., 11.,  8.,  0.],
         ...,
         [ 0.,  4., 11., ..., 12.,  7.,  0.],
         [ 0.,  2., 14., ..., 12.,  0.,  0.],
         [ 0.,  0.,  6., ...,  0.,  0.,  0.]],
 
        [[ 0.,  0.,  0., ...,  5.,  0.,  0.],
         [ 0.,  0.,  0., ...,  9.,  0.,  0.],
         [ 0.,  0.,  3., ...,  6.,  0.,  0.],
         ...,
         [ 0.,  0.,  1., ...,  6.,  0.,  0.],
         [ 0.,  0.,  1., ...,  6.,  0.,  0.],
         [ 0.,  0.,  0., ..., 10.,  0.,  0.]],
 
        [[ 0

In [6]:
print(digits['DESCR'])   # 手寫數字數據集的光學識別

Optical Recognition of Handwritten Digits Data Set

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is a

### Model and pipeline

In [9]:
from sklearn.decomposition import PCA  # 組成分分析，如果降維分析效果不好，可以改用其他方式
from sklearn.linear_model import SGDClassifier  # 具有SGD訓練的線性分類器，使用隨機梯度下降（SGD）學習來實現正則化線性模型
from sklearn.pipeline import Pipeline  # 機器學習流程

In [10]:
# 初始化，組合模型
logistic = SGDClassifier(loss='log', penalty='l2', max_iter=10000, tol=1e-5, random_state=0)
pca = PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])  #步驟

## [SGDClassifier , SGD(隨機梯度下降)](https://www.twblogs.net/a/5b8adedf2b71775d1ce99a58)

### Grid search + CV

In [11]:
import numpy as np
from sklearn.model_selection import GridSearchCV

In [12]:
param_grid = {
    'pca__n_components': [5, 20, 30, 40, 50, 64],   # 名稱要跟上面一樣，用兩個底線連結模型
    'logistic__alpha': np.logspace(-4, 4, 5),  # 線性空間取 log , 返回在對數刻度上均勻間隔的數字。
}

search = GridSearchCV(pipe, param_grid, iid=False,   # GridSearchCV 徹底搜索估計器的指定參數值(暴力搜尋，將所有的參數組合都試過)
                      cv=5, n_jobs=-1,               # cv=5 指定（分層）KFold中的折疊數 , 交叉驗證，將資料分n份
                      return_train_score=False)              

### Training

In [13]:
search.fit(X_digits, y_digits)

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logistic', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l...y='l2', power_t=0.5, random_state=0, shuffle=True,
       tol=1e-05, verbose=0, warm_start=False))]),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'pca__n_components': [5, 20, 30, 40, 50, 64], 'logistic__alpha': array([1.e-04, 1.e-02, 1.e+00, 1.e+02, 1.e+04])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
       scoring=None, verbose=0)

### Performance

In [14]:
print("Best parameter (CV score={}):".format(search.best_score_))  # best_score 最佳分類器的平均驗證得分。
print(search.best_params_)  # best_params 最佳參數組合

Best parameter (CV score=0.9249406085036416):
{'logistic__alpha': 0.0001, 'pca__n_components': 50}


### Best model

In [15]:
search.best_estimator_    # 最佳分類器，如果refit是False就沒有效果!
                          # refit default True，會在最後取得最佳參數之後再以該參數做fit一次全部的資料。 

Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=50, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logistic', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='log', max_iter=10000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=0, shuffle=True,
       tol=1e-05, verbose=0, warm_start=False))])

# Towards Machine Learning Master!

## [klearn DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

## [Decision tree](http://www.csie.ntnu.edu.tw/~u91029/Tree.html)

## [sklearn.ensemble: Ensemble Methods 看 random模型](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)

## [sklearn.ensemble.RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
<font color='red'>oob_score</font> : bool (default=False)

Attributes: <font color='red'>feature_importances</font> 

<font color='red'>class_weight</font>  : dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)權重設定，預設沒開

## [Comparison of Manifold Learning methods](https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py)

## [t-SNE](https://distill.pub/2016/misread-tsne/)