# 機器學習理論與實作 練習四

## Pipeline (機器學習流程) 實作

<img src="pics/pipeline.png" width="80%">

> [picture source](https://blog.csdn.net/lanchunhui/article/details/50521648)

![p](https://img-blog.csdn.net/20160115095855517)

## [機器學習之 sklearn中的pipeline](https://hk.saowen.com/a/673021f529c2dcc9ff6802c40290787d2d55032049ccc6b6606b1a52dbb3f1d8)

### Load data

In [2]:
from sklearn import datasets

In [3]:
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

### Model and pipeline

In [4]:
from sklearn.decomposition import PCA  # 組成分分析，如果降維分析效果不好，可以改用其他方式
from sklearn.linear_model import SGDClassifier  # 梯度下降
from sklearn.pipeline import Pipeline  # 機器學習流程

In [6]:
# 初始化，組合模型
logistic = SGDClassifier(loss='log', penalty='l2', max_iter=10000, tol=1e-5, random_state=0)
pca = PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])  #步驟

## [sklearn.pipeline.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

### Grid search + CV

In [7]:
import numpy as np
from sklearn.model_selection import GridSearchCV

In [9]:
param_grid = {
    'pca__n_components': [5, 20, 30, 40, 50, 64],   # 名稱要跟上面一樣，用兩個底線連結模型
    'logistic__alpha': np.logspace(-4, 4, 5),  # 線性空間取 log
}

search = GridSearchCV(pipe, param_grid, iid=False, cv=5, n_jobs=-1,  # cv=5 對資料做切分
                      return_train_score=False)  # 暴力搜尋，將所有的參數組合都試過

### Training

In [10]:
search.fit(X_digits, y_digits)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logistic', SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=...dom_state=0, shuffle=True, tol=1e-05,
       validation_fraction=0.1, verbose=0, warm_start=False))]),
       fit_params=None, iid=False, n_jobs=-1,
       param_grid={'pca__n_components': [5, 20, 30, 40, 50, 64], 'logistic__alpha': array([1.e-04, 1.e-02, 1.e+00, 1.e+02, 1.e+04])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
       scoring=None, verbose=0)

### Performance

In [11]:
print("Best parameter (CV score={}):".format(search.best_score_))  # best_score 找最好的
print(search.best_params_)  # best_params 最佳參數組合

Best parameter (CV score=0.9175648341973272):
{'logistic__alpha': 1.0, 'pca__n_components': 30}


### Best model

In [12]:
search.best_estimator_

Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=30, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('logistic', SGDClassifier(alpha=1.0, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15,...dom_state=0, shuffle=True, tol=1e-05,
       validation_fraction=0.1, verbose=0, warm_start=False))])

# Towards Machine Learning Master!

## [klearn DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

## [Decision tree](http://www.csie.ntnu.edu.tw/~u91029/Tree.html)

## [sklearn.ensemble: Ensemble Methods 看 random模型](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)

## [sklearn.ensemble.RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
<font color='red'>oob_score</font> : bool (default=False)

Attributes: <font color='red'>feature_importances</font> 

<font color='red'>class_weight</font>  : dict, list of dicts, “balanced”, “balanced_subsample” or None, optional (default=None)權重設定，預設沒開

## [Comparison of Manifold Learning methods](https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html#sphx-glr-auto-examples-manifold-plot-compare-methods-py)

## [t-SNE](https://distill.pub/2016/misread-tsne/)