# C9: Use Keras Models With Scikit-Learn For General Machine Learning

sklearn中提供了许多辅助建模的工具，Keras的模型也可以包装成sklearn中的model形式，然后使用sklean中的工具。本节主要介绍：

* 如何把keras模型包装成sklearn的模型，以便使用sklearn中的工具
* 如何使用sklearn交叉验证工具评价keras模型
* 如何使用sklearn中的grid search来tune keras模型中的hyperparameters


## 9.1 Overview
...

## 9.2 把keras model封装成sklearn model

在sklearn中，model通常使用如下模式进行使用

```
model = XXXClassifier()  # 或者model = XXXRegressor()
model.fit(X, Y)
y = model.predict(X)
```

keras.wrappers.scikit_leran中提供了KerasClassifier()和KerasRegressor()两种对象来把keras中的model封装成sklearn中的model。以KerasClassifier为例，只需要把keras模型的创建方法，训练参数输入进去即可创建一个sklearn的model。[官方文档](https://keras.io/scikit-learn-api/)


In [1]:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier

def model_create():
    # model creator
    model = Sequential()
    model.add(Dense(12, input_dim=8, init='uniform', activation='relu'))
    model.add(Dense(8, init='uniform', activation='relu'))
    model.add(Dense(1, init='uniform', activation='sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# wrap Keras model into a sklean model
skl_model = KerasClassifier(build_fn=model_create, nb_epoch=150, batch_size=10, verbose=0)


Using Theano backend.


## 9.3 使用sklearn中的交叉验证工具

下面使用sklean中的cross_val_score()和StratifiedKFold()两个工具来对skl_model进行交叉验证

In [2]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
import numpy as np

# set random seed
seed = 7
np.random.seed(seed)

# load data
dataset = np.loadtxt('./data_set/pima-indians-diabetes.data', delimiter=',')
X = dataset[:, 0:8]
Y = dataset[:, 8]

# evaluate model by cross validation
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(skl_model, X, Y, cv=kfold)

print results
print np.mean(results)

[ 0.77922079  0.79220779  0.76623378  0.77922077  0.75324676  0.74025974
  0.77922078  0.71428571  0.71052632  0.75      ]
0.756442244065


使用模型进行predict，虽然前面进行过交叉验证，但是模型还是需要训练后才能predict的

In [3]:
skl_model.fit(X, Y)
y = skl_model.predict(X[0:1,:])
print y

[[1]]


## 9.4 使用sklearn中的GridSearchCV工具来选取最优模型

sklearn中提供的GridSearchCV工具来选择最优的hypermeters, 同时生成最优的模型，[官方文档](http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html)

     第一步：创建一个model对象，其参数可以任意（之后由grid search 来调整）
     第二步：设定hypermeters grid
     第三步：创建grid对象，并fit
     第四步：查看grid search结果


In [4]:
from sklearn.model_selection import GridSearchCV
import time

def create_keras_model(optimizer='adam', init='uniform'):
    model = Sequential()
    model.add(Dense(12, input_dim=8, init=init, activation='relu'))
    model.add(Dense(8, init=init, activation='relu'))
    model.add(Dense(1, init=init, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# create a model whose hypermeters is needed to be tuned again by grid search
model = KerasClassifier(build_fn=create_keras_model, verbose=0)

# set hypermeters grid
hypermeters_grid = {
    'optimizer': ['adam', 'rmsprop'],
    'init': ['glorot_uniform', 'normal', 'uniform'],
    'nb_epoch': [10, 100],
    'batch_size': [5, 10],
}

start_time = time.time()
# create a grid and fit, using 5 fold cross validation
grid = GridSearchCV(estimator=model, param_grid=hypermeters_grid, cv=5, verbose=100)
grid.fit(X, Y)
end_time = time.time()
print 'total time:', end_time - start_time

# show grid search result
print("Best: %f using %s" % (grid.best_score_, grid.best_params_))
for params, mean_score, scores in grid.grid_scores_:
    print("%f (%f) with: %r" % (scores.mean(), scores.std(), params))


Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV] init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5 ..
[CV]  init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5, score=0.675325 -   0.3s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.3s remaining:    0.0s
[CV] init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5 ..
[CV]  init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5, score=0.642857 -   0.3s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.5s remaining:    0.0s
[CV] init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5 ..
[CV]  init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5, score=0.694805 -   0.2s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    6.4s remaining:    0.0s
[CV] init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5 ..
[CV]  init=glorot_uniform, optimizer=adam, nb_epoch=10, batch_size=5, score=0.725490 -   0.2s
[Parallel(n_jobs=1)]: Done   4 out of 



一个实验，以上grid_search在不同环境下的耗时如下：

* pc(i7 6700k, gtx1080)
    
        tensorflow, cpu, n_jobs=1, 2159s
        theano, cpu, n_jobs=1, 198s
        theano, gpu, n_jobs=1, 348s

* mac(i7 2.4GHz)

        theano, cpu, n_jobs=1, 400s
        tensorflow, cpu, n_jobs=1, inifinte
    
在tensorflow下还无法使用gpu。可以看到，如果不能分布式计算，gpu还是来不过cpu的。tenserflow比theano慢很多。原因我也还说不清楚。

n_jobs好像只能用1，其他值会有问题，好像是keras的bug。