Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross Validation in Keras #1711

Closed
mahdiman opened this issue Feb 13, 2016 · 21 comments
Closed

Cross Validation in Keras #1711

mahdiman opened this issue Feb 13, 2016 · 21 comments

Comments

@mahdiman
Copy link

Is there any built in feature in Keras that allows me to do cross validation or I have to do it myself?

@jerheff
Copy link
Contributor

jerheff commented Feb 13, 2016

There is built-in support to hold a percentage of the data as a validation data set (validation_split param on fit). My understanding is that most people do not do true k-fold cross validation due to the computational overhead of building k models.

@audiofeature
Copy link

This is really handy to use just before calling Keras' model.compile() fit() and predict() functions:

from sklearn.cross_validation import StratifiedKFold

@KeironO
Copy link

KeironO commented Feb 18, 2016

Hey @mahdiman,

Here is a simplified example of how to perform k-fold CV in Keras using sklearn.

from sklearn.cross_validation import StratifiedKFold

def load_data():
    # load your data using this function

def create model():
    # create your model using this function

def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
    model.fit...
    # fit and evaluate here.

if __name__ == "__main__":
    n_folds = 10
    data, labels, header_info = load_data()
    skf = StratifiedKFold(labels, n_folds=n_folds, shuffle=True)

    for i, (train, test) in enumerate(skf):
            print "Running Fold", i+1, "/", n_folds
            model = None # Clearing the NN.
            model = create_model()
            train_and_evaluate_model(model, data[train], labels[train], data[test], labels[test))

@mahdiman
Copy link
Author

Thanks guys :)

@zhipeng-fan
Copy link

Is there any way to perform cross-validation when the label is a 2D image instead of a pure 1D label? I mean when the label is not a 1D vector with the size of (n, )

@KeironO
Copy link

KeironO commented Apr 10, 2017

@zhipeng-fan - It's not massively clear what exactly you're describing here.

Can you provide an example?

@olix20
Copy link

olix20 commented Apr 20, 2017

You may want to check this for an actual example:
https://www.kaggle.com/zfturbo/the-nature-conservancy-fisheries-monitoring/fishy-keras-lb-1-25267

basically, saving the model after each fold and averaging over each individual model predictions at test time

@atharvap
Copy link

atharvap commented Jul 10, 2017

@zhipeng-fan Stratified kfold of sklearn works only on 1d labels:
Stratified is where training test and testing test have approx the same amount of each class. Sklearn does not allow this for multiclass.
try this instead
kf = KFold(number of train+ test points,n_folds, shuffle=True)

@ndrmahmoudi
Copy link

ndrmahmoudi commented Aug 29, 2017

Hi all, @KeironO thanks for your helpful comment. I have one more question and I would be very happy if you respond it as well. I am a bit confused about validation and test datasets. Will we use data[test] as the validation dataset? Is that enough to show the performance of the model? What should I do for testing the model? I split the 9 fold (train index in your example) to two groups of training and validating and keep the the last 1 fold for test, after splitting the full datasets into 10 folds (using StratifiedKfold). You may see whatever I have done below:

skf = StratifiedKFold(n_splits=10, shuffle=True)
splitted_indices=skf.split(np.zeros(data.shape[0], labels))
for train, test in splitted_indices:
    train_data = data[train]
    train_labels = labels[train]
    x_train = train_data[:int(-0.1*len(train))]
    y_train = train_labels[:int(-0.1*len(train))]
    x_val = train_data[int(-0.1*len(train)):]
    y_val = train_labels[int(-0.1*len(train)):]
    x_test = data[test]
    y_test = labels[test]
    model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=100, batch_size=500, callbacks=[early_stop, checkpoint])
    y_pred=loaded_model.predict(x_test, verbose=0) 
    '''Loaded_model is the best model saved on the disk 
    during the training process based on the validation dataset'''

So, I basically report the performance on test set by y_pred. Can you please let me know if I am doing something wrong?

Regards,
Nader

@hitzkrieg
Copy link

Hi @KeironO .
In my application I had done cross-validation just as you described.
However I had not included the 'model = None' line.
Is there a possibility that the model of my new cross validation set is inheriting some trained weights from previous set?

@oxydron
Copy link

oxydron commented Dec 20, 2017

There is one problem though: when you train using ImageDataGenerator, that flows images per batch into memory (for big datasets).

In this case, how do I do a cross-validation? In the end, do I need to create 10 folders with the K subsets by myself?

@tharuniitk
Copy link

Hi @KeironO , the comment thread was quite useful. I am working on a dataset where the problem is a regression problem. The labels are continuous values. How to perform cross validation in keras for that case. If I use StratifiedKFold it raises following error.

Traceback (most recent call last):
File "keras_workshop/per_subject_testing.py", line 360, in
for train, test in kfold.split(X, y1_s1):
File "/home/tharun/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 332, in split
for train, test in super(_BaseKFold, self).split(X, y, groups):
File "/home/tharun/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 95, in split
for test_index in self._iter_test_masks(X, y, groups):
File "/home/tharun/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 634, in _iter_test_masks
test_folds = self._make_test_folds(X, y)
File "/home/tharun/anaconda2/lib/python2.7/site-packages/sklearn/model_selection/_split.py", line 589, in _make_test_folds
allowed_target_types, type_of_target_y))
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

@audiofeature
Copy link

@tharuniitk As you cant do stratification on a regression problem's groundtruth, you'll probably have to use normal KFold as outlined by someone earlier above.

@audiofeature
Copy link

@hitzkrieg Yes, a model is inheriting all trained weights from previous fold, if it is not re-initialized! Be careful here, otherwise your cross-validation is useless! It all depends on ehat the create_model() function does. If you re-create the model by overwriting the model variable with a new initialization in each fold, you are fun. There is no way to re-initialize a model in each run in keras , so you have to re-create it in each fold, otherwise you keep training the same model which means your cross-validation is fake.

@audiofeature
Copy link

@hitzkrieg:
Ehat -> what
Fun -> fine
;)

@tharuniitk
Copy link

Thank you @audiofeature . KFold is working!!

@abdo-br
Copy link

abdo-br commented Oct 10, 2018

@KeironO do we have to create the model again for each fold?

@zhipeng-fan
Copy link

zhipeng-fan commented Oct 12, 2018 via email

@abdo-br
Copy link

abdo-br commented Oct 13, 2018

@zhipeng-fan I thought fitting the model on new data is enough, but I see now, thanks :)

Hi! Yes, absolutely. Otherwise the model actually see all the data and therefore could not prove anything even you obtain a good result.

On Wed, Oct 10, 2018, 10:40 AM Abdulrahman, @.***> wrote: @KeironO https://github.com/KeironO do we have to create the model again for each fold? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1711 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AVGBuN5F48eXLetJuzr9tIanX4_bcjFbks5ujgbbgaJpZM4HZowu .

@audiofeature
Copy link

audiofeature commented Dec 19, 2018 via email

@AddASecond
Copy link

but the question still exists: how to save the model according to the cross_validation loss? the keras save model only using "val_loss" , w/o considering about which cross it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests