[**sklearn.datasets.load_files**](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files)


sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error=’strict’, random_state=0

container_folder/

    category_1_folder/
    
        file_1.txt file_2.txt … file_42.txt
        
    category_2_folder/
    
        file_43.txt file_44.txt …
        
The folder names are used as supervised signal label names. The individual file names are not important.

This function does not try to extract features into a numpy array or scipy sparse matrix. In addition, if load_content is false it does not try to load the files in memory.

To use text files in a scikit-learn classification or clustering algorithm, you will need to use the sklearn.feature_extraction.text module to build a feature extraction transformer that suits your problem.

If you set load_content=True, you should also specify the encoding of the text using the ‘encoding’ parameter. For many modern text files, ‘utf-8’ will be the correct encoding. If you leave encoding equal to None, then the content will be made of bytes instead of Unicode, and you will not be able to use most functions in sklearn.feature_extraction.text.

Similar feature extractors should be built for other kind of unstructured data input such as images, audio, video, …



[**to_categorical**](https://keras.io/utils/)

keras.utils.to_categorical(y, num_classes=None, dtype='float32')
Converts a class vector (integers) to binary class matrix.

E.g. for use with categorical_crossentropy.

Arguments

>y: class vector to be converted into a matrix (integers from 0 to num_classes).

>num_classes: total number of classes.

>dtype: The data type expected by the input, as a string (float32, float64,  int32...)
Returns

A binary matrix representation of the input. The classes axis is placed last.



In [4]:
from sklearn.datasets import load_files
from keras.utils import np_utils
import numpy as np
from glob import glob
#define function load files acoording to subfolders, one-hot labels
def load_dataset(path):
    data = load_files(path)
    diseases_files = np.array(data['filenames'])
    diseases_targets = np_utils.to_categorical(np.array(data['target']), 3)
    return diseases_files, diseases_targets

train_files, train_targets = load_dataset('./data/train')
valid_files, valid_targets = load_dataset('./data/valid')
test_files, test_targets = load_dataset('./data/test')

#diseases_names =  [item for item in sorted(glob('./data/train/*/'))]

print('There are %d total disease categories.' % len(diseases_names))
print('There are %s total disease images.\n' % len(np.hstack([train_files, valid_files, test_files])))
print('There are %d training disease images.' % len(train_files))
print('There are %d validation disease images.' % len(valid_files))
print('There are %d test disease images.'% len(test_files))

There are 3 total disease categories.
There are 2750 total disease images.

There are 2000 training disease images.
There are 150 validation disease images.
There are 600 test disease images.


In [14]:
diseases_names =  [item[13:-1] for item in sorted(glob('./data/train/*/'))]

In [15]:
diseases_names

['melanoma', 'nevus', 'seborrheic_keratosis']

In [16]:
import random
random.seed(20)
random.shuffle(train_files)
random.shuffle(valid_files)
random.shuffle(test_files)

In [17]:
train_files

array(['./data/train/melanoma/ISIC_0015110.jpg',
       './data/train/seborrheic_keratosis/ISIC_0014642.jpg',
       './data/train/melanoma/ISIC_0000551.jpg', ...,
       './data/train/nevus/ISIC_0012164.jpg',
       './data/train/nevus/ISIC_0014516.jpg',
       './data/train/melanoma/ISIC_0000517.jpg'], dtype='<U50')

In [23]:
from keras.preprocessing import image
from tqdm import tqdm

def path_to_tensor(img_path):
    img = image.load_img(img_path, target_size = (224, 224))
    x = image.img_to_array(img)
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

If we look around line 220 (in your case line 201—perhaps you are running a slightly different version), we see that PIL is reading in blocks of the file and that it expects that the blocks are going to be of a certain size. It turns out that you can ask PIL to be tolerant of files that are truncated (missing some file from the block) by changing a setting.

Somewhere before your code block, simply add the following:
``` python
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
```

In [24]:
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255

100%|██████████| 2000/2000 [01:45<00:00, 18.88it/s]
100%|██████████| 150/150 [00:14<00:00, 10.25it/s]
100%|██████████| 600/600 [01:26<00:00,  6.95it/s]


In [29]:

from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.layers import Dropout, Flatten, Dense, Activation, BatchNormalization
from keras.models import Sequential

model = Sequential()


### TODO: 定义你的网络架构
model.add(Conv2D(filters=64, kernel_size=(3,3), padding='same', activation='relu', input_shape=(224, 224, 3)))   
model.add(BatchNormalization(axis = 1 ))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(filters=32, kernel_size=(3,3), padding='same', activation='relu'))
model.add(BatchNormalization(axis = 1 ))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))

model.add(Conv2D(filters=16, kernel_size=(3,3), padding='same', activation='relu'))
model.add(BatchNormalization(axis = 1 ))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.2))

model.add(GlobalAveragePooling2D())
model.add(Dense(3, activation='softmax'))
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 224, 224, 64)      1792      
_________________________________________________________________
batch_normalization_4 (Batch (None, 224, 224, 64)      896       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 112, 112, 64)      0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 112, 112, 32)      18464     
_________________________________________________________________
batch_normalization_5 (Batch (None, 112, 112, 32)      448       
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 56, 56, 32)        0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 56, 56, 32)        0         
__________

In [30]:
## 编译模型
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [32]:
from keras.callbacks import ModelCheckpoint, EarlyStopping



epochs = 150

### 不要修改下方代码

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.from_scratch.hdf5',  monitor='val_loss', verbose=1, save_best_only=True, mode='min')
earlystopper = EarlyStopping(monitor='val_loss', patience = 10, verbose = 1)
model.fit(train_tensors, train_targets, 
          validation_data=(valid_tensors, valid_targets),
          epochs=epochs, batch_size=20, callbacks=[checkpointer, earlystopper], verbose=1)

Train on 2000 samples, validate on 150 samples
Epoch 1/150

Epoch 00001: val_loss improved from inf to 1.12568, saving model to saved_models/weights.best.from_scratch.hdf5
Epoch 2/150

Epoch 00002: val_loss improved from 1.12568 to 1.08985, saving model to saved_models/weights.best.from_scratch.hdf5
Epoch 3/150

Epoch 00003: val_loss improved from 1.08985 to 1.04490, saving model to saved_models/weights.best.from_scratch.hdf5
Epoch 4/150

Epoch 00004: val_loss improved from 1.04490 to 1.04097, saving model to saved_models/weights.best.from_scratch.hdf5
Epoch 5/150

Epoch 00005: val_loss did not improve from 1.04097
Epoch 6/150

Epoch 00006: val_loss improved from 1.04097 to 1.03481, saving model to saved_models/weights.best.from_scratch.hdf5
Epoch 7/150

Epoch 00007: val_loss did not improve from 1.03481
Epoch 8/150

Epoch 00008: val_loss did not improve from 1.03481
Epoch 9/150

Epoch 00009: val_loss did not improve from 1.03481
Epoch 10/150

Epoch 00010: val_loss did not improve from

<keras.callbacks.History at 0x7f332474aef0>

In [33]:
## 加载具有最好验证loss的模型
model.load_weights('saved_models/weights.best.from_scratch.hdf5')

In [35]:
# 获取测试数据集中每一个图像所预测的狗品种的index
predictions = [np.argmax(model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]

# 报告测试准确率
test_accuracy = 100*np.sum(np.array(predictions)==np.argmax(test_targets, axis=1))/len(predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

Test accuracy: 65.5000%


In [37]:
from keras.preprocessing.image import ImageDataGenerator

# create and configure augmented image generator
datagen_train = ImageDataGenerator(
    width_shift_range=0.5,  # randomly shift images horizontally (10% of total width)
    height_shift_range=0.5,  # randomly shift images vertically (10% of total height)
    horizontal_flip=True) # randomly flip images horizontally

# fit augmented image generator on data
datagen_train.fit(train_tensors)

from keras.callbacks import ModelCheckpoint, EarlyStopping  
earlystopper = EarlyStopping(monitor='val_loss', patience = 10, verbose = 1)
batch_size = 20
epochs = 150

# train the model
checkpointer = ModelCheckpoint(filepath='saved_models/aug_model.weights.best.hdf5', verbose=1, 
                               save_best_only=True)
earlystopper = EarlyStopping(monitor='val_loss', patience = 10, verbose = 1)
model.fit_generator(datagen_train.flow(train_tensors, train_targets, batch_size=batch_size),
                    steps_per_epoch=train_tensors.shape[0] // batch_size,
                    epochs=epochs, verbose=2, callbacks=[checkpointer, earlystopper],
                    validation_data=(valid_tensors, valid_targets),
                    validation_steps=valid_tensors.shape[0] // batch_size)
                   

Epoch 1/150
 - 12s - loss: 0.8331 - acc: 0.6860 - val_loss: 1.1215 - val_acc: 0.5200

Epoch 00001: val_loss improved from inf to 1.12152, saving model to saved_models/aug_model.weights.best.hdf5
Epoch 2/150
 - 11s - loss: 0.8345 - acc: 0.6860 - val_loss: 1.0794 - val_acc: 0.5200

Epoch 00002: val_loss improved from 1.12152 to 1.07944, saving model to saved_models/aug_model.weights.best.hdf5
Epoch 3/150
 - 11s - loss: 0.8349 - acc: 0.6860 - val_loss: 1.1079 - val_acc: 0.5200

Epoch 00003: val_loss did not improve from 1.07944
Epoch 4/150
 - 11s - loss: 0.8344 - acc: 0.6860 - val_loss: 1.1303 - val_acc: 0.5200

Epoch 00004: val_loss did not improve from 1.07944
Epoch 5/150
 - 11s - loss: 0.8337 - acc: 0.6860 - val_loss: 1.0520 - val_acc: 0.5200

Epoch 00005: val_loss improved from 1.07944 to 1.05203, saving model to saved_models/aug_model.weights.best.hdf5
Epoch 6/150
 - 12s - loss: 0.8352 - acc: 0.6860 - val_loss: 1.0960 - val_acc: 0.5200

Epoch 00006: val_loss did not improve from 1.05

<keras.callbacks.History at 0x7f332474ae10>

In [38]:
## 加载具有最好验证loss的模型
model.load_weights('saved_models/aug_model.weights.best.hdf5')
# 获取测试数据集中每一个图像所预测的狗品种的index
predictions = [np.argmax(model.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]

# 报告测试准确率
test_accuracy = 100*np.sum(np.array(predictions)==np.argmax(test_targets, axis=1))/len(predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

Test accuracy: 65.5000%
