[**sklearn.datasets.load_files**](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files)


sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error=’strict’, random_state=0

container_folder/

    category_1_folder/
    
        file_1.txt file_2.txt … file_42.txt
        
    category_2_folder/
    
        file_43.txt file_44.txt …
        
The folder names are used as supervised signal label names. The individual file names are not important.

This function does not try to extract features into a numpy array or scipy sparse matrix. In addition, if load_content is false it does not try to load the files in memory.

To use text files in a scikit-learn classification or clustering algorithm, you will need to use the sklearn.feature_extraction.text module to build a feature extraction transformer that suits your problem.

If you set load_content=True, you should also specify the encoding of the text using the ‘encoding’ parameter. For many modern text files, ‘utf-8’ will be the correct encoding. If you leave encoding equal to None, then the content will be made of bytes instead of Unicode, and you will not be able to use most functions in sklearn.feature_extraction.text.

Similar feature extractors should be built for other kind of unstructured data input such as images, audio, video, …



[**to_categorical**](https://keras.io/utils/)

keras.utils.to_categorical(y, num_classes=None, dtype='float32')
Converts a class vector (integers) to binary class matrix.

E.g. for use with categorical_crossentropy.

Arguments

>y: class vector to be converted into a matrix (integers from 0 to num_classes).

>num_classes: total number of classes.

>dtype: The data type expected by the input, as a string (float32, float64,  int32...)
Returns

A binary matrix representation of the input. The classes axis is placed last.



In [None]:
from sklearn.datasets import load_files
from keras.utils import np_utils
import numpy as np
from glob import glob
#define function load files acoording to subfolders, one-hot labels
def load_dataset(path):
    data = load_files(path)
    diseases_files = np.array(data['filenames'])
    diseases_targets = np_utils.to_categorical(np.array(data['target']), 3)
    return diseases_files, diseases_targets

train_files, train_targets = load_dataset('./data/train')
valid_files, valid_targets = load_dataset('./data/valid')
test_files, test_targets = load_dataset('./data/test')

#diseases_names =  [item for item in sorted(glob('./data/train/*/'))]

#print('There are %d total disease categories.' % len(diseases_names))
print('There are %s total disease images.\n' % len(np.hstack([train_files, valid_files, test_files])))
print('There are %d training disease images.' % len(train_files))
print('There are %d validation disease images.' % len(valid_files))
print('There are %d test disease images.'% len(test_files))

In [None]:
diseases_names =  [item[13:-1] for item in sorted(glob('./data/train/*/'))]

In [None]:
diseases_names

In [None]:
import random
random.seed(20)
random.shuffle(train_files)
random.shuffle(valid_files)
random.shuffle(test_files)

In [None]:
train_files

In [None]:
from keras.preprocessing import image
from tqdm import tqdm

def path_to_tensor(img_path):
    img = image.load_img(img_path, target_size = (224, 224))
    x = image.img_to_array(img)
    return np.expand_dims(x, axis=0)

def paths_to_tensor(img_paths):
    list_of_tensors = [path_to_tensor(img_path) for img_path in tqdm(img_paths)]
    return np.vstack(list_of_tensors)

If we look around line 220 (in your case line 201—perhaps you are running a slightly different version), we see that PIL is reading in blocks of the file and that it expects that the blocks are going to be of a certain size. It turns out that you can ask PIL to be tolerant of files that are truncated (missing some file from the block) by changing a setting.

Somewhere before your code block, simply add the following:
``` python
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
```

In [None]:
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(valid_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255

## NB

Image Preprocession - (https://keras.io/preprocessing/image/)

使用`flow`的话，输入的图片需要预先是tensor的模式（也就是四维张量）

`flow`

Returns

An Iterator yielding tuples of (x, y) where x is a numpy array of image data (in the case of a single image input) or a list of numpy arrays (in the case with additional inputs) and y is a numpy array of corresponding labels. If 'sample_weight' is not None, the yielded tuples are of the form  (x, y, sample_weight). If y is None, only the numpy array x is returned.
```python
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
y_train = np_utils.to_categorical(y_train, num_classes)
y_test = np_utils.to_categorical(y_test, num_classes)

datagen = ImageDataGenerator(
    featurewise_center=True,
    featurewise_std_normalization=True,
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True)

# compute quantities required for featurewise normalization
# (std, mean, and principal components if ZCA whitening is applied)
datagen.fit(x_train)

# fits the model on batches with real-time data augmentation:
model.fit_generator(datagen.flow(x_train, y_train, batch_size=32),
                    steps_per_epoch=len(x_train) / 32, epochs=epochs)

# here's a more "manual" example
for e in range(epochs):
    print('Epoch', e)
    batches = 0
    for x_batch, y_batch in datagen.flow(x_train, y_train, batch_size=32):
        model.fit(x_batch, y_batch)
        batches += 1
        if batches >= len(x_train) / 32:
            # we need to break the loop by hand because
            # the generator loops indefinitely
            break
```
`flow_from_directory`

Returns

A DirectoryIterator yielding tuples of (x, y) where x is a numpy array containing a batch of images with shape  (batch_size, *target_size, channels) and y is a numpy array of corresponding labels.
``` python
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
        'data/train',
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

validation_generator = test_datagen.flow_from_directory(
        'data/validation',
        target_size=(150, 150),
        batch_size=32,
        class_mode='binary')

model.fit_generator(
        train_generator,
        steps_per_epoch=2000,
        epochs=50,
        validation_data=validation_generator,
        validation_steps=800)
```

# New dataset is small and similar to original dataset and  New dataset is large and similar to the original dataset

## NB 
Keras functional API -(https://keras.io/getting-started/functional-api-guide/)

Flatten-(https://stackoverflow.com/questions/43237124/role-of-flatten-in-keras)

In [None]:
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model 
from keras.layers import Dropout,Conv2D, Flatten, Dense, GlobalAveragePooling2D, Input
from keras import backend as k 
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping

img_width, img_height = 256, 256
train_data_dir = "./data/train"
validation_data_dir = "./data/valid"
nb_train_samples = 4216
nb_validation_samples = 466
batch_size = 16
epochs = 50


###  定义你的网络架构
model = applications.VGG19(weights = "imagenet", include_top=False, input_shape = (img_width, img_height, 3))

In [None]:
model.summary()

In [None]:
# Freeze the layers which you don't want to train. Here I am freezing the first 5 layers.
for layer in model.layers[:5]:
    layer.trainable = False

#Adding custom Layers 
x = model.output
x = Flatten()(x)
x = Dense(1024, activation="relu")(x)
x = Dropout(0.5)(x)
x = Dense(1024, activation="relu")(x)
predictions = Dense(3, activation="softmax")(x)

# creating the final model 
model_final = Model(input = model.input, output = predictions)
model_final.compile(loss = "categorical_crossentropy", optimizer = optimizers.SGD(lr=0.0001, momentum=0.9), metrics=["accuracy"])

## NB
fit_generator-(https://keras.io/models/sequential/)

In [None]:
from keras.preprocessing.image import ImageDataGenerator

# create and configure augmented image generator
# Initiate the train and test generators with data Augumentation 
train_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
width_shift_range = 0.3,
height_shift_range=0.3,
rotation_range=30)
train_datagen.fit(train_tensors)

test_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
width_shift_range = 0.3,
height_shift_range=0.3,
rotation_range=30)

train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size, 
class_mode = "categorical")

validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size = (img_height, img_width),
class_mode = "categorical")

# Save the model according to the conditions  
checkpoint = ModelCheckpoint("vgg19_1.h5", monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)
early = EarlyStopping(monitor='val_acc', min_delta=0, patience=10, verbose=1, mode='auto')


# Train the model 
model_final.fit_generator(
train_generator,
samples_per_epoch = nb_train_samples,
epochs = epochs,
validation_data = validation_generator,
nb_val_samples = nb_validation_samples,
callbacks = [checkpoint, early])

model_final.load_weights('vgg19_1.h5')
# 获取测试数据集中每一个图像所预测的狗品种的index
predictions = [np.argmax(model_final.predict(np.expand_dims(tensor, axis=0))) for tensor in test_tensors]

# 报告测试准确率
test_accuracy = 100*np.sum(np.array(predictions)==np.argmax(test_targets, axis=1))/len(predictions)
print('Test accuracy: %.4f%%' % test_accuracy)

In [None]:

Epoch 1/150
 - 12s - loss: 0.8331 - acc: 0.6860 - val_loss: 1.1215 - val_acc: 0.5200

Epoch 00001: val_loss improved from inf to 1.12152, saving model to saved_models/aug_model.weights.best.hdf5
Epoch 2/150
 - 11s - loss: 0.8345 - acc: 0.6860 - val_loss: 1.0794 - val_acc: 0.5200

Epoch 00002: val_loss improved from 1.12152 to 1.07944, saving model to saved_models/aug_model.weights.best.hdf5
Epoch 3/150
 - 11s - loss: 0.8349 - acc: 0.6860 - val_loss: 1.1079 - val_acc: 0.5200

Epoch 00003: val_loss did not improve from 1.07944
Epoch 4/150
 - 11s - loss: 0.8344 - acc: 0.6860 - val_loss: 1.1303 - val_acc: 0.5200

Epoch 00004: val_loss did not improve from 1.07944
Epoch 5/150
 - 11s - loss: 0.8337 - acc: 0.6860 - val_loss: 1.0520 - val_acc: 0.5200

Epoch 00005: val_loss improved from 1.07944 to 1.05203, saving model to saved_models/aug_model.weights.best.hdf5
Epoch 6/150
 - 12s - loss: 0.8352 - acc: 0.6860 - val_loss: 1.0960 - val_acc: 0.5200

Epoch 00006: val_loss did not improve from 1.05203
Epoch 7/150
 - 11s - loss: 0.8348 - acc: 0.6860 - val_loss: 1.1271 - val_acc: 0.5200

Epoch 00007: val_loss did not improve from 1.05203
Epoch 8/150
 - 11s - loss: 0.8341 - acc: 0.6860 - val_loss: 1.0703 - val_acc: 0.5200

Epoch 00008: val_loss did not improve from 1.05203
Epoch 9/150
 - 11s - loss: 0.8348 - acc: 0.6860 - val_loss: 1.0834 - val_acc: 0.5200

Epoch 00009: val_loss did not improve from 1.05203
Epoch 10/150
 - 11s - loss: 0.8333 - acc: 0.6860 - val_loss: 1.0778 - val_acc: 0.5200

Epoch 00010: val_loss did not improve from 1.05203
Epoch 11/150
 - 11s - loss: 0.8334 - acc: 0.6860 - val_loss: 1.0826 - val_acc: 0.5200

Epoch 00011: val_loss did not improve from 1.05203
Epoch 12/150
 - 11s - loss: 0.8330 - acc: 0.6860 - val_loss: 1.1381 - val_acc: 0.5200

Epoch 00012: val_loss did not improve from 1.05203
Epoch 13/150
 - 11s - loss: 0.8350 - acc: 0.6860 - val_loss: 1.0885 - val_acc: 0.5200

Epoch 00013: val_loss did not improve from 1.05203
Epoch 14/150
 - 11s - loss: 0.8323 - acc: 0.6860 - val_loss: 1.1161 - val_acc: 0.5200

Epoch 00014: val_loss did not improve from 1.05203
Epoch 15/150
 - 11s - loss: 0.8339 - acc: 0.6860 - val_loss: 1.1278 - val_acc: 0.5200

Epoch 00015: val_loss did not improve from 1.05203
Epoch 00015: early stopping

Test accuracy: 65.5000%

# New dataset is small but very different from the original dataset

## NB
attrs - (http://www.attrs.org/en/stable/examples.html)

In [25]:
from keras import applications
from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
from keras.models import Sequential, Model 
from keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D, Input, Conv2D, MaxPooling2D
from keras import backend as k 
from keras.callbacks import ModelCheckpoint, LearningRateScheduler, TensorBoard, EarlyStopping
img_width, img_height = 256, 256
train_data_dir = "./data/train"
validation_data_dir = "./data/valid"
nb_train_samples = 4216
nb_validation_samples = 466
batch_size = 16
epochs = 50


### Build the network 
img_input = Input(shape=(256, 256, 3))
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv1')(img_input)
x = Conv2D(64, (3, 3), activation='relu', padding='same', name='block1_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block1_pool')(x)

# Block 2
x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv1')(x)
x = Conv2D(128, (3, 3), activation='relu', padding='same', name='block2_conv2')(x)
x = MaxPooling2D((2, 2), strides=(2, 2), name='block2_pool')(x)


model = Model(input = img_input, output = x)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_13 (InputLayer)        (None, 256, 256, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 256, 256, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 256, 256, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 128, 128, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 128, 128, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 128, 128, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 64, 64, 128)       0         
Total para



In [8]:

"""
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 256, 256, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 256, 256, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 256, 256, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 128, 128, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 128, 128, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 128, 128, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 64, 64, 128)       0         
=================================================================
Total params: 260,160.0
Trainable params: 260,160.0
Non-trainable params: 0.0
"""

layer_dict = dict([(layer.name, layer) for layer in model.layers])
#[layer.name for layer in model.layers]
"""
['input_1',
 'block1_conv1',
 'block1_conv2',
 'block1_pool',
 'block2_conv1',
 'block2_conv2',
 'block2_pool']
"""

import h5py
weights_path = 'vgg19_weights.h5' # ('https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg19_weights_tf_dim_ordering_tf_kernels.h5)
f = h5py.File(weights_path)

list(f["model_weights"].keys())
"""
['block1_conv1',
 'block1_conv2',
 'block1_pool',
 'block2_conv1',
 'block2_conv2',
 'block2_pool',
 'block3_conv1',
 'block3_conv2',
 'block3_conv3',
 'block3_conv4',
 'block3_pool',
 'block4_conv1',
 'block4_conv2',
 'block4_conv3',
 'block4_conv4',
 'block4_pool',
 'block5_conv1',
 'block5_conv2',
 'block5_conv3',
 'block5_conv4',
 'block5_pool',
 'dense_1',
 'dense_2',
 'dense_3',
 'dropout_1',
 'global_average_pooling2d_1',
 'input_1']
"""

# list all the layer names which are in the model.
layer_names = [layer.name for layer in model.layers]


"""
# Here we are extracting model_weights for each and every layer from the .h5 file
>>> f["model_weights"]["block1_conv1"].attrs["weight_names"]
array([b'block1_conv1/kernel:0', b'block1_conv1/bias:0'], 
      dtype='|S21')
# we are assiging this array to weight_names below 
>>> f["model_weights"]["block1_conv1"]["block1_conv1/kernel:0]
<HDF5 dataset "kernel:0": shape (3, 3, 3, 64), type "<f4">
# The list comprehension (weights) stores these two weights and bias of both the layers 
>>>layer_names.index("block1_conv1")
1
>>> model.layers[1].set_weights(weights)
# This will set the weights for that particular layer.
With a for loop we can set_weights for the entire network.
"""
for i in layer_dict.keys():
    weight_names = f["model_weights"][i].attrs["weight_names"]
    weights = [f["model_weights"][i][j] for j in weight_names]
    index = layer_names.index(i)
    model.layers[index].set_weights(weights)

'''
import cv2
import numpy as np
import pandas as pd
from tqdm import tqdm
import itertools
import glob

features = []
for i in tqdm(files_location):
        im = cv2.imread(i)
        im = cv2.resize(cv2.cvtColor(im, cv2.COLOR_BGR2RGB), (256, 256)).astype(np.float32) / 255.0
        im = np.expand_dims(im, axis =0)
        outcome = model_final.predict(im)
        features.append(outcome)

'''

for layer in model.layers[:2]:
    layer.trainable = False
    
x = model.output
x = Flatten()(x)
x = Dense(1024, activation="relu", name='dense_1')(x)
x = Dropout(0.5, name='dropout_1')(x)
x = Dense(1024, activation="relu", name='dense_2')(x)
predictions = Dense(3, activation="softmax", name='dense_3')(x)

model = Model(input = model.input, output = predictions)

model.summary()
# compile the model 
model_final.compile(loss = "categorical_crossentropy", optimizer = optimizers.SGD(lr=0.0001, momentum=0.9), metrics=["accuracy"])

# Initiate the train and test generators with data Augumentation 
train_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
width_shift_range = 0.3,
height_shift_range=0.3,
rotation_range=30)

test_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
width_shift_range = 0.3,
height_shift_range=0.3,
rotation_range=30)

train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size, 
class_mode = "categorical")

validation_generator = test_datagen.flow_from_directory(
validation_data_dir,
target_size = (img_height, img_width),
class_mode = "categorical")

# Save the model according to the conditions  
checkpoint = ModelCheckpoint("vgg16_1.h5", monitor='val_acc', verbose=1, save_best_only=True, save_weights_only=False, mode='auto', period=1)
early = EarlyStopping(monitor='val_acc', min_delta=0, patience=10, verbose=1, mode='auto')


# Train the model 
model_final.fit_generator(
train_generator,
samples_per_epoch = nb_train_samples,
epochs = epochs,
validation_data = validation_generator,
nb_val_samples = nb_validation_samples,
callbacks = [checkpoint, early])


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_7 (InputLayer)         (None, 256, 256, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 256, 256, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 256, 256, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 128, 128, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 128, 128, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 128, 128, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 64, 64, 128)       0         
__________



TypeError: unhashable type: 'slice'