# Day 3: Mixtures of Experts

This notebook demonstrates how mixtures of experts can be used to boost performance.

The objective of this lab is to classify images from Cifar10 (https://www.cs.toronto.edu/~kriz/cifar.html) to one of ten classes: {0: airplane, 1: automobile, 2: bird, 3: cat, 4: deer, 5: dog, 6: frog, 7: horse, 8: ship, 9: truck}


![Cifar10](cifar10_resize.png )

Specifically, a gating function is trained to pass examples to two experts which are trained separatly, where one expert is trained to classify images within the "natural image" category (e.g. cat, dog, etc) and another to classify images within the  "artificial image" category (e.g. plane, car). The experts are then used to boost the performance of a baseline architecture that classifies image to one of the 10 classes.

Specifically, the mixture is built in the following order:
1. A single model is trained to to classify all 10 classes. (This is included in the mixture, and is also our evaluation benchmark)

2. An expert gating function is trained to recognise whether an image is of an artificial or  natural subject.

3. An artificial expert is trained to classify artificial objects that have a label in {0, 1, 8, 9}.

4. A natural expert is trained to classify natural objects that have a label in  {2, 3, 4, 5, 6, 7}.

5. A gating function is trained to determine the contribution of the experts and the contribution of the baseline architecture to the final output.

6. The mixture is built as illustrated in the figure below.


![](moe_architecture_illus.png)

## Import Prerequisites

In [None]:
import keras
from keras.models import Model
from keras.layers import Dense, Dropout, Flatten, BatchNormalization, Input
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import concatenate, Lambda, Reshape
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.optimizers import Adam
import keras.backend as K
from keras.datasets import cifar10
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model

import numpy as np
import os
import copy
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sn
import pydot
from IPython.display import SVG

Using TensorFlow backend.


In [None]:
# Parameters (not to be changed)
orig_classes = 10 ; gate0_classes = 2

## Mixture Parameters

You can try changing the mixture parameters in the following piece of code, when doing so, consider the following:

1 - Increasing the number of epochs increases the fit to training data, at some point this should cause over-fitting. Conversly, setting it low should cause under-fitting.

2 - Increasing the number of training examples increases the number of learnable features.

3 - Using a large model for different classifiers increases their capacity to learn. This increases the amount of required epochs for training, and also increases the risk of over-fitting.

By changing the parameters, the performance of the mixture of experts and the baseline classifier should change accordingly.

In [None]:
# Number of training/testing examples per batch
batch_size = 50

# Training epochs. A higher number of epochs corresponds to "more fitting to training data"
epochs = 1

# Number of training/testing examples to use
train_examples = 5000 # Max is 50000
test_examples = 1000   # Max is 5000

# Large/small model flags. Set to true to change a classifier to "large"
use_large_experts = False
use_large_gating_mlp = False
use_large_baseline_classifier = False

In [None]:
# delete previous model checkpoints
import shutil
shutil.rmtree('gate0Cifar10', ignore_errors=True)
shutil.rmtree('moe3Cifar10', ignore_errors=True)
shutil.rmtree('natureCifar10', ignore_errors=True)
shutil.rmtree('baseCifar10', ignore_errors=True)
shutil.rmtree('artCifar10', ignore_errors=True)

# get the newest model file within a directory
def getNewestModel(model, dirname):
    from glob import glob
    target = os.path.join(dirname, '*')
    files = [(f, os.path.getmtime(f)) for f in glob(target)]
    if len(files) == 0:
        return model
    else:
        newestModel = sorted(files, key=lambda files: files[1])[-1]
        model.load_weights(newestModel[0])
        return model

## Prepare datasets

In [None]:
# load dataset  ; X: input images,  Y: class label ground truth
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train[:train_examples] ; x_test = x_test[:test_examples]
y_train = y_train[:train_examples] ; y_test = y_test[:test_examples]


In [None]:
# prepare x dataset
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255

In [None]:
# Convert class vectors to binary class matrices
y_train0 = keras.utils.to_categorical(y_train, orig_classes)
y_test0 = keras.utils.to_categorical(y_test, orig_classes)

print("y train0:{0}\ny test0:{1}".format(y_train0.shape, y_test0.shape))

## Define architectures

In [None]:
# input layer
cifarInput = Input(shape=(x_train.shape[1:]), name="input")

In [None]:
# Small VGG-like model
def simpleVGG(cifarInput, num_classes, name="vgg"):
    name = [name+str(i) for i in range(12)]
    
    # convolution and max pooling layers
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[0])(cifarInput)
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[1])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[2])(vgg)
    vgg = Dropout(0.25, name=name[3])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[4])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[5])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[6])(vgg)
    vgg = Dropout(0.25, name=name[7])(vgg)

    # classification layers
    vgg = Flatten(name=name[8])(vgg)
    vgg = Dense(512, activation='relu', name=name[9])(vgg)
    vgg = Dropout(0.5, name=name[10])(vgg)
    vgg = Dense(num_classes, activation='softmax', name=name[11])(vgg)
    return vgg

In [None]:
# Large VGG-like model
def fatVGG(cifarInput, num_classes, name="vgg"):
    name = [name+str(i) for i in range(17)]
    
    # convolution and max pooling layers
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[0])(cifarInput)
    vgg = Conv2D(32, (3, 3), padding='same', activation='relu', name=name[1])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[2])(vgg)
    vgg = Dropout(0.25, name=name[3])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[4])(vgg)
    vgg = Conv2D(64, (3, 3), padding='same', activation='relu', name=name[5])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[6])(vgg)
    vgg = Dropout(0.25, name=name[7])(vgg)
    vgg = Conv2D(128, (3, 3), padding='same', activation='relu', name=name[8])(vgg)
    vgg = Conv2D(128, (3, 3), padding='same', activation='relu', name=name[9])(vgg)
    vgg = Conv2D(128, (3, 3), padding='same', activation='relu', name=name[10])(vgg)
    vgg = MaxPooling2D(pool_size=(2,2), name=name[11])(vgg)
    vgg = Dropout(0.25, name=name[12])(vgg)

    # classification layers
    vgg = Flatten(name=name[13])(vgg)
    vgg = Dense(512, activation='relu', name=name[14])(vgg)
    vgg = Dropout(0.5, name=name[15])(vgg)
    vgg = Dense(num_classes, activation='softmax', name=name[16])(vgg)
    return vgg

In [None]:
# first gating network, to decide artificial or natural object
if use_large_gating_mlp:
    gate0VGG = fatVGG(cifarInput, gate0_classes, "gate0")
else:
    gate0VGG = simpleVGG(cifarInput, gate0_classes, "gate0")
gate0Model = Model(cifarInput, gate0VGG)

# base VGG
if use_large_baseline_classifier:
    baseVGG = fatVGG(cifarInput, orig_classes, "base")
else:
    baseVGG = simpleVGG(cifarInput, orig_classes, "base") 
baseModel = Model(cifarInput, baseVGG)

# artificial expert VGG
if use_large_experts:
    artificialVGG = fatVGG(cifarInput, orig_classes, "artificial")
else:
    artificialVGG = simpleVGG(cifarInput, orig_classes, "artificial")
artificialModel = Model(cifarInput, artificialVGG)

# naturalVGG = fatVGG(cifarInput, orig_classes, "natural")
if use_large_experts:
    naturalVGG = fatVGG(cifarInput, orig_classes, "natural")
else:
    naturalVGG = simpleVGG(cifarInput, orig_classes, "natural")

naturalModel = Model(cifarInput, naturalVGG)

## Train networks

### Train 10-Class Classifier

In [None]:
# compile
baseModel.compile(loss='categorical_crossentropy',
                   optimizer=Adam(),
                   metrics=['accuracy'])

In [None]:
# make saving directory for checkpoints
baseSaveDir = "./baseCifar10/"
if not os.path.isdir(baseSaveDir):
    os.makedirs(baseSaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(baseSaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data from the directory if exists
baseModel = getNewestModel(baseModel, baseSaveDir)

In [None]:
# train
baseModel.fit(x_train, y_train0,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_test, y_test0),
               callbacks=[es_cb,cp_cb])

In [None]:
# evaluate
baseModel = getNewestModel(baseModel, baseSaveDir)
baseScore = baseModel.evaluate(x_test, y_test0)
print(baseScore)

## Train 2-Class Natural/Artificial Classifier

The expert gating model determines whether an image is "natural" or "artificial".

In [None]:
# Make ground truth for whether an example is "natural" or "artificial"
y_trainG0 = np.array([0 if i in [0,1,8,9] else 1 for i in y_train])
y_testG0 = np.array([0 if i in [0,1,8,9] else 1 for i in y_test])

y_trainG0 = keras.utils.to_categorical(y_trainG0, 2)
y_testG0  = keras.utils.to_categorical(y_testG0, 2)

print("y trainG0:{0}\ny testG0:{1}".format(y_trainG0.shape, y_testG0.shape))

In [None]:
# compile
gate0Model.compile(loss='categorical_crossentropy',
                   optimizer=Adam(),
                   metrics=['accuracy'])

In [None]:
# make saving directory for check point
gate0SaveDir = "./gate0Cifar10/"
if not os.path.isdir(gate0SaveDir):
    os.makedirs(gate0SaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(gate0SaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data from the directory if exists
gate0Model = getNewestModel(gate0Model, gate0SaveDir)

In [None]:
# train
gate0Model.fit(x_train, y_trainG0,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_test, y_testG0),
               callbacks=[es_cb,cp_cb])

In [None]:
# evaluate
gate0Model = getNewestModel(gate0Model, gate0SaveDir)
gate0Score = gate0Model.evaluate(x_test, y_testG0)
print(gate0Score)

## Train "Natural" and "Artificial" Experts
<br>
The expert networks are specialized in predicting a certain classes.<br>
Each network is only trained with its specialized field: the artificial expert get trained for labels 0, 1, 8 and 9; the natural expert for labels 2, 3, 4, 5, 6 and 7.

In [None]:
# get the position of artificial images and natural images in training and test dataset
artTrain = [i for i in range(len(y_train)) if y_train[i] in [0,1,8,9]]
natureTrain = [i for i in range(len(y_train)) if y_train[i] in [2,3,4,5,6,7]]
artTest = [i for i in range(len(y_test)) if y_test[i] in [0,1,8,9]]
natureTest = [i for i in range(len(y_test)) if y_test[i] in [2,3,4,5,6,7]]

In [None]:
# get artificial dataset and natural dataset
x_trainArt = x_train[artTrain]
x_testArt = x_test[artTest]
y_trainArt = y_train[artTrain]
y_testArt = y_test[artTest]

### Artificial expert network

In [None]:
# for artificial dataset
y_trainArt = keras.utils.to_categorical(y_trainArt, orig_classes)
y_testArt = keras.utils.to_categorical(y_testArt, orig_classes)

print("y train art:{0}\ny test art:{1}".format(y_trainArt.shape, y_testArt.shape))

In [None]:
# compile
artificialModel.compile(loss='categorical_crossentropy',
                        optimizer=Adam(),
                        metrics=['accuracy'])

In [None]:
# make saving directory for check point
artSaveDir = "./artCifar10/"
if not os.path.isdir(artSaveDir):
    os.makedirs(artSaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(artSaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data if exists
artificialModel = getNewestModel(artificialModel, artSaveDir)

In [None]:
# train
artificialModel.fit(x_trainArt, y_trainArt,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_testArt, y_testArt),
               callbacks=[es_cb,cp_cb])

In [None]:
# evaluate
artificialModel = getNewestModel(artificialModel, artSaveDir)
artScore = artificialModel.evaluate(x_testArt, y_testArt)
print(artScore)

### Natural expert network

In [None]:
# for natural dataset
x_trainNat = x_train[natureTrain]
x_testNat = x_test[natureTest]
y_trainNat = y_train[natureTrain]
y_testNat = y_test[natureTest]

In [None]:
# get natural dataset
y_trainNat = keras.utils.to_categorical(y_trainNat, orig_classes)
y_testNat = keras.utils.to_categorical(y_testNat, orig_classes)

print("y train nature:{0}\ny test nature:{1}".format(y_trainNat.shape, y_testNat.shape))

In [None]:
# compile
naturalModel.compile(loss='categorical_crossentropy',
                   optimizer=Adam(),
                   metrics=['accuracy'])

In [None]:
# make saving directory for check point
natSaveDir = "./natureCifar10/"
if not os.path.isdir(natSaveDir):
    os.makedirs(natSaveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(natSaveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data if exists
naturalModel = getNewestModel(naturalModel, natSaveDir)

In [None]:
# train
naturalModel.fit(x_trainNat, y_trainNat,
               batch_size=batch_size,
               epochs=epochs,
               validation_data=(x_testNat, y_testNat),
               callbacks=[es_cb,cp_cb])

In [None]:
# evaluate
naturalModel = getNewestModel(naturalModel, natSaveDir)
natScore = naturalModel.evaluate(x_testNat, y_testNat)
print(natScore)

### Freeze the weights of all trained models so far (i.e. baseline, experts, and expert gating models).


In [None]:
for l in baseModel.layers:
    l.trainable = False
for l in gate0Model.layers:
    l.trainable = False
for l in artificialModel.layers:
    l.trainable = False
for l in naturalModel.layers:
    l.trainable = False

## Connecting the overall networks to form the mixture of experts model

In [None]:
# define sub-Gate network, for the second gating network layer
def subGate(cifarInput, orig_classes, numExperts, name="subGate"):
    name = [name+str(i) for i in range(5)]
    subgate = Flatten(name=name[0])(cifarInput)
    subgate = Dense(512, activation='relu', name=name[1])(subgate)
    subgate = Dropout(0.5, name=name[2])(subgate)
    subgate = Dense(orig_classes*numExperts, activation='softmax', name=name[3])(subgate)
    subgate = Reshape((orig_classes, numExperts), name=name[4])(subgate)
    return subgate

In [None]:
# the artificial gating network
artGate = subGate(cifarInput, orig_classes, 2, "artExpertGate")

# the natural gating network
natureGate = subGate(cifarInput, orig_classes, 2, "natureExpertGate")

In [None]:
# define inference calculation with Keras Lambda layer with base VGG, expert network and the second gating network of corresponding expert as input
# the inference is calculated as sum of multiplications of base VGG inference output and its importance, and expert network inference output and its importance
def subGateLambda(base, expert, subgate):
    output = Lambda(lambda gx: (gx[0]*gx[2][:,:,0]) + (gx[1]*gx[2][:,:,1]), output_shape=(orig_classes,))([base, expert, subgate])
    return output

In [None]:
# connecting the overall networks.
# the Keras backend switch works as deciding with the first gating network, leading to artificial or natural gate
output = Lambda(lambda gx: K.switch(gx[1][:,0] > gx[1][:,1], 
                                    subGateLambda(gx[0], gx[2], gx[4]), 
                                    subGateLambda(gx[0], gx[3], gx[5])), 
                output_shape=(orig_classes,))([baseVGG, gate0VGG, artificialVGG, naturalVGG, artGate, natureGate])

In [None]:
# the mixture of experts model
model = Model(cifarInput, output)

In [None]:
# compile
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics=['accuracy'])

In [None]:
# show layers and if it's trainable or not
# only the second gating network layers and the last Lambda inference layer are left trainable
for l in model.layers:
    print(l, l.trainable)

In [None]:
# make saving directory for check point
saveDir = "./moe3Cifar10/"
if not os.path.isdir(saveDir):
    os.makedirs(saveDir)
    
# early stopping and model checkpoint
es_cb = EarlyStopping(monitor='val_loss', patience=2, verbose=1, mode='auto')
chkpt = os.path.join(saveDir, 'Cifar10_.{epoch:02d}-{val_loss:.2f}.hdf5')
cp_cb = ModelCheckpoint(filepath = chkpt, monitor='val_loss', verbose=1, save_best_only=True, mode='auto')

# load the newest model data if exists
model = getNewestModel(model, saveDir)

In [None]:
# train
model.fit(x_train, y_train0,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test0),
          callbacks=[es_cb, cp_cb])

In [None]:
# evaluate
mixture_loss_accuracy = model.evaluate(x_test, y_test0)
print(mixture_loss_accuracy)