<div width=50% style="display: block; margin: auto">
    <img src="figures/ucl-logo.svg" width=100%>
</div>

### [UCL-ELEC0135 Applied Machine Learning Systems II - 2025]()
University College London
# Lab 2: Mixture of Experts


<hr width=70% style="float: left">

### Introduction

Mixture of experts (MoE) is a machine learning technique where multiple expert Neural Networks are used to divide a problem space into homogeneous regions. In this lab, we will use MoE in a classification task on [Cifar10](https://www.cs.toronto.edu/~kriz/cifar.html).

We will train a gating function to detect wether images are "natural images" (e.g. cat, dog, etc) or "artificial images" (e.g. plane, car). This gating functions will direct the samples to two deferent "expert" classifers, one trained to classify within the natural images catergory, and the other trained to classify within the artificial images catergory. Fianly, these experts will then used to boost the performance of a baseline 10 classes classiers.

All models (gating function, experts, and baseline classifers) will be [Convolutional Neural Networks (CNNs)](https://en.wikipedia.org/wiki/Convolutional_neural_network), a variant of the Multi Layers Perceptron (MLP) seen in lab 1 that uses [Convolutional layers](https://en.wikipedia.org/wiki/Convolutional_layer) and [Pooling layers](https://en.wikipedia.org/wiki/Pooling_layer) to automate feature extraction.

![](figures/moe_architecture_david.png)


### Intended Learning Outcome
* Define and train Convolutional Neural Networks with Tensorflow Keras.
* Use Layers and Wrappers to define and combine custom gate models.
* Compare the performances of a simple CNN classifier with a MoE classifier.


### Outline

This notebook has 4 parts:

0. [Setting up](#0.-Setting-up)
1. [Baseline 10 classes CNN](#1-baseline-10-classes-cnn)
2. [Experts CNNs](#2-experts-cnns)
3. [Integration of the models into a MoE](#3-gating-models)
4. [MoE](#4-moe)

<hr width=70% style="float: left">

# 0. Setting up

## 0.1 Importing librairies

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Make sure you are running this Notebook in the kernel of the virtual environment you created for this module.
- Run the following cell. If some packages have not been installed, add them to the requirements.txt file, and run in terminal (after having activated the virtual environment you created) the command `pip install -r requirements.txt`. 

</div>

In [None]:
# TODO: run this cell and add packages as needed
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import os
from sklearn.model_selection import train_test_split # called scikit-learn when importing with pip (in requirements.txt)
from sklearn.metrics import confusion_matrix
import seaborn as sns

2025-01-14 09:28:50.052081: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 0.2 Importing the data

The [Cifar10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset consists of 60000 32x32 colour images in 10 classes {0: airplane, 1: automobile, 2: bird, 3: cat, 4: deer, 5: dog, 6: frog, 7: horse, 8: ship, 9: truck}, with 6000 images per class. There are 50000 training images and 10000 test images.

![](figures/cifar10_resize.png)

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Run the cell bellow to import the cifar0 dataset, select a subset of the training and test set, and normalize the data.
- Print the number of samples in the training and test sets.
- Display the distribution of labels, is this dataset balanced?

</div>

In [None]:
# Load the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Since the cifar dataset is quite large, we are only going to select a subset of smaller size to reduce the computational load required to finish this lab.
train_examples = 5000 # Max is 50000
test_examples = 1000   # Max is 5000

x_train = x_train[:train_examples] ; x_test = x_test[:test_examples]
y_train = y_train[:train_examples] ; y_test = y_test[:test_examples]

# Normalize pixel values to be between 0 and 1 (helps models converge faster)
x_train = x_train / 255.0
x_test = x_test / 255.0

# Convert class vectors to binary class matrices (one-hot encoding)
y_train0 = tf.keras.utils.to_categorical(y_train, 10)
y_test0 = tf.keras.utils.to_categorical(y_test, 10)

print("y train0:{0}\ny test0:{1}".format(y_train0.shape, y_test0.shape))

In [None]:
# TODO: your code here

# 1. Baseline 10 classes CNN

The baseline model and expert models will have the same CNN architecture, defined in the cell bellow. The large models should be more performent, but will require more training time.

In [None]:
def CNN(n_classes = 10, large = False):
    """
    Create a Convolutional Neural Network for image classification

    Parameters:
        - n_classes (int): number of classes in the dataset
        - large (bool): whether to use a large or small model

    Returns:
        - model (tf.keras.models.Sequential): CNN model

    """
    if large:
        model = tf.keras.models.Sequential([
            # Convolutions and Pooling layers for feature extraction
            tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu', name='conv1.1', input_shape=(32, 32, 3)),
            tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu', name='conv1.2'),
            tf.keras.layers.MaxPooling2D((2, 2), name='pool1'),
            tf.keras.layers.Dropout(0.25, name='drop1'),
            tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu', name='conv2.1'),
            tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu', name='conv2.2'),
            tf.keras.layers.MaxPooling2D((2, 2), name='pool2'),
            tf.keras.layers.Dropout(0.25, name='drop2'),
            tf.keras.layers.Conv2D(128, (3, 3), padding='same', activation='relu', name='conv3.1'),
            tf.keras.layers.Conv2D(128, (3, 3), padding='same', activation='relu', name='conv3.2'),
            tf.keras.layers.MaxPooling2D((2, 2), name='pool3'),
            tf.keras.layers.Dropout(0.25, name='drop3'),

            # MLP for classification
            tf.keras.layers.Flatten(name='flatten'),
            tf.keras.layers.Dense(512, activation='relu', name='dense1'),
            tf.keras.layers.Dropout(0.5, name='drop4'),
            tf.keras.layers.Dense(n_classes, activation='softmax', name='output')
        ])

    else:
        model = tf.keras.models.Sequential([
            # Convolutions and Pooling layers for feature extraction
            tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu', name='conv1.1', input_shape=(32, 32, 3)),
            tf.keras.layers.Conv2D(32, (3, 3), padding='same', activation='relu', name='conv1.2'),
            tf.keras.layers.MaxPooling2D((2, 2), name='pool1'),
            tf.keras.layers.Dropout(0.25, name='drop1'),
            tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu', name='conv2.1'),
            tf.keras.layers.Conv2D(64, (3, 3), padding='same', activation='relu', name='conv2.2'),
            tf.keras.layers.MaxPooling2D((2, 2), name='pool2'),
            tf.keras.layers.Dropout(0.25, name='drop2'),

            # MLP for classification
            tf.keras.layers.Flatten(name='flatten'),
            tf.keras.layers.Dense(512, activation='relu', name='dense1'),
            tf.keras.layers.Dropout(0.5, name='drop3'),
            tf.keras.layers.Dense(n_classes, activation='softmax', name='output')
        ])

    return model

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Use the function above to create a baseline CNN.
- Compile the model, use the [ADAM optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) with a learning rate of 0.00001, and the [categorical crossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) loss funtion.
- Use model.summary() to print a summary of your architecture.
- Train the model and display the confusion matrix on the test set. You can use the function [`confusion_matrix`](https://scikit-learn.org/dev/modules/generated/sklearn.metrics.confusion_matrix.html) from sklearn, and [`sns.heatmap`](https://seaborn.pydata.org/generated/seaborn.heatmap.html).

</div>

<div class="alert alert-block alert-info"> 
<b>💡 Tips</b> 

- Go back to lab 1 for a refresher on how to compile and train a model with tensorflow keras.

</div>

In [None]:
# TODO: your code here

baseline_model = ...

# 2. Experts CNNs

## 2.1 Natural Images Expert CNN

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Use the function `CNN` defined in task 1 to create the Natural Images Expert.
- Compile the model, use the [ADAM optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) with a learning rate of 0.00001, and the [categorical crossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) loss funtion.
- Use model.summary() to print a summary of your architecture.

</div>

In [None]:
# TODO: your code here

natural_expert_model = ...



<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Create a subset of the training data containing all samples from the "natural" classes (labels 2, 3, 4, 5, 6 and 7).
- Train the expert on the subset.
- Display 2 confusion matrices, one with the prediction results the test set when only including the "natural" classes, and one on the entire test set.

</div>

In [None]:
# TODO: your code here

## 2.2 Artificial Images Expert CNN

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Use the function `CNN` defined in task 1 to create the Artificial Images Expert.
- Compile the model, use the [ADAM optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) with a learning rate of 0.00001, and the [categorical crossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) loss funtion.
- Use model.summary() to print a summary of your architecture.

</div>

In [None]:
# TODO: your code here

artificial_expert_model = ...

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Create a subset of the training data containing all samples from the "artificial" classes labels 0, 1, 8 and 9.
- Train the expert on the subset.
- Display 2 confusion matrices, one with the prediction results the test set when only including the "natural" classes, and one on the entire test set.

</div>

In [None]:
# TODO: your code here

# 3. Gating models

## 3.1 Gating Natural / Artificial images Experts

The first gating model will take as input an image from the cifar dataset, and decide which expert to use. In other words, it's a binairy classifier trained to know if an image falls into the "natural" or "artificial" category.

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- For the training and test set, create new labels that indicate wether the samples are "natural" objects or "artificial" objects. Use one-hot encoding (see code in task 0.2 for reference).
- Use the function `CNN` defined in task 1 to create a gating binary classfifier.
- Use model.summary() to print a summary of your architecture.
- Train the model and display the confusion matrix on the test set.


</div>

In [None]:
# TODO: your code here

expert_choice_gate = ...

## 3.2 Gating Baseline / Experts

We now have a baseline classifier (10 classes CNN), 2 experts classifiers (10 classes but specialise in classifying either the natural or artificial categories), and a gating classfier (2 classes CNN) which decides which of the experts to choose for the given input. 

We now want our second gate to be able to take in the cifar input and weighs the importance of the output of the baseline and the selected expert in producing a final MoE output prediction. To do this, the second gate will be composed of 2 sub-gates, one for each expert model. 

Each gate will use an importance model that will output values for the i) baseline and ii) chosen expert, i.e. what the sub-gate 'thinks' the realtive importance of the 10 baseline classifier and the relative importance is of the expert are.

**Importance model:**
1. Flatten Layer 
2. Dense layer with 512 neurons and relu activation function
3. Dropout layer with dropout rate of 0.5
4. Dense layer with 2*the number of classes and softmax activation functions
5. Reshape layer that reshapes the output to (number of classes, 2)

<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Implement the function `importance_model` to create a model with the arcitecture described above.
- Create 2 sub gate models, one for the "natural" expert and one for the "artificial" expert.

</div>

In [None]:
def importance_model(n_classes):
    # TODO: your code here
    pass

# TODO: your code here
artificial_importance_model = ...
natural_importance_model = ...

Now we can implement the subgate as a model that:

- Takes in argument gx, which is a list of 3 tensors: 
    - gx[0] -> baseline output tensor (10,). Is a softmax output for each of the 10 classes
    - gx[1] -> expert network output tensor (10,). Is a softmax output for each of the 10 classes. Which expert's output reached here depends on the binary classifier output of the first gate (we will implement the logic of choosing the expert when we integrate all the models together, see below).
    - the corresponding importance of each models
    - gx[2] -> corresponding importance model output tensor (10,2,). 
        - gx[2][:,:,0] -> baseline importance tensor of shape (10,) i.e what the sub-gate thinks the importance is of each of the 10 baseline output classes. 
        - gx[2][:,:,1] -> expert importance tensor of shape (10,) i.e what the sub-gate thinks the importance is of each of the 10 expert output classes.

- Performs the following:
    - Multiplies the baseline's output by the sub-gate's baseline importance -> (10,) tensor.
    - Multiplies the expert's output by the sub-gate's expert importance -> (10,) tensor 
    - Sums these two importance-weighted terms to get a final (10,) tensor of logit outputs (one for each class) -> (10,) tensor.

To implement the subgate, we use a [`tf.keras.layers.Lamdba`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda) layer, itself using a [lambda](https://www.w3schools.com/python/python_lambda.asp) object.

In [None]:
def subgate(baseline_model, expert_model, importance_model, n_classes=10):
    output = tf.keras.layers.Lambda(
        lambda gx: (gx[0]*gx[2][:,:,0]) + (gx[1]*gx[2][:,:,1]), output_shape=(n_classes,)
        )([baseline_model, expert_model, importance_model])
    return output

# 4. MoE

At this stage, we have implemented:
- A baseline 10 classes classifier.
- 2 experts 10 classes classifiers.
- A binairy classifier that will be used to choose which expert to use.
- Subgates that weighs the importance of the output of the baseline and the selected expert in producing a final MoE output prediction using an importance model.

We now need to implement the logic deciding which expert to use, and integrate all the different components of our MoE together in order to be able to train the whole architecture at once.

<div class="alert alert-block alert-info"> 
<b>💡 Tips</b> 

- The advantage of using a framework like Tensorflow Keras is that all the layers are inheritted from the same abstract classes, meaning that as long as we use classes and wrappers properly, and that input and output shapes of the components models match, the resulting model will be trainable using the same tools as the individual component models. This spares us extensive and complicated gradiant calculations.

</div>


<div class="alert alert-block alert-danger">
<b>👩‍💻👨‍💻 Action required</b>

- Using a [`tf.keras.layers.Lamdba`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda) layer and [`tf.where`](https://www.tensorflow.org/api_docs/python/tf/where) in a similar manner as the implementation of the function `subgate`, implement a model that:
    - Takes in argument gx, which is a list of 6 tensors:
        - gx[0] -> baseline output tensor (10,)
        - gx[1] -> first gate binary output tensor (2,)
        - gx[2] -> artificial expert output tensor (10,)
        - gx[3] -> natural expert output tensor (10,)
        - gx[4] -> artificial importance model output tensor (10,2,)
        - gx[5] -> natural importance model output tensor (10,2,)
    - If the binairy classifier classifies the image as artificial (`switch(tf.expand_dims(gx[1][:,0],axis=1) > tf.expand_dims(gx[1][:,1],axis=1)`):
        - Outputs the results of the artificial subgate model (10,2,)
    - Else:
        - Outputs the results of the natural subgate model (10,2,)

- Compile the model, use the [ADAM optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) with a learning rate of 0.00001, and the [categorical crossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) loss funtion.
- Use model.summary() to print a summary of your architecture.
- Train the model and display the confusion matrix on the test set. Compare performances with the baseline classifier alone.
    - Have the performances been improved?
    - Is the increased model complexity and training time worth the accuracy gains?

</div>




In [None]:
def MoE(baseline_model, expert_choice_gate, artificial_expert_model, natural_expert_model, artificial_importance_model, natural_importance_model):
    """
    Create a Mixture of Experts model for image classification

    Args:
        - baseline_model (tf.keras.models.Sequential): baseline CNN model
        - expert_choice_gate (tf.keras.models.Sequential): expert choice gate model
        - artificial_expert_model (tf.keras.models.Sequential): artificial expert model
        - natural_expert_model (tf.keras.models.Sequential): natural expert model
        - artificial_importance_model (tf.keras.models.Sequential): artificial importance model
        - natural_importance_model (tf.keras.models.Sequential): natural importance model

    Returns:
        - model (tf.keras.models.Model): MoE model

    """
    # Define input tensor
    inputs = tf.keras.layers.Input(shape=(32, 32, 3))

    # Get baseline predictions
    baseline_output = baseline_model(inputs)
    
    # Get expert choice gate predictions
    expert_gate_output = expert_choice_gate(inputs)

    # Get expert predictions
    artificial_expert_output = artificial_expert_model(inputs)
    natural_expert_output = natural_expert_model(inputs)

    # Get importance model outputs
    artificial_importance_output = artificial_importance_model(inputs)
    natural_importance_output = natural_importance_model(inputs)

    # Condition selection: If expert_choice_gate prefers artificial expert, use it, else use natural expert
    def moe_logic(gx):
        """
        Mixture of Experts logic, selecting the expert output based on the expert choice gate
        
        Args:
            - gx (list): list of tensors containing expert_choice, baseline, artificial_expert, natural_expert, artificial_importance, natural_importance
            
        Returns:
            - selected_expert_output (tf.Tensor): selected expert output
        """
        expert_choice, base_out, art_exp_out, nat_exp_out, art_imp_out, nat_imp_out = gx
        condition = tf.expand_dims(expert_choice[:, 0], axis=1) > tf.expand_dims(expert_choice[:, 1], axis=1)
        
        # Select the expert output based on the expert choice gate using tf.where
        # TODO: your code here
        selected_expert_output = ...

        return selected_expert_output

    # MoE decision layer using tf.keras.layers.Lambda
    # TODO: your code here
    output = ...

    # Build the final model
    model = tf.keras.Model(inputs=inputs, outputs=output)
    return model

# TODO: your code here