# Section 2.1 - A Distributed Training Example

We will train a splitNN model that has been distributed to three different hosts. One host, Alice, is the data subject. Alice has the labelled data and will also be the custodian of the network start and end segments. Claire and Bob are worker hosts. They will feed the activation signals from the start of the chain forward until it reaches alices end layer. They will do the reverse with gradients in the backpropogation process. 

## Section 2.1.1 - Set up environmental variables

Here we will import our required libraries and initialise our model segments and data. We will need;

<img src="images/distributed.png" width="50%">

- A dummy distributed dataset
- 5 model segments
- 3 Virtual Workers

In [1]:
import torch
from torch import nn
from torch import optim
import syft as sy
import time
hook = sy.TorchHook(torch)

#from torchviz import make_dot, make_dot_from_trace
from torch.autograd import Variable

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])








In [2]:
@property
def location(self):
    m = self.__getitem__(0)
    w = m.weight[0]
    return w.location

nn.Sequential.location = location

In [3]:
# A Toy Dataset
x = torch.tensor([[0,0,0,0],[1,0,0,0],[0,1,0,0],[0,0,1,0],[1,1,0,0],[1,0,1,0],[0,1,1,0],[1,1,1,0],[0,0,0,1],[1,0,0,1],[0,1,0,1],[0,0,1,1],[1,1,0,1],[1,0,1,1],[0,1,1,1],[1,1,1,1.]])
y = torch.tensor([[0],[0],[0],[0],[0],[0],[0],[0],[1],[1],[1],[1],[1],[1],[1],[1.]])

torch.manual_seed(1)

# Define 5 chained models
models = [
    nn.Sequential(
        nn.Linear(4, 3),
        nn.Tanh()
    ),
    nn.Sequential(
        nn.Linear(3, 3),
        nn.Sigmoid()
    ),
    nn.Sequential(
        nn.Linear(3, 3),
        nn.Sigmoid()
    ),
    nn.Sequential(
        nn.Linear(3, 2),
        nn.Tanh()
    ),
    nn.Sequential(
        nn.Linear(2, 1),
        nn.Sigmoid()
    )
]

# create some workers
alice = sy.VirtualWorker(hook, id="alice")
bob = sy.VirtualWorker(hook, id="bob")
claire = sy.VirtualWorker(hook, id="claire")
workers = alice, bob, claire

The final predictions are shown above, we can compare this with the output of the same 'split' neural network

## Section 2.1.2 - Send Variables to Starting Locations

In this example, Alice is the worker with the data and labels. Bob and Claire are intermediary hosts in the chain. Alice has the start and end model segments. Bob and Claire have intermediary segments.

We send the models and data to their respective hosts and store the pointers in associative arrays; the Model Chain (MC) and the xy Chain (xyC). These contain the locations of the data, but no actual values. These are the only necessary parameters for coordinating this learning process. A summary of this is seen below

<img src="images/Parameters.png" width="50%">

In this experiment, the models and data are initialised locally and then distributed out.

In [4]:
# Send Model Segments to starting locations
model_locations = [alice, alice, bob, claire, alice]

for model, location in zip(models, model_locations):
    model.send(location)

# Create a remote copy of the dataset for each worker
datasets = [
    sy.BaseDataset(x.send(worker), y.send(worker))
    for worker in (alice, bob, claire)
]

## Section 2.1.3 - Forward Propogation

We will need to define the logic of forward and backward propogation. 

Forward propogation feeds the input data into Alice's segment at the beginning of the chain. Alice then sends her activation signal to the location of the next model in the chain. This model propogates this activation and sends it onward to the location of the next segment. The signal will eventually reach alice's end segment to perform a prediction. We store pointers to the activations of each layer using the Activation Chain (AC). This allows us to retrieve the values when processing gradients. When the activations have fully propogated the MC, the method returns the resultant AC for use in the backpropogation function. 

<img src="images/activationchain.png" width="50%">

## Section 2.1.4 - Backward Propogation

The backpropogation function takes the MC, xyC and AC as input parameters.

<img src="images/backpropParams.png" width="80%">


First the backpropogation algorithm computes the loss on Alice's prediction. We use <b>**** what seems to be**** </b> the sum of squared error as our loss function.

<img src="images/loss.png" width="100%">

We then calculate the gradients for the parameters of the end segment using the chain rule.

<img src="images/chainRule.png" width="40%">

This is done automatically for the layers in each segment but we have to recalculate loss for each model segment during the backpropogation phase.

<img src="images/intermediateLoss.png" width="80%">




Each segment feeds the gradients of their activation function back to the segment behind them and updates their weights w.r.t these gradients. This layer computes it's loss by dot joining the orignal activation signal and it's new gradient. The sum of the result is used to feed back error down the line. After each segment is complete, the optimiser for that model updates. The process is repeated until the segment at the beginning of the chain is reached and alice updates the gradients on her beginning segment.

In [5]:
def forward(models, x):

    inputs = []
    outputs = []
    
    # First: provide x as input
    inputs.append(x)
    outputs.append(models[0](x))    
    
#     Update: Move() can crash if self-referencing,
#             This has been fixed with an if-statement.
#             If a variable is not to be moved, it 
#             isn't moved before filling the next input.
            
#             However, not moving the output seems to break
#             the grad system in the backprop function; leaving
#             no gradients attached to this after backward()
#             function. 
            
#             Oddly this doesn't happen when the tensor is moved.
            
    if outputs[-1].location != models[1].location:
        next_input = outputs[-1].copy().move(models[1].location)
    else:
        next_input = outputs[-1].copy()
#         next_input = outputs[-1].copy().get().send(models[1].location)
        
    for i in range(1, len(models)-1):
        inputs.append(next_input)
        outputs.append(models[i](next_input))
        next_input = outputs[-1].copy().get().send(models[i+1].location)
 
    # Last: don't move the result to the next location
    inputs.append(next_input)
    outputs.append(models[len(models)-1](next_input))
    
    return inputs, outputs

In [6]:
def backward(models, optimizers, segment_inputs, segment_outputs, dataset):
    data, targets = dataset.data, dataset.targets
        
    # Destroy pre-existing gradient of final layer
    optimizers[len(optimizers)-1].zero_grad()
   
    loss = (((segment_outputs[-1] - targets)**2).sum())

    # Compute gradients
    loss.backward()
    
    # End layer sends the gradient of the activation signal back to the layer behind
    input_gradient = segment_inputs[-1].grad.clone().get().send(models[len(models)-2].location)
    
    # End layer updates weights
    optimizers[-1].step()

    # Compute Intermediary Layers: repeat the same operations
    for iter in range(len(models)-1, 1, -1): 
        optimizers[iter-1].zero_grad()
        
        intermediate_loss = torch.matmul(torch.t(segment_outputs[iter-1]), input_gradient).sum()
        intermediate_loss.backward()
        
        print(iter)
        if iter == 2 and segment_inputs[iter-1].grad == None:
            print("BREAKS ON THIS SEGMENT: Processing input which wasn't moved. \n Try uncommenting the old .get().send() command in the forward propogation function")
            
        input_gradient = segment_inputs[iter-1].grad.clone().get().send(models[iter-2].location)
        optimizers[iter-1].step()

    # Compute Final Layer, same but now input is the real input data
    optimizers[0].zero_grad()
    segment_output = segment_outputs[0]
    intermediate_loss = torch.matmul(torch.t(segment_output), input_gradient).sum()
    intermediate_loss.backward()
    optimizers[0].step()
        
    return segment_outputs[-1], loss

## Section 2.1.5 - Run Training Logic

Now we will run the training process over 200 epochs for each data owner. Every 20 epochs we will print our progress. The front and end sections of the model will be swapped between data owners training each individual batch.

<img src="images/BatchFlow.png" width="40%">


In [7]:
def splitNN_train(models, xyChain):
    
    #   Variables for performance metrics
    start_time = time.time()
    epochs = 300
    lr = 0.2
    counter = 0
    
    # Create optimisers for each segment and link to their segment
    optimizers = [
        optim.SGD(params=model.parameters(),lr=lr)
        for model in models
    ]
    
    for i, local_worker in enumerate(workers):
        
        # Begin work on current data subject
        dataset = datasets[i]
        
        print('*', dataset.location.id, models[0].location.id)
        
        for epoch in range(epochs):
            # Forward propogate through network until final layer is reached
            segment_inputs, segment_outputs = forward(models, dataset.data)
            
            # Backward propogate
            predictions, loss = backward(models, optimizers, segment_inputs, segment_outputs, dataset)

            if epoch % 30 == 0:
                print(f"Epoch: {epoch}/{epochs} \tLoss: ", "{:.4f}\tRuntime: {:.2f}s".format(loss.get().data, time.time() - start_time))
        
        # If we are not at the end of the data owner chain send perimeter segments to next data owner
        if i < len(workers)-1:
            models[0].get().send(datasets[i+1].location)
            models[len(models)-1].get().send(datasets[i+1].location)      
            

            print("\nNEXT DATA OWNER\n")
            print("MODEL CHAIN LOCATIONS")
            for iter in range(len(models)):
                print(models[iter].location.id)  
            print("\n")
    
    # Send models back to researcher
    [model.get() for model in models]
    
    # Perform predictions with updates weights
    out = torch.tensor([[0,0,0,0],[1,0,0,0],[0,1,0,0],[0,0,1,0],[1,1,0,0],[1,0,1,0],[0,1,1,0],[1,1,1,0],[0,0,0,1],[1,0,0,1],[0,1,0,1],[0,0,1,1],[1,1,0,1],[1,0,1,1],[0,1,1,1],[1,1,1,1.]])
    for i in range(len(models)):
        out = models[i](out)
        
    print("\n\nFinal Predictions:", torch.t(out).data)
    

In [8]:
splitNN_train(models, datasets)

* alice alice
4
3
2
BREAKS ON THIS SEGMENT: Processing input which wasn't moved. 
 Try uncommenting the old .get().send() command in the forward propogation function


AttributeError: 'NoneType' object has no attribute 'clone'

# Notes

- I figured out why move() was breaking the forward prop function; it was trying to send to itself which caused an error.
- Gradient system breaks when not moving input tensors.
- Problems of 'knowledge of other models' in the training process can be solved by reversing the order from Claire to Alice. This removes the need for encrypting model segments.
- After I get to the bottom of the gradients, I will plug this into the MNIST and CIFAR datasets and for benchmarking. 