# Using Flower on AWS with Terraform

Welcome to the federated learning tutorial!

This notebook will show how to build a federated learning system and deploy it on AWS. At the end of this tutorial three AWS VMs are running while one will be the server and two others will represent clients.

[Terraform](https://www.terraform.io/) will be used to provision the infrastructure. For those whom never heard of Terraform this quote from its website will give an idea:

>Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.

Putting it simply Terraform will enable to create infrastructure in a repeatable and usually predictable way.

## Infrastructure

Before beginning with any actual code, make sure you have access to an AWS account. You will need a AWS_ACCESS_KEY as well as the corresponding AWS_SECRET_ACCESS_KEY. Here you can find a guide how to create and get those: [How do I create an AWS access key?](https://aws.amazon.com/de/premiumsupport/knowledge-center/create-access-key)

Make sure to NOT share them with anyone or leak them accidentally as others could gain full control over your account using these secrets. When you are ready you can enter your credentials into the next section which will enable tools such as Terraform to access your AWS account.

In [None]:
# Set environment variables with `%env` magic command specific to Jupyter
# Outside of Jupyter you would replace `%env` with `export`
%env AWS_ACCESS_KEY_ID=REPLACE_ME
%env AWS_SECRET_ACCESS_KEY=REPLACE_ME
%env AWS_DEFAULT_REGION=eu-central-1

### Installing dependencies

Running this notebook requires Terraform which is going to be installed using some statements form the official installation [guide](https://learn.hashicorp.com/tutorials/terraform/install-cli?in=terraform/aws-get-started). This will work on Google Colab as well as on any Debian based system as apt is used to install the dependencies.

In [None]:
%%bash
export DEBIAN_FRONTEND=noninteractive
sudo apt-get update && sudo apt-get install -y gnupg software-properties-common curl
curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -
sudo apt-add-repository "deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main"
sudo apt-get update && sudo apt-get install terraform

### Provision EC2 instances

With Terraform installed, you are ready to create the infrastructure nessecary to run a Flower server and two clients.

In this tutorial, you will provision an EC2 instance on Amazon Web Services (AWS). EC2 instances are virtual machines running on AWS. They can be started using a variety of machine images. The image used will be based on Ubuntu 20.04.

#### Configuration

For the purpose of grouping all configuration files a directory called `infrastructure` is going to be created. All configuration files will be written into that directory. Additionally a SSH key will be created which is later going to be used.

In [None]:
%%bash
# Create directory infrastructure and use -p so this command becomes idempotent
mkdir -p ./infrastructure

# Create ssh key to be used later to connect ot the machines
cd infrastructure

# Create SSH key with name flower_notebook_rsa if it does not exist
if [[ ! -f "./flower_notebook_rsa" ]]; then
    ssh-keygen -b 2048 -t rsa -N '' -f ./flower_notebook_rsa
fi

# You public key. You will need this later.
echo "You public key (copy this)"
cat flower_notebook_rsa.pub

When provisioning the machines it is desireable that they are automatically configured on startup and be ready to run Flower code. Cloud-init is a standard configuration support tool available on most Linux distributions and all major cloud providers. It allows you to pass a shell script to the command which starts a cloud instance. That script can be used to install or configure the machine.

In the next step a shell script will be written to the `infrastructure` directory. It will be pass used in the Terraform configuration so it is run when the machine is provisioned.

In [None]:
%%writefile ./infrastructure/user_data.sh
#!/bin/bash
set -e

# Install dependencies
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg lsb-release openssh-client python3 python3-pip

# Install Flower
python3 -m pip install flwr==0.18.0 torch==1.11.0 torchvision==0.12.0

Now the Terraform `main.tf` file which will contain the primary entrypoint to our Terraform configuration will be written. It is not nessecary to understand the intricacies of this but for all who are interested it is recommend going through the offical Terraform ["Get Started - AWS"](https://learn.hashicorp.com/collections/terraform/aws-get-started) tutorial for AWS. Similarly there are also tutorials for cloud providers such as Azure, GCP and others.

The previously created public key needs to be inserted here into the Terraform configuration. Scroll up and copy the public key. Replace the string "REPLACE_ME" in line 30 before executing the next segment and writing the configuration to disk.

When reading the Terraform configuration it is worthwhile to have a deeper look at the security group configuration.

In [None]:
%%writefile ./infrastructure/main.tf

# Configure Terraform
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.27"
    }
  }

  required_version = ">= 0.14.9"
}

# Configure AWS and the default region for all AWS resources
provider "aws" {
  profile = "default"
  region  = "eu-central-1"
}

# This is going to be used so in a workshop the resources
# created by the participants don't colide.
resource "random_pet" "name" {}

# Login in into the virtual machines requires a public SSH key to
# be registered on the instance. Replace MY_PUBLIC_KEY down below
# with your public SSH key.
resource "aws_key_pair" "default" {
  key_name   = "flower-${random_pet.name.id}"
  public_key = "REPLACE_ME"
}

# Add one instance for the Flower server. We are using a m5a instance type
# as the m instances are not limited by CPU credits. This is quite important
# as the machines will utilize most of their resources. The m5a has 2 vCPU
# and 8 GiB RAM. 
resource "aws_instance" "flower_server" {
  # Use a data reference for cross-region compatibility
  ami           = data.aws_ami.ubuntu.id
  instance_type = "m5a.large"
  key_name      = aws_key_pair.default.key_name

  root_block_device {
    # Size of disk in GiB
    volume_size = "30"
  }

  user_data     = "${file("user_data.sh")}"

  vpc_security_group_ids = [
    aws_security_group.flower.id
  ]

  tags = {
    Name = "FlowerServer"
  }
}

# Additionally we are going to start Flower instances
resource "aws_instance" "flower_clients" {
  # Use a data reference for cross-region compatibility
  ami           = data.aws_ami.ubuntu.id
  instance_type = "m5a.large"
  key_name      = aws_key_pair.default.key_name
  count         = 2

  user_data     = "${file("user_data.sh")}"

  root_block_device {
    # Size of disk in GiB
    volume_size = "30"
  }

  vpc_security_group_ids = [
    aws_security_group.flower.id
  ]

  tags = {
    Name = "FlowerClient"
  }
}

# Define a data element to make sure we get the right AWS AMI
# independent of the region in which we start the EC2 instance.
# The same AMI image might have different ID's in different 
# AWS regions.
data "aws_ami" "ubuntu" {
  most_recent = true

  filter {
    name   = "name"
    values = ["*ubuntu-focal-20.04-amd64-server-20211129"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
  # AWS owner id of Canonical
  # Find out with:
  # aws ec2 describe-images \
  # --filters "Name=name,Values=*ubuntu-focal-20.04-amd64-server-20211129"
  owners = ["099720109477"]
}

resource "aws_default_vpc" "default" {
  tags = {
    Name = "Default VPC"
  }
}

# IMPORTANT
# The security groups will configure which ports are externally reachable
resource "aws_security_group" "flower" {
  name        = "flower-${random_pet.name.id}"
  description = "All ports required for a Flower server"
  vpc_id      = aws_default_vpc.default.id

  # In-comming traffic
  # Allow port 22 so developers can connect to the server
  # Allow port 8080 so Flower clients can connect to the Flower server
  ingress = [
    {
      description      = "SSH"
      from_port        = 22
      to_port          = 22
      protocol         = "tcp"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
      security_groups = null
      prefix_list_ids  = null
      self = null
    },
    {
      description      = "HTTP"
      from_port        = 8080
      to_port          = 8080
      protocol         = "tcp"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
      security_groups = null
      prefix_list_ids  = null
      self = null
    }
  ]

  # Out-going traffic
  # Allow all ports when a connection is made from inside the server
  # to the outside world
  egress = [
    {
      description      = "Any"
      from_port        = 0
      to_port          = 0
      protocol         = "-1"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
      security_groups = null
      prefix_list_ids  = null
      self = null
    }
  ]

  tags = {
    Name = "flower"
  }
}

output "your_pet" {
    description = "Your pets name"
    value       = "${random_pet.name.id}"
}

output "server_ip" {
  description = "Public IP address of server"
  value       = aws_instance.flower_server.public_ip
}

output "client_1_ip" {
  description = "Public IP address of clients"
  value       = aws_instance.flower_clients[0].public_ip
}

output "client_2_ip" {
  description = "Public IP address of clients"
  value       = aws_instance.flower_clients[1].public_ip
}

#### Initialize Terraform

Now its time to initialize Terraform in the infrastructure directory. As long as we don't have any errors in our Terraform files this should just work.

In [None]:
%%bash
cd ./infrastructure
terraform init

#### Create Infrastructure - plan & apply Terraform

Now after Terraform is successfully initialized plan and apply the Terraform configuration. By doing so Terraform is going to create infrastructure in the AWS account. As a first step running `terraform plan` will show what Terraform would do if applied. Running `terraform apply -auto-approve` will do the plan step and immidiatly apply those. In a setting where these commands are run in a native terminal one can directly run `terraform apply` as Terraform will than (without the `-auto-approve` option present) ask the user if the proposed changes should be applied.

In [None]:
%%bash
cd ./infrastructure
terraform plan

The output of `terraform plan` shows what `terraform apply` will do. In a production workflow the output of the `plan` command would be written to disk and than reviewed. Here this is not done to keep it simple.

In [None]:
%%bash
cd ./infrastructure
terraform apply -auto-approve

As a note while Terraform will return its output earlier the execution of the cloud-init scripts start as soon as the machines are running. The IPs of these machines will later on be used upload or execute code on those. Copy those into the next code segment so it becomes easier later on.

In [None]:
# Replace the "REPLACE_ME" with the correct IP
%env CLIENT_1_IP=REPLACE_ME
%env CLIENT_2_IP=REPLACE_ME
%env SERVER_IP=REPLACE_ME

#### Clean-up

Now after the infrastructure is deployed and running it is also important to understand how to stop it. As long as its running it will incure cost which makes it quite important to remember to cleanup after our experiments are done. Ideally you should configure as much of the infrastructure in a way that its automatically cleaned up. The best way to remove infrastructure created with Terraform is to use the `terraform destroy` commmand. Terraform is stateful and the state in our case will be in `./infrastructure/.terraform`. Using the state Terraform will only destroy resource it created and not remove anything else. __Don't run this now!__ but rather come back here and run it when you stop working on this notebook and want to cleanup the infrastructure you have created. If you would like to test how it works you can naturally execute this code block now but you will have to provision the infrastructure once more by running the `terraform apply` code block. 

In [None]:
%%bash
cd ./infrastructure
terraform destroy -auto-approve

## The experiment

Three instances are now available but no Flower code is yet deployed. Application deployment can be quite complex and depending on various requirements. This tutorial will keep it simple so the concept is understood. The artifacts will be a `server.py` and `client.py` file which is uploaded to the respective machine. More advanced setups might use Docker containers which are pushed to a registry and than downloaded and started on the respective machines. Alternatively tools such as ArgoCD on a Kubernetes Cluster could be used. These more advanced setups can unfortunatly not be in the scope of this tutorial.

The steps which will be taken in the next section will be:

1. Write server.py and client.py to disk
2. Upload files to respective machines
3. Execute scripts in screen sessions
4. Read logfiles to check progress

>Note: The default Flower strategy `FedAvg` waits by default for two clients before starting a round of federation.

### Preparation

For the preparation we are going to create a directory called app where we will store all the files which we are going to upload to 



In [None]:
%%bash
# Create directory app and use -p so this command becomes idempotent
mkdir -p ./app

### Server

The first thing needed is a server. The default Flower server is extremly simple and will use FedAvg by default. For this showcase we will use just that and allow the user to customize the code after we have shown that everything works.

As an important side not the `server_address=0.0.0.0:8080` in the next code segment technical instructs the server to listen on hosts for port `8080` and accept all connections made to it. If the `0.0.0.0` would be replaced with a certain IP or hostname it would only listen for requests made to that specific hostname. Therefore we are not going to change it.

In [None]:
%%writefile ./app/server.py
# Flower Server
import flwr as fl

# Start Flower server
fl.server.start_server(
  server_address="0.0.0.0:8080",
  config={"num_rounds": 3},
)

### Client

Additionally the client script needs to be written and stored to disk. The client naturally needs to know the IP of the server. For this purpose scroll upwards and lookup in the Terraform output the server IP address and adjust the next code segment so that the correct IP address is inserted.

_Hint: Look at the last line._

In [None]:
%%writefile ./app/client.py
# Flower Client
from collections import OrderedDict
import warnings

import flwr as fl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision.transforms import Compose, ToTensor, Normalize
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR10

# #############################################################################
# Regular PyTorch pipeline: nn.Module, train, test, and DataLoader
# #############################################################################

warnings.filterwarnings("ignore", category=UserWarning)
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
  """Model (simple CNN adapted from 'PyTorch: A 60 Minute Blitz')"""

  def __init__(self) -> None:
    super(Net, self).__init__()
    self.conv1 = nn.Conv2d(3, 6, 5)
    self.pool = nn.MaxPool2d(2, 2)
    self.conv2 = nn.Conv2d(6, 16, 5)
    self.fc1 = nn.Linear(16 * 5 * 5, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 10)

  def forward(self, x: torch.Tensor) -> torch.Tensor:
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    x = x.view(-1, 16 * 5 * 5)
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    return self.fc3(x)

def train(net, trainloader, epochs):
  """Train the model on the training set."""
  criterion = torch.nn.CrossEntropyLoss()
  optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
  for _ in range(epochs):
    for images, labels in trainloader:
      optimizer.zero_grad()
      criterion(net(images.to(DEVICE)), labels.to(DEVICE)).backward()
      optimizer.step()

def test(net, testloader):
  """Validate the model on the test set."""
  criterion = torch.nn.CrossEntropyLoss()
  correct, total, loss = 0, 0, 0.0
  with torch.no_grad():
    for images, labels in testloader:
      outputs = net(images.to(DEVICE))
      loss += criterion(outputs, labels.to(DEVICE)).item()
      total += labels.size(0)
      correct += (torch.max(outputs.data, 1)[1] == labels).sum().item()
  return loss / len(testloader.dataset), correct / total

def load_data():
  """Load CIFAR-10 (training and test set)."""
  trf = Compose([ToTensor(), Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
  trainset = CIFAR10("./data", train=True, download=True, transform=trf)
  testset = CIFAR10("./data", train=False, download=True, transform=trf)
  return DataLoader(trainset, batch_size=32, shuffle=True), DataLoader(testset)

# #############################################################################
# Federating the pipeline with Flower
# #############################################################################

# Load model and data (simple CNN, CIFAR-10)
net = Net().to(DEVICE)
trainloader, testloader = load_data()

# Define Flower client
class FlowerClient(fl.client.NumPyClient):
  def get_parameters(self):
    return [val.cpu().numpy() for _, val in net.state_dict().items()]

  def set_parameters(self, parameters):
    params_dict = zip(net.state_dict().keys(), parameters)
    state_dict = OrderedDict({k: torch.tensor(v) for k, v in params_dict})
    net.load_state_dict(state_dict, strict=True)

  def fit(self, parameters, config):
    self.set_parameters(parameters)
    train(net, trainloader, epochs=1)
    return self.get_parameters(), len(trainloader.dataset), {}

  def evaluate(self, parameters, config):
    self.set_parameters(parameters)
    loss, accuracy = test(net, testloader)
    return float(loss), len(testloader.dataset), {"accuracy": float(accuracy)}

# Start Flower client
# REPLACE_THIS_WITH_THE_SERVER_IP
fl.client.start_numpy_client("REPLACE_THIS_WITH_THE_SERVER_IP:8080", client=FlowerClient())

### Deployment

After we have defined our `server.py` and `client.py` we have to upload those to the respective machines. Afterwards we are going to first start the server and following that the clients. For uploading the files we are going to use [scp](https://linux.die.net/man/1/scp) which can be find on most UNIX systems.

In [None]:
%%bash
# Use `set -ex` to see the actual command with $SERVER_IP resolved
# which will be executed and stop if any of the commands fail
set -ex

function upload {
    scp -i ./infrastructure/flower_notebook_rsa -o "StrictHostKeyChecking=no" $@
}

# Upload code
upload ./app/client.py ubuntu@$CLIENT_1_IP:/home/ubuntu/
upload ./app/client.py ubuntu@$CLIENT_2_IP:/home/ubuntu/
upload ./app/server.py ubuntu@$SERVER_IP:/home/ubuntu/

Next starting the server (wait a few seconds afterwards until the start is finished)

>Using screen here so that the command continues to run when the ssh connection closes. When connecting to the instance via SSH one can easily connect to the screen session by using `screen -r`.

In [None]:
%%bash
ssh -i ./infrastructure/flower_notebook_rsa -o "StrictHostKeyChecking=no" ubuntu@$SERVER_IP "screen -d -L -m python3 server.py && sleep 5"

and the clients.

Client 1:

In [None]:
%%bash
ssh -i ./infrastructure/flower_notebook_rsa -o "StrictHostKeyChecking=no" ubuntu@$CLIENT_1_IP "screen -d -L -m python3 client.py"

Client 2:

In [None]:
%%bash
ssh -i ./infrastructure/flower_notebook_rsa -o "StrictHostKeyChecking=no" ubuntu@$CLIENT_2_IP "screen -d -L -m python3 client.py"

### Logs

As soon as the server runs it will be interesting to see the server logs. SSH will be used to run `cat` on the screenlog file.

In [None]:
%%bash
ssh -i ./infrastructure/flower_notebook_rsa -o "StrictHostKeyChecking=no" ubuntu@$SERVER_IP "cat screenlog.0"

Client 1:

In [None]:
%%bash
ssh -i ./infrastructure/flower_notebook_rsa -o "StrictHostKeyChecking=no" ubuntu@$CLIENT_1_IP "cat screenlog.0"

Client 2:

In [None]:
%%bash
ssh -i ./infrastructure/flower_notebook_rsa -o "StrictHostKeyChecking=no" ubuntu@$CLIENT_2_IP "cat screenlog.0"

## Clean-up

As already mentioned previously the infrastructure needs to be destroyed after the experiments. For this purpose one can run the clean-up code segment from before. Here repeated for ease of use

In [None]:
%%bash
cd ./infrastructure
terraform destroy -auto-approve