Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flower-via-docker-compose example #2626

Merged
merged 52 commits into from Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
c55cdf4
flower-via-docker-compose example
NikosVlachakis Nov 22, 2023
bda2dc1
flower-via-docker-compose minor change in the README.md
NikosVlachakis Nov 22, 2023
d4990bd
flower-via-docker-compose minor change in the README.md v2
NikosVlachakis Nov 22, 2023
08608f7
flower-via-docker-compose minor change in the README.md v3
NikosVlachakis Nov 22, 2023
fcf2aef
flower-via-docker-compose minor change in the README.md v4
NikosVlachakis Nov 22, 2023
d8cd075
adding docker ps view in README.md
NikosVlachakis Nov 23, 2023
cb706bb
grafana configuration for automatic dashboard discovery through grafa…
NikosVlachakis Nov 23, 2023
99ac45d
removing initial docker-compose file
NikosVlachakis Nov 23, 2023
8663dd0
adding UID in grafana,prometheus configs and updating README.md file
NikosVlachakis Nov 26, 2023
a02ed9b
adding flower + docker images
NikosVlachakis Nov 26, 2023
e4c4546
changing grafana's UI default screen and adding system/application me…
NikosVlachakis Nov 28, 2023
fd15bf8
mega dashboard
jafermarq Dec 18, 2023
9a6323a
not stacking
jafermarq Dec 18, 2023
e906b58
fixed time
jafermarq Dec 18, 2023
6a8f88e
updating graphs grafana
NikosVlachakis Dec 18, 2023
1a1b850
Merge branch 'main' of https://github.com/NikosVlachakis/flower
NikosVlachakis Dec 18, 2023
081e833
integrating flwr_datasets in data pipeline and updating README.md file
NikosVlachakis Dec 21, 2023
1ac0b87
simplifying strategy; load data once; bumped flwr 1.6; other minor ch…
jafermarq Dec 21, 2023
f5620b5
updating generate_docker_compose.py by passing the data_percentage as…
NikosVlachakis Dec 22, 2023
e03d5aa
updating README.md + simplifying logic in load_data.py
NikosVlachakis Dec 26, 2023
af60339
argparse to compose generator; minor tweaks readme
jafermarq Dec 27, 2023
711ea0b
adding random argument in generate_docker_compose file
NikosVlachakis Dec 27, 2023
5e4aaf8
small changes in readme.md file
NikosVlachakis Dec 27, 2023
c2ec018
remove unnecessary variables
NikosVlachakis Dec 27, 2023
ee1a076
bump python3.10; --random flag
jafermarq Dec 28, 2023
954df9e
adding % logic in the generate_docker_compose
NikosVlachakis Dec 28, 2023
f99b963
Merge branch 'main' of https://github.com/NikosVlachakis/flower
NikosVlachakis Dec 28, 2023
115f2ba
add % logic in generate_docker_compose.py
NikosVlachakis Dec 28, 2023
f02481d
Merge branch 'main' into main
jafermarq Dec 29, 2023
b9a0a6e
formatting
jafermarq Dec 29, 2023
87f1036
added minimal guide to run example
jafermarq Dec 29, 2023
d8c7f1d
minor tweaks
jafermarq Dec 29, 2023
d72f899
removing space at the end of the main folder's name
NikosVlachakis Dec 29, 2023
51ffed8
removing old directory
jafermarq Dec 29, 2023
dbd7b88
README got lost
jafermarq Dec 29, 2023
382dc5a
bringing back files from ios example
jafermarq Dec 29, 2023
829e5a9
updating readme.md file
NikosVlachakis Dec 30, 2023
7d1e558
fixing readme.md code
NikosVlachakis Dec 30, 2023
dd62259
removing html tags from readme.md code
NikosVlachakis Dec 30, 2023
4fdc64d
Format README.md with mdformat
NikosVlachakis Jan 3, 2024
617365f
format
jafermarq Jan 4, 2024
8910506
Merge branch 'main' into main
jafermarq Jan 4, 2024
9af1f43
Merge branch 'main' into main
jafermarq Jan 18, 2024
632154a
small changes in docker
NikosVlachakis Jan 19, 2024
2b4266a
Merge branch 'main' of https://github.com/NikosVlachakis/flower
NikosVlachakis Jan 19, 2024
9e57400
Merge branch 'main' into main
jafermarq Jan 23, 2024
c24e73f
added reference to top-level reamde; typo fix
jafermarq Jan 23, 2024
8a7b6ac
Remove public folder
NikosVlachakis Jan 25, 2024
0cae140
Merge branch 'main' of https://github.com/NikosVlachakis/flower
NikosVlachakis Jan 25, 2024
4122b39
adding back files from other examples
jafermarq Jan 25, 2024
7dee8ea
format
jafermarq Jan 25, 2024
4a496a2
Merge branch 'main' into main
jafermarq Jan 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 20 additions & 0 deletions examples/flower-via-docker-compose /.gitignore
@@ -0,0 +1,20 @@
# ignore __pycache__ directories
__pycache__/

# ignore .pyc files
*.pyc

# ignore .vscode directory
.vscode/

# ignore mlflow and mlruns directories
mlflow/
mlruns/
dataset/

# ignore .npz files
*.npz

# ignore .csv files
*.csv

19 changes: 19 additions & 0 deletions examples/flower-via-docker-compose /Dockerfile
@@ -0,0 +1,19 @@
# Use an official Python runtime as a parent image
FROM python:3.8-slim-buster

# Set the working directory in the container to /app
WORKDIR /app

# Copy the requirements file into the container
COPY ./requirements.txt /app/requirements.txt

# Install gcc and other dependencies
RUN apt-get update && apt-get install -y \
gcc \
python3-dev && \
rm -rf /var/lib/apt/lists/*

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt


94 changes: 94 additions & 0 deletions examples/flower-via-docker-compose /README.md
@@ -0,0 +1,94 @@
# Leveraging Flower and Docker for Device Heterogeneity Management in Federated Learning


## Introduction
In this example, we tackle device heterogeneity in federated learning, arising from differences in memory and CPU capabilities across devices. This diversity affects training efficiency and inclusivity. Our strategy includes simulating this heterogeneity by setting CPU and memory limits in a Docker setup, using a custom docker compose generator script. This approach creates a varied training environment and enables us to develop strategies to manage these disparities effectively.


## Handling Device Heterogeneity
1. **System Metrics Access**:
- Effective management of device heterogeneity begins with monitoring system metrics of each container. We integrate the following services to achieve this:
- **Cadvisor**: Collects comprehensive metrics from each Docker container.
- **Prometheus**: Using `prometheus.yaml` for configuration, it scrapes data from Cadvisor at scheduled intervals, serving as a robust time-series database. Users can access the Prometheus UI at `http://localhost:9090` to create and run queries using PromQL, allowing for detailed insight into container performance.

2. **Mitigating Heterogeneity**:
- In this basic use case, we address device heterogeneity by establishing rules tailored to each container's system capabilities. This involves modifying training parameters, such as batch sizes and learning rates, based on each device's memory capacity and CPU availability. These settings are specified in the `client_configs` array in the `create_docker_compose` script. For example:

```python
client_configs = [
{'mem_limit': '3g', 'batch_size': 32, "cpus": 3.5, 'learning_rate': 0.001},
{'mem_limit': '4g', 'batch_size': 64, "cpus": 3, 'learning_rate': 0.02},
{'mem_limit': '5g', 'batch_size': 128, "cpus": 2.5, 'learning_rate': 0.09},
{'mem_limit': '6g', 'batch_size': 256, "cpus": 1, 'learning_rate': 0.15}
]
```


## Installation and Setup
To get the project up and running, follow these steps:

### Prerequisites
Before starting, ensure the following prerequisites are met:

- **Docker Installation**: Docker must be installed and the Docker daemon running on your server. If you don't already have Docker installed, you can get [installation instructions for your specific Linux distribution from Docker](https://docs.docker.com/engine/install/).


### Step 1: Configure Docker Compose
1. **Generate Docker Compose File**:
- Execute the following command to run the `helpers/generate_docker_compose.py` script. This script creates the docker-compose configuration needed to set up the environment.
```bash
python helpers/generate_docker_compose.py
```
- Within the script, specify the number of clients (`total_clients`), the number of training rounds (`number_of_rounds`), and resource limitations for each client in the `client_configs` array.

### Step 2: Build and Launch Containers
1. **Execute Initialization Script**:
- Run the `docker_init.sh` script to build the Docker images and start the Docker Compose process. Use the following command:
```bash
./docker_init.sh
NikosVlachakis marked this conversation as resolved.
Show resolved Hide resolved
```

2. **Services Startup**:
- The script will launch several services as defined in your `docker-compose.yml` file:
- **Monitoring Services**: Prometheus for metrics collection, Cadvisor for container monitoring, and Grafana for data visualization.
- **Flower Federated Learning Environment**: The Flower server and client containers are initialized and start running.
- After launching the services, verify that all Docker containers are running correctly by executing the `docker ps` command. Here's an example output:
```bash
➜ ~ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
72063c8968d3 flower-via-docker-compose-client3 "python client.py --…" 12 minutes ago Up 13 seconds 0.0.0.0:6003->6003/tcp client3
77ca59fc42e6 flower-via-docker-compose-client2 "python client.py --…" 12 minutes ago Up 13 seconds 0.0.0.0:6002->6002/tcp client2
2dc33f0b4ef6 flower-via-docker-compose-client1 "python client.py --…" 12 minutes ago Up 13 seconds 0.0.0.0:6001->6001/tcp client1
8d87f3655476 flower-via-docker-compose-server "python server.py --…" 12 minutes ago Up 13 seconds 0.0.0.0:6000->6000/tcp, 0.0.0.0:8265->8265/tcp server
dbcd8cf1faf1 grafana/grafana:latest "/run.sh --config=/e…" 12 minutes ago Up 5 minutes 0.0.0.0:3000->3000/tcp grafana
80c4a599b2a3 prom/prometheus:latest "/bin/prometheus --c…" 12 minutes ago Up 5 minutes 0.0.0.0:9090->9090/tcp prometheus
169880ab80bd gcr.io/cadvisor/cadvisor:v0.47.0 "/usr/bin/cadvisor -…" 12 minutes ago Up 5 minutes (healthy) 0.0.0.0:8080->8080/tcp cadvisor
```

NikosVlachakis marked this conversation as resolved.
Show resolved Hide resolved
3. **Automated Grafana Configuration**:
- Grafana is set up to automatically load pre-defined data sources and dashboards for immediate monitoring. This automation is facilitated by provisioning files: `prometheus-datasource.yml` for data sources and `default_dashboard.json` for dashboards. These files are located in the `./config/provisioning/` directory of the project and are mounted directly into the Grafana container through Docker Compose volume mappings. This ensures that upon startup, Grafana is pre-configured with the necessary settings for monitoring without any manual setup.

4. **Begin Training Process**:
- The federated learning training automatically begins once all client containers are successfully connected to the Flower server. This synchronizes the learning process across all participating clients.


By following these steps, you will have a fully functional federated learning environment with device heterogeneity and monitoring capabilities.



## Monitoring with Grafana
1. **Access and Customize Grafana Dashboard**:
- Visit `http://localhost:3000` to enter Grafana. Thanks to the automated setup, Grafana will already have Prometheus as a data source and a pre-configured dashboard for monitoring, similar to the example provided below.
- You can further customize or create new dashboards as per your requirements.

2. **Grafana Dashboard Example**:
Below is an example of a Grafana dashboard showing a Bar Chart of memory usage for a specific client-container:


<img src="public/grafana-memory-usage.png" alt="Grafana Memory Usage Histogram" width="600"/>


This histogram offers a visual representation of the container's memory usage over time, highlighting the contrast in resource utilization between training and non-training periods. As evident from the graph, there are noticeable differences in memory consumption during active training phases compared to times when the container is not engaged in training.

## Conclusion
This project serves as a foundational example of managing device heterogeneity within the federated learning context, employing the Flower framework alongside Docker, Prometheus, and Grafana. It's designed to be a starting point for users to explore and further adapt to the complexities of device heterogeneity in federated learning environments.
90 changes: 90 additions & 0 deletions examples/flower-via-docker-compose /client.py
@@ -0,0 +1,90 @@
import os
import argparse
import flwr as fl
import tensorflow as tf
import logging
from helpers.load_data import load_data
import os
from model.model import Model

logging.basicConfig(level=logging.INFO) # Configure logging
logger = logging.getLogger(__name__) # Create logger for the module

# Make TensorFlow log less verbose
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

# Parse command line arguments
parser = argparse.ArgumentParser(description='Flower client')

parser.add_argument('--server_address', type=str, default="server:8080")
parser.add_argument('--batch_size', type=int, default=32)
parser.add_argument('--learning_rate', type=float, default=0.1)

args = parser.parse_args()

# Create an instance of the model and pass the learning rate as an argument
model = Model(learning_rate=args.learning_rate)

# Compile the model
model.compile()

class Client(fl.client.NumPyClient):
def __init__(self, args):
self.args = args

def get_parameters(self, config):
# Return the parameters of the model
return model.get_model().get_weights()


def fit(self, parameters, config):

# Set the weights of the model
model.get_model().set_weights(parameters)

# Load the training dataset and get the number of examples
train_dataset, _, num_examples_train, _ = load_data(batch_size=self.args.batch_size)

# Train the model
history = model.get_model().fit(train_dataset)

# Calculate evaluation metric
results = {
"accuracy": float(history.history["accuracy"][-1]),
}

# Get the parameters after training
parameters_prime = model.get_model().get_weights()

# Directly return the parameters and the number of examples trained on
return parameters_prime, num_examples_train, results



def evaluate(self, parameters, config):

# Set the weights of the model
model.get_model().set_weights(parameters)

# Use the test dataset for evaluation
_, test_dataset, _, num_examples_test = load_data(batch_size=self.args.batch_size)

# Evaluate the model and get the loss and accuracy
loss, accuracy = model.get_model().evaluate(test_dataset)

# Return the loss, the number of examples evaluated on and the accuracy
return float(loss), num_examples_test, {"accuracy": float(accuracy)}


# Function to Start the Client
def start_fl_client():
try:
fl.client.start_numpy_client(server_address=args.server_address, client=Client(args))
except Exception as e:
logger.error("Error starting FL client: %s", e)
return {"status": "error", "message": str(e)}


if __name__ == "__main__":
# Call the function to start the client
start_fl_client()
12 changes: 12 additions & 0 deletions examples/flower-via-docker-compose /config/grafana.ini
@@ -0,0 +1,12 @@
[security]
allow_embedding = true
admin_user = admin
admin_password = admin

[dashboards]
default_home_dashboard_path = /etc/grafana/provisioning/dashboards/default_dashboard.json

[auth.anonymous]
enabled = true
org_name = Main Org.
org_role = Admin
14 changes: 14 additions & 0 deletions examples/flower-via-docker-compose /config/prometheus.yml
@@ -0,0 +1,14 @@

global:
scrape_interval: 1s
evaluation_interval: 1s

rule_files:
scrape_configs:
- job_name: 'cadvisor'
scrape_interval: 1s
metrics_path: '/metrics'
static_configs:
- targets: ['host.docker.internal:8080']
labels:
group: 'cadvisor'
@@ -0,0 +1,139 @@
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": {
"type": "grafana",
"uid": "-- Grafana --"
jafermarq marked this conversation as resolved.
Show resolved Hide resolved
},
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"fiscalYearStartMonth": 0,
"graphTooltip": 0,
"id": 2,
"links": [],
"liveNow": false,
"panels": [
{
"datasource": {
"type": "prometheus",
"uid": "db69454e-e558-479e-b4fc-80db52bf91da"
},
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"fillOpacity": 80,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"lineWidth": 1,
"scaleDistribution": {
"type": "linear"
},
"thresholdsStyle": {
"mode": "off"
}
},
"mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "green",
"value": null
},
{
"color": "red",
"value": 80
}
]
}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"id": 1,
"options": {
"barRadius": 0,
"barWidth": 0.97,
"fullHighlight": false,
"groupWidth": 0.7,
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": true
},
"orientation": "auto",
"showValue": "auto",
"stacking": "none",
"tooltip": {
"mode": "single",
"sort": "none"
},
"xTickLabelRotation": 0,
"xTickLabelSpacing": 0
},
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "db69454e-e558-479e-b4fc-80db52bf91da"
},
"disableTextWrap": false,
"editorMode": "builder",
"expr": "container_memory_usage_bytes{name=\"client1\"}",
"fullMetaSearch": false,
"includeNullMetadata": true,
"instant": false,
"legendFormat": "__auto",
"range": true,
"refId": "A",
"useBackend": false
}
],
"title": "Panel Title",
"type": "barchart"
}
],
"refresh": "",
"schemaVersion": 38,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "barchart_memory",
"uid": "cd0d5026-20aa-4614-9dfe-0c14f1d6522f",
"version": 1,
"weekStart": ""
}