## Multiparty XGBoost with Federated Training
We will now discuss running XGBoost in the federated setting. Unlike the previous exercise, in the federated setting all data stays on its respective machine. This eliminates the need to transfer over the network which incurs high overhead and requires significant bandwidth. Instead, in the federated setting in each iteration each party sends a summary of the update made to its model. The central server then aggregates these updates, applies the aggregated update to its model, and broadcasts the new model to all parties. The parties then train locally with the new model and sends the update to the central server.

![title](img/exercise3.png)

In our project, all this is abstracted away. The central server simply starts the training, and everything else is performed automatically.

Import some helper functions.

In [None]:
import pandas as pd
import subprocess
from Utils import network_analysis, start_job

### Edit hosts.config
The `hosts.config` file should contain the IPs and SSH ports of all parties in the federation. 
Retrieve the IPs of all members in the federation from the PKI and write it to the hosts.config file.

In [None]:
# Get the IPs of all members in your federation and add it to hosts.config
from Utils import PKI

members = ["chester", "rishabh", "wenting"]
pki = PKI()
with open("hosts.config", "w+") as hosts:
    for member in members:
        IP, key = pki.lookup(member)
        
        # Write the member's IP address and port 5522 to hosts.config
        hosts.write(IP +":5522\n")

### Set Variables For Network Analysis
We'll walk you through inspecting packets during this tutorial as well to make sure that the network topology is indeed federated. For each variable below, fill in the corresponding IP (don't worry about the ordering of the worker nodes).

In [None]:
master = '0'
worker_1 = '1'
worker_2 = '2'
worker_3 = '3'

### Training Script
We will now examine the script that will be run for federated training. We've written the training script for this part for you. Load it in by running the following cell. The contents of the script should appear in the cell. 

You, the aggregator, control the training. Feel free to play with the `params` argument passed into the `train()` function. A list of possible parameters and their descriptions can be found [here](https://xgboost.readthedocs.io/en/latest/parameter.html). 

Note: the top of this cell should look like the following before you run it.

```python
%%writefile train_model.py
from Utils import FederatedXGBoost
...
```

The `%%writefile train_model.py` is a magic that will write the cell to disk under the name `train_model.py` once you run the cell. Running the cell basically saves your changes to disk.


In [2]:
%%writefile train_model.py
from Utils import FederatedXGBoost

# Instantiate a FederatedXGBoost instance
fxgb = FederatedXGBoost()

# Get number of federating parties
print("Number of parties in federation: ", fxgb.get_num_parties())

# Load training data
training_data_path = "/data/insurance/insurance_training.csv"
fxgb.load_training_data(training_data_path)

# Train a model
params = {'max_depth': 3, "objective": "binary:logistic"}
num_rounds = 100
fxgb.train(params, num_rounds)

# Save the model
fxgb.save_model("ex3_model.model")

# Shutdown
fxgb.shutdown()

Overwriting train_model.py


### Using tcpdump to Capture Packets
We will be using `tcpdump` to monitor the network traffic during training. The cell below spawns a process that records all incoming network traffic.

In [None]:
tcpdump_cmd = 'tcpdump -ni eth0 -s0 -w capture.pcap'
tcpdump_process = subprocess.Popen(tcpdump_cmd, stdout=subprocess.PIPE, shell=True)

### Start Job
After modifying the script, we can start our job! We can use the `start_job()` helper function to do so.
`start_job(num_parties, memory, script_path)` takes in three parameters:
* num_parties: The number of parties in the federation. This should be the same as the number of IPs added to hosts.config
* memory: The amount of memory to use for this job on each party's machine
* script_path: The absolute path to the script we want to run

In [None]:
start_job(2, 3, "/home/$USER/train_model.py")

Kill the tcpdump process once training has finished as we no longer need to monitor network traffic

In [None]:
tcpdump_process.terminate()

## Network Analysis Results
In the federated setting, parties don't communicate with each other -- they only communicate with the central aggregator. We've monitored network traffic in this section to show this isolated communication, and also to show that the communication of updates (as in the federated setting) requires less bandwidth than the transfer of whole raw datasets (as in the centralized training scenario from Exercise 2). 

Running the cell below does some conversion and preprocessing in pandas of the `.pcap` outputted by `tcpdump` to visualize the byte transmission during training.

In [None]:
counts = network_analysis(master, worker_1, worker_2, worker_3)
counts

## Model Evaluation
We'll now use the model we trained in the previous step to make predictions on our test data. Load in the federated model, preprocess your test data, and evaluate the model with the test data.

* Test data for the insurance dataset is at `/data/insurance/insurance_test_{party_id}.csv`

In [None]:
import xgboost as xgb

model_path = "ex3_model.model"
multiparty_model = xgb.Booster()
multiparty_model.load_model(model_path)

In [None]:
test_data_path = "/path/to/test/data" # TODO: replace this with the path to your test data
test_data_subset = pd.read_csv(test_data_path, sep=",", header=None)
y_test_subset = test_data_subset.iloc[:, 0]
x_test_subset = test_data_subset.iloc[:, 1:]
test_data = xgb.DMatrix(x_test_subset, label=y_test_subset)

In [None]:
multiparty_model.eval(test_data)