## Multiparty XGBoost with Centralized Training
In this exercise, we'll demonstrate a workflow in which each party has its own data and sends a copy of its data to the central server. Therefore, all the training data is sent over the network to the central server, who collects it and locally trains a model on all the data. The central server will then broadcast the trained model back to the parties, who will load the model and test it on their local test datasets. 

![title](img/exercise2.png)


We will also measure the number of bytes sent over the network to show the large bandwidth needed for this workflow. 
This shows the benefits of using as much data as possible to make the model more robust.

### Data Transfer
Import the necessary libraries

In [None]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from Utils import scp, PKI

Though we don't need to do this part, we think it's helpful to see how many bytes would be transferred over the network if you weren't the aggregator and had to send your training data over the network. Send the training data you used in Exercise 1 over the network to your `~/shared_data` directory. Note how many bytes are transferred.

* Training data for the insurance dataset is at `/data/insurance/insurance_training_{party_id}.csv`

In [None]:
# Instantiate the PKI to help with IP lookups
pki = PKI()
aggregator = "" # TODO: fill in your username here

In [None]:
# Make sure you use the training data you used in exercise 1
training_data = "/path/to/training_data" # TODO: fill in the path to the training data
my_ip = pki.lookup(aggregator)[0] # Get your IP
dest_dir = "~/shared_data"
scp(training_data, my_ip, dest_dir)

Data will be sent from all parties to your `~/shared_data/` directory.

### Aggregate the Received Data
Wait for all parties to send you their data and load all the data that has been sent to your machine. For example, if three other parties sent you data, make 4 calls to `read_csv()`: one for your own data and three for the other parties' data.

Concatenate all the data in preparation for training

In [None]:
training_data_lst = []

# TODO: add the paths to all shared data to shared_data_path_lst
shared_data_path_lst = []

for path in shared_data_path_lst:
    training_data_subset = pd.read_csv(path, sep=",", header=None)
    training_data_lst.append(training_data_subset)

aggregated_training_data = pd.concat(training_data_lst) 
aggregated_training_data.shape

In [None]:
# Split the aggregated training data into features and labels
y_agg_train = aggregated_training_data.iloc[:, 0]
x_agg_train = aggregated_training_data.iloc[:, 1:]

### Train a Model

In [None]:
arg1, arg2 = # TODO: fill these variables in with the aggregated features and labels

multiparty_model = xgb.XGBClassifier()
multiparty_model.fit(arg1, arg2)

### Broadcast the Trained Model
Save the trained model and send it to all parties in the federation. The model will be sent to the home directory of each party.

In [None]:
import pickle 

model_name = "ex2_model.model"
pickle.dump(multiparty_model, open(model_name, "wb"))

In [None]:
# If you're the central server, run this cell as many times as needed to send the saved model
# to all parties in the federation
model_file = "ex2_model.model"
dest_dir = "~"
dest_ips = []

# TODO: fill in the usernames of all members of your federation
# No need to include your own username here
members = []

for member in members:
    member_ip = pki.lookup(member)[0]
    dest_ips.append(member_ip)

for ip in dest_ips:
    scp(model_file, ip, dest_dir)

### Model Evaluation
Load in your local test data and preprocess it to split it into features and labels. Then evaluate your model.
* Test data for the insurance dataset is at `/data/insurance/insurance_test_{party_id}.csv`

In [None]:
# Load in your local test data and preprocess it to split it into features and labels
test_data_path = "path/to/test/data" # TODO: fill in the path to your test data
test_data = pd.read_csv(test_data_path, sep=",", header=None)
y_test = test_data.iloc[:, 0]
x_test = test_data.iloc[:, 1:]

In [None]:
arg1, arg2 = # TODO: set arg1 to the test features, arg2 to the test labels
preds = multiparty_model.predict(arg1)
print(accuracy_score(arg2, preds))

Discuss the results with other members of your federation. How did the centrally trained model perform on your local test data compared with the locally trained model? Did adding more data help?

Once you're ready, please move to [Exercise 3](./exercise3-aggregator.ipynb).