## Single Party XGBoost on Data Subset
First we'll train an XGBoost model on a subset of the data. This simulates the federated setting in that a party will only have a subset of the data that's available to the central trusted server for training. We'll look at the performance of a XGBoost model that's only trained on this subset. 
![title](img/exercise1.png)

Import the necessary libraries

In [1]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

Load in and examine the comma separated training data partition belonging to your party to get a better understanding of the data. The training data is located at `/data/<insurance or hospital>/<insurance or hospital>_training_<party ID>.csv`. For example, if you're party 2 and your federation is using the hospital dataset, your training data is at `/data/hospital/hospital_training_2.csv`. 

You can use the `pandas.read_csv()` function to read in the data. Make sure to specify the `sep` and `header` arguments. If you're using the insurance dataset, `header=0`; if you're using the hospital dataset, `header=None`.

In [3]:
# TODO: Read in the comma separated training data using the pandas.read_csv() function 
# and print out the first few rows of the dataset using the .head() function

training_data_subset = pd.read_csv('/data/hospital/hospital_training_1.csv', sep=",", header=None)
training_data_subset.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,81,82,83,84,85,86,87,88,89,90
0,2007,25.23214,-232.77465,-37.51542,-40.34335,56.11564,-55.94831,43.06882,15.46278,-38.6737,...,-15.64058,257.69408,113.5974,-90.14988,-13.41911,-72.59105,-185.49959,1.16272,-73.13128,-6.89193
1,2007,27.96974,-166.08713,-11.19265,-28.07397,-56.10902,-35.47258,23.35854,7.19973,-36.81179,...,21.49227,289.05914,-34.75972,-19.38242,2.44006,-67.78591,-46.62749,0.38383,98.98315,13.14364
2,2007,24.75152,-97.45055,-40.15226,-43.39929,-57.25665,-33.93026,-1.95605,0.93121,7.76578,...,-5.96584,573.94557,11.83355,-107.81947,-3.42495,-141.79299,-150.794,0.55715,148.7149,-2.41587
3,2007,20.19082,-162.50028,-123.04788,-71.11772,-8.96605,-51.72176,30.5383,15.27979,-34.99486,...,-73.13628,18.76005,46.07843,-309.69087,-24.52842,-35.79334,-774.53143,3.34849,-194.68101,-41.23842
4,2007,25.10092,-189.85543,-28.69605,-34.42398,24.64007,-55.86989,63.91339,17.88235,-3.39713,...,-3.70478,40.14964,95.55738,-36.47506,-8.63102,-34.57157,-13.6361,8.25615,108.42127,3.51335


In [None]:
y_train_subset = training_data_subset.iloc[:, 0]
y_train_subset.head()

In [None]:
x_train_subset = training_data_subset.iloc[:, 1:]
x_train_subset.head()

Preprocess the test data.

Test data for the hospital dataset is located at `/data/hospital/hospital_test_{party_id}.csv`<br>
Test data for the insurance dataset is located at `/data/insurance/insurance_test_{party_id}.csv`

In [None]:
# TODO: split the test data into features and labels
test_data_subset = pd.read_csv('/data/hospital/hospital_test_1.csv', sep=",", header=None)
y_test_subset = test_data_subset.iloc[:, 0]
x_test_subset = test_data_subset.iloc[:, 1:]

Train the model with the training data. Feel free to play with the hyperparameters. For example, you may want to adjust the `n_estimators` hyperparameter, which adjusts the number of trees in the ensemble. For a full list of hyperparameters, go here: https://xgboost.readthedocs.io/en/latest/parameter.html

In [None]:
model = xgb.XGBClassifier()
# TODO: fit the model to the training data
model.fit(x_train_subset, y_train_subset)

Get predictions and evaluate the model with the test data. Feel free to use different error functions. We suggest the sklearn `accuracy_score()` function for classification.

In [None]:
# TODO: use the model to get predictions for the test set and calculate the prediction error
preds = model.predict(x_test)
print(accuracy_score(y_test, preds))