[在 GitHub 上打开 ](https://github.com/secretflow/secretflow/tree/59e121daf1e6623d3088e421f96fbccaa4256d19/docs/tutorial/Federated_Xgboost.ipynb)

# 水平联邦XGBoost

> 以下代码仅作为演示用，请勿直接在生产环境使用。

在本教程中，我们将学习如何使用 SecretFlow 来训练水平联邦的树模型。Secretflow 为水平场景提供了 `tree modeling` 能力（`SFXgboost`），`SFXgboost` 类似于 `XGBoost`，您可以轻松地将现有的 XGBoost 程序转换为 SecretFlow 的联合模型。

## Xgboost

XGBoost 是一个优化的分布式梯度提升库，旨在高效、灵活和便携。 它在 Gradient Boosting 框架下实现机器学习算法。

官方文档 ~~`XGBoost tutorials 。

### 准备secretflow devices

In [1]:
%load_ext autoreload
%autoreload 2

import secretflow as sf

# Check the version of your SecretFlow
print('The version of SecretFlow: {}'.format(sf.__version__))

# In case you have a running secretflow runtime already.
sf.shutdown()

sf.init(['alice', 'bob', 'charlie'], address='local')
alice, bob, charlie = sf.PYU('alice'), sf.PYU('bob'), sf.PYU('charlie')

The version of SecretFlow: 1.8.0b0


  self.pid = _posixsubprocess.fork_exec(
2024-08-21 14:18:11,694	INFO worker.py:1724 -- Started a local Ray instance.


[33m(raylet)[0m Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/secretflow/device/proxy.py", line 77, in wrapper
    return method(*args, **kwargs)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 

### XGBoost Example

In [2]:
import xgboost as xgb
import pandas as pd
from secretflow.utils.simulation.datasets import dataset

df = pd.read_csv(dataset('dermatology'))
df.fillna(value=0)
print(df.dtypes)
y = df['class']
y = y - 1
x = df.drop(columns="class")
dtrain = xgb.DMatrix(x, y)
dtest = dtrain
params = {
    'max_depth': 4,
    'objective': 'multi:softmax',
    'min_child_weight': 1,
    'max_bin': 10,
    'num_class': 6,
    'eval_metric': 'merror',
}
num_round = 4
watchlist = [(dtrain, 'train')]
bst = xgb.train(params, dtrain, num_round, evals=watchlist, early_stopping_rounds=2)

erythema                                      int64
scaling                                       int64
definite_borders                              int64
itching                                       int64
koebner_phenomenon                            int64
polygonal_papules                             int64
follicular_papules                            int64
oral_mucosal_involvement                      int64
knee_and_elbow_involvement                    int64
scalp_involvement                             int64
family_history                                int64
melanin_incontinence                          int64
eosinophils_in_the_infiltrate                 int64
pnl_infiltrate                                int64
fibrosis_of_the_papillary_dermis              int64
exocytosis                                    int64
acanthosis                                    int64
hyperkeratosis                                int64
parakeratosis                                 int64
clubbing_of_

In [3]:
df.head()

Unnamed: 0,erythema,scaling,definite_borders,itching,koebner_phenomenon,polygonal_papules,follicular_papules,oral_mucosal_involvement,knee_and_elbow_involvement,scalp_involvement,...,disappearance_of_the_granular_layer,vacuolisation_and_damage_of_basal_layer,spongiosis,saw-tooth_appearance_of_retes,follicular_horn_plug,perifollicular_parakeratosis,inflammatory_monoluclear_inflitrate,band-like_infiltrate,age,class
0,2,2,0,3,0,0,0,0,1,0,...,0,0,3,0,0,0,1,0,55.0,2
1,3,3,3,2,1,0,0,0,1,1,...,0,0,0,0,0,0,1,0,8.0,1
2,2,1,2,3,1,3,0,3,0,0,...,0,2,3,2,0,0,2,3,26.0,3
3,2,2,2,0,0,0,0,0,3,2,...,3,0,0,0,0,0,3,0,40.0,1
4,2,3,2,2,2,2,0,2,0,0,...,2,3,2,3,0,0,2,3,45.0,3


### 那么，我们在SecretFlow中应该怎么做联邦XGBoost呢

1. 使用基于迭代的federate binning方法联合各方数据计算全局分桶信息，作为candidate splits进入后续的建树流程。
2. 数据输入到各个Client xgboost引擎中，计算G & H。
3. 进行联邦建树流程
   1. 进行数据reassign，分配到待分裂的节点上；
   2. 根据之前计算好的binning分桶计算sum_of_grad 和sum_of_hess；
   3. 发送给server端，server端做secure aggregation，挑选分裂信息发送回client端；
   4. Clients更新本地模型；
4. 完成训练，并保存模型。

在 Secretflow 环境中创建 3 个实体 [Alice, Bob, Charlie]， `Alice` 和 `Bob` 是客户端， `Charlie` 是服务器，那么你可以愉快地开始 `Federate Boosting` 。了。

### 准备数据

In [3]:
from secretflow.data.horizontal import read_csv
from secretflow.security.aggregation import SecureAggregator
from secretflow.security.compare import SPUComparator
from secretflow.utils.simulation.datasets import load_dermatology

aggr = SecureAggregator(charlie, [alice, bob])
spu = sf.SPU(sf.utils.testing.cluster_def(['alice', 'bob']))
comp = SPUComparator(spu)
data = load_dermatology(parts=[alice, bob], aggregator=aggr, comparator=comp)
data=data[0:5]
data.fillna(value=0, inplace=True)

INFO:root:Create proxy actor <class 'secretflow.device.proxy.Actor_Masker'> with party alice.


INFO:root:Create proxy actor <class 'secretflow.device.proxy.Actor_Masker'> with party bob.
INFO:root:Create proxy actor <class 'secretflow.device.proxy.ActorPartitionAgent'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.device.proxy.ActorPartitionAgent'> with party bob.


In [4]:
data

HDataFrame(partitions={PYURuntime(alice): <secretflow.data.core.partition.Partition object at 0x7fb30860f1f0>, PYURuntime(bob): <secretflow.data.core.partition.Partition object at 0x7fb30860ffd0>}, aggregator=<secretflow.security.aggregation.secure_aggregator.SecureAggregator object at 0x7fb3a2fcd4e0>, comparator=SPUComparator(device=<secretflow.device.device.spu.SPU object at 0x7fb3a2fccf70>))

### 准备超参

In [5]:
params = {
    # XGBoost parameter tutorial
    # https://xgboost.readthedocs.io/en/latest/parameter.html
    'max_depth': 4,  # max depth
    'eta': 0.3,  # learning rate
    'objective': 'multi:softmax',  # objection function，support "binary:logistic","reg:logistic","multi:softmax","multi:softprob","reg:squarederror"
    'min_child_weight': 1,  # The minimum value of weight
    'lambda': 0.1,  # L2 regularization term on weights (xgb's lambda)
    'alpha': 0,  # L1 regularization term on weights (xgb's alpha)
    'max_bin': 10,  # Max num of binning
    'num_class': 6,  # Only required in multi-class classification
    'gamma': 0,  # Same to min_impurity_split,The minimux gain for a split
    'subsample': 1.0,  # Subsample rate by rows
    'colsample_bytree': 1.0,  # Feature selection rate by tree
    'colsample_bylevel': 1.0,  # Feature selection rate by level
    'eval_metric': 'merror',  # supported eval metric：
    # 1. rmse
    # 2. rmsle
    # 3. mape
    # 4. logloss
    # 5. error
    # 6. error@t
    # 7. merror
    # 8. mlogloss
    # 9. auc
    # 10. aucpr
    # Special params in SFXgboost
    # Required
    'hess_key': 'hess',  # Required, Mark hess columns, optionally choosing a column name that is not in the data set
    'grad_key': 'grad',  # Required，Mark grad columns, optionally choosing a column name that is not in the data set
    'label_key': 'class',  # Required，ark label columns, optionally choosing a column name that is not in the data set
}

### Create SFXgboost

In [6]:
from secretflow.ml.boost.homo_boost import SFXgboost

bst = SFXgboost(server=charlie, clients=[alice, bob])
bst

INFO:root:Create proxy actor <class 'secretflow.device.proxy.ActorHomoBooster'> with party alice.
INFO:root:Create proxy actor <class 'secretflow.device.proxy.ActorHomoBooster'> with party bob.
INFO:root:Create proxy actor <class 'secretflow.device.proxy.ActorHomoBooster'> with party charlie.


<secretflow.ml.boost.homo_boost.homo_booster.SFXgboost at 0x7fb30867f910>

run SFXgboost

In [7]:
bst.train(data, data, params=params, num_boost_round=6)

RayActorError: The actor died unexpectedly before finishing this task.
	class_name: ActorPartitionAgent
	actor_id: 2464a27c063490e6403d12ca01000000
	pid: 561243
	namespace: 69b0ee27-2477-40a9-891c-bc3e028c35c6
	ip: 10.0.0.4
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None. Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 1807, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1908, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1813, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/ray/_private/function_manager.py", line 726, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/secretflow/device/proxy.py", line 77, in wrapper
    return method(*args, **kwargs)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/secretflow/data/core/agent.py", line 96, in __getitem__
    data = working_object.__getitem__(item)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/secretflow/data/core/pandas/dataframe.py", line 52, in __getitem__
    return PdPartDataFrame(self.data.__getitem__(item_list))
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/pandas/core/frame.py", line 3813, in __getitem__
    indexer = self.columns._get_indexer_strict(key, "columns")[1]
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6065, in _get_indexer_strict
    indexer = self.get_indexer_for(keyarr)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6052, in get_indexer_for
    return self.get_indexer(target)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3973, in get_indexer
    return self._get_indexer(target, method, limit, tolerance)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 4000, in _get_indexer
    indexer = self._engine.get_indexer(tgt_values)
  File "pandas/_libs/index.pyx", line 308, in pandas._libs.index.IndexEngine.get_indexer
  File "pandas/_libs/hashtable_class_helper.pxi", line 5794, in pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'slice'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 2206, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 2102, in ray._raylet.execute_task_with_cancellation_handler
  File "python/ray/_raylet.pyx", line 1756, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1757, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1995, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 1055, in ray._raylet.store_task_errors
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "/mnt/users/beng003/anaconda3/envs/sf/lib/python3.10/site-packages/secretflow/device/proxy.py", line 77, in wrapper
    return method(*args, **kwargs)
TypeError: PartitionAgent.__len__() missing 1 required positional argument: 'idx'
An unexpected internal error occurred while the worker was executing a task.

Now our Federated XGBoost training is complete, where the BST is the federated Boost object.

## Conclusion
* This tutorial introduces how to use tree models for training etc.
* SFXgboost encapsulates the logic of the federated subtree model. Sfxgboost trained models remain compatible with XGBoost, and we can directly use the existing infrastructure for online prediction and so on.
* Next, you can try SFXgboost on your data, just need to follow this tutorial.
