[toc]

# GDBT python 实战

## 代码实战

导入一些必要的类。由于 GDBT 用 CART 来做 base learner，这里我们直接从 sklearn 中导入。

In [1]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.datasets import load_boston, load_iris

import numpy as np
from collections import Counter

先定义一个 Loss 类， 用来计算 loss 值和它对应的梯度

In [2]:
class Loss:
    def fit(self, y, yhat):
        pass
    def gradient(self, y, yhat):
        pass
    def __call__(self, y, yhat):
        return self.fit(y, yhat)

In [3]:
class MSE(Loss):
    """
    对于 MSE，假设 y 和 yhat 的 shape 都是 (n_samples, )
    """
    def fit(self, y, yhat):
        return 0.5 * np.mean((y - yhat)**2)
    
    def gradient(self, y, yhat):
        return yhat - y
    
class CrossEntropyWithLogits(Loss): 
    """
    对于 CrossEntropyWithLogits，我们假设 y 和 logits 的shape都是 (n_samples, n_categories)
    """
    def softmax(self, x):
        c = np.max(x, axis=0)
        a = np.exp(x-c + 10e-4)
        sum_a = np.sum(a, axis=0)
        return a / sum_a
    
    def fit(self, y, logits):
        output = - np.mean(np.sum(y * np.log(self.softmax(logits)), axis=0))
        return output

    def gradient(self, y, logits):
        return logits - y

In [None]:
定义 GBDT 类来处理分类问题和回归问题。

In [4]:
class GBDT:
    def __init__(self, n_estimator=3, learning_rate=0.01, regression=True, **params):
        """
        n_estimator %%bash示弱分类器的数量。
        regession: 表示是否是回归任务
        params 是传给 base classifier 或 base regressor 的参数
        """
        self.n_estimator = n_estimator
        self.regression = regression
        self.learning_rate = learning_rate
        self.trees = [DecisionTreeRegressor(**params) for _ in range(self.n_estimator)]
        self.loss = MSE() if self.regression else CrossEntropyWithLogits()
    
    @staticmethod
    def majority_voting(y):
        """
        param:
            y: shape (n_samples, n_categories)

        output:
            yhat: shape=(1, n_categories)
        """
        yhat = np.zeros_like(y[0])
        majority = Counter(map(tuple, y)).most_common()
        for val, _ in Counter(map(tuple, y)).most_common():
            yhat += np.array(val)
        yhat /= len(majority)
        yhat = yhat[np.newaxis, :]
        return yhat

    def fit(self, X, y):
        yhat = np.mean(y) if self.regression else GBDT.majority_voting(y)
        for i in range(self.n_estimator):
            negative_gradient = - self.loss.gradient(y, yhat)
            self.trees[i].fit(X, negative_gradient)
            yhat = yhat + self.learning_rate * self.trees[i].predict(X)
    
    def predict(self, X):
        yhat = np.mean(y) if self.regression else GBDT.majority_voting(y)
        for i in range(self.n_estimator):   
            yhat = yhat + self.learning_rate * self.trees[i].predict(X)
        return yhat

### 测试

#### 分类问题

使用著名的 iris 数据集进行测试

In [5]:
data = load_iris()

X = data['data']
y = data['target']
y = OneHotEncoder(sparse=False).fit_transform(y[:, np.newaxis])

In [6]:
params = {"max_depth": 2}
gbdt = GBDT(n_estimator=5, **params, regression=False)
gbdt.fit(X,y)

yhat = gbdt.predict(X)
acc =  np.mean(np.argmax(y, axis=0) == np.argmax(yhat, axis=0))
print("acc: ", acc)

acc:  0.6666666666666666


#### 回归问题

对于回归问题来说，一个入门级的数据集就是波士顿房价数据。

In [7]:
data = load_boston()

X = data['data']
y = data['target']

params = {"max_depth": 10}
gbdt = GBDT(n_estimator=200, **params)
gbdt.fit(X,y)

yhat = gbdt.predict(X)
MSE()(y, yhat)

0.8563770138243728