Adaboost 算法
======

Adaboost主要解决两个问题，一是在每一轮如何改变训练数据的权值或概率分布；二是如何将弱分类器组合成一个强分类器。
1. 关于第一个问题，Adaboost的做法是提高那些被前一轮弱分类器误分类样本的权值，而降低那些被正确分类的权值
2. 关于第二个问题, Adaboost采取加权多数表决的方法.具体地，加大分类误差率小的弱分类器的权值，使其在表决中起较大的作用。


算法流程如下:

输入:训练数据集 T = {$(x_1,y_1),(x_2,y_2),...(x_N,y_N)$}, y$\in \gamma={-1,+1}$

输出: 最终分类器G(x)

1. 初始化训练数据集的权值分布
$$D_1=(\omega_{11},...\omega_{1i},...,\omega_{1N}), \omega_{1i}={1 \over N},i=1,2,...N$$
2. 对m=1,2,...M

 (2.1) 使用具有权值分布$D_m$的训练数据集学习，得到基本分类器
$$G_m(x) = \chi \rightarrow \{-1,+1\}$$
 (2.2) 计算$G_m(x)$在训练数据集上的分类误差率
$$e_m=P(G_m(x_i)\neq y_i) = \sum_{i=1}^N\omega_{mi}I(G_m(x_i)\neq y_i)$$
 (2.3) 计算$G_m(x)的系数$
$$\alpha_m = {1\over2}log{1-e_m \over e_m}$$
 $a_m$随着$e_m$的减小而增大，所以分类误差率越小的基本分类器在最终的分类器中作用越大
 
 (2.4) 更新训练数据集的权值分布
$$D_{m+1} = (\omega_{m+1,1},...,\omega_{m+1,i},...\omega_{m+1,N})$$
$$\omega_{m+1,i}={\omega_{mi}\over Z_m}exp(-\alpha_my_iG_m(x_i)) ,i=1,2,...,N $$
这里,$Z_m$是规范化因子
$$Z_m = \sum_{i=1}^Nexp(-\alpha_my_iG_m(x_i))$$
它使$D_{m+1}称为一个概率分布$

3. 构建基本分类器的线性组合
$$f(x)=\sum_{m=1}^M\alpha_mG_m(x)$$
得到最终分类器
$$G(x)=sign(f(x)) = sign(\sum_{m=1}^M\alpha_mG_m(x))$$

In [1]:
def loadSimpData():  
    datMat=matrix([[1.,2.1],  
                   [2.,1.1],  
                   [1.3,1.],  
                   [1.,1.],  
                   [2.,1.]])  
    classLabels=[1.0,1.0,-1.0,-1.0,1.0]  
    return datMat,classLabels  

In [2]:
def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):
    """
    用于测试是否有某个值小于或大于我们的阈值
    """
    retArray = ones((dataMatrix.shape[0],1))
    if threshIneq == 'lt':
        retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
    else:
        retArray[dataMatrix[:,dimen] > threshVal] = -1.0
    return retArray


In [3]:
from numpy import *
def buildStump(dataArr,classLabels,D):
    """
    将minError设置为无穷大
    对数据集中的每一个属性:
        对每个步长（第二层循环）:
            对每个不等号:
                建立一棵单层决策树并利用加权数据集对其进行测试
                如果错误率低于minError，则将当前的决策树设为最佳单层决策树
    返回最佳单层决策树
    """
    dataMatrix = mat(dataArr)
    labelMat =mat(classLabels).T
    m,n = dataMatrix.shape
    numSteps = 10.0;bestStump={};bestClasEst = mat(zeros((m,1)))
    #下面按照上述伪代码实现
    minError  = inf
    for i in range(n):
        rangeMin = dataMatrix[:,i].min()
        rangeMax = dataMatrix[:,i].max()
        stepSize = (rangeMax-rangeMin)/numSteps
        for j in range(-1,int(numSteps)+1):
            for inequal in ['lt','gt']:
                threshVal = (rangeMin+float(j)*stepSize )
                #阈值一边的会被分类到-1，另一边的会被分类到+1.
                predictVals = stumpClassify(dataMatrix,i,threshVal,inequal)
                errArr = mat(ones((m,1)))
                errArr[predictVals==labelMat]=0
                weightedError  = D.T*errArr
#                 print("split: dim %d, thresh %.2f, thresh inequal: %s, the weighted error: %.3f" \
#                       %(i, threshVal, inequal, weightedError)  )
                if weightedError < minError:
                    minError = weightedError
                    bestClasEst = predictVals.copy()
                    bestStump['dim']=i
                    bestStump['thresh'] = threshVal
                    bestStump['ineq'] = inequal
    return bestStump,minError,bestClasEst

In [4]:
from numpy import *
D= mat(ones((5,1))/5)
datamat,classlabels = loadSimpData()
buildStump(datamat,classlabels,D)

({'dim': 0, 'ineq': 'lt', 'thresh': 1.3}, matrix([[ 0.2]]), array([[-1.],
        [ 1.],
        [-1.],
        [-1.],
        [ 1.]]))

In [5]:
def adaBoostTrainDS(dataArr,classLabels,numIt=40):
    """
    对每次迭代：
        利用buildStump找到最佳的单层决策树
        将最佳单层决策树加入数组
        计算分类器系数alpha
        计算新的权重D
        更新累计类别估计值
        如果错误率为0.0，跳出循环
    """
    weakClassArr = []
    m = shape(dataArr)[0]
    D = mat(ones((m,1))/m)
    aggClassEst  = mat(zeros((m,1)))
    for i in range(numIt):
        bestStump,error,classEst= buildStump(dataArr,classLabels,D)
        print("D:",D.T)
        alpha = float(0.5*log((1.0-error)/max(error,1e-16)))
        bestStump['alpha']=alpha
        weakClassArr.append(bestStump)
        print("classEst:",classEst.T)
        expon = multiply(-1*alpha*mat(classlabels).T,classEst)
        D = multiply(D,exp(expon))
        D = D/D.sum()
        aggClassEst += alpha*classEst
        print("aggClassEst:",aggClassEst.T)
        aggErrors = multiply(sign(aggClassEst)!=mat(classlabels).T,ones((m,1)))
        errorRate = aggErrors.sum()/m
        print("total error:",errorRate,"\n")
        if errorRate==0.0:break
    return weakClassArr

In [6]:
adaBoostTrainDS(datamat,classlabels,9)

D: [[ 0.2  0.2  0.2  0.2  0.2]]
classEst: [[-1.  1. -1. -1.  1.]]
aggClassEst: [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]
total error: 0.2 

D: [[ 0.5    0.125  0.125  0.125  0.125]]
classEst: [[ 1.  1. -1. -1. -1.]]
aggClassEst: [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]
total error: 0.2 

D: [[ 0.28571429  0.07142857  0.07142857  0.07142857  0.5       ]]
classEst: [[ 1.  1.  1.  1.  1.]]
aggClassEst: [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]
total error: 0.0 



[{'alpha': 0.6931471805599453, 'dim': 0, 'ineq': 'lt', 'thresh': 1.3},
 {'alpha': 0.9729550745276565, 'dim': 1, 'ineq': 'lt', 'thresh': 1.0},
 {'alpha': 0.8958797346140273,
  'dim': 0,
  'ineq': 'lt',
  'thresh': 0.90000000000000002}]

In [7]:
def adaClassify(datToClass,classifierArr):
    dataMatrix = mat(datToClass)
    m = dataMatrix.shape[0]
    aggClassEst = mat(zeros((m,1)))
    for i in range(len(calssifierArr)):
        classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\
                                classifierArr[i]['thresh'],\
                                classifierArr[i]['ineq'])
        aggClassEst += classifierArr[i]['alpha']*classEst
        print(aggClassEst)
    return sign(aggClassEst)

In [8]:
calssifierArr=adaBoostTrainDS(datamat,classlabels,10)
adaClassify([[0,0],[5,5]],calssifierArr)

D: [[ 0.2  0.2  0.2  0.2  0.2]]
classEst: [[-1.  1. -1. -1.  1.]]
aggClassEst: [[-0.69314718  0.69314718 -0.69314718 -0.69314718  0.69314718]]
total error: 0.2 

D: [[ 0.5    0.125  0.125  0.125  0.125]]
classEst: [[ 1.  1. -1. -1. -1.]]
aggClassEst: [[ 0.27980789  1.66610226 -1.66610226 -1.66610226 -0.27980789]]
total error: 0.2 

D: [[ 0.28571429  0.07142857  0.07142857  0.07142857  0.5       ]]
classEst: [[ 1.  1.  1.  1.  1.]]
aggClassEst: [[ 1.17568763  2.56198199 -0.77022252 -0.77022252  0.61607184]]
total error: 0.0 

[[-0.69314718]
 [ 0.69314718]]
[[-1.66610226]
 [ 1.66610226]]
[[-2.56198199]
 [ 2.56198199]]


matrix([[-1.],
        [ 1.]])