# Adaptive basis function models

## Introduction

We target to learn useful features $\phi(x)$ directly from the data without using kernel function. 
Adaptive basis-function model (ABM):

$$f ( \mathbf { x } ) = w _ { 0 } + \sum _ { m = 1 } ^ { M } w _ { m } \phi _ { m } ( \mathbf { x } )$$

where $\phi_m(x)$ is the $m$'th basis function, which is learned from data. We can write $\phi _ { m } ( \mathbf { x } ) = \phi \left( \mathbf { x } ; \mathbf { v } _ { m } \right)$ where $v_m$ are the parameters of the basis function itself. We will use $\boldsymbol { \theta } = \left( w _ { 0 } , \mathbf { w } _ { 1 : M } , \left\{ \mathbf { v } _ { m } \right\} _ { m = 1 } ^ { M } \right)$ to denote the entire parameter set. The resulting model is not linear -in-the-parameters anymore, so will will only be able to compute the locally optimal MLE or MAP estimate of $\theta$. 


## Classification and regression trees (CART)
or decision trees.

### Intro
model:
$$f ( \mathbf { x } ) = \mathbb { E } [ y | \mathbf { x } ] = \sum _ { m = 1 } ^ { M } w _ { m } \mathbb { I } \left( \mathbf { x } \in R _ { m } \right) = \sum _ { m = 1 } ^ { M } w _ { m } \phi \left( \mathbf { x } ; \mathbf { v } _ { m } \right)$$

where $R_m$ is the $m$'th region, $w_m$ is the mean response in this region, and $v_m$ encodes the choice of variable to split on (the threshold value, on the path from the root to the $m$'th leaf. 

It is adaptive basis functions: 
+ basis function: define the regions
+ weight: response value in each region

![](../images/16.DT.png)

### Growing a tree: 
(used in C4.5 and ID3). The split function chooses the best feature, and the best value for that feature, as follows:

$$\left( j ^ { * } , t ^ { * } \right) = \arg \min _ { j \in \{ 1 , \ldots , D \} } \min _ { t \in \mathcal { T } _ { j } } \operatorname { cost } \left( \left\{ \mathbf { x } _ { i } , y _ { i } : x _ { i j } \leq t \right\} \right) + \operatorname { cost } \left( \left\{ \mathbf { x } _ { i } , y _ { i } : x _ { i j } > t \right\} \right)$$

![](../images/16.DT_algo.png)

where the worthSplitting function: 
$$\Delta \triangleq \operatorname { cost } ( \mathcal { D } ) - \left( \frac { \left| \mathcal { D } _ { L } \right| } { | \mathcal { D } | } \operatorname { cost } \left( \mathcal { D } _ { L } \right) + \frac { \left| \mathcal { D } _ { R } \right| } { | \mathcal { D } | } \operatorname { cost } \left( \mathcal { D } _ { R } \right) \right)$$

+ Regression cost:
$$\operatorname { cost } ( \mathcal { D } ) = \sum _ { i \in \mathcal { D } } \left( y _ { i } - \overline { y } \right) ^ { 2 }$$
where $\overline { y } = \frac { 1 } { | \mathcal { D } | } \sum _ { i \in \mathcal { D } } y _ { i }$ is the mean of the response variable in the specified set of data.

+ Classification cost:
    First, we fit a multinoulli (categorical) model to the data in the leaf satisfying the test: $X_j < t$ by: $$\hat { \pi } _ { c } = \frac { 1 } { | \mathcal { D } | } \sum _ { i \in \mathcal { D } } \mathbb { I } \left( y _ { i } = c \right)$$

    where $\mathcal{D}$ is the data in the leaf. There are several common error measures for evaluating a proposed partition:

    + Misclassification rate: most probable class label: $\hat { y } _ { c } = \operatorname { argmax } _ { c } \hat { \pi } _ { c }$. The corresponding error rate: $$\frac { 1 } { | \mathcal { D } | } \sum _ { i \in \mathcal { D } } \mathbb { I } \left( y _ { i } \neq \hat { y } \right) = 1 - \hat { \pi } _ { \hat { y } }$$
    
    + Entropy (deviance, information gain): $$\mathbb { H } ( \hat { \boldsymbol { \pi } } ) = - \sum _ { c = 1 } ^ { C } \hat { \pi } _ { c } \log \hat { \pi } _ { c }$$
    
    + Gini index: $$\sum _ { c = 1 } ^ { C } \hat { \pi } _ { c } \left( 1 - \hat { \pi } _ { c } \right) = \sum _ { c } \hat { \pi } _ { c } - \sum _ { c } \hat { \pi } _ { c } ^ { 2 } = 1 - \sum _ { c } \hat { \pi } _ { c } ^ { 2 }$$
    
### Pros and cons of trees:
+ Pros:
    + easy to interpret
    + easily handle mixed discrete and continuous inputs
    + insensitive to monotone transofrmations of the inputs (split points are based on ranking the data points)
    + peform automatic variable selection, 
    + robust to outliers
    + scale well to large dataset
    + can be modified to handle missing inputs
+ Cons:
    + do not predict very accurately compared to other kinds of model.
    + unstable: small changes to the input data can have large effects on the structure of the tree, due to the hierarchical nature of the tree-growing process.
    
### Random forests:
To reduce the variance of an estimate is to average together many estimates. 

+ Bagging: we can train $M$ different trees on different subsets of the data, chosen randomly with replacement, and then compute the ensemble: 

    $$f ( \mathbf { x } ) = \sum _ { m = 1 } ^ { M } \frac { 1 } { M } f _ { m } ( \mathbf { x } )$$

    where $f_m$ is the $m$'th tree.

+ Random forests: tries to decorrelate the base learners by learning trees based on a randomly chosen subset of input **features**, as well as a randomly chosen subset of **data** cases. Such models often have very good predictive accuracy: 

## Boosting
### Intro:
Boosting is a greedy algorithm for fitting adaptive basis-function models of the form: 
$$f ( \mathbf { x } ) = w _ { 0 } + \sum _ { m = 1 } ^ { M } w _ { m } \phi _ { m } ( \mathbf { x } )$$

where $\phi_m$ are generated by an algorithm called weak learner or base learner. The algorithm works by applying th weak learner sequentially to **weighted versions** of the data, where more weight is given to examples that were misclassified by earlier rounds. The weak learneer can by any classification or regression algorithm, but is common to use a CART model. Boosting is very resistant to overfitting.

### Forward stagewise additive modeling:
The goal of boosting is to solve the following optimization problem:
$$\min _ { f } \sum _ { i = 1 } ^ { N } L \left( y _ { i } , f \left( \mathbf { x } _ { i } \right) \right)$$

where $L(y, \hat{y})$ is some loss functions, and $f$ is assumed to be an ABM model.

Some common choices for the loss function:
![](../images/16.boost.png)

Process:
+ First, we initialize:
$$f _ { 0 } ( \mathbf { x } ) = \arg \min _ { \gamma } \sum _ { i = 1 } ^ { N } L \left( y _ { i } , f \left( \mathbf { x } _ { i } ; \gamma \right) \right)$$

    for example: 
    + for squared error loss: $f _ { 0 } ( \mathbf { x } ) = \overline { y }$
    + for log-loss or exponential loss: $f _ { 0 } ( \mathbf { x } ) = \frac { 1 } { 2 } \log \frac { \hat { n } } { 1 - \tilde { \pi } } , \text { where } \hat { \pi } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \mathbb { I } \left( y _ { i } = 1 \right)$

+ We iteratively compute:
$$\left( \beta _ { m } , \gamma _ { m } \right) = \underset { \beta , \gamma } { \operatorname { argmin } } \sum _ { i = 1 } ^ { N } L \left( y _ { i } , f _ { m - 1 } \left( \mathbf { x } _ { i } \right) + \beta \phi \left( \mathbf { x } _ { i } ; \gamma \right) \right)$$
    and set: $$f _ { m } ( \mathbf { x } ) = f _ { m - 1 } ( \mathbf { x } ) + \nu\beta _ { m } \phi \left( \mathbf { x } ; \gamma _ { m } \right)$$
    
    where $0 < \nu < 1$ is learning rate, common to use $\nu = 0.1$
    
+ the keypoint: we do not go back and adjust earlier parameter, we just add new function to the list of current ones

### L2boosting:
Squared error loss:
$$L \left( y _ { i } , f _ { m - 1 } \left( \mathbf { x } _ { i } \right) + \beta \phi \left( \mathbf { x } _ { i } ; \gamma \right) \right) = \left( r _ { i m } - \phi \left( \mathbf { x } _ { i } ; \gamma \right) \right) ^ { 2 }$$

where $r _ { i m } \triangleq y _ { i } - f _ { m - 1 } \left( \mathbf { x } _ { i } \right)$ is the current residual, $\beta = 1$. We can find new basis function by using the weak learner to predict $r_m$. 

### Adaboost: 
Exponential loss, at step $m$:
    $$L _ { m } ( \phi ) = \sum _ { i = 1 } ^ { N } \exp \left[ - \tilde { y } _ { i } \left( f _ { m - 1 } \left( \mathbf { x } _ { i } \right) + \beta \phi \left( \mathbf { x } _ { i } \right) \right) \right] = \sum _ { i = 1 } ^ { N } w _ { i , m } \exp \left( - \beta \tilde { y } _ { i } \phi \left( \mathbf { x } _ { i } \right) \right)$$

where $w _ { i , m } \triangleq \exp \left( - \tilde { y } _ { i } f _ { m - 1 } \left( \mathbf { x } _ { i } \right) \right)$ is a weight applied to datacase $i$, $\tilde { y } _ { i } \in \{ - 1 , + 1 \}$

![](../images/16.Adaboost.png)

### LogitBoost:
Expected log-loss:
$$L _ { m } ( \phi ) = \sum _ { i = 1 } ^ { N } \log \left[ 1 + \exp \left( - 2 \tilde { y } _ { i } \left( f _ { m - 1 } ( \mathbf { x } ) + \phi \left( \mathbf { x } _ { i } \right) \right) \right) \right]$$

![](../images/16.Logitboost.png)

### Gradient Boosting:
Work for any loss function: $\hat { \mathbf { f } } = \underset { \mathbf { f } } { \operatorname { argmin } } L ( \mathbf { f } )$

where $\mathbf { f } = \left( f \left( \mathbf { x } _ { 1 } \right) , \ldots , f \left( \mathbf { x } _ { N } \right) \right)$ are the ': parameters'. We solve it stagewise, using gradient descent.

At step $m$, let $g_m$ be the gradient of $L(f)$ evaluated at $f=f_{m-1}$:

$$g _ { i m } = \left[ \frac { \partial L \left( y _ { i } , f \left( \mathbf { x } _ { i } \right) \right) } { \partial f \left( \mathbf { x } _ { i } \right) } \right] _ { f = f _ { m - 1 } }$$

The we update: $\mathbf { f } _ { m } = \mathbf { f } _ { m - 1 } - \rho _ { m } \mathbf { g } _ { m }$

where $\rho_m$ is the step length ==> Functional gradient descent.


![](../images/16.Gradientboost.png)