### Constrained Optimization Review
- Convex Constrained optimization problem: 
- $min_x f_0(x)$ s.t. $f_i(x) \leq 0, i = [1...M]$, $f_j(x) = 0, j = [1...N]$ 
- Define the lagrangian by introducing variable corresponding to each of the M+N constraints. 
- $L(x, \alpha, \beta) = f_0(x) + \sum_n \alpha_i f_i(x) + \sum_m \beta_j f_j(x)$
- If you consider $max_{\alpha, \beta, \alpha_i \geq 0} L$, the value is $\infty$ if $x$ violates a primal constraint, otherwise the value of the objective is exactly $f_0(x)$. 
- So computing $min_x max_{\alpha, \beta, \alpha_i \geq 0} L$ gives us the same answer as the original primal problem, denoted as $p*$. 
- Switching the order of maximization and minimization leads us to the dual problem. 
- Primal for soft margin SVM: $min_{w, b, \zeta} C \sum_n \zeta_n + \frac{1}{2}||w||_2^2$ s.t. $ 1 - \zeta_n - y_n[w^T\phi(x_n) + b] \leq 0, n = [1...N]$, and $-\zeta_n \leq 0, n = [1...N]$. 
- We can immediately write down the lagrangian (see previous notes for this). 
- Then, we have the solution to the primal problem $p* = min_{w, b, \zeta} max_{\alpha, \lambda, \alpha_i \geq 0, \alpha_i \geq 0} L(w, b, \zeta, \alpha, \lambda) $
- The solution to the dual problem is given by maximizing over the primal variables first, then minimizing over the dual: $d* = max_{\alpha, \lambda} min_{w, b, \zeta} L(w, b, \zeta, \alpha, \lambda)$. 
- In general, we have weak duality: $d* \leq p*$ but for the SVM, since we have $f_0$ and $f_i$ are convex functions and $h_j$ (the equality constraints) are affine (aka linear, but with an extra intercept term), then there is strong duality: $d* = p*$. 
- The solution to the dual is given by $max_\alpha \sum_n \alpha_n - \frac{1}{2}\sum_{m,n} \alpha_m \alpha_n y_m y_n \phi(x_m) \phi(x_n) $ s.t. $\sum_n \alpha_n y_n = 0$ and $0 \leq \alpha_n \leq C$ for $n \in 1...N$
- We can find the primal weights if we have knowledge of the function $\phi$: $ w= \sum_n \alpha_n y_n \phi(x_n)$
- Also, the KKT conditions hold: $\lambda_n \zeta_n = 0$ and more importantly $\alpha[ 1 - \zeta_n - y_n(w^T\phi(x_n) + b)] = 0$.
- More generally, the KKT conditions are that at the optimal values, $\frac{\delta L}{\delta x} = 0, \frac{\delta L}{\delta \alpha} = 0, \frac{\delta L}{\delta \beta} = 0$, the values are optimal (obviously), and complementary slackness holds, which means that $\alpha_i f_i(x) = 0, i \in [1...N]$ and $\beta_j f_j(x) = 0, j \in [1...N]$. 
- If $\alpha_n > 0$, then it contributes to $w$ which characterizes the hyperplane learned. The feature vector corresponding to $\alpha_n$ is then known as one of the support vectors. 
- Since $\alpha_n > 0$ for support vectors, we require $ 1 - \zeta_n - y_n(w^t\phi(x_n) + b) = 0$ giving us $\zeta_n$ conditions for support vectors. Simply put, support vectors are training data that are misclassified, classified correctly but within the margin, or classified correctly on the margin. 

### Ensemble Methods
- Consider a set of predictors/base learners $h_1...h_L$. We can combine their predictions to hopefully get a more accurate predictor $H$. 
- This might work when the predictors make different types of mistakes. Basically, if we have predictors that are based on different sets of assumptions, they will produce different kinds of things that are learned and predicted, and thus also different mistakes. 
- The first way to do this can be having multiple classifiers on the same data. 
- Train $h_1...h_L$ on a given training dataset, and then combine their predictions: $ h = sign(\sum_i h_i) $, $h_i \in { + 1, -1} $. 
- Or, we could have a weighted majority vote, if we think that some classifiers are more important than other classifiers: $h = sign(w_1h_1 + ... w_L h_L) $. 
    - to learn these weights, we can compute weights on a validation set to find the best setting of the weight parameters. 
- If we are doing regression instead of binary classification, we can use the mean, median, or weighted mean to combine predictions.

### Training same classifier on multiple datasets
- Split orignal training dataset into multiple datasets, train classifer on each. 
- Each classifier is trained on a small dataset, so this does not generalize very well. 



### Bagging/Boostrap Aggreagting
- Bootstrap resampling: training data with $N$ instances: $D$. 
- Create $B$ bootstrap training datasets: $D_1, ... D_B$. 
- Each $D_b$ contains $N$ training examples drawn randomly from $D$ with replacement. 
- Train a classifier $h_b$ on each of $D_b$, take the majority vote. 
- Basically, we construct a bunch of datasets that are the same size as the original training dataset, but we sample from the original with replacement, with leads to random duplicates in each of our new sampled training datasets. Then, we train individual classifiers on each of these (different from each other) datasets, and combine their predictions. 

### Adaboost Algorithm
- High-level ideas: combine lots of classifiers 
- construct/identify these classifiers one at a time
- use weak/base classifiers to arrive at complex decision boudnarise (strong classifiers)
- Adaboost takes an ensemble of weak/base classifiers and turns them into a single strong classifier
- Given: $N$ training examples ${x_n, y_n}$ where $y_n \in {+1, -1}$ and some way of constructing weak classifiers (ie, decision stumps)
- Initialize weights $w_1(n) = \frac{1}{N}$ for every tarining sample. 
- For t = 1...T (a hyperparameter): 
    - Train a weak classifier $h_t(x)$ using the current weights $w_t(x)$ by minimizing the weighted classification error: $\epsilon_t = \sum_n w_t(n)I(y_n != h_t(x_n)) $
    - Compute contribution for this classifier to overall strong classifier: $\beta_t = \frac{1}{2}log(\frac{1 - \epsilon_t}{\epsilon_t})$. If we have a more accurate classifier, that means $\epsilon_t$ is small and $\beta_t$ is higher. Similarly, if we have a more random classifier, $\epsilon_t$ approaches $0.5$ and $\beta_t$ approaches 0. 
    - Update weights for training points: $w_t(n+1) = w_t(n)e^{-\beta_t y_n h_t(x_n)} $. If we predicted incorrectly for a particular training point, this means that we increase its associated weight, and if we predicted correctly for a particular training point, this means that we decrease its associated weight. To make sure that the weights don't blow up, we normalize them such that $\sum_n w_{t + 1}(n) = 1$. 
- Output final classifier: $h(x) = sign( \sum_{t = 1}^{T} \beta_t h_t(x)$. Each term in the sum is a weak classifeir for every iteration multiplied by a value that indicates how important it is/how much it contributes to the overall strong classifier learned. 
