### Lecture 15: Adaboost and PCA

#### More on Adaboost
- Given $N$ samples ${x_n, y_n}$ where $y_n \in {-1, + 1}$ and some way of constructing weak/base classifiers.
- for $t = 1...T$: 
    - choose a weak learner $h_t(x)$ and compute its contribution $\beta_t$. 
- Output the final classifier: $h(x) = sign(\sum_{t=1}^{T} h_t(x)\beta_t)$. 

- Adaboost works by minimizing the exponential empirical risk. 
- $a_t(x) = a_{t-1}(x) + \beta_t h)t(x)$. 
- iteratively build a strong classifer. 
- At the $t$th step, how do we chooose an optimal $h_t(x)$ and $\beta_t$ ? 
- Since we want to greedily  minimize the exponential risk: 
- $(h*_t(x), \beta*_t) = argmin_{h_t(x), \beta_t} \sum_n e^{-y_na(x_n)} = argmin \sum_n e^{-y_n[a_{t-1}(x_n) + \beta_t h_t(x_n)]} = argmin \sum_n w_t(n)e^{-y_n\beta_t h_t(x_n)}$
- where $w_t(n)$ is shorthand for the e term that is normalized to sum to one across all training examples (so the weights don't blow up). 
- So in each of the t iterations, weminimize the following quantity: 
- $argmin \sum_n w_t(n)e^{-y_n\beta_th_t(x_n)}$
- Which is equivalent to: 
- $argmin (e^{\beta_t} - e^{\beta_t})\sum_n w_t(n) I(y_n != h_t(x_n)) + e^{-\beta_t} \sum_n w_t(n)$ (it's a good exercise to show this). 
- This give the optimal $h_t(x)$ by finding $argmin \sum_n w_t(n) I(y_n != h_t(x_n)) $, which is minimizing the weighted classification error from one of the steps in Adaboost. 
- This also gives the optimal $\beta_t = \frac{1}{2} log \frac{1 - \zeta_t}{\zeta_t}$. This can be shown by taking the derivative and using the fact that the weights are normalized, so they sum to 1. 
- Then, the new classifier is given by $ a_t(x) = a_{t-1}(x) + \beta_t h_t(x)$ and the new weights are given by the exponential loss:
- $w_{t+1}(n) = w_t(n) e^{\beta_t}$ if $y_n != h_t(x)$, $w_{t+1}(n) = w_t(n) e^{-\beta_t}$ if $y_n = h_t(x)$. 
- The weights are decayed for a particular training example if we classified them correctly, the weights are increased if we classified them incorrectly. 
- Adaboost is a meta-algorithm: it can use any base/weak classifier that can minimize the weighted classification error. Every iteration, we pick the best weak learner. 

### Unsupervised Learning
- Raw data = complex & high-dimensional
- Often has redundant info, such as shoe size in American & European units which are highly correlated (would be perfectly correlated if the data had absolutely no noise). 
- We want to eliminiate redundant info by finding lower dimensional reprsentations. This also improves our computational costs. 

### PCA: Basic Ideas
- Pick a line and project the data onto it. 
- With PCA, we weant to minimize the Eulcidean distances between the orignal points and their projections. 
- Another goal: maximize variance. To ideintify patterns, we want to study the varaition across our observations. SO when reducing dimensionality, we don't want to lose that much variance - this would be indicative of losing a lot of information in the data. 
- It can be shown that trying to find a projection onto a lower-dimensional space that maximizes varaincae as well as minimize reconstruction errors is mmathemitcally equialvanet. 

### PCA Formulation: Finding a lower-dimensional representation
- Let $X$ be the raw data that is $n * d$ dimensional. Then, $Z = XP$ is the reduced representation, also called the PCA scores. 
- Here, $P \in R^{d * k}$ and $Z \in R^{n * k}$. This reduces the dimensionality if $K < D$. 
- Given $n$ training poitns with $d$ features, we denote $x_j^i$ to be the jth feature for the ith training example, and we let $\mu_j$ be the mean of the jth feature across the daaset. 
- We denote the variance of the jth feature as $\sigma_j^2 = \frac{1}{N} \sum_{i=1}^{N} (x_j^i - \mu_j)^2$
- We denote the covariance of the first and second features, assuming zero mean, as $\sigma_{1,2} = \frac{1}{N} \sum_{i=1}^{N} x_1^ix_2^i$. 
- The covariance is symemtric, if it's zero that means the features are uncorrelated, if they have large magnitude then they are (anti) correlated/redundant. 
- We can generalize this deai for many features with the covariance matrix. It's a $d*d$ covariance matrix with 0 mean features: 
- $C_x = \frac{1}{N}X^TX$. In this matrix, the $ith$ diagonal entry equals the variance of the $ith$ feature, and the $i,jth$ entry indicates the convariance between the $i$th and $j$th features. 

### PCA Goal
- OUr goal is to find $P$ such that the variance of the reduced representation, $Z$, is maximized, and this is equivalent to finding the $P$ that minimizes the reconstruction/euclidean distance error. 
- $P$ is the top $K$ eigenvectors of $C_x$ <- quick answer. 

### Deriving PCA Solution with Eigendecomposition
- All covariance matrices have an eigendecomposition: $C_x = U \Lambda U^T$. $U \in R^{d * d}$ is a matrix whose columns ar eiigenvectors of $C_x$, sorted by eigenvalues (greatest to smallest). 
- $\Lambda \in R^{d * d}$ is a matrix whose diagonals are eigenvales, sorted from greatest to least, and off-diagonals are zero. So $\Lambda$ is a diagonal matrix. 
- The $d$ eigenvectors are orthonormal directions of max variance. This means that the vectors collectively form an orthonormal eigenbasis for $R^D$, the d-dimensional feature space. 
- The associated eigenvalues indicate the maximum variance in that direction (the direction of its associated eigenvector). 
- For example, the first eigenvector is the direction of maximal variance, and the variance is given by $\lambda_1$. 
- Since the eigenvectors $(u_1 ... u_d)$ are all orthonormal, this means that $u_j^T u_i = 0$ if i is not equal to j, and $u_j^T u_i = 1$ if i = j. 
- Since $u_i$ is an eigenvector of the covariance matrix $C_x$, we have the following identity: $C_x u_i = \lambda_i u_i$ where $\lambda_i \geq 0$ and $||u_i|| = 1$. 
- Also, the covariance matrix is symmetric PSD. 
- We can define a matrix $U \in R^{d * d} = [u_1 ... u_d]$ by arranging the eigenvectors as columns in the matrix. 
- Then, $UU^T$ computes an inner product between pairs of eigenvectors, and once again, since they are orthonormal, we have the identity that $UU^T = U^TU = I$. 
- Now, we can let $C_x U = C_x[u_1 ... u_d] = [C_x u_1 ... C_x u_d] = [\lambda_1 u_1 ... \lambda_d u_d] $. 
- Therefore, $C_x U = U \begin{bmatrix} \lambda_1 & .. & 0 \\ 0 & \lambda_2 & .. & 0 \end{bmatrix} = U\Lambda$. 
- Here, $\Lambda$ is the diagonal matrix of eigenvalues. 
- So, $CU = U\Lambda$. Since orthonormal matrices have the identity that their transpose is the same as their inverse, we have: 
- $ CUU^-1 = U\Lambda U^-1 = CUU^T = U\Lambda U^T = C = U \Lambda U^T$ being the eigendecomposition of the covariance matrix. 
- Convention: the eigenvalues $\lambda_1 ... \lambda_d$ are order in nonincreasing from left to right. 

### Choosing K and Other practical tips
- Visualization: Use K = 2 or 3. 
- Else, choose some K that captures "most" of the variance in the data. This can be set to be a particular percentage. 
- Since $\lambda_i$ indicates the variance in the direction of the $i$th corresponding eigenvector, the fraction of retained variance can be given by $ \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_d}$. This can be set to be 80% for example. 
- The assumptions for PCA which are linearity and orthogonality may not always be applicable. 
- Centering the data is essential: preprocess features to have 0 mean. 
- PCA often depends on the dscale of the data. 
- Relies on the goal of maximizing retained variance and minimizing reconstruction error. 

### PCA Formulation, K = 1
- We want to find one-dimensional representation of raw data where $X \in R^{N *D}$ and $z = Xp$ is $n * 1$,
- We want to maximize the variance of the first feature of $z$, which is given by $\sigma_z^2 = \frac{1}{n}\sum_{i=1}^{N} (z^i)^2 = \frac{1}{2} ||z||_2^2 $.
- Therefore, we want $max_p \sigma_z^2$ with the constraint $||p||_2^2 = 1$ (we need this constraint similar to why we needed it in SVM. 
- Since $\sigma_z^2 = \frac{1}{n} z^T z$ and $z = Xp$, we have the new problem $max_p \frac{1}{n} (Xp)^T Xp = \frac{1}{n} p^T X^T X p = p^T C_x p$.
- So our new problem is $max_p p^T C_x p$ s.t. $||p||_2 = 1$. 
- Consider the connection to eigenvectors: remember that $C_x u_i = \lambda_i u_i$. By definition, $u_i^Tu_i = 1$, therefore, $u_i^T C_x u_i = \lambda_i$. 
- But this is the expression that we're optimizing, so we choose $u_i$ to be the top eigenvector of $C_x$, ie, the eigenvector that is associated with the eigenvalue $\lambda_1$, which is the largest eigenvalue. 
- We can make a similar argument for $k > 1$: chooose the 1st, 2nd, ... kth eigenvector. 
- But or original problem didn't constrain $p$ to be an eigenvector. However, any vector $x \in R^D$ can be written as a linear combination of the eigenvectors that from the (orthonormal) eigenbasis for $R^D$: $x = \sum_{i=1}^{D} \alpha_i u_i$. 
- So, the constant $\alpha_i$ tells you the projection onto the $i$th eigenvector. Since we want maximal variance, setting $\alpha_i = 1$ and the rest to zero make sense. Also, all of the alphas must sum to one, a property that stems from the fact that the eigenvectors have unit norm. '
- It is important to note that the $d$ eigenvectors of the covariance matrix $C_x = \frac{1}{n} X^TX$ gives us an orthonormal eigenbasis of the $R^D$ dimensional space. 
- More generally: 
- For $i \in {1...k}$: 
    - FInd direction of max variance that is orthonormal to previously selected directions. 
    - Project onto this direction. 
    - Locations along this direction are the ith feature in the new representation. 
- If $ K = D$ we don't lose any information, but our new matix $Z$ is different - it gives features that are as uncorrelated as possible. This is sometimes referred to as whitening. 

