## Quizz pill 11: KERNELS


1. The notion of kernel allows to transform any linear classifier into a non-linear one.
2. Kernels are measures of similarity in a certain transformed space (usually high dimensional).
3. The kernel trick allows to kernelize any algorithm by replacing $\langle x, y, \rangle$ by $K(x,y)$.
4. We can go beyond the kernel trick considering RKHS equivalents.

- **Kernels allow to use linear models in non-linear problems with success: True**

 As given by the Reproducing Kernel Hilbert Space.
- **The kernel trick is based on replacing any inner product by a kernel evaluation: TRUE**

 The kernel trick replaces any inner product $<x_i,x_j>$ by the corresponding inner product in the reproducing kernel Hilbert space $K(x_i, x_j)$,: $<x_i,x_j> = K(x_i,x_j)$. For example, for the Instance Based Learning with classifier
$$
f(x) = \sum_{i=1}^N \nu_i y_i x_i^Tx
$$
then the **Kernelized version of the classifier** is given as follows:
$$
f(x) = \sum_{i=1}^N \nu_i y_i K(x_i,x)
$$
Our decision function is $ y_i(w^Tx_i) > 0$ but we seek to change it by $\sum \alpha_i k_i x_i$. nsk machine
In the case the cost function is 1, we just look for the feasible set. 
- **The representer’s theorem states that the minimal solution of a regularized loss objective function in the Reproducing Kernel Hilbert Space is a linear combination of the kernel evaluated on the data set: TRUE**
- **The equation $k(x,y) = x^Ty$ corresponds to a lineal kernel: TRUE**

 Note that we are expressing the output as a linear compination of x or y, while the other vector represents the coefficients
- **Increasing the sigma parameter in an RBF kernel reduces the non-linear capacity for modelling boundaries: TRUE**

 RBF additive models: Kernel Regularized Least Squares. RKHS needs the definition of a positive semidefinite kernel function $K$:
$$
\begin{align*}
{\bf linear}: &K(x_i,x_j)=x_i^Tx_j\\
{\bf polynomial:}&K(x_i,x_j)=(x_i^Tx_j+1)^d\\
{\bf gaussian:}&K(x_i,x_j)=exp\big( -\frac{\|x_i-x_j\|^2}{\sigma^2} \big)
\end{align*}$$
Consider the Gaussian RBF, which has as a paramater $\sigma$. It looks like a value of $\sigma$ is close to 1 the model fits better the system, providing a kernel close to the ideal one. However when fixind too big parameters it may get worse. So increasing the sigma parameter from 1 on it reduces the non-linear capacity for modelling boundaries. 
- **A good kernel has a diagonal Gram matrix: FALSE**

 The **Gram matrix** shows the degree of shared information among training samples.**Gram matrix interpretation:**
    - **Bad Kernel:** Mostly diagonal $\implies$ most points are orthogonal to each other, no clusters, no structure.
    - **Good Kernel:** The matrix has structure and show clusters.
    - **Ideal Kernel:** ${\bf K_{\mbox{ideal}} = yy^T}$

# Pill 12: Ensemble learning

**Q: What does diversity mean in this context? Name three different ways of achieving diversity (give a one-two lines description of each of them).**

Diversity on an ensemble learning context means that errors on different classifiers should be made on different samples in order for the strategic combination of the classifiers to correct possible errors in the judgement of the class or a particular instance.

Diversity can be obtained in different ways:

+ Using **different training sets**. Use resampling strategies to obtain different optimal classifiers. This effect is correlated with the notion of stability of the classifier and the concept of bias and variance of the classifier.
+ Using **different training parameters for different classifiers**
+ Combining **different architectures**. (i.e. svm, decission trees, ...)
+ **Training on different features**. (i.e. random subspaces or random projections)



# Exam 13/01/21

**1.1 Describe two ways of forcing a bottleneck in an Autoencoder**

We can force a bottleneck in an Autoencoder using either a physical bottleneck or a logical bottleneck:
- Physical: literally taking a layer structure in which we decrease the dimension of the layers, throughout an encoder part of the model, until we get all the information compressed in the reduced middle layer, which has enough information for reconstructing the initial data set. Then we increase again the dimension of the layers in the decoder part until we get the output of the desired dimension. Structure ><
- Logical: we build the inverse structure. We have an augmention of the dimension of the layers until we get a high dimensional middle layer with introduced sparcity, in which we have active the necessary neurons. Then we use the active neurons to reconstruct the information in a decreasing process of the dimension of the layers. Structure <>

**1.b Ensemble learning is based on the aggregation of decisions of different classifiers. For ensemble learning to work classifiers must be diverse. What does diversity mean in this context? Name three different ways of achieving diversity (give a one-two lines description of each of them).**

In this contect diversity means that each different classifier of the ensemble should have errrors on different samples, so they can balance the errors made by the other classifiers. Three ways of achieving diversity are:
- Using different training sets
- Using different parameters on the definition of the classifiers
- Training on different features

**2. We need to design a machine learning technique displaying a non- linear decision boundary to achieve the required accuracy for the new service of predicting whether a new customer will order Miso or Shoyu based Ramen. A good prediction will def- initely increase the service speed and stock management. This in turn will revert in a much better customer experience and removal of waste storage. We have some constraints**:
- **We do not have direct access to the dataset but for a streaming API that allows to query and get the data. (This precludes the potential use of interaction variables).**
- **Unfortunately, the engineering and legal departments are new and unexperienced and can only clear the use of logistic regression as base classifier.**
**Describe a solution detailing how to solve to the problem using just logistic regression classifiers. You need to explicitly specify:**
- **The hypothesis space**:
The logistic regression classifier is given by the logistic relation between the variables given as: 
$$
logit Pr(Y_i = 1) = a + b x_i\ \ logit Pr(Y_i = 0) = -(a + b x_i)
$$
where $ logit p = \frac{p}{1-p}$. Least squares is not the method used for estimating 𝛼 and 𝛽; maximum likelihood is, and the MLE is found by iteratively re-weighted least squares. In this case therefore, the function of most interest may be $\mathcal{H} = \{ p \}$ where:
$$p = logit^{-1}(ax_1 + bx_2 + c) = \frac{1}{1-e^{ax_1 + bx_2+c}}$$
- **The objective function**
In classification models the most natural loss function is the 0-1 loss:
$$ \mathcal{L} = \sum \delta_i$$
where $\delta_i =0$ if $h(x_i) = y$ (good classification, for $h \in \mathcal(H)$), or 1 if it is bad classified. However this loss function is more common for supervised learning. 

We could also use the Hinge loss, since we are considering probabilities as the output of the model:
$$ \mathcal{L} = \sum max(0, 1 - y_i h(x_i))$$
for $h \in \mathcal(H)$

Finally, note that the optimal loss function for a logistic regression classifier is to consider the logistic loss:
$$
\mathcal{L} = \sum log(1 + e^{y_i h(x_i)}) = - y_i log(h(x_i)) - (1-y_i) log(1 - h(x_i))
$$
- **The optimization algorithm**:
Using this last loss function, note that computing the corresponding derivative is quite direct. Therefore, we could use gradient descent as the optimization algorithm. We could even consider the stochastic gradient descent so we add some variation to the data. 

**3. To Think**

**4. Semi- supervised learning considers the problem of classifying but it biases the solution so that the boundary passes through a low data density area. The technique that we are going to use is to automatically label all non-labelled data and then build a classifier on the new data set.**

- **4.a**
The simplest decision boundary using the labelled data is a linear decision boundary that could be given by svm, this is equidistant to both labelled groups. 
- **4.b**
A way to correctly label all the data point into both classes could be by using a deep learning model, since we could create a decision boundary using the superposition of multiple linear models. Remember that any data set can be classified using N linear models, with N big enough. Note however that we could get a straigth forward model if we consider a transformation on the input data of the model into polar coordinates, so the space in which we work is mapped by $r = \sqrt{x^2 + y^2}, \theta = arctan(y/x)$, so both concentric circumeferences would be projected into this space as two vertical lines with different fixed value of $r$, and could be then classified by a single linear decision boundary. Then we would need to just go back to our space to visualize the output. 
- **4.c**
An alternative could be to use a Gaussian kernel, given by the projection on the space described by the Gram matrix $K(x_i, x_j) =- exp( \frac{\| x_i - x_j\|^2}{\sigma^2} $

# Exam 2022

**1. Explain what a kernel is and how it is used (Either briefly explain the kernel trick or the use of kernels in the primal)**

Kernel trick: the kernel trick consists of changing each inner product by a kernel product, this is replacing any $< x_i, x_j>$ by the corresponding product in the kernel space $<x_i, x_j> = K(x_i, x_j) = \phi(x_i)^T \phi( x_j)$. For instance, for the learning model with classifier $f(x) = \sum \alpha_i y_i x_i^T x$, then the corresponding kernel verison is given by $f(x) = \sum \alpha_i y_i K(x_i, x)$. 

In the primal, kernels consist of considering the minimization problem with the corresponding constraints, and first of all considering the kernel trick over the cost function (changing all the inner products by the corresponding products in the kernel projection) and then we use the model $f(x) = \sum \alpha_i K(x_i, x)$, then we take any data point and check for the feasible projection (if the point does not lay in the feasible space) by considering $(x_i, y_i)$ and changing $\alpha_i \leftarrow \alpha_i + y_i$. 

**2. Soft-margin SVM is the formulation used for non-separable data. When paired with and RBF kernel, this formulation is governed by means of two parameters, C (in the objective function) and gamma (for the kernel). Describe the role of these two terms and their influence in the obtained solution.**

Paper

**3. Ensemble learning is based on the aggregation of decisions of different classifiers. For ensemble learning to work classifiers must be diverse. What does diversity mean in this context? Name three different ways of achieving diversity (give a one-two lines description of each of them).**

Done 

**4. Briefly explain two different clustering strategies and provide a use case/example where that strategy fits in.**

Clustering consists of making groups in different prototypes or classes regarding the similarities between the data samples. Two ways of making clusters could be K-means or mixture gaussians. 
- K-means: takes a priori the number of clusters or partitions we seek to make, taking into account k different prototypes.  We then seek to minimize the objective function: $min \sum_k \sum_i r_i^k ||x_i - m^k\|^2 $ by first minimizing in terms of the responsabilities $r_i^k= 1 $ if $x_i \in C_k$ and 0 otherwise and leaving the center of the prototypes $m^k$ fixed and then minimizing the prototype position by leaving the responsabilities fixed. This optimization process is called Expectation minimization.  
- Mixture of gaussians: we conduct the clustering by preassuming that the different prototypes have Gaussian distribution. Each distribution is found by conducting the maximum likelihood of the assumed Gaussian, to determine the corresponding parameters of the distribution.

**5. Consider that we have a dataset. Describe a technique for generating samples “compatible” with the data in the dataset.** 

We could use an autoencoder to do so by creating and applying a generative adversarial network. This consists of a model with a decoder part form an autoencoder, which is given a certain distribution and generates data samples. Then we input the data artificially obtained and plug it into a classifier along with a set of the real data in order to train it to classify the real from the fake one. We train the classifier until it works perfectly with the generator frozen. Then we freeze the classifier and we train the decoder or generator to produce data that cannot be classified by the classifier. The we freeze again the generator and train the classifier on more time. We iterate the process until we get a generator that produces data perfectly fitted into the real dataset and that cannot be distinguished from it. 