In [1]:
%%javascript
MathJax.Hub.Config({TeX:{equationNumbers:{autoNumber:"AMS"}}});

<IPython.core.display.Javascript object>

# BCB Final Project - a.y. 17/18

### <i>Fiaschi Lorenzo, Franco Danilo</i>

### 1. The Problem
The dataset consists of 2741 aligned patients affected by Diabetes; it is structured in 3 differents tables:
<ol>
    <li><b>Genetic Features</b>, 347 - Related to the SNP of 344 genes + 3 composite genes scores [blood pressure, alzheimer, CVD]</li>
    <li><b>Retina's Features</b>, 157 - Engineered features extracted from patients' retinas</li>
    <li><b>Clinical Features</b>, 15 - General patients' information</li>
</ol>
The objective of the study is to find a model that is able to predict whether a patient will incurr in either cardiovascular failures or in episodes of dementia.
In the positive case, it is also aked to predict at which age this episodes will occurr.
In order to learn such model, the output dataset has been provided; it is composed by two information, both for the occurrence of a cardiovascular failure and for the dementia episode: whether the episode has happened or not and at what age.

### 2. Data Preprocessing
In order to be fed, the tables have been cleaned as follows:
<ul>
    <li>
        <b><i>NaN Filling</i></b>: 
        <ul>
            <li>
                Clinical DS: 
                <ul>
                    <li>
                        ["therapy","gender","precedent CVD", "smoker"] $\rightarrow$ most frequent;
                    </li>
                    <li>
                        Others $\rightarrow$ mean.
                    </li>
                </ul>
            <li>
                Genetic DS: 
                <ul>
                    <li>
                        [composite gene scores] $\rightarrow$ mean;
                    </li>
                    <li>
                        Others $\rightarrow$ min.
                    </li>
                </ul>
            <li>
                Vampire DS: 
                <ul>
                    <li>
                        All $\rightarrow$ mean.
                    </li>
                </ul>
        </ul>
    </li>
    <li>
        <b><i>One-Hot Encoding</i></b>:
        <ul>
            <li>Clinical DS:
                <ul>
                    <li>
                        "therapy" and "Apoe4Presence".
                    </li>
                </ul>
            </li>
        </ul>
    </li>
    <li><b><i>Boolean: from binary to symmetric</i></b>: 
        <ul>
            <li>Clinical DS:
                <ul>
                    <li>
                        ["gender","precedent CVD", "smoker"] $\rightarrow$ from [0,1] to [-1,1].
                    </li>
                </ul>
            </li>
            <li>Outputs:
                <ul>
                    <li>
                        ["cvd_fail","dement_fail"] $\rightarrow$ from [0,1] to [-1,1].
                    </li>
                </ul>
            </li>
        </ul>
    </li>
</ul>

### 3. Multiple Kernel Learning
<b>Multiple Kernel Learning</b> refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm.<br>
Reasons to use multiple kernel learning include:
<ul>
    <li>the ability to select for an optimal kernel and parameters from a larger set of kernels, reducing bias due to kernel selection while allowing for more automated machine learning methods;
    </li>
    <li>combining data from different sources (e.g. sound and images from a video) that have different notions of similarity and thus require different kernels;
    </li>
</ul>
<img src='others/mkl.png'>

### 4. The Model
<ul>
    <li><b><i>Target Function</i></b> $\rightarrow$ Similarity-based function; it uses a similarity metric between the combined kernel matrix and an optimum kernel matrix calculated from the training data, in order to select the combination function parameters that maximize the similarity. The similarity between two kernel matrices can be calculated using kernel alignment, Euclidean distance, Kullback-Leibler (KL) divergence, or any other similarity measure.
    </li>
    <li><b><i>Training Method</i></b> $\rightarrow$ One-step method; it calculates both the combination function parameters and the parameters of the combined base learner in a single pass. One can use a sequential approach or a simultaneous approach. In the sequential approach, the combination function parameters are determined first, and then a kernel-based learner is trained using the combined kernel. In the simultaneous approach, both set of parameters are learned together.
    </li>
</ul>
The chosen algorithm is called <i>Centered-kernel alignment</i> [1] and its main purpose is to compute a set of $P$ kernels $K_i$, $i \, = \, 1...P$ from the datasets and to use a linear combination of them to approximate an ideal kernel. Then the obtained approximated kernel can be used for both classification and regression purposes. The functional is:<br>

\begin{equation}
\max\limits_{\eta \in \mathcal{M}} CA(K_{\eta}, IK)
\label{eq:problem}
\end{equation}

where $IK = y^T y$ is the ideal kernel, $K_{\eta} = \sum\limits_{i=1}^P K_i \eta_i$ is its approximation and $\mathcal{M}=\{\eta : ||\eta||_2 = 1\}$ imposing $\eta$ being a unit norm vector. In turn, $CA(K_1, K_2)$ is defined as

\begin{equation*}
CA(K_1^c, K_2^c) = \frac{\langle K_1^c, K_2^c \rangle_F}{\sqrt{\langle K_1^c, K_1^c\rangle_F \; \langle K_2^c, K_2^c \rangle_F}}
\end{equation*}

with $K^c$ is the centered version of $K$ and can be calculated as

\begin{equation*}
K^c = K - \frac{1}{N} 11^TK - \frac{1}{N} K11^T + \frac{1}{N^2} \left(1^TK1\right)11^T
\end{equation*}

where $1$ is the vector of ones with proper dimension.

The optimization problem \eqref{eq:problem} has then a unique analytical solution

\begin{equation}
\eta = \frac{M^{-1}a}{||M^{-1}a||_2}
\label{eq:problem_solution}
\end{equation}

where $M = \{\langle K_m, K_h\rangle_F\}_{m,h \, = \, 1}^P$ and $a = \{\langle K_m, IK\rangle_F\}_{m \, = \, 1}^P$.

In order to avoid overfitting, to be more robust in the learning procedure and in order to get deeper insights of the problem we added either an $L_1$ or $L_2$ penalty to the functional \eqref{eq:problem}, obtaining the following two reformulations

\begin{equation}
\max\limits_{\eta \in \mathcal{M}} CA(K_{\eta}, IK) - \lambda \, ||\eta||_2^2 \;\; (a), \qquad \max\limits_{\eta \in \mathcal{M}} CA(K_{\eta}, IK) - c \, ||\eta||_1 \;\; (b).
\label{eq:problem_penalty}
\end{equation}

While the Equation \eqref{eq:problem_penalty}(a) has the unique analytical solution 

\begin{equation}
\eta = \frac{(M-\lambda I)^{-1}a}{||(M-\lambda I)^{-1}a||_2}
\label{eq:problem_solution_lambda}
\end{equation}

Equation \eqref{eq:problem_penalty}(b) does not due to the non-differentiability of the $L_1$ norm. To converge to the optimum we applied the forward-backward splitting method [2][3].

### 5. The Algorithm

<ol>
    <li><i>Datasets loading</i> $\rightarrow$ (Clinical, Genetic, Vampire, Outputs);</li>
    <li><i>Outer split</i> $\rightarrow$ 75% training + 25% test for final testing purposes;</li>
    <li><i>Kernel definition</i> $\rightarrow$ dictionary list: [{<b>kernelType1:[parameter list]</b>, <b>kernelType2:[parameter list]</b>, <b>...</b> }, <b>...</b> ] </li>
    <li><i>Sampling rounds</i> $\rightarrow$ 3 rounds of 75% training + 25% mid test (extracted from the previous 75% of training) for statistical stability:
        <ol>
            <li>If stated, perform training matrix centering and/or normalisation;</li>
            <li>For each dictionary configuration:
                <ol>
                    <li>Build the ideal kernel from the training outputs;</li>
                    <li>Build all the possible combination between the kernels parameters;</li>
                    <li>3-Fold cross validation over the 75% sampling training:
                        <ol>
                            <li>Build the ideal kernel matrices from both training and validation output sets;</li>
                            <li>For each configurations:
                                <ol>
                                    <li>Compute all the kernel matrices for that configuration;</li>
                                    <li>Find the optimal weights for those kernel matrices using the ideal training kernel;</li>
                                    <li>Compute the weighted sum of the kernel matrices;</li>
                                    <li>Compare the obtained matrix with the ideal validation kernel using the Cortes Alignment metric (or just measure the balanced accuracy) and store this similarity measure;</li>
                                </ol>
                            </li>
                        </ol>                       
                    </li>
                    <li>Find the configuration with greatest mean across the three validation fold;</li>
                    <li>Recompute weights and similarity score for the newfound setting using the ideal training kernel found at point <i>a</i>;</li>
                    <li>Compute accuracy, precision and recall for the selected setting against the sampling test set;</li>
                    <li>Store these value, decorated with configuration and alignment for this dictionary;</li>
                </ol>
            </li>
        </ol>
    </li>
    <li>For each sample round, sum all the alignment values (or balanced accuracies) related to the same kernel parameters in order to find the configuration that best behaved, keeping the dictionary within a list separated;</li>
    <li>Try the configurations (one per dictionary) against the outer test set, compute accuracy, precision and recall</li>
</ol>

### 6. Approach Setting

Kernel used:
    
<ul>
    <li>Gaussian: $\hspace{4mm}K_{\sigma}(x_i, x) = e^{\,-\,\left(\frac{||x_i \, - \, x||^2}{2\sigma}\right)}$</li>
    <li>Linear: $\hspace{9mm}K(x_i, x) = x_i \cdot x^T$</li>
    <li>Polynomial: $\hspace{1mm}K_d(x_i, x) = (1 +  x_i \cdot x^t)^d$</li>
    <li>Laplacian: $\hspace{3mm}K_{\gamma}(x_i, x) = e^{\,-\,(\, \gamma \, ||x_i \, - \, x||_1)}$</li>
    <li>Sigmoid: $\hspace{5mm}K_{\gamma}(x_i, x) = \tanh \,(\,\gamma \, \langle x_i, x \rangle + 1)$</li>
</ul>

The basical approach exploited has been picking three different kernels and applying them to every dataset, getting back 9 kernel matrices. Several combinations of three data preprocessing methods (Data origin centering, Data Normalization, Kernel Normalization) have been tested.

### 7. Toy Testing

In order to analyze the correctness of the algorithm implemented three synthetic datasets have been generated and the algorithm have been launched over them.


#### 7.1 Toy Dataset Generation

For dataset generation we used the sklearn.datasets.make_classification function [4], initialized with the following configuration:

<ul>
    <li>n_samples = 300</li>
    <li>n_features = 30</li>
    <li>n_informative = 10</li>
    <li>n_redundant = 0</li>
    <li>n_classes = 2</li>
</ul>

The data gathered in this way have been splitted in three datasets and the informative variables have been distributed evenly between the first two. This procedure let us to check if the algorithm is able or not to percieve the irrelevance of the third dataset.

Then, the three datasets have been splitted in training and test sets exploiting sklearn.model_selection.StratifiedShuffleSplit [5]. The initialization in the next

<ul>
    <li>n_splits = 1</li>
    <li>test_size = 0.25</li>
</ul>


#### 7.2 Performances over Toy

##### 7.1.1 MKL

In the next the results achieved using the Toy datasets are shown. In particular, they are compared with the learning performances got studying one single dataset (the first one) using common learning algorithm, e.g., Logistic Regression and SVM.

<table style="width:100%;">
    <tr>
        <th> 
          <p><font  color="red">Based on CA</font></p>
        </th>
        <th align="justify">
          Configuration
        </th>
        <th>
          Test
        </th>
        <th>
          Train
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            L2 - Centering -$\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 1.3<br>
            Laplacian : 0.2, 0.6, 0.2<br>
            Gaussian: 0.3, 0.3, 0.6
        </td>
        <td>
            Accuracy: 0.8549<br>
            Precision: 0.7826<br>
            Recall: 0.9730
        </td>
        <td>
            CA: 0.2187<br>
            Accuracy: 0.84832<br>
            Precision: 0.8315<br>
            Recall: 0.86905
        </td>
        <td>
            [0.1189, -0.1065, -0.5868, 0.3827,  0.6407, -0.2706]
        </td>
    </tr>
    <tr>
        <th>
            L2 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 0.7<br>
            Polynomial : 2, 2, 7<br>
            Gaussian: 0.3, 0.3, 0.6
        </td>
        <td>
            Accuracy: 0.8275<br>
            Precision: 0.7857<br>
            Recall: 0.8919
        </td>
        <td>
            CA: 0.2029<br>
            Accuracy: 0.8186<br>
            Precision: 0.8213<br>
            Recall: 0.8095
        </td>
        <td>
            [0.6905, -0.4044, 0.3360, 0.2329, 0.4194, 0.1289]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering - $\;\;$ Normalizing
        </th>
        <td>
            Lambda: 0.5<br>
            Linear : -<br>
            Gaussian: 0.3, 0.3, 0.3
        </td>
        <td>
            Accuracy: 0.8410<br>
            Precision: 0.7907<br>
            Recall: 0.9189
        </td>
        <td>
            CA: 0.2448<br>
            Accuracy: 0.8130<br>
            Precision: 0.8051<br>
            Recall: 0.8214
        </td>
        <td>
            [0.1559, 0.0015, 0.2757, 0.0015, 0.9485, 0.0015]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 0.5<br>
            Polynomial: 3, 3, 3<br>
            Gaussian: 0.5, 0.5, 0.1
        </td>
        <td style="background-color: #ccffb3;">
            Accuracy: 0.8937<br>
            Precision: 0.87188<br>
            Recall: 0.9189
        </td>
        <td>
            CA: 0.2430<br>
            Accuracy: 0.7545<br>
            Precision: 0.7458<br>
            Recall: 0.7619
        </td>
        <td>
            [0.2520, 0.4460, 0.2554, 0.4487, 0.5193, 0.4486]
        </td>
    </tr>
</table>

<br>
<br>

<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Accuracy</font></p>
        </th>
        <th align="justify">
          Configuration
        </th>
        <th>
          Test
        </th>
        <th>
          Train
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            L2 - Centering - $\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 1.3<br>
            Linear: -<br>
            Polynomial : 2, 2, 2<br>
            Gaussian: 0.1, 0.1, 0.5
        </td>
        <td style="background-color: #ccffb3;">
            Accuracy: 0.8670<br>
            Precision: 0.8462<br>
            Recall: 0.8649
        </td>
        <td>
            CA: 0.2213<br>
            Accuracy: 0.7779<br>
            Precision: 0.7675<br>
            Recall: 0.7857
        </td>
        <td>
            [0.0469, -0.0620, -0.5322, 0.0106, -0.0275, 0.8237, 0.0185, 0.0777,  0.1581]
        </td>
    </tr>
    <tr>
        <th>
            L2 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 0.7<br>
            Linear: -<br>
            Polynomial : 2, 2, 2<br>
            Gaussian: 0.5, 0.1, 0.7
        </td>
        <td>
            Accuracy: 0.8538<br>
            Precision: 0.8250<br>
            Recall: 0.8919
        </td>
        <td>
            CA: 0.2175<br>
            Accuracy: 0.7956<br>
            Precision: 0.7810<br>
            Recall: 0.8095
        </td>
        <td>
            [0.3343, -0.1759, -0.3996, 0.1541, -0.1073, 0.1076, 0.0791,  0.7195,  0.3563]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering - $\;\;$ Normalizing
        </th>
        <td>
            Lambda: 0.9<br>
            Linear: -<br>
            Gaussian: 0.3, 0.6, 0.3
        </td>
        <td>
            Accuracy: 0.8535<br>
            Precision: 0.8421<br>
            Recall: 0.8649
        </td>
        <td>
            CA: 0.2086<br>
            Accuracy: 0.8013<br>
            Precision: 0.7911<br>
            Recall: 0.8095
        </td>
        <td>
            [0.1821, 0.0015, 0.2643, 0.0013, 0.9471, 0.0015]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 0.5<br>
            Sigmoid: 0.4, 0.9, 0.4<br>
            Gaussian: 0.5, 0.5, 1
        </td>
        <td>
            Accuracy: 0.8403<br>
            Precision: 0.8205<br>
            Recall: 0.8649
        </td>
        <td>
            CA: 0.2144<br>
            Accuracy: 0.8424<br>
            Precision: 0.8346<br>
            Recall: 0.8571
        </td>
        <td>
            [0.2808, 0.4576, 0.2557, 0.4613, 0.4738, 0.4573]
        </td>
    </tr>
</table>


##### 7.1.2 L1 Logistic Regression
<br>
<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Logistic Regression</font ></p>
        </th>
        <th align="justify">
          Best Lambda
        </th>
        <th>
          Test
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering - $\;\;$ Normalizing</p>
        </th>
        <td align="left">
            0.1
        </td>
        <td>
            Accuracy: 0.5473<br>
            Precision: 0.5366<br>
            Recall: 0.5946
        </td>
        <td>
            [0, 0, 0, 0, 0, 0, 0.7094, 0, 0, 0]
        </td>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering</p>
        </th>
        <td align="left">
            1.9
        </td>
        <td>
            Accuracy: 0.5473<br>
            Precision: 0.5366<br>
            Recall: 0.5946
        </td>
        <td>
            [0, 0, 0, 0, 0, 0, 0.7094, 0, 0, 0]
        </td>
    </tr>
</table>

#### 7.3 Results discussion

Some considerations can be made out of the results:
<ul>
    <li>it seems to be some performances improvements using an approach based on the CA rather than one based on the accuracy;</li>
    <li>the results obtained using kernel normalisation approaches are generally comparable, in term of accuracy, with the one achieved by the normalisation procedure usage;</li>
    <li>the selected kernels are generally belonging to the set K = {Linear, Polynomial};</li>
    <li>even if the informative features are not evenly distributed among the datasets (in particulare none of them is in the third one), looking at the $\eta$ coefficients, the algorithm seems not to see it. Indeed the coefficients magnitudes, kernel-wise, are generally comparable (also in the best performances);</li>
    <li>compared with the shallow learning realized on only the first dataset, MKL's performances are notably higher. This is in line with the fact that such dataset contains only the 70% of the global, useful information.</li>
</ul>

### 8 Real Data

In the next, the results achieved over the real data are shown and discussed. Again the comparison between the data integration approach and the shallow learning is proposed.


#### 8.1 Performances 

##### 8.1.1 MKL

<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on CA</font ></p>
        </th>
        <th align="justify">
          Cardio configuration
        </th>
        <th>
          Cardio test
        </th>
        <th>
          Cardio train
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            L2 - Centering - $\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 1.3<br>
            Linear: - <br>
            Gaussian: 0.3, 0.6, 0.6
        </td>
        <td>
            Accuracy: 0.5142<br>
            Precision: 0.3453<br>
            Recall: 0.5044
        </td>
        <td>
            CA: 0.0221<br>
            Accuracy: 0.5173<br>
            Precision: 0.3763<br>
            Recall: 0.1834
        </td>
        <td>
            [1.5359e-01, 2.0512e-04, 9.8813e-01, 2.0513e-04, 6.1960e-08, 2.0513e-04]
        </td>
    </tr>
    <tr>
        <th>
            L2 - Centering -$\;$K Normalizing
        </th>
        <td align="left">
            Lambda: 0.7<br>
            Polynomial: 5, 3, 5<br>
            Gaussian: 1, 1, 1
        </td>
        <td style="background-color: #ccffb3;">
            Accuracy: 0.6336<br>
            Precision: 0.4486<br>
            Recall: 0.6886
        </td>
        <td>
            CA: 0.0219<br>
            Accuracy: 0.5151<br>
            Precision: 0.3496<br>
            Recall: 0.4438
        </td>
        <td>
            [9.9999e-01, 1.5504e-03, 4.8583e-04, 1.7324e-03, -7.0856e-04, 1.7324e-03]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering - $\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 0.9<br>
            Polynomial: 7, 7, 2<br>
            Gaussian: 0.3, 0.3, 0.3
        </td>
        <td>
            Accuracy: 0.5172<br>
            Precision: 0.3534<br>
            Recall: 0.3857
        </td>
        <td>
            CA: 0.0184<br>
            Accuracy: 0.5751<br>
            Precision: 0.3904<br>
            Recall: 0.6391
        </td>
        <td>
            [8.6100e-09, 5.9277e-25, 2.0955e-23, 5.9277e-25, 1.0000e+00, 5.9277e-25]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering -$\;$K Normalizing
        </th>
        <td align="left">
            Lambda: 0.9<br>
            Polynomial: 5, 3, 3<br>
            Gaussian: 0.5, 1, 0.5
        </td>
        <td>
            Accuracy: 0.5523<br>
            Precision: 0.3911<br>
            Recall: 0.4649
        </td>
        <td>
            CA: 0.0214<br>
            Accuracy: 0.5059<br>
            Precision: 0.3522<br>
            Recall: 0.1538
        </td>
        <td>
            [0.0042, 0.5773, 0.0100, 0.5773, -0.0011, 0.5773]
        </td>
    </tr>
</table>

<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on CA</font></p>
        </th>
        <th>
          Dementia configuration
        </th>
        <th>
          Dementia test
        </th>
        <th>
          Dementia train
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            L2 - Centering - $\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 0.9<br>
            Polynomial: 2, 7, 7<br>
            Gaussian: 0.3, 0.3, 0.3
        </td>
        <td>
            Accuracy: 0.5016<br>
            Precision: 0.25<br>
            Recall: 0.0084
        </td>
        <td>
            CA: 0.0196<br>
            Accuracy: 0.5292<br>
            Precision: 0.2273<br>
            Recall: 0.3408
        </td>
        <td>
            [5.2793e-51, 0.0000e+00, 1.3163e-53, 4.8795e-56, 1.0000e+00, 4.8795e-56]
        </td>
    </tr>
    <tr>
        <th>
            L2 - Centering -$\;$K Normalizing
        </th>
        <td align="left">
            Lambda: 0.7<br>
            Polynomial: 3, 3, 3<br>
            Gaussian: 0.5, 0.5, 0.5
        </td>
        <td style="background-color: #ccffb3;">
            Accuracy: 0.6962<br>
            Precision: 0.2825<br>
            Recall: 0.8403
        </td>
        <td>
            CA: 0.0209<br>
            Accuracy: 0.4751<br>
            Precision: 0.1542<br>
            Recall: 0.3745
        </td>
        <td>
            [9.9998e-01, -2.4968e-03, 5.9783e-04, -4.0097e-03, -8.5158e-04, -4.0143e-03]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering - $\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 0.7<br>
            Linear: -<br>
            Gaussian: 0.3, 0.6, 0.3
        </td>
        <td>
            Accuracy: 0.5017<br>
            Precision: 0.18<br>
            Recall: 0.0756
        </td>
        <td>
            CA: 0.0198<br>
            Accuracy: 0.6351<br>
            Precision: 0.2457<br>
            Recall: 0.7566
        </td>
        <td>
            [3.8433e-09, 1.0829e-11, 1.3226e-09, 1.0829e-11, 1.0000e+00, 1.0829e-11]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering -$\;$K Normalizing
        </th>
        <td align="left">
            Lambda: 0.5<br>
            Polynomial: 2, 2, 2<br>
            Gaussian: 0.6, 0.6, 0.3
        </td>
        <td>
            Accuracy: 0.5399<br>
            Precision: 0.1965<br>
            Recall: 0.5630
        </td>
        <td>
            CA: 0.0201<br>
            Accuracy: 0.5170<br>
            Precision: 0.1832<br>
            Recall: 0.5131
        </td>
        <td>
            [0.0019, 0.5773, 0.0058, 0.5773, 0.0027, 0.5773]
        </td>
    </tr>
</table>

<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Accuracy</font></p>
        </th>
        <th align="justify">
          Cardio configuration
        </th>
        <th>
          Cardio test
        </th>
        <th>
          Cardio train
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            L2 - Centering - $\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 1.3<br>
            Polynomial: 7, 2, 2<br>
            Gaussian: 0.3, 0.6, 0.3
        </td>
        <td>
            Accuracy: 0.5106<br>
            Precision: 0.3680<br>
            Recall: 0.4057
        </td>
        <td>
            CA: 0.0163<br>
            Accuracy: 0.5612<br>
            Precision: 0.3639<br>
            Recall: 0.7030
        </td>
        <td>
            [1.0000e+00, 1.1166e-27, 3.6103e-27, 1.1166e-27, 1.9267e-14, 1.1165e-27]
        </td>
    </tr>
    <tr>
        <th>
            L2 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 0.9<br>
            Polynomial: 5, 3, 3<br>
            Gaussian: 0.5, 0.5, 1
        </td>
        <td style="background-color: #ccffb3;">
            Accuracy: 0.6089<br>
            Precision: 0.4503<br>
            Recall: 0.6680
        </td>
        <td>
            CA: 0.0204<br>
            Accuracy: 0.5360<br>
            Precision: 0.3565<br>
            Recall: 0.4970
        </td>
        <td>
            [10.0000e-01, -9.7576e-04, -1.4896e-03, -8.5473e-04, 4.3603e-05, -8.7393e-04]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering - $\;\;$ Normalizing
        </th>
        <td>
            Lambda: 1.3<br>
            Laplacian: 0.2, 0.2, 0.6<br>
            Gaussian: 0.3, 0.3, 0.3
        </td>
        <td>
            Accuracy: 0.5385<br>
            Precision: 0.3993<br>
            Recall: 0.4549
        </td>
        <td>
            CA: 0.0198<br>
            Accuracy: 0.5903<br>
            Precision: 0.3956<br>
            Recall: 0.6505
        </td>
        <td>
            [0.4115, 0.4076, 0.4076, 0.4076, 0.4076, 0.4076]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 1.3<br>
            Laplacian: 0.6, 0.2, 0.2<br>
            Gaussian: 0.6, 0.3, 0.3
        </td>
        <td>
            Accuracy: 0.5209<br>
            Precision: 0.3754<br>
            Recall: 0.5123
        </td>
        <td>
            CA: 0.0260<br>
            Accuracy: 0.5408<br>
            Precision: 0.3452br>
            Recall: 0.7273
        </td>
        <td>
            [0.4082, 0.4082, 0.4083, 0.408, 0.4083, 0.4082]
        </td>
    </tr>
</table>

<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Accuracy</font></p>
        </th>
        <th align="justify">
          Dementia configuration
        </th>
        <th>
          Dementia test
        </th>
        <th>
          Dementia train
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            L2 - Centering - $\;\;$ Normalizing
        </th>
        <td align="left">
            Lambda: 0.5<br>
            Sigmoid: 0.6, 0.2, 0.6 <br>
            Gaussian: 0.3, 0.6, 0.6
        </td>
        <td>
            Accuracy: 0.5011<br>
            Precision: 0.1738<br>
            Recall: 0.8824
        </td>
        <td>
            CA: 0.0229<br>
            Accuracy: 0.5319<br>
            Precision: 0.1885<br>
            Recall: 0.8127
        </td>
        <td>
            [0.0000, -5.9899e-01, 3.8868e-07, 5.668e-01, 0.0000, 5.6559e-01]
        </td>
    </tr>
    <tr>
        <th>
            L2 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 0.5<br>
            Polynomial: 5, 3, 3 <br>
            Gaussian: 1, 0.5, 1
        </td>
        <td style="background-color: #ccffb3;">
            Accuracy: 0.6374<br>
            Precision: 0.2479<br>
            Recall: 0.7563
        </td>
        <td>
            CA: 0.0236<br>
            Accuracy: 0.5368<br>
            Precision: 0.1945<br>
            Recall: 0.5543
        </td>
        <td>
            [10.0000e-01, 1.0908e-03, -1.1484e-03, 1.1052e-03, 4.5443e-04, 1.0930e-03]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering - $\;\;$ Normalizing
        </th>
        <td>
            Lambda: 0.5<br>
            Sigmoid: 0.6, 0.2, 0.2 <br>
            Gaussian: 0.3, 0.3, 0.3
        </td>
        <td>
            Accuracy: 0.5039<br>
            Precision: 0.1829<br>
            Recall: 0.1261
        </td>
        <td>
            CA: 0.02304<br>
            Accuracy: 0.6507<br>
            Precision: 0.2544<br>
            Recall: 0.7790
        </td>
        <td>
            [0, 0.5774, 0, 0.5773, 0, 0.5773]
        </td>
    </tr>
    <tr>
        <th>
            L1 - Centering -$\;$K Normalizing
        </th>
        <td>
            Lambda: 0.5<br>
            Sigmoid: 0.9, 0.4, 0.4 <br>
            Gaussian: 0.5, 0.5, 1
        </td>
        <td>
            Accuracy: 0.5421<br>
            Precision: 0.1988<br>
            Recall: 0.5462
        </td>
        <td>
            CA: 0.0236<br>
            Accuracy: 0.5682<br>
            Precision: 0.2118<br>
            Recall: 0.5356
        </td>
        <td>
            [0, 0.5774, 0, 0.5773, 0, 0.5773]
        </td>
    </tr>
</table>
<br><br>

##### 8.1.2 L1 Logistic Regression
<br>
<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Logistic Regression</font ></p>
        </th>
        <th align="justify">
          Cardio Best Lambda
        </th>
        <th>
          Cardio test
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering - $\;\;$ Normalizing</p>
        </th>
        <td align="left">
            1.9
        </td>
        <td>
            Accuracy: 0.5846<br>
            Precision: 0.5755<br>
            Recall: 0.2675
        </td>
        <td>
            [0.2430, 0, 2.2032, -0.7541, 0.2334, 0.5259, 0, 0, 0.7105, 2.8409, -1.2948, 2.3778, -3.1866, 8.2225, 0, -0.1001, 0, 0.5114, 0.8202, 0, 1.1853]
        </td>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering</p>
        </th>
        <td align="left">
            1.9
        </td>
        <td>
            Accuracy: 0.6411<br>
            Precision: 0.4723<br>
            Recall: 0.6360
        </td>
        <td>
            [0.2454, 0, 2.2035, -0.7565,  0.2334,  0.5258, 0, 0, 0.7106,  2.8411, -1.2947,  2.3779, -3.1867, 8.2227, 0, -0.0971, 0, 0.5094, 0.8159, 0, 1.1851]
        </td>
    </tr>
</table>
<br>
<br>
<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Logistic Regression</font ></p>
        </th>
        <th align="justify">
          Dementia Best Lambda
        </th>
        <th>
          Dementia test
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering - $\;\;$ Normalizing</p>
        </th>
        <td align="left">
            0.5
        </td>
        <td>
            Accuracy: 0.5049<br>
            Precision: 0.3333<br>
            Recall: 0.0168
        </td>
        <td>
            [0, 0, 0, -0.3494, 0.4792, -4.3349, 0, 0, 1.6866, 0, 0, 0, 0, -0.1756, 0, 0, 0, 0, 0, 0]
        </td>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering</p>
        </th>
        <td align="left">
            0.5
        </td>
        <td>
            Accuracy: 0.6716<br>
            Precision: 0.2758<br>
            Recall: 0.7647
        </td>
        <td>
            [0, 0, 0, -0.3494, 0.4792, -4.3349, 0, 0, 1.6867, 0, 0, 0, 0, 0, -0.1756, 0, 0, 0, 0, 0, 0,]
        </td>
    </tr>
</table>
<br>

#### 8.2 Results discussion
Some considerations can be pointed out from the results:
<ul>
    <li>it seems to be some performances improvements using an approach based on the CA rather than one based on the accuracy</li>
    <li>the kernel normalisation approach gives better results in term of accuracy when compared to the others;</li>
    <li>usually the approach based on the L2 penalization picks a Polynomial kernel;</li>
    <li>the kernel normalisation provides always the best result when the functional is penalized with an L2-norm Tikhonov;</li>
    <li>the selected kernel tends to belong to set K = {Laplacian, Sigmoid, Polynomial};</li>
    <li>among the best performances, the Polynomial kernel is always selected and the clinical dataset is pointed out as the most informative; the latter is justified by the coefficient ($\eta_1$) associated to such dataset once transformed by the Polynomial kernel. Indeed it is several orders of magnitude greater than the others coefficients;</li>
    <li>due to the fact that the information is mainly contained in the clinical dataset, the performances achieved with a shallow classification are comparable with the ones got exploiting MKL.</li>
</ul>

### 9 Further considerations and studies

From the Section 8 has been inferred that with high probability the most (and only!) meaningful dataset for the project purposes (see Section 1) is the clinical one. Because of this, an interesting study could have been the one which applies a multi-kernel learning procedure only to such dataset. This kind of approach let the realization of multiple points of view (one per used kernel) of the same object, i.e., the clinical dataset. Such procedure could provide a better comprehension of the correlation existing between the data and the diseases.


#### 9.1 MKL over the Clinical dataset

Looking at the results of Section 8, has been decided to generate the wider overall vision of the Clinical dataset exploiting the Polynomial kernel both along with the Gaussian one (again, but with a larger set of possible meta-parameters). In the next the performances.



#### 9.2 Results discussion

### 10 Shallow regression

Once selected as the most (and only) relevant dataset for dementia and cardiovascular failures, the Clinical one has been exploited for regression purposes. The regressive learning has the goal to predict the age at which the diesease will arise.
The regression algorithm adopted are Lasso and SVM. In the next the performances and their discussion.


#### 10.1 Shallow regression performances

<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Lasso and SVM</font ></p>
        </th>
        <th align="justify">
          Cardio Best Lambda
        </th>
        <th>
          Cardio test
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            <p>L2 - Centering - $\;\;$ Normalizing</p>
        </th>
        <td align="left">
            0.12
        </td>
        <td>
            Average Error: 0.6927<br>
            Error Variance: 0.7127
        </td>
        <td>
            -
        </td>
    </tr>
    <tr>
        <th>
            <p>L2 - Centering</p>
        </th>
        <td align="left">
            0.12
        </td>
        <td>
            Average Error: 0.6938<br>
            Error Variance: 0.7111
        </td>
        <td>
            -
        </td>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering - $\;\;$ Normalizing</p>
        </th>
        <td align="left">
            0.01
        </td>
        <td>
            Average Error: 0.8382<br>
            Error Variance: 0.1187
        </td>
        <td>
            [0, 0,  0, 0.1439, 0.0559, 0, 0, 0, 0.2578  0, 0,  0, 0, 0.9938, 0, 0, 0,  0, 0, 0, 0]
        </td>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering</p>
        </th>
        <td align="left">
            0.01
        </td>
        <td>
            Average Error: 2.7799<br>
            Error Variance: 4.3404
        </td>
        <td>
            [0, 0,  0, 0.1439, 0.0559, 0, 0, 0, 0.2578,  0, 0,  0, 0, 0.9938, 0, 0, 0,  0, 0, 0, 0]
        </td>
    </tr>
</table>
<br>
<br>
<table style="width:100%">
    <tr>
        <th>
          <p><font  color="red">Based on Lasso and SVM</font ></p>
        </th>
        <th align="justify">
          Dementia Best Lambda
        </th>
        <th>
          Dementia test
        </th>
        <th>
          Eta
        </th>
    </tr>
    <tr>
        <th>
            <p>L2 - Centering - $\;\;$ Normalizing</p>
        </th>
        <td align="left">
            0.12
        </td>
        <td>
            Average error: 0.6938<br>
            Error variance: 0.7111
        </td>
        <td>
            -
        </td>
    </tr>
    <tr>
        <th>
            <p>L2 - Centering</p>
        </th>
        <td align="left">
            0.12
        </td>
        <td>
            Average error: 0.6938<br>
            Error variance: 0.7111
        </td>
        <td>
            -
        </td>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering - $\;\;$ Normalizing</p>
        </th>
        <td align="left">
            0.1
        </td>
        <td>
            Average error: 0.8433<br>
            Error variance: 0.1335
        </td>
        <td>
            [0, 0, 0, -0.2173  0.0739, 0, 0, 0,  0.2657, 0, 0, 0, 0, 1.2094, 0, 0, 0, 0, 0, 0, 0]
        </td>
    </tr>
    <tr>
        <th>
            <p>L1 - Centering</p>
        </th>
        <td align="left">
            0.1
        </td>
        <td>
            Average error: 3.3486<br>
            Error variance: 6.0332
        </td>
        <td>
            [0, 0, 0, -0.2173  0.0739, 0, 0, 0,  0.2657, 0, 0, 0, 0, 1.2094, 0, 0, 0, 0, 0, 0, 0]
        </td>
    </tr>
</table>

#### 10.2 Results discussion

Some considerations can be pointed out from the results:
<ul>
    <li>the L2-penalized approaches gives better results in term of average error and variance when compared to the L1 ones;</li>
    <li>when the Lasso algorithm is used, the only four variables that are picked, softened more or less with the same proportions, are the ones associated to the information:
        <ul>
            <li>dbp: distolic blood pressure</li>
            <li>dur_diab: duration of diabetes</li>
            <li>e_age: age at the developmenet of the disease</li>
            <li>pre: whether or not the subject had heart problems before</li>
        </ul>
    </li>
</ul>

### 11 Conclusions and further possible studies

In this section is presented a summarization of the results obtained during the study presented above. Further possible studies are suggested. Most of the ideas proposed have not been tested largely due to time and computational limits of the authors.


#### 11.1 What has been learned

1) In the context of the Cortes Alignment, the Polynomial kernel usage seems to constantly provides higher performances<br>
2) The most informative (and only relevant) dataset seems to be the Clinical one


#### 11.2 Best approaches and configurations

In the following is presented the resume of the best performing approaches and configuration that have been found:
<br>
<br>
<table>
    <tr>
        <th>
            Disease to predict
        </th>
        <th>
            Performance base
        </th>
        <th>
            Best approach
        </th>
        <th>
            Best configuration
        </th>
    </tr>
    <tr>
        <td>
            Heart Attack
        </td>
        <td>
            CA
        </td>
        <td>
            L2 - Centering - K Normalization
        </td>
        <td>
            Lambda: 0.7<br>
            Polynomial: 5, 3, 5<br>
            Gaussian: 1, 1, 1
        </td>
    </tr>
    <tr>
        <td>
            Heart Attack
        </td>
        <td>
            Accuracy
        </td>
        <td>
            L2 - Centering - K Normalization
        </td>
        <td>
            Lambda: 0.9<br>
            Polynomial: 5, 3, 3<br>
            Gaussian: 0.5, 0.5, 1
        </td>
    </tr>
    <tr>
        <td>
            Dementia
        </td>
        <td>
            CA
        </td>
        <td>
            L2 - Centering - K Normalization
        </td>
        <td>
            Lambda: 0.7<br>
            Polynomial: 3, 3, 3<br>
            Gaussian: 0.5, 0.5, 0.5
        </td>
    </tr>
    <tr>
        <td>
            Dementia
        </td>
        <td>
            Accuracy
        </td>
        <td>
            L2 - Centering - K Normalization
        </td>
        <td>
            Lambda: 0.5<br>
            Polynomial: 5, 3, 3<br>
            Gaussian: 1, 0.5, 1
        </td>
    </tr>
</table>

#### 11.3 Suggested further studies

Further possible procedures, approaches, ideas that could have been either added to the pipeline or interchanged with some of its steps are prensented in the following:<br>
<ul>
    <li>Introduce in the preprocessing phase the usage filter methods</li>
    <li>Provide a deeper and more conscious analysis of the configurations performance stability</li>
    <li>Study configuations based on kernel combinations always contempling the Polynomial one</li>
    <li>Sobstitute the functional based on the kernel alignment with the one based on the risk minimization</li>
</ul>

### References
[1] Corinna Cortes, Mehryar Mohri, and Rostamizadeh Afshin (2010). "Two-stage learning kernel algorithms", In Proceedings of the 27th International Conference on Machine Learning.<br>
[2] https://en.wikipedia.org/wiki/Proximal_gradient_methods_for_learning<br>
[3] Combettes, Patrick L.; Wajs, Valérie R. (2005). "Signal Recovering by Proximal Forward-Backward Splitting", Multiscale Model, Simul 4 (4): 1168–1200. doi:10.1137/050626090<br>
[4] http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html<br>
[5] http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html