# Joint conf-call report #3

In [1]:
import datetime
print(str(datetime.datetime.today()))

2018-05-09 16:56:40.244625


# Formalism adopted

The following are most common formalisms:
- uppercase bold letter like $\mathbf{A}$ indicates a matrix;
- lowercase bold letter like $\mathbf{x}$ indicates a column vector ($\mathbf{x}^T$ is a row vector);
- lowercase normal letter like $x$ indicates a mono-dimensional variable;
- usually subscript stands for _index access_, e.g. $x_i$ is the $i$-th component of $\mathbf{x}$;
- superscript within brackets stands for a sequence that embodies the history of a variable (or anything else), like $\pi^{(0)},\pi^{(1)},...,\pi^{(t-1)},\pi^{(t)}$;

Therefore:
- the training set is $\mathcal{X} = (\mathbf{X}, \mathbf{y}) = \{(\mathbf{x_i}, y_i) : i = 1...n\}$;
- $\mathbf{X}$ is a $n \times p$ matrix where $n$ is the number of _samples_ and $p$ is the number of _predictors_ (or _features_);
- $\mathbf{x_i} = (x_{i1},...,x_{in}) \in \mathbf{X}$ is a sample of the training set while $y_i$ is its label;
- $\mathbf{w}^{(t+1)}$ is the $(t+1)$-th update of the weight vector $\mathbf{w}$.

# Iterations' velocity lower bound comparison
> #### Why is it a lower bound for $k=1$
> ![Example](media/img/graph-samples/iter-speed-lb-example.png)

> `a` is the most advanced node (e.g. `a.iteration` $>$ `b.iteration` and `a.iteration` $>$ `c.iteration`), so it is waiting for `b` to finish its calculations, but actually `b` already got the required input from `c`, then, as soon as `b` finishes with the current computation, both `a` and `b` can move to compute to the next step. Then the next iteration will be computed by the most advanced node (it can be either `a` or `b`) after $\frac{1}{\lambda} + \frac{1}{2 \lambda}$ and not after $\frac{1}{\lambda} + \frac{1}{\lambda}$ as considered by the lower bound.

> _(from 2nd-conf-call-report)_

## Actually it is not a lower bound
**Counterexample**.
When `a` ends its calculation for iteration $t$, then it have to wait for `b` to finish iteration $t$ to proceed to compute iteration $t+1$. But there's no guarantee that by the time `a` finishes, then `b` has already got the necessary outcomes from its parent to perform iteration $t$, maybe `b` is still waiting for `c` to finish iteration $t-1$, while `a` has already finished iteration $t$. Hence such is not a lower bound. However it seems to be very near to the lower bound so for the tests in this report I will use it as lower bound.

The speed (almost) lower bound is

$$V_l = \frac{\lambda}{1+\sum_{i=1}^{k} \frac{1}{i}}$$.

The lower bound of the iterations' amount is

$$\#it = V_l * time$$

so we're dealing with a line $f(x) = V_l x$.

For $n=10$ and $\lambda = 1$:

| Graph | degree $k$ | $V_l$
| :--- |:---: | ---:
| Clique |  $9$  | $\sim 0.34$
| Root and diameter expanders | $2$  | $0.4$
| Cycle |  $1$  | $0.5$
| Diagonal* | $0$  | $1$

_(*) NB: $V_l$ for diagonal graph is not reliable._

### Iterations over time LB plot for _10k samples GD test 003_
![Iter/time LB for 10k samples](media/img/tests/test_003_10ksamples_classic/1_iter-lb_time.png)

## Velocity vs degree test
`test_log/test_004_velVSdeg10ksamples10ktime_classic`.

### Test parameters
- `n = 100` (number of computational units).
- `n_samples = 10000`.
- `n_features = 100`.
- `max_time = 10000`.

### Test topologies
Based on degree:
- degree **1**: cycle;
- degree **2**: diameter expander;
- degree **3**, **4**, **8**, **20**, **50**: `generate_d_regular_graph_by_degree(100, k)`;
- degree **99**: clique.

#### How are generated generic regular graphs with degree $k : 3 \leq k \leq N-2$
By exploiting the following function.
```python
def generate_d_regular_graph_by_degree(N, K):
    if N <= 1 or K <= 0:
        edges = []
    else:
        edges = ["i->i+1"]
        for i in range(K-1):
            edges.append("i->i+{}".format(int((i + 1) * N / K)))

    return generate_d_regular_graph_by_edges(N, edges)
```

**Execution examples.**
The body of the function generates edges for the general node $i$ as a list of strings like in snippets below:

In [2]:
N = 100
K = 10
edges = ["i->i+1"]
for i in range(K-1):
    edges.append("i->i+{}".format(int((i + 1) * N / K)))
print(edges)

['i->i+1', 'i->i+10', 'i->i+20', 'i->i+30', 'i->i+40', 'i->i+50', 'i->i+60', 'i->i+70', 'i->i+80', 'i->i+90']


In [3]:
N = 10
K = 4
edges = ["i->i+1"]
for i in range(K-1):
    edges.append("i->i+{}".format(int((i + 1) * N / K)))
print(edges)

['i->i+1', 'i->i+2', 'i->i+5', 'i->i+7']


Then such edges are used from `generate_d_regular_graph_by_edges(N, edges)` to produce the relative $N \times N$ graph's adjacency matrix.

**Example.**
Let's consider the edge list `['i->i+1', 'i->i+2', 'i->i+5', 'i->i+7']` and $N=10$. 

- node $0$ edges will be: $0 \to 1$, $0 \to 2$, $0 \to 5$, $0 \to 7$;
- node $1$ edges will be: $1 \to 2$, $1 \to 3$, $1 \to 6$, $1 \to 8$;
- ...
- node $5$ edges will be: $5 \to 6$, $5 \to 7$, $5 \to 0$, $5 \to 2$; since the $mod\ N$ operation in applied to each edge evaluation;
- etc.

```python
generate_d_regular_graph_by_edges(10, ['i->i+1', 'i->i+2', 'i->i+5', 'i->i+7'])
```

Output:

```
array([[1., 1., 1., 0., 0., 1., 0., 1., 0., 0.],
       [0., 1., 1., 1., 0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 1., 1., 0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 1., 1., 1., 0., 0., 1.],
       [1., 0., 1., 0., 0., 1., 1., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0., 1., 1., 1., 0.],
       [0., 0., 1., 0., 1., 0., 0., 1., 1., 1.],
       [1., 0., 0., 1., 0., 1., 0., 0., 1., 1.],
       [1., 1., 0., 0., 1., 0., 1., 0., 0., 1.]])
```

### Velocity vs degree test - iterations over time
![Velocity vs degree iter/time plot.](media/img/tests/test_004_velVSdeg10ksamples10ktime_classic/1_iter_time.png)

### Velocity vs degree test - mse over time
![Velocity vs degree mse/time plot.](media/img/tests/test_004_velVSdeg10ksamples10ktime_classic/3_mse_time.png)

### Velocity vs degree test - rmse over time
![Velocity vs degree rmse/time plot.](media/img/tests/test_004_velVSdeg10ksamples10ktime_classic/3_real-mse_time.png)

# Single node model inspection
Set of tests #003 showed some unexpected outcomes:
1. diagonal topology has shown to outperform other topologies in _MSE/RMSE over iterations_ plots when we used to think having an high degreed would have led single step updates to be more accurate;
2. in _MSE/RMSE over time_ plots, starting from a certain time, all topologies seemed converging to a high error value (too far from the noise variance), while diagonal topology seemed the only one continuing to improve;
3. 

Below I will analyze in details these points trying to figure out their meaning and causes.

## Single node model inspection
I have noticed 

- making the avg at the end of each step counts as much as making the avg just once at the end, this is what it seems to suggest these outcomes;
- 

I decided to inspect single node performance, thus below I provide a test which plots take into account local metrics of nodes.

In the following test, MSE (and RMSE) are computed by averaging local errors in nodes:

$$MSE = \frac{1}{n} \sum_{v=1}^{n} MSE_v$$

$$RMSE = \frac{1}{n} \sum_{v=1}^{n} RMSE_v$$

where $MSE_v$ and $RMSE_v$ are respectively the mean squared error and the real mean squared error of node $v$:

$$MSE_v = \frac{1}{|\mathcal{X}_v|} \sum_{i=1}^{|\mathcal{X}_v|} (y_i - \hat{y}_i)^2$$

$$RMSE_v = \frac{1}{|\mathcal{X}_v|} \sum_{i=1}^{|\mathcal{X}_v|} (\tilde{y}_i - \hat{y}_i)^2$$
 
- $\mathcal{X}_v \subset \mathcal{X}$ is the subset of the training set assigned to $v$;
- $y_i$ is the outcome provided by the training set;
- $\hat{y}_i$ is the value computed using the model $\mathbf{w}_v$;
- $\tilde{y}_i$ is the noiseless $y_i$.

These MSE and RMSE will be called "**ALT metrics**" (_alternatives_ metrics).

## 10k samples ALT metrics 500k time GD test
- `n_samples = 10000`.
- `max_time = 500000`: simulation stops when time reaches $500000$s.
- All other parameters are left untouched.

### 10k samples ALT metrics 500k time GD - iter/time
![10k samples ALT metrics GD test iter/time.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/1_iter_time.png)

### 10k samples ALT metrics 500k time GD - mse/iter
![10k samples ALT metrics 500k time GD test mse/iter.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/2_mse_iter.png)

![10k samples ALT metrics 500k time GD test mse/iter zoom.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/2_mse_iter_zoom.png)

### 10k samples ALT metrics 500k time GD - rmse/iter
![10k samples ALT metrics 500k time GD test rmse/iter.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/2_real-mse_iter.png)

![10k samples ALT metrics 500k time GD test rmse/iter zoom.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/2_real-mse_iter_zoom.png)

### 10k samples ALT metrics 500k time GD - mse/time

#### 10k samples ALT metrics 500k time GD - mse/time (log scale)
![10k samples ALT metrics 500k time GD test mse/time.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_mse_time.png)

![10k samples ALT metrics 500k time GD test mse/time zoom 1.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_mse_time_zoom_1.png)

#### 10k samples ALT metrics 500k time GD - mse/time (linear scale zoom)
![10k samples ALT metrics 500k time GD test mse/time zoom 2.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_mse_time_zoom_2.png)

![10k samples ALT metrics 500k time GD test mse/time zoom 3.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_mse_time_zoom_3.png)

### 10k samples ALT metrics 500k time GD - rmse/time
![10k samples ALT metrics 500k time GD test rmse/time.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_real-mse_time.png)

![10k samples ALT metrics 500k time GD test rmse/time zoom 1.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_real-mse_time_zoom_1.png)

![10k samples ALT metrics 500k time GD test rmse/time zoom 2.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_real-mse_time_zoom_2.png)

![10k samples ALT metrics 500k time GD test rmse/time zoom 3.](media/img/tests/test_004_10ksamplesALTmetrics500ktime_classic/3_real-mse_time_zoom_3.png)

## 10k samples 500k time GD test
Here MSE and RMSE are computed in the regular way (and not averaging local errors).

Simulation stops when time reaches $500'000$s.

This simulation has taken $3h14m26s$.

### 10k samples 500k time GD - iter/time
![10k samples 500k time GD test iter/time.](media/img/tests/test_004_10ksamples500ktime_classic/1_iter_time.png)

### 10k samples 500k time GD - mse/iter
![10k samples 500k time GD test mse/iter.](media/img/tests/test_004_10ksamples500ktime_classic/2_mse_iter.png)

![10k samples 500k time GD test mse/iter zoom.](media/img/tests/test_004_10ksamples500ktime_classic/2_mse_iter_zoom.png)

### 10k samples 500k time GD - rmse/iter
![10k samples 500k time GD test rmse/iter.](media/img/tests/test_004_10ksamples500ktime_classic/2_real-mse_iter.png)

![10k samples 500k time GD test rmse/iter zoom.](media/img/tests/test_004_10ksamples500ktime_classic/2_real-mse_iter_zoom.png)

### 10k samples 500k time GD - mse/time
![10k samples 500k time GD test mse/iter.](media/img/tests/test_004_10ksamples500ktime_classic/3_mse_time.png)

![10k samples 500k time GD test mse/iter zoom.](media/img/tests/test_004_10ksamples500ktime_classic/3_mse_time_zoom_1.png)

![10k samples 500k time GD test mse/iter zoom.](media/img/tests/test_004_10ksamples500ktime_classic/3_mse_time_zoom_2.png)

### 10k samples 500k time GD - rmse/time
![10k samples 500k time GD test rmse/iter.](media/img/tests/test_004_10ksamples500ktime_classic/3_real-mse_time.png)

![10k samples 500k time GD test rmse/iter zoom.](media/img/tests/test_004_10ksamples500ktime_classic/3_real-mse_time_zoom_1.png)

![10k samples 500k time GD test rmse/iter zoom.](media/img/tests/test_004_10ksamples500ktime_classic/3_real-mse_time_zoom_2.png)