# AutoScale test case for learning purposes

In [324]:
import numpy as np
from scipy.cluster.hierarchy import fcluster, linkage
from scipy.stats import spearmanr

rng = np.random.default_rng()

## Create response matrix consisting of **independent** responses

Drawing random observations and enough samples such that each primary component explains approximately 25% of the variance.
Notice that we need quite a few before this is the case.

In [325]:
n_observations = 4
n_realizations = 100
# NB! Note the transposing of Y before SVD.
Y = rng.normal(size=(n_observations, n_realizations)).T

## Response Scaling

Scaling is done as `scaled_responses = responses / obs_errors.reshape(-1, 1)` in the code, but we don't need it here since all observations are drawn from a standard normal distribution.
However, it is worth investigating why we don't do standard scaling, i.e., `responses: (responses - mean(responses)) / sd(responses)`.

## Find the number of principal components that explain **less** than 95% of the variance

Since we here have 4 independent responses, should we also get 4 as the number of components?
The current implementation will give 3 and remember that the first 3 components explain only 75% of the variance.
Note also that the number of principal components is later used as the number of clusters,
which means that we will have to place two of the independent responses into one cluster.

In [326]:
_, s, _ = np.linalg.svd(Y - Y.mean(axis=0), full_matrices=False)
variance_ratio = np.cumsum(s**2) / np.sum(s**2)
print("variance_ratio: ", variance_ratio)
threshold = 0.95
nr_components = max(len([1 for i in variance_ratio[:-1] if i < threshold]), 1)
print("number of principal components: ", nr_components)

variance_ratio:  [0.29133229 0.56201325 0.80096563 1.        ]
number of principal components:  3


# Find clusters

Since we are using a correlation as measure of distance, in this case spearman, we will be effected by spurious correlations.
From the documentation of `linkage`:

`The input y may be either a 1-D condensed distance matrix or a 2-D array of observation vectors.`

We are passing in a correlation matrix which is not supported.

https://github.com/scipy/scipy/blob/e29dcb65a2040f04819b426a04b60d44a8f69c04/scipy/cluster/hierarchy.py#L1024

This needs to be investigated further.

In [329]:
print("Affected by spurious correlations")
correlation = spearmanr(Y).statistic
print(correlation)
print("---------------------")

# Do we want this instead maybe: linkage_matrix = linkage(Y.T, "average", "correlation")
linkage_matrix = linkage(correlation, "average", "euclidean")
# linkage_matrix = linkage(pdist(Y.T, metric="euclidean"), "average")
print("Note that there are 3 clusters and not 4: ")
fcluster(linkage_matrix, nr_components, criterion="maxclust", depth=2)

Affected by spurious correlations
[[ 1.         -0.12441644 -0.00621662 -0.04763276]
 [-0.12441644  1.          0.00344434  0.06481848]
 [-0.00621662  0.00344434  1.          0.05968197]
 [-0.04763276  0.06481848  0.05968197  1.        ]]
---------------------
Note that there are 3 clusters and not 4: 


array([3, 1, 2, 1], dtype=int32)

## From each cluster, calculate scaling factor to scale observation errors with

scaling_factor = sqrt(nr_observations / nr_components) (in each cluster).

Not a well known method as far as I know.