## Visualization
Rosenbaum's test has a known, exact null distribution, which is exactly distribution free, meaning that the test works for any underlying distributions F and G which are tested on being equal. 
It constructs an optimal bi-partite matching of all observations, meaning that it pairs each data point to exactly one other, minimizing the within-pair distances.
If F and G are similar or equal, the matching will result in a high number of cross-matches (pairs with one observation from each of the groups). If they are different, the minimal distance pairs will mostly be from the same groups.    
This notebooks visualizes the Rosenbaum test with two distributions. 

#### Experiment with the distributions to see the effects on the matching!

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np 
from graph_tool.all import Graph, graph_draw
import pandas as pd
import anndata as ad
import sys
sys.path.append("..")
from src import *
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
import itertools

import time
np.random.seed(42)

will use the CPU to calculate the distance matrix.
will use the CPU to calculate the distance matrix.


In [3]:
metric = "sqeuclidean"

In [4]:
n_obs = 5000
n_var = 1000
samples_A = np.random.normal(0, 1, [n_obs, n_var]) 
samples_B = np.random.normal(0, 1, [n_obs, n_var]) 


groups = ["A"] * n_obs + ["B"] * n_obs  
samples = np.concatenate((samples_A, samples_B)) 
adata = ad.AnnData(samples)
adata.obs["Group"] = groups

In [5]:
print(adata)
start = time.time()
sc.pp.pca(adata)
print("PCA", time.time() - start)

sc.pp.neighbors(adata, metric=metric)
print("Total", time.time() - start)
print(adata)

AnnData object with n_obs × n_vars = 10000 × 1000
    obs: 'Group'
PCA 25.35942578315735


  from .autonotebook import tqdm as notebook_tqdm


Total 51.8831729888916
AnnData object with n_obs × n_vars = 10000 × 1000
    obs: 'Group'
    uns: 'pca', 'neighbors'
    obsm: 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'


In [6]:
adata.obsp["distances"]

<10000x10000 sparse matrix of type '<class 'numpy.float32'>'
	with 160000 stored elements in Compressed Sparse Row format>

In [None]:
test = "A"
reference = "B"
metric="sqeuclidean"
distances = calculate_distances_nx(adata.X, metric)
G = construct_graph_from_distances_nx(distances)
matching = match_nx(G)
p_val, z, support = rosenbaum(adata, group_by="Group", test_group=test, reference=reference, use_nx=True)

using CPU to calculate distance matrix.
creating distance graph.


In [None]:
sc.pp.neighbors(adata, n_neighbors=5, metric=metric)

In [None]:
G_knn = nx.from_scipy_sparse_array(adata.obsp["distances"])

In [None]:
G_matching = nx.from_edgelist(matching)

In [None]:
f, axs = plt.subplots(1, 3, figsize=(10, 4))
used_elements = list(chain.from_iterable(matching))
pos = {i: adata.X[i] for i in used_elements}
n_colors = adata.obs["Group"].replace({"A": "purple", "B": "orange"}).values[used_elements]

nx.draw(G, pos=pos, node_color=n_colors, edge_color=(0,0,0,0.3), node_size=50, ax=axs[0])
nx.draw(G_knn, pos=pos, node_color=n_colors, edge_color=(0,0,0,0.3), node_size=50, ax=axs[1])
nx.draw(G_matching, pos=pos, node_color=n_colors, edge_color=(0,0,0,0.3), node_size=50, ax=axs[2])
plt.savefig("explanation.jpg")