# statdepth: An Interactive Guide

In this notebook we'll be exploring `statdepth,` a Python package for computing statistical depth of univariate functional data, multivariate functional data, and pointcloud data for distributions in $\mathbb{R}^d$

We'll begin by importing some libraries we may need

In [30]:
import numpy as np
import pandas as pd
from string import ascii_lowercase

from statdepth import FunctionalDepth, PointcloudDepth
from statdepth.testing import generate_noisy_pointcloud, generate_noisy_univariate

We'll now generate some random univariate functions with similar shape and some noise.

In [31]:
df = generate_noisy_univariate(data=[2,3,3.4,4,5,3.1,3,3,2]*3, columns=[f'f{i}' for i in range(20)], seed=42)
df.head()

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19
0,0.74908,1.901429,1.463988,1.197317,0.312037,0.311989,0.116167,1.732352,1.20223,1.416145,0.041169,1.93982,1.664885,0.424678,0.36365,0.366809,0.608484,1.049513,0.86389,0.582458
1,1.12362,2.852143,2.195982,1.795975,0.468056,0.467984,0.174251,2.598528,1.803345,2.124218,0.061753,2.90973,2.497328,0.637017,0.545475,0.550214,0.912727,1.574269,1.295835,0.873687
2,1.273436,3.232429,2.488779,2.035439,0.530463,0.530381,0.197484,2.944999,2.043791,2.407447,0.069987,3.297693,2.830305,0.721953,0.618205,0.623575,1.034424,1.784172,1.468613,0.990179
3,1.49816,3.802857,2.927976,2.394634,0.624075,0.623978,0.232334,3.464705,2.40446,2.83229,0.082338,3.879639,3.329771,0.849356,0.7273,0.733618,1.216969,2.099026,1.72778,1.164917
4,1.872701,4.753572,3.65997,2.993292,0.780093,0.779973,0.290418,4.330881,3.005575,3.540363,0.102922,4.849549,4.162213,1.061696,0.909125,0.917023,1.521211,2.623782,2.159725,1.456146


Now we'll use our library to calculate band depth (using standard containment on $\mathbb{R}^2$

In [32]:
bd = FunctionalDepth([df], J=2, relax=False, quiet=False)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 16.45it/s]


Well, we can first look at the $n$ deepest and most outlying curves

In [33]:
bd.deepest(n=5)

f0     0.473684
f18    0.473684
f16    0.463158
f17    0.463158
f19    0.442105
dtype: float64

In [34]:
bd.outlying(n=5)

f7     0.178947
f6     0.094737
f1     0.094737
f11    0.000000
f10    0.000000
dtype: float64

But this is much more meaningful with visuals!

In [35]:
n=3
fig = bd.plot_deepest(n=n, return_plot=True, title=f'{n} Deepest Curves, Plotted in Red')
fig.update_layout(width=750, height=750)
fig.write_image('ex1_deepest.pdf')
fig.show()

In addition to writing out the image in any general image format, we can visualize the results with Plotlys `.show()` method.

We can also plot the most outlying functions

In [36]:
n=3
fig = bd.plot_outlying(n=n, return_plot=True, title=f'{n} Most Outlying Curves in Red')
fig.update_layout(width=750, height=750)
fig.write_image('ex1_outlying.pdf')
fig.show()

Or, supposing we've tuned our FunctionalDepth to our liking, return our data with the `n` most outlying samples dropped

In [37]:
bd.get_deep_data(n=10).head()

Unnamed: 0,f0,f18,f16,f17,f19,f3,f8,f13,f9,f15
0,0.74908,0.86389,0.608484,1.049513,0.582458,1.197317,1.20223,0.424678,1.416145,0.366809
1,1.12362,1.295835,0.912727,1.574269,0.873687,1.795975,1.803345,0.637017,2.124218,0.550214
2,1.273436,1.468613,1.034424,1.784172,0.990179,2.035439,2.043791,0.721953,2.407447,0.623575
3,1.49816,1.72778,1.216969,2.099026,1.164917,2.394634,2.40446,0.849356,2.83229,0.733618
4,1.872701,2.159725,1.521211,2.623782,1.456146,2.993292,3.005575,1.061696,3.540363,0.917023


We can calculate depth for multivariate data using simplex depth, which generalizes the idea of containment in 2 dimensions to functions $f: D \rightarrow \mathbb{R}^n$, where $D$ is a set of discrete time indices.

In [13]:
from statdepth.testing import generate_noisy_multivariate

data = generate_noisy_multivariate(columns=list('ABC'), num_curves=10, seed=42)

In [14]:
data[2]

Unnamed: 0,A,B,C
0,0.024364,0.061845,0.047617
1,0.038944,0.010149,0.010148
2,0.003778,0.056346,0.039103
3,0.046061,0.001339,0.063094
4,0.054152,0.013813,0.011828
5,0.011931,0.019791,0.034136
6,0.028099,0.018945,0.039802
7,0.009074,0.019004,0.023832
8,0.029668,0.051077,0.012989
9,0.033452,0.038538,0.003022


In [15]:
bd = FunctionalDepth(data, containment='simplex', relax=True, quiet=False)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:27<00:00,  2.75s/it]


Again, we can look at our curves ordered

In [16]:
bd.ordered()

0    0.952381
9    0.952381
6    0.880952
8    0.880952
1    0.722222
5    0.722222
3    0.444444
7    0.444444
2    0.000000
4    0.000000
dtype: float64

Now, let's try calculating band depth for some pointcloud data. Maybe you've sampled $n$ points from some distribution in $R^d$, and you'd like to understand which points are the most "central".

First, let's try this for some points sampled in $\mathbb{R^2}$

In [38]:
from statdepth import PointcloudDepth
from statdepth.testing import generate_noisy_pointcloud 

df = generate_noisy_pointcloud(n=50, d=2, seed=42)
bd = PointcloudDepth(df, K=7, containment='simplex', quiet=False)
df.head()

Unnamed: 0,0,1
0,0.37454,0.950714
1,0.731994,0.598658
2,0.156019,0.155995
3,0.058084,0.866176
4,0.601115,0.708073


We can look at deepest points

In [39]:
bd.deepest(n=5)

48    0.179592
32    0.159184
9     0.148980
23    0.140306
8     0.109694
dtype: float64

Again, we can plot our data. Here, the lighter the color the deeper (more central) the point.

In [42]:
fig = bd.plot_depths(invert_colors=True, return_plot=True, title='Pointcloud Depths, Deepest are Darkest')
fig.update_layout(width=750, height=750)
fig.write_image('ex2_colored.pdf')
fig.show()

We can also just plot the $n$ deepest points. 

In [41]:
fig = bd.plot_deepest(n=5, return_plot=True, title='5 Deepest Points Plotted in Red')
fig.update_layout(width=750, height=750)
fig.write_image('ex2_deepest.pdf')
fig.show()

Or even the $n$ most outlying points, since often it's nice to know which data we should consider to be outliers

In [11]:
n=10
fig = bd.plot_outlying(n=n, title=f'{n} Deepest Points Plotted in Red', return_plot=True)
fig.update_layout(width=750, height=750)
fig.write_image('ex2_outlying.pdf')
fig.show()

But of course, if we're just defining depth using a certain measure of containment, there is no reason it shouldn't generalize to arbitrary dimensions. And indeed, this is the case. Let's take a look at some data in $\mathbb{R}^3$.

Notice, we're using sample depth because if we were to compute depth precisely, we'd be calculating about 500k simplices for each of our 50 datapoints, which can become unweildy fast.

However, it turns out that sample band depth is quite accurate for $K << n$, where $n$ is our number of datapoints, so this is definitely worth it.

In [7]:
df = generate_noisy_pointcloud(n=50, d=3, columns=list('ABC'), seed=42)
bd = PointcloudDepth(df, K=7, containment='simplex')
df.head()

Unnamed: 0,A,B,C
0,0.544143,0.500031,0.99428
1,0.422571,0.184612,0.468136
2,0.24618,0.962745,0.681754
3,0.287997,0.489943,0.784931
4,0.022005,0.11615,0.44796


In [8]:
bd.deepest(n=5)

16    0.053061
9     0.048980
17    0.040816
33    0.036735
7     0.032653
dtype: float64

Well, looking at the 5 deepest points is interesting, but it's a lot more meaningful visually.

In [10]:
fig = bd.plot_depths(invert_colors=True, return_plot=True, title='Pointcloud Depths, Deepest are Darkest')
fig.update_layout(width=750, height=750)
fig.write_image('ex3_colored.pdf')
fig.show()

Or, we could just plot the $n$ deepest points

In [43]:
n=5
fig = bd.plot_deepest(n=n, title=f'{n} Deepest Points Plotted in Red', return_plot=True)
fig.update_layout(width=750, height=750)
fig.write_image('ex3_deepest.pdf')
fig.show()

fig = bd.plot_outlying(n=n, title=f'{n} Outlying Points Plotted in Red', return_plot=True)
fig.update_layout(width=750, height=750)
fig.write_image('ex3_outlying.pdf')
fig.show()

In [14]:
bd.deepest(n=3)

16    0.053061
9     0.048980
17    0.040816
dtype: float64

In [15]:
bd.get_deep_data(n=5)

Unnamed: 0,A,B,C
16,0.424581,0.622321,0.429581
9,0.437138,0.7112,0.361582
17,0.524423,0.443179,0.577518
33,0.431662,0.592148,0.404195
7,0.78902,0.4767,0.526527


The above uses simplex containment, where to find the depth of a point in $\mathbb{R}^2$ we use all possible subsequences of 3 other points, construct a triangle, and check the proportion of triangles that our point is contained in.

We then to this for all other points we'd like to calculate depth for.

But this is not the only definition of depth. So let's use another below and see how it compares

In [24]:
bd = PointcloudDepth(df, K=3, containment='l1')

In [25]:
fig = bd.plot_deepest(n=n, title=f'{n} Deepest Points Plotted in Red, L1 Depth')
fig.update_layout(width=750, height=750)
fig.show()

Notice that this gives different deepest points than simplex depth. 

In [16]:
bd.get_deep_data(n=20)

Unnamed: 0,A,B,C
16,0.424581,0.622321,0.429581
9,0.437138,0.7112,0.361582
17,0.524423,0.443179,0.577518
33,0.431662,0.592148,0.404195
7,0.78902,0.4767,0.526527
8,0.514519,0.789053,0.511894
48,0.635031,0.753455,0.334445
25,0.653916,0.614086,0.398891
3,0.287997,0.489943,0.784931
28,0.872023,0.460053,0.417494


In [17]:
bd.drop_outlying_data(n=25)

Unnamed: 0,A,B,C
1,0.422571,0.184612,0.468136
3,0.287997,0.489943,0.784931
7,0.78902,0.4767,0.526527
8,0.514519,0.789053,0.511894
9,0.437138,0.7112,0.361582
16,0.424581,0.622321,0.429581
17,0.524423,0.443179,0.577518
25,0.653916,0.614086,0.398891
26,0.366053,0.859981,0.631905
27,0.57311,0.238784,0.203376


## Rat data example

In [47]:
import pandas as pd 
df_a = pd.read_excel('rat-trial-a.xlsx', sheet_name='Body weight').drop(['Trial_ID', 'Animal_ID'], axis=1).T
df_b = pd.read_excel('rat-trial-b.xlsx', sheet_name='Body weight').drop(['Trial_ID', 'Animal_ID'], axis=1).T
df_a

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,150,151,152,153,154,155,156,157,158,159
Weight_Start,146.4,153.85,152.26,145.7,140.17,164.22,150.71,158.05,151.67,155.53,...,138.21,128.31,135.73,139.04,127.48,138.71,137.66,134.64,142.31,133.3
Weight_Week_1,188.82,198.72,202.68,192.86,176.31,217.36,208.47,202.84,203.19,220.21,...,158.77,143.28,165.63,160.99,147.42,151.0,153.68,157.38,164.83,162.89
Weight_Week_2,231.13,242.23,235.57,233.55,210.79,256.71,260.79,255.33,241.13,263.07,...,172.07,160.16,190.76,188.65,165.92,168.78,177.86,180.99,186.37,195.11
Weight_Week_3,273.35,283.23,271.58,284.89,245.62,308.45,312.91,302.27,279.51,299.4,...,188.57,173.64,210.01,197.2,176.7,192.33,188.83,200.18,206.78,215.6
Weight_Week_4,300.72,310.19,290.65,315.04,264.58,345.24,350.01,330.07,302.49,334.9,...,196.98,188.81,228.41,213.92,191.95,197.2,191.36,208.66,216.41,227.93
Weight_Week_5,329.61,338.59,308.29,343.03,282.42,381.67,390.39,361.54,323.3,358.59,...,207.56,206.04,240.18,227.28,194.78,206.31,200.89,219.75,223.66,242.71
Weight_Week_6,342.34,353.64,319.63,365.44,300.09,406.74,416.73,378.97,342.15,381.84,...,214.29,222.01,253.85,235.16,204.68,214.59,209.05,232.51,233.98,244.48
Weight_Week_7,357.31,374.55,333.47,384.2,312.76,427.59,436.35,398.43,361.6,402.37,...,219.92,233.91,262.29,245.37,204.86,225.0,217.16,238.26,243.68,252.04
Weight_Week_8,373.85,387.79,347.71,396.33,321.68,449.7,455.57,417.14,376.46,413.82,...,225.34,240.58,271.68,250.39,207.58,231.16,223.24,240.75,250.59,262.33
Weight_Week_9,385.66,399.3,350.44,406.13,334.04,466.11,472.06,427.34,381.42,422.72,...,228.81,250.53,276.45,259.64,212.94,236.86,233.21,247.36,261.87,261.79


In [49]:
a_depths = FunctionalDepth([df_a], quiet=False, relax=True, K=10)
b_depths = FunctionalDepth([df_b], quiet=False, relax=True, K=10)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:51<00:00,  3.08it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:53<00:00,  3.00it/s]


Now that we've calculated the depths for the rat curves, we can visualize the results, plotting the 3 deepest curves in red.

In [57]:
n=3
fig = a_depths.plot_deepest(n=n, title=f'{n} Deepest Rat Curves, Group A', return_plot=True, yaxis_title='Weight (g)')
fig.update_layout(width=750, height=750)
fig.write_image('rat_a_deepest.pdf')
fig.show()

fig = a_depths.plot_outlying(n=n, title=f'{n} Outlying Rat Curves, Group A', return_plot=True, yaxis_title='Weight (g)')
fig.update_layout(width=750, height=750)
fig.write_image('rat_a_outlying.pdf')

fig.show()

Similarly, we can do the same for experimental group B

In [58]:
fig = b_depths.plot_deepest(n=n, title=f'{n} Deepest Rat Curves, Group B', return_plot=True, yaxis_title='Weight (g)')
fig.update_layout(width=750, height=750)
fig.write_image('rat_b_deepest.pdf')
fig.show()

fig = b_depths.plot_outlying(n=n, title=f'{n} Outlying Rat Curves, Group B', return_plot=True, yaxis_title='Weight (g)')
fig.update_layout(width=750, height=750)
fig.write_image('rat_b_outlying.pdf')

fig.show()

However, a visual representation may not always be enough. For this reason, we can homogeneity test between the two trials, and check if they are distributionally similar, an assumption we hope holds true

In [None]:
from statdepth.homogeneity import FunctionalHomogeneity

hom = FunctionalHomogeneity([df_a], [df_b], K=10, J=2, relax=True, method='p2')

 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                | 125/160 [00:39<00:11,  3.04it/s]

In [None]:
hom