# statdepth: An Interactive Guide

In this notebook we'll be exploring `statdepth,` a Python package for computing statistical depth of univariate functional data, multivariate functional data, and pointcloud data for distributions in $\mathbb{R}^d$

We'll begin by importing some libraries we may need

In [1]:
import numpy as np
import pandas as pd
from string import ascii_lowercase

from statdepth import FunctionalDepth, PointcloudDepth
from statdepth.testing import generate_noisy_pointcloud, generate_noisy_univariate

We'll now generate some random univariate functions with similar shape and some noise.

In [2]:
df = generate_noisy_univariate(data=[2,3,3.4,4,5,3.1,3,3,2]*3, columns=[i for i in ascii_lowercase[0: 20]])
df.head()

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t
0,1.121749,1.081742,1.851219,1.381189,1.412406,1.468049,1.055825,1.772375,1.645961,0.157278,0.246396,0.044118,0.211,1.598023,1.553491,1.927331,0.525068,1.524293,1.551847,1.162738
1,1.682624,1.622613,2.776829,2.071783,2.11861,2.202073,1.583738,2.658563,2.468942,0.235917,0.369593,0.066177,0.3165,2.397034,2.330237,2.890996,0.787602,2.28644,2.32777,1.744107
2,1.906974,1.838961,3.147072,2.348021,2.401091,2.495683,1.794903,3.013038,2.798135,0.267372,0.418872,0.075001,0.3587,2.716639,2.640935,3.276463,0.892615,2.591298,2.638139,1.976654
3,2.243498,2.163484,3.702438,2.762377,2.824813,2.936097,2.11165,3.544751,3.291923,0.314556,0.492791,0.088236,0.422001,3.196046,3.106982,3.854662,1.050136,3.048586,3.103693,2.325475
4,2.804373,2.704355,4.628048,3.452972,3.531016,3.670122,2.639563,4.430939,4.114904,0.393194,0.615989,0.110295,0.527501,3.995057,3.883728,4.818327,1.31267,3.810733,3.879617,2.906844


In [9]:
import numpy as np
df = generate_noisy_univariate(data=np.random.randint(0, 10, 10))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,2.351862,1.484719,4.492325,8.095872,7.02061,2.568469,8.674169,5.276279,5.303086,3.426988,6.37649,3.381277,8.779412,5.981027,1.521278,3.483269,6.214455,6.649262,3.005984,4.107519
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.30659,0.824844,2.495736,4.497707,3.900339,1.426927,4.818983,2.931266,2.946159,1.903882,3.542495,1.878487,4.877451,3.322793,0.845154,1.935149,3.452475,3.694034,1.669991,2.281955
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.783954,0.494906,1.497442,2.698624,2.340203,0.856156,2.89139,1.75876,1.767695,1.142329,2.125497,1.127092,2.926471,1.993676,0.507093,1.16109,2.071485,2.216421,1.001995,1.369173
5,0.522636,0.329937,0.998294,1.799083,1.560136,0.570771,1.927593,1.172507,1.178464,0.761553,1.416998,0.751395,1.950981,1.329117,0.338062,0.77406,1.38099,1.477614,0.667996,0.912782
6,2.090544,1.31975,3.993177,7.196331,6.240543,2.283084,7.710372,4.690026,4.713855,3.046212,5.667992,3.005579,7.803922,5.316469,1.352247,3.096239,5.52396,5.910455,2.671985,3.651128
7,1.567908,0.989812,2.994883,5.397248,4.680407,1.712313,5.782779,3.51752,3.535391,2.284659,4.250994,2.254184,5.852942,3.987352,1.014185,2.322179,4.14297,4.432841,2.003989,2.738346
8,2.090544,1.31975,3.993177,7.196331,6.240543,2.283084,7.710372,4.690026,4.713855,3.046212,5.667992,3.005579,7.803922,5.316469,1.352247,3.096239,5.52396,5.910455,2.671985,3.651128
9,1.829226,1.154781,3.49403,6.296789,5.460475,1.997698,6.746576,4.103773,4.124623,2.665435,4.959493,2.629882,6.828432,4.65191,1.183216,2.709209,4.833465,5.171648,2.337987,3.194737


Now we'll use our library to calculate band depth (using standard containment on $\mathbb{R}^2$

In [10]:
bd = FunctionalDepth([df], J=2, relax=False)

Well, we can first look at the $n$ deepest and most outlying curves

In [11]:
bd.deepest(n=5)

2     0.473684
7     0.473684
19    0.463158
8     0.463158
13    0.442105
dtype: float64

In [12]:
bd.outlying(n=5)

0     0.178947
14    0.094737
6     0.094737
12    0.000000
1     0.000000
dtype: float64

But this is much more meaningful with visuals!

In [13]:
bd.plot_deepest(n=2)

We can also plot the most outlying functions

In [7]:
bd.plot_outlying(n=3)

Or, supposing we've tuned our FunctionalDepth to our liking, return our data with the `n` most outlying samples dropped

In [8]:
bd.get_deep_data(n=10).head()

Unnamed: 0,b,m,i,r,o,g,k,s,j,d
0,1.224522,1.160155,1.126448,1.403982,1.470256,1.120264,1.643169,1.012147,1.663221,0.952024
1,1.836782,1.740232,1.689672,2.105973,2.205384,1.680396,2.464753,1.51822,2.494831,1.428036
2,2.081687,1.972263,1.914961,2.386769,2.499435,1.904448,2.793387,1.72065,2.827475,1.618441
3,2.449043,2.320309,2.252896,2.807964,2.940512,2.240527,3.286338,2.024294,3.326441,1.904048
4,3.061304,2.900386,2.81612,3.509955,3.67564,2.800659,4.107922,2.530367,4.158052,2.38006


We can calculate depth for multivariate data using simplex depth, which generalizes the idea of containment in 2 dimensions to functions $f: D \rightarrow \mathbb{R}^n$, where $D$ is a set of discrete time indices.

In [9]:
from statdepth.testing import generate_noisy_multivariate

data = generate_noisy_multivariate(columns=list('ABC'), num_curves=10)

In [10]:
data[2]

Unnamed: 0,A,B,C
0,0.44543,0.309411,0.254958
1,0.262054,0.2605,0.184004
2,0.304871,0.205094,0.235407
3,0.432567,0.476952,0.323592
4,0.00769,0.291786,0.014931
5,0.08906,0.453359,0.103003
6,0.433094,0.240143,0.360531
7,0.437885,0.247091,0.11933
8,0.035934,0.317859,0.383191
9,0.202494,0.091295,0.339578


In [11]:
bd = FunctionalDepth(data, containment='simplex', relax=True)

Again, we can look at our curves ordered

In [12]:
bd.ordered()

9    0.952381
4    0.952381
5    0.880952
2    0.880952
6    0.722222
3    0.722222
8    0.444444
7    0.444444
1    0.000000
0    0.000000
dtype: float64

Now, let's try calculating band depth for some pointcloud data. Maybe you've sampled $n$ points from some distribution in $R^d$, and you'd like to understand which points are the most "central".

First, let's try this for some points sampled in $\mathbb{R^2}$

In [13]:
df = generate_noisy_pointcloud(n=50, d=2)
bd = PointcloudDepth(df, K=7, containment='simplex')
df.head()

Unnamed: 0,0,1
0,0.799422,0.966509
1,0.365066,0.472376
2,0.427279,0.393058
3,0.370023,0.062601
4,0.55536,0.302281


We can look at deepest points

In [14]:
bd.deepest(n=5)

39    0.200510
42    0.190816
31    0.164286
1     0.150000
20    0.147959
dtype: float64

Again, we can plot our data. Here, the lighter the color the deeper (more central) the point.

In [15]:
bd.plot_depths(invert_colors=True)

We can also just plot the $n$ deepest points. 

In [16]:
bd.plot_deepest(n=5)

Or even the $n$ most outlying points, since often it's nice to know which data we should consider to be outliers

In [17]:
bd.plot_outlying(n=25)

But of course, if we're just defining depth using a certain measure of containment, there is no reason it shouldn't generalize to arbitrary dimensions. And indeed, this is the case. Let's take a look at some data in $\mathbb{R}^3$.

Notice, we're using sample depth because if we were to compute depth precisely, we'd be calculating about 500k simplices for each of our 50 datapoints, which can become unweildy fast.

However, it turns out that sample band depth is quite accurate for $K << n$, where $n$ is our number of datapoints, so this is definitely worth it.

In [18]:
df = generate_noisy_pointcloud(n=50, d=3, columns=list('ABC'))
bd = PointcloudDepth(df, K=7, containment='simplex')
df.head()

Unnamed: 0,A,B,C
0,0.771743,0.149359,0.028044
1,0.158763,0.315276,0.152181
2,0.264411,0.640676,0.936701
3,0.750667,0.399114,0.397765
4,0.921378,0.217658,0.024406


In [19]:
bd.deepest(n=5)

12    0.065306
24    0.065306
28    0.032653
25    0.032653
43    0.024490
dtype: float64

Well, looking at the 5 deepest points is interesting, but it's a lot more meaningful visually.

In [20]:
bd.plot_depths(invert_colors=True)

Or, we could just plot the $n$ deepest points

In [21]:
bd.plot_deepest(n=3)

In [22]:
bd.deepest(n=3)

12    0.065306
24    0.065306
28    0.032653
dtype: float64

In [23]:
bd.get_deep_data(n=5)

Unnamed: 0,A,B,C
12,0.354914,0.509508,0.743035
24,0.484699,0.253426,0.639967
28,0.225032,0.492542,0.654891
25,0.718285,0.379978,0.588705
43,0.543705,0.363351,0.804977


The above uses simplex containment, where to find the depth of a point in $\mathbb{R}^2$ we use all possible subsequences of 3 other points, construct a triangle, and check the proportion of triangles that our point is contained in.

We then to this for all other points we'd like to calculate depth for.

But this is not the only definition of depth. So let's use another below and see how it compares

In [24]:
bd = PointcloudDepth(df, K=3, containment='l1')

In [25]:
bd.plot_deepest(n=3)

Notice that this gives different deepest points than simplex depth. 

In [26]:
bd.get_deep_data(n=20)

Unnamed: 0,A,B,C
25,0.718285,0.379978,0.588705
24,0.484699,0.253426,0.639967
12,0.354914,0.509508,0.743035
43,0.543705,0.363351,0.804977
37,0.55462,0.192657,0.563665
6,0.367289,0.541579,0.315475
3,0.750667,0.399114,0.397765
41,0.531385,0.246896,0.303379
28,0.225032,0.492542,0.654891
29,0.196697,0.325023,0.514655


In [27]:
bd.drop_outlying_data(n=25)

Unnamed: 0,A,B,C
1,0.158763,0.315276,0.152181
2,0.264411,0.640676,0.936701
3,0.750667,0.399114,0.397765
6,0.367289,0.541579,0.315475
7,0.333215,0.052844,0.630215
9,0.764077,0.8071,0.558157
12,0.354914,0.509508,0.743035
13,0.665708,0.113994,0.566883
17,0.394223,0.249405,0.968503
19,0.893753,0.768649,0.404552
