# Outlier detection based on persistent homology

by Matthew J. Graham (California Institute of Technology/National Optical Astronomy Observatory)

(c) 2017

<b><i>Version: 0.1</i></b>

<i>An outlier is an observation that differs so much from other observations as to arouse suspicion that it was generated by a different mechanism</i> (Hawkins 1980).


## Introduction

Characterized distributions of features inform about population behaviors. The global <i>shape</i> of the data in the $n$-dimensional parameter space can provide information about the phenomena they represent. Planar projections fail to represent features of point data clouds where the space is too high dimensional or too twisted. 

<i>Persistent homology</i> is a technique from topological data analysis (TDA) which identifies which topological features (components, holes, graph structure) persist over a wide range of length scales. These are more likely to represent true features of the underlying space rather than artifacts of sampling, noise, or parameter choice. They will also be invariant to data perturbations.

From the perspective of TDA, outliers can be regarded as having minimal connectivity within a data space over a substantial range of scales. There are spatial constructs which can be used as proxies for more complex topological analysis techniques (persistent homology). The goal of this hack is to explore how well these work with astronomical feature spaces and to see what types of outliers they identify relative to more "traditional" outlier detection techniques, e.g., tails of distributions.

## Methodology

The <i>minimal spanning tree (MST)</i> for a data set is a unique construct which encodes all the relevant information about connected components at all resolutions. It is invariant under rotations, scalings, and translations. <i>Pruning<i> is equivalent to compact partitioning and produces clusters that are precisely persistent components. Algebraic topological measures can be derived from the MST, e.g., Betti numbers, and the asymptotic behaviors of the MST are well understood (mean-field limit approach). 

The MST will not just identify low density regions but also areas of low connectivity. In addition to looking for outliers (marginally connected points based on the node degree), <i>hub</i> nodes may also be of interest since these bridge clusters. This approach is non-parametric.


