Skip to content
High-dimensional medians (medoid, geometric median, etc.). Fast implementations in Python.
Python Makefile
Branch: master
Clone or download
Latest commit 8e0e9e3 Oct 27, 2017
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs Update README Apr 6, 2017
hdmedians Update __init__.py Oct 26, 2017
.gitignore Change font. Feb 6, 2017
LICENSE Change licence to Apache 2.0 Oct 26, 2017
Makefile Use setuptools to Cythonize Feb 14, 2017
README.md Update README Apr 6, 2017
setup.cfg nose config. Feb 1, 2017
setup.py bump version Feb 21, 2017

README.md

Hdmedians

Did you know there is no unique way to mathematically extend the concept of a median to higher dimensions?

Various definitions for a high-dimensional median exist and this Python package provides a number of fast implementations of these definitions. Medians are extremely useful due to their high breakdown point (up to 50% contamination) and have a number of nice applications in machine learning, computer vision, and high-dimensional statistics.

This package currently has implementations of medoid and geometric median with support for missing data using NaN.

Installation

The latest version of the package is always available on pypi, so can be easily installed by typing:

pip3 install hdmedians

Medoid

Given a finite set of -dimensional observation vectors , the medoid of these observations is given by

The current implementation of medoid is in vectorized Python and can handle any data type supported by ndarray. If you would like the algorithm to take care of missing values encoded as nan then you can use the nanmedoid function.

Examples

Create an 6 x 10 array of random integer observations.

>>> import numpy as np
>>> X = np.random.randint(100, size=(6, 10))
array([[12,  9, 61, 76,  2, 17, 12, 11, 26,  0],
       [65, 72,  7, 64, 21, 92, 51, 48,  9, 65],
       [39,  7, 50, 56, 29, 79, 47, 45, 10, 52],
       [70, 12, 23, 97, 86, 14, 42, 90, 15, 16],
       [13,  7,  2, 47, 80, 53, 23, 59,  7, 15],
       [83,  2, 40, 12, 22, 75, 69, 61, 28, 53]])

Find the medoid, taking the last axis as the number of observations.

>>> import hdmedians as hd
>>> hd.medoid(X)
array([12, 51, 47, 42, 23, 69])

Take the first axis as the number of observations.

>>> hd.medoid(X, axis=0)
array([39,  7, 50, 56, 29, 79, 47, 45, 10, 52])

Since the medoid is one of the observations, the medoid function has the ability to only return the index if required.

>>> hd.medoid(X, indexonly=True)
6
>>> X[:,6]
array([12, 51, 47, 42, 23, 69])

Geometric Median

The geometric median is also known as the 1-median, spatial median, Euclidean minisum, or Torricelli point. Given a finite set of -dimensional observation vectors , the geometric median of these observations is given by

Note there is a subtle difference between the definition of the geometric median and the medoid: the search space for the solution differs and has the effect that the medoid returns one of the true observations whereas the geometric median can be described as a synthetic (not physically observed) observation.

The current implementation of geomedian uses Cython and can handle float64 or float32. If you would like the algorithm to take care of missing values encoded as nan then you can use the nangeomedian function.

Examples

Create an 6 x 10 array of random float64 observations.

>>> import numpy as np
>>> np.set_printoptions(precision=4, linewidth=200)
>>> X = np.random.normal(1, size=(6, 10))
array([[ 1.1079,  0.5763,  0.3072,  1.2205,  0.8596, -1.5082,  2.5955,  2.8251,  1.5908,  0.4575],
       [ 1.555 ,  1.7903,  1.213 ,  1.1285,  0.0461, -0.4929, -0.1158,  0.5879,  1.5807,  0.5828],
       [ 2.1583,  3.4429,  0.4166,  1.0192,  0.8308, -0.1468,  2.6329,  2.2239,  0.2168,  0.8783],
       [ 0.7382,  1.9453,  0.567 ,  0.6797,  1.1654, -0.1556,  0.9934,  0.1857,  1.369 ,  2.1855],
       [ 0.1727,  0.0835,  0.5416,  1.4416,  1.6921,  1.6636,  1.6421,  1.0687,  0.6075, -0.0301],
       [ 2.6654,  1.6741,  1.1568,  1.3092,  1.6944,  0.2574,  2.8604,  1.6102,  0.4301, -0.3876]])
>>> X.dtype
dtype('float64')

Find the geometric median, taking the last axis as the number of observations.

>>> import hdmedians as hd
>>> np.array(hd.geomedian(X))
array([ 1.0733,  0.8974,  1.1935,  0.9122,  0.9975,  1.3422])

Take the first axis as the number of observations.

>>> np.array(hd.geomedian(X, axis=0))
array([ 1.4581,  1.6377,  0.7147,  1.1257,  1.0493, -0.091 ,  1.7907,  1.4168,  0.9587,  0.6195])

Convert to float32 and compute the geometric median.

>>> X = X.astype(np.float32)
>>> m = hd.geomedian(X)

References

You can’t perform that action at this time.