Skip to content
/ kde Public

Python library for kernel density estimation, including interface for WESTPA data

Notifications You must be signed in to change notification settings

ajd98/kde

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

kde

Overview

This repository provides a Python library for kernel density estimation. In comparison to other Python implementations of kernel density estimation, key features of this library include:

  1. Support for weighted samples
  2. A variety of kernels, including a smooth, compact kernel.
  3. Interface for kernel density estimation from WESTPA data sets (https://westpa.github.io/westpa/).

Basics

Kernel density estimation is a technique for estimation of a probability density function based on empirical data. Suppose we have some observations xᵢ ∈ V where i = 1, ..., n and V is some feature space, typically ℝᵈ. Given a metric 𝒹: V × V → ℝ⁺∪{0}, a kernel function K: ℝ → ℝ⁺∪{0} with ∀ x ∈ V, ∫ᵥK(𝒹(x,y))dy = 1, and a bandwidth h ∈ ℝ⁺, the kernel density estimate p: V → ℝ⁺∪{0} is defined as:

p(x) := 1/(hn) ΣᵢK(𝒹(x,xᵢ)/h)

This library simplifies calculation by including only a set of metrics 𝒹 that may be expressed as 𝒹(x,xᵢ) = q(x-xᵢ) for some norm q:V → ℝ⁺∪{0}:

p(x) := 1/(hn) ΣᵢK(q(x-xᵢ)/h)

Similarly, a weighted version of the kernel density estimate may be defined as:

p(x) := 1/h ΣᵢwᵢK(q(x-xᵢ)/h)

where wᵢ is the weight of the ith sample, and Σᵢwᵢ=1.

This package includes the following kernel functions:

kernel equation
bump p(x) ∝ 1Aexp(1/(x²-1))
cosine p(x) ∝ 1Acos(πx/2)
epanechnikov p(x) ∝ 1A(1-x²)
gaussian p(x) ∝ exp(-x²/2)
logistic p(x) ∝ 1/(exp(-x)+2+exp(x))
quartic p(x) ∝ 1A(1-x²)²
tophat p(x) ∝ 1A
triangle p(x) ∝ 1A(1-‖x‖)
tricube p(x) ∝ 1A(1-‖x‖³)³

In the above definitions, 1A is the indicator function and A = {x: ‖x‖ < 1}.

For more information on the mathematical theory of kernel density estimation, see the following references:

  1. Rosenblatt, Murray. Remarks on Some Nonparametric Estimates of a Density Function. Ann. Math. Statist. 27 (1956), no. 3, 832–837. doi:10.1214/aoms/1177728190.
  2. Parzen, Emanuel. On Estimation of a Probability Density Function and Mode. Ann. Math. Statist. 33 (1962), no. 3, 1065–1076. doi:10.1214/aoms/1177704472.

Installation

This library requires Numpy, Scipy, and Cython. In addition, gcc is required for compilation. To install, run make from the directory in which this README file is found.

CUDA installation requires the CUDA toolkit and has been tested with CUDA version 8.0.44 on GTX 1080 cards. To install the CUDA backend, run make cuda from the directory in which this README file is found. The CUDA backend may be used by passing the cuda=True keyword argument to the evaluate method of the KDE class.

Use

Kernel density estimation with arbitrary data

Before using this library, you will need to make sure that it may be imported by Python. To do so, add the top-level directory of this git repository (the directory containing this README file) to your PYTHONPATH environment variable. If this does not work, you may also add the following commands to the top of your Python script:

import sys
sys.path.append("path to this git repository")

Then, import the kde module via Python.

Kernel density estimation is performed via the KDE class, accessible as kde.KDE.

class kde.KDE(training_points, kernel='gaussian', weights=None, metric='euclidean_distance', bw=1)

Parameters:

Parameter Data type Description
training_points numpy.ndarray The values of the samples in ℝⁿ or (S¹)ⁿ = S¹×S¹×...×S¹
kernel string The kernel. Options are:
"bump"
"cosine"
"epanechnikov"
"gaussian"
"logistic"
"quartic"
"tophat"
"triangle"
"tricube"
See above for kernel definitions.
weights numpy.ndarray or None The weights of the samples. If None, the samples are uniformly weighted.
metric string The norm from which to induce the metric for distance between points. Options are 'euclidean_distance' and 'euclidean_distance_ntorus'. 'euclidean_distance_ntorus' assumes the sample space is an n-torus (S¹×S¹×...×S¹) where each dimension runs between -180 and 180, and the distance is the minimum euclidean distance to a periodic image.
bw float The bandwidth of the kernel

Methods:

Method Description
set_kernel_type(kernel) Set the kernel to kernel. See above for options.
evaluate(p, cuda=False) Evaluate the kernel density estimate at each position of p, an m-by-k numpy array, where m is the number of samples and k is the number of features. If cuda=True, use the CUDA backend (requires compilation with the cuda option; see Installation above).

Kernel density estimation with WESTPA data

This library provides classes for interacting with WESTPA data sets, enabling kernel density estimation from WESTPA data via Python scripts and via the command line.

From within a Python script, import the kde module, which provides the kde.WKDE class for interacting with WESTPA data sets. The WKDE class should be initialized as:

kde.WKDE(westh5, first_iter=None, last_iter=None, load_func=None, bw=1)
Parameter Data type Description
westh5 h5py HDF5 File object The WESTPA data file (typically named 'west.h5')
first_iter int or None The first weighted ensemble iteration from which to use data. If None, start at iteration 1.
last_iter int or None The last weighted ensemble iteration from which to use data (inclusive).
load_func Python function or None Load data using the specified Python function. The function will be called as load_func(iter_group, niter) where iter_group is the HDF5 group corresponding to a weighted ensemble iteration, and niter is an integer denoting the index of the weighted ensemble iteration. The function should return a numpy array of shape (nsegs, ntimepoints, ndim) where nsegs is the number of segments in that iteration, ntimepoints is the number of sub-iteration timepoints, and ndim is the number of dimensions of the coordinate. If None (default), load the progress coordinate data.
bw float The bandwidth to use for the kernel. See the bw parameter of the kde.KDE class for more information.

Following initialization, call the evaluate method as <WKDE class instance>.evaluate(points) to evaluate the kernel density estimate at each point in points. A gaussian kernel is set automatically; to use another kernel, use the set_kernel_type method (see documentation for kde.KDE) followed by the evaluate method.

To interact with WESTPA data from the command line, run python -m kde.w_kde; include the -h or --help flag for more information.

About

Python library for kernel density estimation, including interface for WESTPA data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages