Skip to content

Latest commit

 

History

History
80 lines (66 loc) · 2.76 KB

README.md

File metadata and controls

80 lines (66 loc) · 2.76 KB

TreeCat

A Bayesian latent tree model of multivariate multinomial data.

Intended Use

You might want to use TreeCat if: you have medium-sized tabular data with categorical and ordinal values, possibly with missing observations.

TreeCat supports
Feature Types categorical, ordinal, binomial, multinomial
# Rows (n) 1000-100K
# Features (p) 10-1000
# Cells (n × p) <1M
# Categories 2-10ish
# Max Ordinal 10ish
Missing obervations? yes
Repeated observations? yes
Sparse data? there are better methods
Unsupervised yes
Semisupervised yes
Supervised there are better methods

The Model

Let V be a set of vertices (one vertex per feature).
Let C[v] be the dimension of the vth feature.
Let N be the number of datapoints.
Let K[n,v] be the number of observations of feature v in row n (e.g. 1 for a categorical variable, 0 for missing data, or k for an ordinal value with minimum 0 and maximum k).

TreeCat is the following generative model:

E ~ UniformSpanningTree(V)    # An undirected tree.
for v in V:
    Pv[v] ~ Dirichlet(size = [M], alpha = 1/2)
for (u,v) in E:
    Pe[u,v] ~ Dirichlet(size = [M,M], alpha = 1/(2M))
    assume Pv[u] == sum(Pe[u,v], axis=1)
    assume Pv[v] == sum(Pe[u,v], axis=0)
for v in V:
    for i in 1:M:
        Q[v,i] ~ Dirichlet(size = [C[v]])
for n in 1:N:
    for v in V:
        X[n,v] ~ Categorical(Pv[v])
    for (u,v) in E:
        (X[n,u],X[n,v]) ~ Categorical(Pe[u,v])
    for v in V:
        Z[n,v] ~ Multinomial(Q[v,X[n,v]], count = K[n,v])

The Inference Algorithm

This package implements fully Bayesian MCMC inference using subsample-annealed Gibbs sampling. There are two pieces of latent state that are sampled:

  • Latent classes for each row for each vertex. These are sampled by single-site Gibbs sampling with a linear subsample-annealing schedule.

  • The latent tree structure is sampled by randomly removing an edge and replacing it. Since removing an edge splits the graph into two connected components, the only replacement locations that are feasible are those that re-connect the graph.

The single-site Gibbs sampler uses dynamic programming to simultaneously sample the complete latent assignment vector for each row. A dynamic programming schedule is recomputed each time the tree structure changes. This schedule is interpreted by various virtual machines for different purposes (training the model, sampling from the posterior, computing log probability of the posterior). The virtual machine for training is jit-compiled using numba.

License

Copyright (c) 2017 Fritz Obermeyer.
TreeCat is licensed under the Apache 2.0 License.