Skip to content
/ treecat Public
forked from posterior/treecat

A Bayesian latent tree model of multivariate multinomial data

License

Notifications You must be signed in to change notification settings

fritzo/treecat

 
 

Repository files navigation

TreeCat

A Bayesian latent tree model of multivariate multinomial data.

Intended Use

You might want to use TreeCat if: you have medium-sized tabular data with categorical and ordinal values, possibly with missing observations.

TreeCat supports
Feature Types categorical, ordinal, binomial, multinomial
# Rows (n) 1000-100K
# Features (p) 10-1000
# Cells (n × p) <1M
# Categories 2-10ish
# Max Ordinal 10ish
Missing obervations? yes
Repeated observations? yes
Sparse data? there are better methods
Unsupervised yes
Semisupervised yes
Supervised there are better methods

The Model

Let V be a set of vertices (one vertex per feature).
Let C[v] be the dimension of the vth feature.
Let N be the number of datapoints.
Let K[n,v] be the number of observations of feature v in row n (e.g. 1 for a categorical variable, 0 for missing data, or k for an ordinal value with minimum 0 and maximum k).

TreeCat is the following generative model:

E ~ UniformSpanningTree(V)    # An undirected tree.
for v in V:
    Pv[v] ~ Dirichlet(size = [M], alpha = 1/2)
for (u,v) in E:
    Pe[u,v] ~ Dirichlet(size = [M,M], alpha = 1/(2M))
    assume Pv[u] == sum(Pe[u,v], axis=1)
    assume Pv[v] == sum(Pe[u,v], axis=0)
for v in V:
    for i in 1:M:
        Q[v,i] ~ Dirichlet(size = [C[v]])
for n in 1:N:
    for v in V:
        X[n,v] ~ Categorical(Pv[v])
    for (u,v) in E:
        (X[n,u],X[n,v]) ~ Categorical(Pe[u,v])
    for v in V:
        Z[n,v] ~ Multinomial(Q[v,X[n,v]], count = K[n,v])

The Inference Algorithm

This package implements fully Bayesian MCMC inference using subsample-annealed Gibbs sampling. There are two pieces of latent state that are sampled:

  • Latent classes for each row for each vertex. These are sampled by single-site Gibbs sampling with a linear subsample-annealing schedule.

  • The latent tree structure is sampled by randomly removing an edge and replacing it. Since removing an edge splits the graph into two connected components, the only replacement locations that are feasible are those that re-connect the graph.

The single-site Gibbs sampler uses dynamic programming to simultaneously sample the complete latent assignment vector for each row. A dynamic programming schedule is recomputed each time the tree structure changes. This schedule is interpreted by various virtual machines for different purposes (training the model, sampling from the posterior, computing log probability of the posterior). The virtual machine for training is jit-compiled using numba.

License

Copyright (c) 2017 Fritz Obermeyer.
TreeCat is licensed under the Apache 2.0 License.

About

A Bayesian latent tree model of multivariate multinomial data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Makefile 0.3%