dbstan
is glue code for mapping rstan::stanfit
objects to
relational database schemas. It leverages the DBI package to provide a
simple, DBMS-agnostic interface for:
INSERT
ing a representation of astanfit
object into a DB- Retrieving a particular
stanfit
object's information from the DB. - Materializing a
dbplyr
-based list of tables which contain allstanfit
information.
This makes it easy to move (batches of) sampler results into a database, which makes it more convenient to run analyses and share data.
The package maps a stanfit
object into records stored in nine SQL tables:
stanfit.run_ids
stanfit.run_info
stanfit.model_pars
stanfit.stanmodel
stanfit.summary
stanfit.c_summary
stanfit.samples
stanfit.log_posterior
stanfit.sampler_params
library(rstan)
library(dbstan)
# Establish a DB connection using DBI
conn <- DBI::dbConnect(
RPostgres::Postgres(),
user = 'postgres',
password = 'password',
host = 'mydb.abcdefg1234567.us-east-2.rds.amazonaws.com'
)
# Get samples from any sampled Stan model. Currently only NUTS is supported.
fit1 <- stan(
file = "schools.stan", # Stan program
data = schools_data, # named list of data
)
# INSERT `stanfit` into db and store the generated primary key, optionally
# including all posterior samples.
id <- stanfit_insert(fit1, conn, insert_samples = T)
# Retrieve all relevant tables as a list of dbplyr-backed tables
tbl_dict(conn)
# Or, retrieve all relevant tables for a particular `id`
tables <- get_stanfit(id, conn)
This helps avoid situations where:
-
Multiple researchers swap
.RDS
archives back and forth to exchange results.Now, just write queries against a database to get the results
-
Researchers want to do analysis across multiple (possibly many) sampled runs, but don't have enough RAM to do so, or don't want to keep track of various slimmed-down representations of the original
stanfit
object.Now, the RDBMS does the heavy lifting, and gracefully adapts to a model that may change over time in its parameter schema
-
Researchers are using a computing cluster and want to avoid shuffling around tons of
.RDS
files, running out of space on either the cluster or their dev machine, etc.Now, the cluster can just
INSERT
thestanfit
object once it is created, and move onto the next task - no write to disk necessary.
Tested with Postgres 14.2, but it should work with many other SQL-based databases.
-
Execute
init.sql
against your database, whichCREATE
s a"stanfit"
schema in your database andCREATE
s all tables. We recommend readinginit.sql
first!If you're just testing out
dbstan
, you could use a SQLite db for this, or Postgres in a container.psql -f init.sql
-
Make sure you can connect to the database using
DBI::dbConnect()
. -
Pass a
stanfit
object tostanfit_insert(stanfit_object, conn)
. The returned number is a unique identifierid
which is a field in all the tables.
Results from calls to rstan::sampling()
or rstan::stan()
are stored in
stanfit
objects. The contents of the object are described in detail
here. Each object is an S4 class with a bunch of slots that
represent various parts of the sampling process, such as the model code, the
samples drawn from the posterior, and diagnostic messages from the NUTS sampler.
dbstan
organizes these slots into a relational model. The slots are
summarized below:
model_name
Name of the modelmodel_pars
Parameters in the model, including thegenerated quantities{}
block andtransformed parameters{}
blockpar_dims
Dimensions of said paramaetersmode
Status code indicating success or failure of the samplersim
Matrix containing the individual samples.inits
Initial values of all parameters on the first iterationstan_args
Arguments for the samplerstanmodel
Model code, as astanmodel
objectdate
- Date of object creation
The schema can be viewed in init.sql
- here is a more descriptive mapping
between the stanfit
object and its relational model.
This table is an index of all of the stanfit
objects represented in the
schema. The method
column identifies the run as a sampled run or an optimized
run.
field | stanfit slot | notes |
---|---|---|
id | None | Serial, assigned by the database on insertion using stanfit_insert |
method | None | enum, sampling /optimizing |
Basic information about the run including how long it took.
field | stanfit slot | type | notes |
---|---|---|---|
id | None | serial | Foreign key |
model_name | @model_name |
text | The name of the file, minus the extension |
date | @date |
timestamptz | The time the Stanfit object was created |
duration | get_elapsed_time(r) %>% as_tibble(rownames='chain') |
JSON | Retval is in SECONDS* |
mode | @mode |
numeric | |
stan_args | @stan_args[[1]] |
JSON | Take the first chain and represent it as JSON |
Optimizing runs have a lot less information associated with them - here we only keep track of the optimizer's return code and the value of the log-posterior.
field | stanfit slot | type | notes |
---|---|---|---|
id | None | serial | Foreign key |
return_code | $return_code |
integer | Return code from the optimizing routine |
log_posterior | $value |
double precision | The value of the log-posterior at the point-estimate |
Model parameters and their dimension. Includes variables declared in:
parameters{}
blocktransformed parameters{}
blockgenerated quantities{}
block
field | stanfit slot | type | notes |
---|---|---|---|
id | None | serial | Foreign key |
par | names(r@par_dims) |
text | Includes all parameters/transformed ps/generated |
dim | @par_dims |
numeric | 0 represents scalars |
Model code.
field | stanfit slot | type | notes |
---|---|---|---|
id | serial | ||
code | get_stancode(r) |
text |
The output of calling rstan::summary()
on a stanfit
object and indexing
into the combined-chain summary (rstan::summary(obj)$summary
).
field | stanfit slot | type | notes |
---|---|---|---|
id | None | ||
par | summary(r)$summary$par |
text | |
idx | Slightly complicated | int | |
mean | summary(r)$summary$mean |
double precision | |
se_mean | summary(r)$summary$se_mean |
double precision | |
sd | summary(r)$summary$sd |
double precision | |
P2_5 | summary(r)$summary[["2.5%"]] |
double precision | |
P25 | summary(r)$summary[["25%"]] |
double precision | |
P50 | summary(r)$summary[["50%"]] |
double precision | |
P75 | summary(r)$summary[["75%"]] |
double precision | |
P97_5 | summary(r)$summary[["97.5%"]] |
double precision | |
n_eff | summary(r)$summary$n_eff |
double precision | |
Rhat | summary(r)$summary$Rhat |
double precision |
Per-chain summary.
- There's no
n_eff
orRhat
- The column names in the object returned by
summary(r)
are confusing, as all vars exceptpar
are named[metric].chain:[n]
.
We pivot the data a bit to get it into nearly the same structure as the summary
table.
field | stanfit slot | type | notes |
---|---|---|---|
id | None | ||
chain | No exact mapping | ||
par | summary(r)$summary$par |
text | |
idx | Slightly complicated | double precision | |
mean | summary(r)$summary$mean |
double precision | |
se_mean | summary(r)$summary$se_mean |
double precision | |
sd | summary(r)$summary$sd |
double precision | |
P2_5 | summary(r)$summary[["2.5%"]] |
double precision | |
P25 | summary(r)$summary[["25%"]] |
double precision | |
P50 | summary(r)$summary[["50%"]] |
double precision | |
P75 | summary(r)$summary[["75%"]] |
double precision | |
P97_5 | summary(r)$summary[["97.5%"]] |
double precision |
All samples. Call stanfit_insert(sft, include_samples = T)
to insert samples
from your posterior into the database. Otherwise, the default behavior is to
save (potentially lots) of space and not insert any samples.
field | stanfit slot | type | notes |
---|---|---|---|
id | None | smallint | using smallint to save space |
chain | No exact mapping | smallint | |
iter | see R/get_table_entries.R |
smallint | |
par | text | ||
idx | smallint | ||
value | double precision |
Optimizing results are sufficiently different in structure from sampler results that they are given their own table.
field | list key | type | notes |
---|---|---|---|
id | None | ||
par | names(r$par) |
text | |
idx | names(r$par) |
integer | |
point_est | r$value |
double precision |
The log posterior at each iteration, for each chain. Note: log_posterior
values for optimized runs are not located here - they're in the
optimizing_run_info
table.
field | stanfit slot | type | notes |
---|---|---|---|
id | None | integer | |
chain | get_logposterior(r)[[1,2,...]] |
numeric | |
iter | Row number of vector | numeric | |
value | The vector from each chain | double precision |
Diagnostic parameters, by chain.
field | stanfit slot | type | notes |
---|---|---|---|
id | None | integer | |
chain | No exact mapping | numeric | |
iter | Row number of vector | numeric | |
accept_stat | get_sampler_params(r)[[chain]]$accept_stat__ |
double precision | |
stepsize | get_sampler_params(r)[[chain]]$stepsize__ |
double precision | |
treedepth | get_sampler_params(r)[[chain]]$treedepth__ |
double precision | |
n_leapfrog | get_sampler_params(r)[[chain]]$n_leapfrog__ |
double precision | |
divergent | get_sampler_params(r)[[chain]]$divergent__ |
double precision | 1=divergent, 0=not divergent |
energy | get_sampler_params(r)[[chain]]$energy__ |
double precision |