autocodebook

Automatic codebook and eligibility tracking for data preprocessing pipelines in R.

Write the mutate() — the codebook writes itself.

Built for large-scale epidemiological and social data pipelines using sparklyr, but works equally well with local data frames.

Installation

# From CRAN (after release)
install.packages("autocodebook")

# Development version
# install.packages("devtools")
devtools::install_github("patriciafortesm/autocodebook")

Why autocodebook?

In data preprocessing pipelines, documenting variables is duplicated work. You already wrote the case_when() with all the logic — but then you have to manually write the type, the source columns, the category labels, and the code again in a separate codebook table.

Before (manual codebook — you write everything twice):

# Step 1: Create the variable
df <- df %>%
  mutate(
    sex = case_when(
      cod_sex %in% c(0L, 99L) ~ NA_character_,
      cod_sex == 1L            ~ "Male",
      cod_sex == 2L            ~ "Female",
      TRUE                     ~ NA_character_
    )
  )

# Step 2: Manually document it (duplicated effort!)
register_var("sex",
  type       = "character",
  source     = "cod_sex",
  label      = "Sex",
  categories = "Male; Female; NA (codes 0 and 99)",
  code       = "case_when(cod_sex %in% c(0L, 99L) ~ NA_character_, ...)"
)

After (with autocodebook — you only write the label):

df <- auto_mutate(df,
  labels = list(sex = "Sex"),
  sex = case_when(
    cod_sex %in% c(0L, 99L) ~ NA_character_,
    cod_sex == 1L            ~ "Male",
    cod_sex == 2L            ~ "Female",
    TRUE                     ~ NA_character_
  )
)
# Done. Type, source, categories, and code are captured automatically.

The package uses introspection (rlang) to capture the source code of each expression and infer:

Field	How it's inferred
`type`	Keywords in the code (`NA_character_`, `0L`, `/`)
`source`	Column names referenced in the expression
`categories`	Literal values extracted from `case_when` / `if_else`
`code`	The literal R expression, captured automatically

What you write vs. what is automatic

Field	Who fills it	Example
`label`	You	`"Sex"`, `"Household crowding"`
`block`	You (optional)	`"Demographics"`, `"Migration"`
`type`	Automatic	`"character"`, `"integer"`, `"date"`
`source`	Automatic	`"cod_sex"`, `"n_people, n_rooms"`
`categories`	Automatic	`"Male; Female; NA"`
`code`	Automatic	The full `case_when(...)` expression

Quick example

library(dplyr)
library(autocodebook)

cb_init(id_col = "person_id")

df <- df %>%
  # Track raw data
  auto_filter(step = "1. Raw data", description = "All records", TRUE) %>%
  # Eligibility
  auto_filter(step = "2. Valid sex",
              description = "Exclude records with missing sex",
              !is.na(cod_sex)) %>%
  auto_filter(step = "3. Adults",
              description = "Restrict to age >= 18",
              age >= 18) %>%
  # Create derived variables (auto-documented)
  auto_mutate(
    labels = list(
      sex      = "Sex",
      race     = "Self-declared race / ethnicity",
      crowding = "Household crowding (people per room)"
    ),
    block = "Demographics",
    sex = case_when(
      cod_sex == 1L ~ "Male",
      cod_sex == 2L ~ "Female",
      TRUE          ~ NA_character_
    ),
    race = case_when(
      cod_race == 1L ~ "White",
      cod_race == 2L ~ "Black",
      cod_race == 3L ~ "Brown",
      cod_race == 5L ~ "Indigenous",
      TRUE           ~ NA_character_
    ),
    crowding = n_people / n_rooms
  )

# View and export
cb_render()                                              # Codebook as gt table
cb_export(file.path(tempdir(), "codebook.html"))         # Export to HTML
cb_export(file.path(tempdir(), "codebook.docx"))         # Editable Word table
cb_export(file.path(tempdir(), "codebook.xlsx"))         # Editable Excel spreadsheet
track_render()                                           # Eligibility flow as gt table

# Programmatic access
cb_get()      # Codebook as a tibble
track_get()   # Tracking log as a tibble

Standardized HTML report

A single call to generate_report() produces a complete dashboard with eligibility flowchart, codebook, and per-variable inspection — ready to share with collaborators or attach as a supplement.

generate_report(
  data        = df,
  type        = "longitudinal",         # or "cross_sectional"
  id_var      = "person_id",
  time_var    = "year",
  output_html = file.path(tempdir(), "report.html")
)

Eligibility section — automatic flowchart with N per step and number of records removed:

Codebook section — all derived variables with type, source, categories, and the exact code that produced them:

Variable inspection — distribution by period, missingness pattern, and within-subject variation (Fixed vs. Varies), per variable:

Editable exports for papers and supplements

The codebook can be exported as a fully editable Word table (for paper supplements) or Excel spreadsheet (with filters, for review before publication):

Word (.docx) — paste straight into supplementary material:

Excel (.xlsx) — filter, sort, edit, then re-import if needed:

CONSORT-style eligibility flowchart

For studies that split the cohort by exposure (and optionally by mediator), track_split() + track_outcomes() capture N and outcome counts at every subgroup combination. flow_diagram() then renders a publication-ready CONSORT-style flowchart directly from the eligibility steps (recorded by auto_filter()) and the flow tree — no manual positioning needed:

df %>%
  auto_filter(step = "age",   description = "Younger than 10 years", age >= 10) %>%
  auto_filter(step = "sinan", description = "No record of violence", has_violence) %>%
  track_split(by = "sgm", label = "SGM status",
              value_labels = c("0" = "Non-SGM", "1" = "SGM")) %>%
  track_outcomes(c("self_harm", "psych"),
                 labels = list(self_harm = "Self-harm",
                               psych     = "Psychiatric hospitalization"))

flow_diagram()       # publication-ready ggplot
flow_table()         # the same data as a tidy tibble (one row per leaf × outcome)

flow_diagram() — vertical trunk (baseline → aggregated exclusions → eligible cohort), one column per subgroup, and outcome boxes stacked beneath each subgroup:

flow_table() — the same information as a tidy tibble, ready for analysis or editable export (CSV, XLSX):

flow_diagram() returns a ggplot object, so it can be themed, embedded in the standardized report (which does so automatically), or saved with flow_diagram_export(). The export format follows the file extension:

flow_diagram_export("flow.png")    # raster image
flow_diagram_export("flow.pdf")    # vector (also .svg, .eps)
flow_diagram_export("flow.emf")    # editable vector for Word (needs 'devEMF')
flow_diagram_export("flow.docx")   # Word document with the flowchart embedded (needs 'officer')
flow_diagram_export("flow.pptx")   # PowerPoint, fully editable shapes (needs 'rvg' + 'officer')

For the .pptx output, right-click the figure in PowerPoint and choose Ungroup to edit each box and label as a native shape. The tidy table can still be piped into a dedicated diagramming package such as consort or DiagrammeR if you prefer.

Spark example

Works the same way with sparklyr — no API changes:

library(sparklyr)
library(dplyr)
library(autocodebook)

sc <- spark_connect(master = "local")
df <- copy_to(sc, my_data, "my_table")

cb_init(id_col = "person_id")
track_step(df, "1. Raw data")

df <- auto_mutate(df,
  labels = list(
    region_code = "Municipality code (7 digits)",
    state_code  = "State code (first 2 digits)"
  ),
  block = "Geographic variables",
  region_code = lpad(as.character(cod_munic), 7L, "0"),
  state_code  = substring(region_code, 1L, 2L)
)

cb_render()
spark_disconnect(sc)

Big-data optimizations

For large Spark pipelines, several helpers reduce wasted recomputation:

cb_set_default_cache(TRUE) — caches intermediate results across the whole session.
auto_filter(..., assume_unique = TRUE) — skips the n_distinct(id) call in tracking when the dataset is already unique by ID (orders of magnitude faster on multi-million-row data).
cb_checkpoint(sdf, mode = "memory") — materializes a lazy tbl_spark to break long chains of transformations.
generate_report(..., cache_data = TRUE) — persists the dataset once before computing all report aggregations.

API reference

Verb wrappers

Function	Replaces	Registers in	Description
`auto_mutate()`	`mutate()`	Codebook	Creates variables + auto-documents them
`auto_summarise()`	`summarise()`	Codebook	Summarises + auto-documents new columns
`auto_filter()`	`filter()`	Tracking	Filters + logs how many IDs remain

Codebook

Function	Description
`cb_init()`	Initialize session and set the unique ID column
`cb_register()`	Manually register a variable (for edge cases)
`cb_get()`	Returns the full codebook as a tibble
`cb_reset()`	Clears all codebook entries
`cb_render()`	Renders the codebook as a formatted `gt` table
`cb_export()`	Saves to `.html`, `.csv`, `.docx`, or `.xlsx`

Eligibility tracking

Function	Description
`track_step()`	Records a step with unique ID count and number removed
`track_get()`	Returns the tracking log as a tibble
`track_reset()`	Clears the tracking log
`track_render()`	Renders the tracking table as a formatted `gt` table
`track_export()`	Saves to `.html`, `.csv`, `.docx`, or `.xlsx`

Flow tree (CONSORT-style)

Function	Description
`track_split()`	Adds a branching level (e.g., by exposure)
`track_outcomes()`	Stacks outcome counts on the current leaves
`flow_diagram()`	Renders a CONSORT-style flowchart (`ggplot`) from the flow
`flow_diagram_export()`	Saves the flowchart (`.png/.pdf/.svg/.emf/.pptx`)
`flow_table()`	Tidy tibble with one row per leaf x outcome
`flow_get()`	Returns the raw flow-tree structure as a list
`flow_reset()`	Clears the flow tree

Reports and session options

Function	Description
`generate_report()`	Builds the full HTML dashboard (+ editable exports)
`cb_checkpoint()`	Materializes a lazy `tbl_spark`
`cb_set_verbose()`	Toggles diagnostic messages
`cb_set_default_cache()`	Sets the session-wide default for `cache`

Parameters for auto_mutate / auto_summarise

auto_mutate(.data,
  labels = list(var1 = "Label for variable 1"),  # only required field
  block  = "Section name",                        # optional: groups in codebook
  var1   = case_when(...)                          # your normal dplyr expressions
)

labels: Named list mapping variable names to descriptions. If omitted, the variable name itself is used.
block: Optional string. Groups variables into sections in the rendered codebook (e.g., "Demographics", "Migration flags").

Compatibility

R >= 4.1
Works with both sparklyr (tbl_spark) and local data frames
Compatible with Spark SQL functions (lpad, substring, lag with window_order, etc.)
No stringr dependency — uses only base R internally
Report exports require rmarkdown, ggplot2, patchwork, scales (Suggests)
Editable exports to .docx / .xlsx require officer, flextable, openxlsx (Suggests)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
R		R
inst/rmarkdown		inst/rmarkdown
man		man
tests		tests
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
MD5		MD5
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autocodebook

Installation

Why autocodebook?

What you write vs. what is automatic

Quick example

Standardized HTML report

Editable exports for papers and supplements

CONSORT-style eligibility flowchart

Spark example

Big-data optimizations

API reference

Verb wrappers

Codebook

Eligibility tracking

Flow tree (CONSORT-style)

Reports and session options

Parameters for auto_mutate / auto_summarise

Compatibility

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autocodebook

Installation

Why autocodebook?

What you write vs. what is automatic

Quick example

Standardized HTML report

Editable exports for papers and supplements

CONSORT-style eligibility flowchart

Spark example

Big-data optimizations

API reference

Verb wrappers

Codebook

Eligibility tracking

Flow tree (CONSORT-style)

Reports and session options

Parameters for auto_mutate / auto_summarise

Compatibility

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages