vectra is an R-native columnar query engine for datasets larger than RAM.
Write dplyr-style pipelines against multi-GB files on a laptop. Data streams through a C11 pull-based engine one row group at a time, so peak memory stays bounded regardless of file size.
Point vectra at any file and query it with dplyr verbs. Nothing runs until collect().
library(vectra)
# CSV — lazy scan with type inference
tbl_csv("measurements.csv") |>
filter(temperature > 30, year >= 2020) |>
group_by(station) |>
summarise(avg_temp = mean(temperature), n = n()) |>
collect()
# GeoTIFF — climate rasters as tidy data
tbl_tiff("worldclim_bio1.tif") |>
filter(band1 > 0) |>
mutate(temp_c = band1 / 10) |>
collect()
# Point extraction — sample raster values at coordinates, no terra needed
tiff_extract_points("worldclim_bio1.tif",
x = c(10.5, 11.2), y = c(47.1, 47.3))
# SQLite — zero-dependency, no DBI required
tbl_sqlite("survey.db", "responses") |>
filter(year == 2025) |>
left_join(tbl_sqlite("survey.db", "sites"), by = "site_id") |>
collect()For repeated queries, convert to vectra's native .vtr format for faster reads:
write_vtr(big_df, "data.vtr", batch_size = 100000)
tbl("data.vtr") |>
filter(x > 0, region == "EU") |>
group_by(region) |>
summarise(total = sum(value), n = n()) |>
collect()Append new data without rewriting the file, or do a key-based diff between two snapshots:
# Append new rows as a new row group — existing data untouched
append_vtr(new_rows_df, "data.vtr")
# Logical diff: what was added or deleted between two snapshots?
d <- diff_vtr("snapshot_old.vtr", "snapshot_new.vtr", key_col = "id")
collect(d$added) # rows present in new but not old
d$deleted # key values present in old but not newFuzzy string matching runs inside the C engine, no round-trip to R:
tbl("taxa.vtr") |>
filter(levenshtein(species, "Quercus robur") <= 2) |>
mutate(similarity = jaro_winkler(species, "Quercus robur")) |>
arrange(desc(similarity)) |>
collect()Register a star schema to avoid flat-table column creep. Define the links once, then pull only what you need:
s <- vtr_schema(
fact = tbl("observations.vtr"),
species = link("sp_id", tbl("species.vtr")),
site = link("site_id", tbl("sites.vtr"))
)
# Pull columns from any dimension — joins are built automatically
lookup(s, count, species$name, site$habitat) |> collect()
#> species: all 500 keys matched
#> site: 3/500 unmatched keys (X1, X2, X3)Use explain() to inspect the optimized plan:
tbl("data.vtr") |>
filter(x > 0) |>
select(id, x) |>
explain()
#> vectra execution plan
#>
#> ProjectNode [streaming]
#> FilterNode [streaming]
#> ScanNode [streaming, 2/5 cols (pruned), predicate pushdown, v3 stats]
#>
#> Output columns (2):
#> id <int64>
#> x <double>Querying large datasets in R usually means Arrow (requires compiled binaries matching your platform), DuckDB (links a 30 MB bundled library), or Spark (requires a JVM and cluster configuration).
vectra is a self-contained C11 engine compiled as a standard R extension. No external libraries, no JVM, no runtime configuration. It provides:
- Streaming execution: data flows one row group at a time, never fully in memory
- Zero-copy filtering: selection vectors avoid row duplication
- Query optimizer: column pruning skips unneeded columns at scan; predicate pushdown uses per-rowgroup min/max statistics to skip entire row groups
- Hash joins: build right, stream left --- join a 50 GB fact table against a lookup without materializing both
- External sort: 1 GB memory budget with automatic spill-to-disk
- Window functions:
row_number(),rank(),dense_rank(),lag(),lead(),cumsum(),cummean(),cummin(),cummax() - String expressions:
nchar(),substr(),grepl()evaluated in the engine without round-tripping to R - Multiple data sources:
.vtr, CSV, SQLite, GeoTIFF --- all produce the same lazy query nodes - Integer TIFF output: write rasters as
int16/int32/uint8/uint16/float32with embedded GDAL metadata for 5-10x smaller files
| Category | Verbs |
|---|---|
| Transform | filter(), select(), mutate(), transmute(), rename(), relocate() |
| Aggregate | group_by(), summarise() (n, sum, mean, min, max, sd, var, first, last, any, all, median, n_distinct), count(), tally(), distinct() |
| Join | left_join(), inner_join(), right_join(), full_join(), semi_join(), anti_join(), cross_join(), lookup() |
| Order | arrange(), slice_head(), slice_tail(), slice_min(), slice_max(), slice() |
| Window | row_number(), rank(), dense_rank(), lag(), lead(), cumsum(), cummean(), cummin(), cummax(), ntile(), percent_rank(), cume_dist() |
| Date/Time | year(), month(), day(), hour(), minute(), second(), as.Date() (in filter()/mutate()) |
| String | nchar(), substr(), grepl(), tolower(), toupper(), trimws(), paste0(), gsub(), sub(), startsWith(), endsWith() (in filter()/mutate()) |
| String similarity | levenshtein(), levenshtein_norm(), dl_dist(), dl_dist_norm(), jaro_winkler() — fuzzy matching in filter()/mutate(), with optional max_dist early termination |
| Expression | abs(), sqrt(), log(), exp(), floor(), ceiling(), round(), log2(), log10(), sign(), trunc(), if_else(), between(), %in%, as.numeric(), pmin(), pmax(), resolve(), propagate() (in filter()/mutate()) |
| Combine | bind_rows(), bind_cols(), across() |
| Schema | vtr_schema(), link(), lookup() — star schema definition and dimension lookup with match reporting |
| I/O | tbl(), tbl_csv(), tbl_sqlite(), tbl_tiff(), write_vtr(), write_csv(), write_sqlite(), write_tiff(), tiff_extract_points(), tiff_metadata(), append_vtr(), delete_vtr(), diff_vtr() |
| Inspect | explain(), glimpse(), print(), pull() |
Full tidyselect support in select(), rename(), relocate(), and across(): starts_with(), ends_with(), contains(), matches(), where(), everything(), all_of(), any_of().
# CRAN
install.packages("vectra")
# Development version
pak::pak("gcol33/vectra")- Getting Started — Full walkthrough with runnable examples
- Format Backends — CSV, SQLite, Excel, GeoTIFF, and streaming conversion pipelines
- Joins — All join types, fuzzy joins, key coercion, and memory model
- Star Schemas — Dimension lookups, match reporting, and avoiding flat-table column creep
- String Operations — Pattern matching, fuzzy matching, and block lookups
- Indexing and Optimization — Hash indexes, zone-map pruning, column pruning, and reading
explain()output - Working with Large Data — Streaming pipelines, append/delete/diff, external sort, and memory budgeting
- Engine Reference — Execution model, types, coercion, .vtr format, and limitations
- Function Reference
"Software is like sex: it's better when it's free." -- Linus Torvalds
If this package saved you some time, buying me a coffee is a nice way to say thanks.
MIT (see the LICENSE.md file)
@software{vectra,
author = {Colling, Gilles},
title = {vectra: Columnar Query Engine for Larger-Than-RAM Data},
year = {2026},
url = {https://github.com/gcol33/vectra}
}