Skip to content

aaryannaidu/dataframe_lib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataFrameLib

A C++17 DataFrame library built on Apache Arrow. Apache Arrow is used only for I/O (CSV/Parquet read/write) and data storage — all computation is implemented from scratch in pure C++. It offers both an eager API (operations execute immediately) and a lazy API (operations build a query plan that is optimized and executed only on .collect()).


Dependencies

Library Install
Apache Arrow (core + Parquet) brew install apache-arrow
Graphviz (for explain()) brew install graphviz
CMake ≥ 3.14 brew install cmake

Build

mkdir build && cd build
cmake ..
cmake --build . --parallel

This produces:

  • libdataframelib.a — the library
  • dataframe_test — visualization demo (main.cpp)
  • dfl_test_expressions, dfl_test_eager, dfl_test_lazy, dfl_test_optimizer — unit tests

After building, run the demo to generate DAG visualization PNGs in the build/ directory:

./dataframe_test
# produces: build/test_optimizer.png  (constant folding + predicate pushdown example)
#           build/test_explain.png    (join + group-by pipeline example)

Quick Start

Include the single public header:

#include <dataframelib/dataframelib.h>
using namespace dataframelib;

Eager API

Operations run immediately and return a new EagerDataFrame.

// Read
auto df = read_csv("data.csv");

// Select, filter, sort
auto result = df.select({"name", "salary", "department"})
                .filter(col("salary") > 50000)
                .sort({"salary"}, false);   // false = descending

// Add / replace a column
auto df2 = df.with_column("bonus", col("salary") * 0.1);

// Group-by and aggregate
auto summary = df.group_by({"department"})
                 .aggregate({{"salary", "mean"}, {"salary", "count"}});
// Output columns: department, salary_mean, salary_count

// Join
auto joined = df.join(other, {"id"}, "inner");   // inner | left | right | outer

// Write
result.write_csv("out.csv");
result.write_parquet("out.parquet");

Lazy API

Operations build a query plan. Call .collect() to execute.

auto ldf = scan_csv("data.csv")
               .filter(col("department") == "Engineering")
               .select({"name", "salary"})
               .sort({"salary"}, false)
               .head(10);

EagerDataFrame top10 = ldf.collect();

// Visualise the query plan (before + after optimization) as a PNG
ldf.explain("plan.png");

// Sink directly to file without materializing
ldf.sink_csv("top10.csv");

Expressions

col("x") + col("y")          // arithmetic: + - * / %
col("age") >= 18              // comparison: == != < <= > >=
col("active") & ~col("banned")// boolean: & | ~
col("name").to_upper()        // strings: to_lower/upper, length, contains, starts_with, ends_with
col("salary").is_null()       // null checks: is_null, is_not_null
col("x").abs()
(col("x") * 2).alias("double_x")  // alias for computed columns in select()

Build from arrays

auto id_arr   = /* arrow::Int32Array */;
auto name_arr = /* arrow::StringArray */;

auto df = from_columns({{"id", id_arr}, {"name", name_arr}});

Project Structure

include/
  dataframelib/dataframelib.h   public entry-point header
  Expression.hpp                expression AST and operator overloads
  EagerDataFrame.hpp            eager execution API
  LazyDataFrame.hpp             lazy / query-plan API
  DAGNode.hpp                   query plan node types
  QueryOptimizer.hpp            optimizer interface
  IO.hpp                        internal I/O helpers
  Compute.hpp                   manual compute layer (NativeValue + all operation declarations)

src/
  Compute.cpp                   all compute implementations (arithmetic, comparison, aggregation,
                                string functions, sort, filter — no Arrow compute used)
  Expression.cpp                expression AST walker; delegates evaluation to Compute
  EagerDataFrame.cpp            eager operations (filter, join, group-by …)
  LazyDataFrame.cpp             lazy plan builder + DAG executor
  DAGNode.cpp                   node ID management
  QueryOptimizer.cpp            query optimizer passes
  IO.cpp                        CSV / Parquet read & write (only Arrow usage outside of data types)

tests/
  test_expressions.cpp
  test_eager.cpp
  test_lazy.cpp
  test_optimizer.cpp

Query Optimizer

The lazy API automatically applies the following optimizations before execution:

Pass What it does
Projection pushdown Inserts early SELECT nodes near scans to drop unused columns as soon as possible
Projection merging Collapses consecutive redundant SELECT nodes into one
Predicate pushdown Moves FILTER below SORT and below SELECT (when safe) to reduce rows early
Limit pushdown Moves HEAD below SELECT so fewer rows are projected
Constant folding Evaluates literal arithmetic at plan time (2 * 36)
Expression simplification Rewrites x + 0x, x * 1x, x * 00, x - x0

Call .explain("plan.png") on any LazyDataFrame to produce a side-by-side PNG of the unoptimized and optimized query plans. Example outputs from ./dataframe_test (generated in build/):

  • test_optimizer.png — shows constant folding + predicate pushdown transforming FILTER(SORT(SELECT(SCAN))) into SORT(FILTER(SCAN))
  • test_explain.png — shows a join + group-by pipeline with projection and filter pushdown into the join branches

Running Tests

cd build
./dfl_test_expressions
./dfl_test_eager
./dfl_test_lazy
./dfl_test_optimizer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors