Skip to content

arkus7/simple-query-engine

Repository files navigation

simple-query-engine

A query engine for CSV data with a custom SQL-like syntax.

Load CSV files and query them using projection and filtering operations through an interactive REPL.

Short demo

asciicast

Requirements

  • Node.js 20.18.0 or higher

Quick start

Install dependencies:

npm install

Run the query engine using default dataset (data/countries.csv):

npm start

Or with a specific CSV file:

npm start -- -f path/to/your/data.csv

Command line options

  • -f <file> - Specify CSV file to load (default: data/countries.csv)
  • -h, --help - Show help message

Query syntax

The query engine supports a simple SQL-like syntax:

PROJECT column1
PROJECT column1, column2
PROJECT column1 FILTER column2 > "value"
PROJECT column1, column2 FILTER column3 = 5
PROJECT column2 FILTER column3 != column4

Supported filter operators: =, !=, >, >=, <, <=

REPL commands

  • .columns - Show available columns in the loaded dataset
  • .exit - Quit the REPL

Project structure

.
├── src/              
│   ├── index.ts                    # Main entry point
│   ├── cli.ts                      # Command line argument parsing
│   ├── repl.ts                     # REPL implementation
│   ├── query-engine.ts             # Query execution engine
│   ├── query-parser.ts             # Query parser
│   ├── lex.ts                      # Lexer for tokenization
│   ├── csv-loader.ts               # CSV file loader
│   ├── column-schema-inference.ts  # Type inference for columns
│   └── errors/                     # Error types
├── test/                           # Test files
└── data/                           # Sample CSV data files

Development

Run tests:

npm test

Run tests in watch mode:

npm run test:watch

Run tests with coverage:

npm run test:coverage

Build the project:

npm run build

Lint code:

npm run lint:fix

Format code:

npm run format

Questions

What were some of the tradeoffs you made when building this and why were these acceptable tradeoffs?

Using CSV library vs custom CSV parser

Decision: Used @fast-csv/parse library instead of implementing parsing from scratch.

Why:

  • Using battle-tested library with streaming support for large files
  • Saved implementation time for other parts of the project

Downside:

  • Goes slightly beyond spec (supports floating point numbers, not just integers)

Acceptable because: Focus of the project is query engine, not the CSV parsing.

Full lexer/parser implementation vs regex-based parsing or a library

Decision: Built custom lexer and parser

Why:

  • Full control over grammar of the query language
  • Better error messages with precise token positions
  • Easier to extend (adding new tokens, keywords)
  • Fun challenge

Downside:

  • More code to maintain
  • A lot more tests to write vs using a battle-tested library

Acceptable because: For a query engine project, building own query parser felt natural. Shows that I can do more than just using libraries.

Support only integers vs integers and decimals

Decision: Support integers and decimals

Why:

  • No limitation on the data part, number can be both integer and a decimal

Downside:

  • Slightly more complex lexer implementation

Acceptable because: Small implementation time overhead for very useful feature.

Case sensitive vs case insensitive keywords in query language

Decision: Make query keywords case insensitive

Why:

  • Easier to write query when you don't need to switch between cases for keywords and column names
  • Better UX
  • Follow SQL standard

Downside:

  • Slightly more complex logic
  • More test cases required

Acceptable because: Better UX for a small performance overhead

Double vs single quotes for string literals

Decision: Support both double and single quotes

Why:

  • Better UX

Downside:

  • Slightly more complex lexer logic
  • More test cases required

Acceptable because: Better UX for a small complexity added

Implementing basic operators (> and =) vs all comparison operators

Decision: Implement all comparison operators

Why:

  • Marginal additional effort of implementation time
  • Makes queries more useful

Downside:

  • More test cases needed
  • Slightly more complex lexer

Acceptable because: Makes the query engine much more real-world usable

Query engine inferring schema vs passing schema to query engine from CSV loader

Decision: Infer schema in query engine

Why:

  • No additional responsibility on CSV data loader to gather information about types
  • Handling of the data types "hidden" from the public API
  • Less prone to human error, does not allow to manually pass invalid types to query engine

Downside:

  • Query engine has to infer the types from the data, performance overhead

Acceptable because: Simpler public API, harder to misuse

Scanning all rows vs using first row vs sampling few rows for schema inference

Decision: Sample first, middle and last row instead of scanning entire dataset

Why:

  • Checking just one row (first) felt too brittle, checking multiple rows should cover more edge cases
  • O(k) schema inference instead of O(n * k) for full scan (n - number of rows, k - number of columns)
  • CSV are typically homogeneous

Downside:

  • Won't detect type inconsistencies in the full dataset
  • Could misclassify if first/middle/last rows aren't representative

Acceptable because: Sampling three positions should catch most issues while maintaining almost constant-time performance.

Building simple script showing output vs interactive REPL

Decision: Built REPL with helper commands instead of just simple script

Why:

  • Better demo experience for reviewers
  • Easier to test queries interactively
  • Shows error messages from parser / query engine
  • Demonstrates the helper function for presenting results better
  • Helper commands help discovering the dataset

Downside:

  • Additional REPL-specific code needed
  • Requires manually typing queries to test implementation rather than just launching a script

Acceptable because: Makes the project more usable, easier to test different queries than modifying script and re-running

Use sample data vs allow custom CSV file

Decision: Allow passing custom CSV file via command line argument, with fallback to a sample file

Why:

  • Allows testing the project on different datasets
  • Common practice to support changing config/data through arguments
  • Fallback to sample file doesn't require from the user to have a CSV file to use project

Downside:

  • Requires command line arguments parsing

Acceptable because: Introduces easy way to quickly change the dataset without need to change source code.

Given more time, what improvements or optimizations would you want to add? When would you add them?

Add LIMIT-like clause

Currently there is no way to limit how many results are taken from the dataset, and for a large results, the display logic might not be able to display all of them.

When to add: Immediately, this can prevent OOM issues and is quite critical for usage with large datasets.

Example: PROJECT name FILTER age > 25 LIMIT 10

Better error messages

While some error types contain additional data, the context is not utilized as much as it could. Some errors about syntax might be confusing to the user that is not familiar with the query language grammar.

Example: PROJECT id, first name results with Syntax error: Expected <eof> token, got <identifier> error while it could say Expected comma (,) between column names and displaying a ^ pointer at the error position. Or even a Expected <comma> token, got <identifier> would be better.

When to add: before the query language goes public :)

Query plan caching

Currently every query is lexed and parsed from scratch, even for repeated queries. Simple optimization would be to keep a cache of parsed queries. This would eliminate parsing overhead for repeated queries.

When to add: After implementing performance measurement, only if profiling shows that time spent on parsing is >10% of query time.

What changes are needed to accommodate changes to support other data types, multiple filters, or ordering of results?

Other data types (boolean, Date, null)

  • Lexer: recognize type literals like true / false / null, handle date syntax
  • Schema inference: detect additional types besides string and number
  • Filter validation: type-specific operator restrictions (> is not a valid operator for comparing booleans)
  • Execution: handle null comparison (null != null returns false)

Multiple filters

  • Lexer: recognize additional tokens like AND and OR and parentheses for grouping
  • Parser: implement expression parsing with operator precedence and parentheses, significant parser rewrite needed due to data structure change
  • Execution: significant change in filtering logic to support complex filters

Ordering results

  • Lexer: add new tokens for keywords like ORDER, BY, ASC, DESC
  • Parser: add new logic for order-specific tokens
  • Execution: apply sorting after projection

What changes are needed to process extremely large datasets?

  • Streaming API: Return results incrementally (e.g. using async generators) instead of full array - better memory usage, although ordering becomes more complex (requires collecting all results first)
  • Data source abstraction: Abstract data behind interface - don't store all rows in memory, load on-demand
  • Chunk-based processing: Read and process CSV in batches (e.g., 10K rows) rather than loading entire file
  • LIMIT: Essential safety feature - allows early termination and prevents OOM
  • Column indexes: For frequently filtered columns - reduces O(n) scan to O(log n) lookup. High complexity.

What do you still need to do to make this code production ready?

  • Error handling: File system errors, malformed CSV, better error messages with context
  • Input validation: Query string length limit (prevent DoS via lengthy query strings)
  • Testing: Integration tests with real CSV files, error scenarios, performance benchmarks
  • Logging: Structured logs with context (query, duration, errors)
  • Observability: Metrics for query count, duration of each step, error rates by type
  • Resource limits: Query timeout, memory limits, result set size limits
  • Documentation: JSDoc on public API, query syntax reference, error code guide

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published