A query engine for CSV data with a custom SQL-like syntax.
Load CSV files and query them using projection and filtering operations through an interactive REPL.
- Node.js 20.18.0 or higher
Install dependencies:
npm installRun the query engine using default dataset (data/countries.csv):
npm startOr with a specific CSV file:
npm start -- -f path/to/your/data.csv-f <file>- Specify CSV file to load (default:data/countries.csv)-h, --help- Show help message
The query engine supports a simple SQL-like syntax:
PROJECT column1
PROJECT column1, column2
PROJECT column1 FILTER column2 > "value"
PROJECT column1, column2 FILTER column3 = 5
PROJECT column2 FILTER column3 != column4
Supported filter operators: =, !=, >, >=, <, <=
.columns- Show available columns in the loaded dataset.exit- Quit the REPL
.
├── src/
│ ├── index.ts # Main entry point
│ ├── cli.ts # Command line argument parsing
│ ├── repl.ts # REPL implementation
│ ├── query-engine.ts # Query execution engine
│ ├── query-parser.ts # Query parser
│ ├── lex.ts # Lexer for tokenization
│ ├── csv-loader.ts # CSV file loader
│ ├── column-schema-inference.ts # Type inference for columns
│ └── errors/ # Error types
├── test/ # Test files
└── data/ # Sample CSV data files
Run tests:
npm testRun tests in watch mode:
npm run test:watchRun tests with coverage:
npm run test:coverageBuild the project:
npm run buildLint code:
npm run lint:fixFormat code:
npm run formatWhat were some of the tradeoffs you made when building this and why were these acceptable tradeoffs?
Decision: Used @fast-csv/parse library instead of implementing parsing from scratch.
Why:
- Using battle-tested library with streaming support for large files
- Saved implementation time for other parts of the project
Downside:
- Goes slightly beyond spec (supports floating point numbers, not just integers)
Acceptable because: Focus of the project is query engine, not the CSV parsing.
Decision: Built custom lexer and parser
Why:
- Full control over grammar of the query language
- Better error messages with precise token positions
- Easier to extend (adding new tokens, keywords)
- Fun challenge
Downside:
- More code to maintain
- A lot more tests to write vs using a battle-tested library
Acceptable because: For a query engine project, building own query parser felt natural. Shows that I can do more than just using libraries.
Decision: Support integers and decimals
Why:
- No limitation on the data part, number can be both integer and a decimal
Downside:
- Slightly more complex lexer implementation
Acceptable because: Small implementation time overhead for very useful feature.
Decision: Make query keywords case insensitive
Why:
- Easier to write query when you don't need to switch between cases for keywords and column names
- Better UX
- Follow SQL standard
Downside:
- Slightly more complex logic
- More test cases required
Acceptable because: Better UX for a small performance overhead
Decision: Support both double and single quotes
Why:
- Better UX
Downside:
- Slightly more complex lexer logic
- More test cases required
Acceptable because: Better UX for a small complexity added
Decision: Implement all comparison operators
Why:
- Marginal additional effort of implementation time
- Makes queries more useful
Downside:
- More test cases needed
- Slightly more complex lexer
Acceptable because: Makes the query engine much more real-world usable
Decision: Infer schema in query engine
Why:
- No additional responsibility on CSV data loader to gather information about types
- Handling of the data types "hidden" from the public API
- Less prone to human error, does not allow to manually pass invalid types to query engine
Downside:
- Query engine has to infer the types from the data, performance overhead
Acceptable because: Simpler public API, harder to misuse
Decision: Sample first, middle and last row instead of scanning entire dataset
Why:
- Checking just one row (first) felt too brittle, checking multiple rows should cover more edge cases
O(k)schema inference instead ofO(n * k)for full scan (n- number of rows,k- number of columns)- CSV are typically homogeneous
Downside:
- Won't detect type inconsistencies in the full dataset
- Could misclassify if first/middle/last rows aren't representative
Acceptable because: Sampling three positions should catch most issues while maintaining almost constant-time performance.
Decision: Built REPL with helper commands instead of just simple script
Why:
- Better demo experience for reviewers
- Easier to test queries interactively
- Shows error messages from parser / query engine
- Demonstrates the helper function for presenting results better
- Helper commands help discovering the dataset
Downside:
- Additional REPL-specific code needed
- Requires manually typing queries to test implementation rather than just launching a script
Acceptable because: Makes the project more usable, easier to test different queries than modifying script and re-running
Decision: Allow passing custom CSV file via command line argument, with fallback to a sample file
Why:
- Allows testing the project on different datasets
- Common practice to support changing config/data through arguments
- Fallback to sample file doesn't require from the user to have a CSV file to use project
Downside:
- Requires command line arguments parsing
Acceptable because: Introduces easy way to quickly change the dataset without need to change source code.
Currently there is no way to limit how many results are taken from the dataset, and for a large results, the display logic might not be able to display all of them.
When to add: Immediately, this can prevent OOM issues and is quite critical for usage with large datasets.
Example: PROJECT name FILTER age > 25 LIMIT 10
While some error types contain additional data, the context is not utilized as much as it could. Some errors about syntax might be confusing to the user that is not familiar with the query language grammar.
Example: PROJECT id, first name results with Syntax error: Expected <eof> token, got <identifier> error while it could say Expected comma (,) between column names and displaying a ^ pointer at the error position.
Or even a Expected <comma> token, got <identifier> would be better.
When to add: before the query language goes public :)
Currently every query is lexed and parsed from scratch, even for repeated queries. Simple optimization would be to keep a cache of parsed queries. This would eliminate parsing overhead for repeated queries.
When to add: After implementing performance measurement, only if profiling shows that time spent on parsing is >10% of query time.
What changes are needed to accommodate changes to support other data types, multiple filters, or ordering of results?
- Lexer: recognize type literals like
true/false/null, handle date syntax - Schema inference: detect additional types besides
stringandnumber - Filter validation: type-specific operator restrictions (
>is not a valid operator for comparing booleans) - Execution: handle null comparison (
null != nullreturnsfalse)
- Lexer: recognize additional tokens like
ANDandORand parentheses for grouping - Parser: implement expression parsing with operator precedence and parentheses, significant parser rewrite needed due to data structure change
- Execution: significant change in filtering logic to support complex filters
- Lexer: add new tokens for keywords like
ORDER,BY,ASC,DESC - Parser: add new logic for order-specific tokens
- Execution: apply sorting after projection
- Streaming API: Return results incrementally (e.g. using async generators) instead of full array - better memory usage, although ordering becomes more complex (requires collecting all results first)
- Data source abstraction: Abstract data behind interface - don't store all rows in memory, load on-demand
- Chunk-based processing: Read and process CSV in batches (e.g., 10K rows) rather than loading entire file
- LIMIT: Essential safety feature - allows early termination and prevents OOM
- Column indexes: For frequently filtered columns - reduces O(n) scan to O(log n) lookup. High complexity.
- Error handling: File system errors, malformed CSV, better error messages with context
- Input validation: Query string length limit (prevent DoS via lengthy query strings)
- Testing: Integration tests with real CSV files, error scenarios, performance benchmarks
- Logging: Structured logs with context (query, duration, errors)
- Observability: Metrics for query count, duration of each step, error rates by type
- Resource limits: Query timeout, memory limits, result set size limits
- Documentation: JSDoc on public API, query syntax reference, error code guide