Support JOIN in all of its wonderful incarnations. The initial implementation should focus on correctness on not worry about optimizing the join order based on table statistics.
Has there been any work done on this feature behind the scenes? If not, is there perhaps some design documentation already available?
It could be fun to try to give this one a shot to contribute...
We have been thinking about the more general problem of how to distribute SQL computation across the cluster, there is an RFC at https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/distributed_sql.md
We are aiming to have some limited join support before that, using code that will be reusable later within the distributed SQL framework. But even that will involve quite a bit of code restructuring, TBH this doesn't seem like a good starter project for someone who isn't familiar with the codebase.
Alright. I will let this one slide then, see if there's something interesting with the help-wanted label.
A few options come to mind that might be well suited for getting started with the codebase:
CREATE TABLE ... AS
Have you considered writing a Presto connector for CockroachDB? Presto is a full distributed SQL query engine with pluggable connectors (data sources) and supports distributed joins, including joins between different connectors.
We support batch index joins for tables that are indexed on the join key and otherwise support broadcast and distributed hash joins.
@electrum Presto appears targeted at analytics, while CockroachDB is targeted at transactional workloads. Beyond that, Presto is written in Java while CockroachDB is written in Go. Calling out to Java for SQL execution doesn't seem good from a performance perspective for transactional workloads.
@petermattis You're correct, Presto is definitely targeted at analytics, although the engine itself is capable of low latency queries. We have an internal connector at Facebook based on a sharded MySQL backend that can do complex, multi-way index join queries for reporting workloads in hundreds of milliseconds: https://www.youtube.com/watch?v=Gf9JqvNNRZg
I'm definitely not suggesting calling out to or trying to use Presto within CoackroachDB -- as you say, that would be horrible for transactional workloads, nor is it technically feasible. However, it could be a good complement for other workloads like reporting, analytics, ETL, batch pipelines, combining heterogeneous data sources, etc., and might also serve as a stop gap.
Unrelated, I really like that design document and all the rest of the documentation for the project. It's probably the best documented project I've seen and is a model for others to strive towards.
CockroachDB speaks the postgres wire protocol and our SQL is similar to the PostgreSQL dialect. The existing presto-postgres connector might work (with some adjustments).
Thanks for the note about the documentation. Is it nice to to hear those efforts are being recognized and appreciated.
I'm going to close this now #7202 is in. There's more work to be done, but the spirit of this issue is implemented.