Join GitHub today
[ADAM-1680] Eliminate non-determinism in the ShuffleRegionJoin. #1754
Resolves #1680. In the existing shuffle region join code, the partition start/stop boundaries are determined by sorting the data and looking at the coordinates of the first and last record on each partition. This is done via a sampling process, which is fundamentally non-deterministic. Once these partition bounds are picked, we replicate records from the right side of the join into the partitions they overlap. There was a bug in this step that led to records being dropped from the first partition of each contig (except for the first contig). This was due to a bug in how the IntervalArray was being used to search for repartitioning bounds.
This PR replaces the interval array with a map between contig names and the partitions that contain their data. The first and last partition of each contig are extended to the start/end of each contig.