Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1680] Eliminate non-determinism in the ShuffleRegionJoin. #1754

Merged

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Oct 7, 2017

Resolves #1680. In the existing shuffle region join code, the partition start/stop boundaries are determined by sorting the data and looking at the coordinates of the first and last record on each partition. This is done via a sampling process, which is fundamentally non-deterministic. Once these partition bounds are picked, we replicate records from the right side of the join into the partitions they overlap. There was a bug in this step that led to records being dropped from the first partition of each contig (except for the first contig). This was due to a bug in how the IntervalArray was being used to search for repartitioning bounds.

This PR replaces the interval array with a map between contig names and the partitions that contain their data. The first and last partition of each contig are extended to the start/end of each contig.

Resolves #1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.
@fnothaft fnothaft added this to the 0.23.0 milestone Oct 7, 2017
@fnothaft fnothaft requested a review from devin-petersohn Oct 7, 2017
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Oct 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2409/
Test PASSed.

// and create a record if the sequence is not in the map
var lastIdx = 0
val missingSequences: Map[String, Seq[(ReferenceRegion, Int)]] =
sequences.records

This comment has been minimized.

@heuermh

heuermh Oct 9, 2017
Member

there wasn't a dependency on sequences before, will this still work if both RDDs have empty sequence dictionaries? I can give it a try with two FeatureRDDs

This comment has been minimized.

@fnothaft

fnothaft Oct 11, 2017
Author Member

Yeah, we can't guarantee correct behavior if you don't have correctly defined sequence dictionaries.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Oct 11, 2017

Ping for review.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

Looks good, do you have an idea how this affects runtime?

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Oct 12, 2017

I can go back and rerun if you need numbers, but the perf impact was reasonably negligible.

Copy link
Member

@devin-petersohn devin-petersohn left a comment

It doesn't seem like it would add much runtime, but I wanted to see if you saw something different. Looks great!

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Oct 13, 2017

Can we merge this?

@heuermh
Copy link
Member

@heuermh heuermh commented Oct 13, 2017

Can we merge this?

I've been running joins of features to features, to see if behavior changed per #1754 (comment). No answer to that question yet, unfortunately

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Oct 13, 2017

My opinion is that we should require the user to provide valid sequence dictionaries. GIGO.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Oct 17, 2017

Pinging for a merge.

@heuermh
Copy link
Member

@heuermh heuermh commented Oct 17, 2017

I've been trying to cause troubles by joining various features with no sequence dictionaries and haven't found any, in fact the answers are correct whereas before they were not always correct.

@heuermh heuermh merged commit 02f0abe into bigdatagenomics:master Oct 17, 2017
1 of 2 checks passed
1 of 2 checks passed
codacy/pr Not so good... This pull request quality could be better.
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented Oct 17, 2017

Thank you, @fnothaft. Thank you for the review, @devin-petersohn.

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.