New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1680] Eliminate non-determinism in the ShuffleRegionJoin. #1754

Merged
merged 1 commit into from Oct 17, 2017

Conversation

Projects
4 participants
@fnothaft
Member

fnothaft commented Oct 7, 2017

Resolves #1680. In the existing shuffle region join code, the partition start/stop boundaries are determined by sorting the data and looking at the coordinates of the first and last record on each partition. This is done via a sampling process, which is fundamentally non-deterministic. Once these partition bounds are picked, we replicate records from the right side of the join into the partitions they overlap. There was a bug in this step that led to records being dropped from the first partition of each contig (except for the first contig). This was due to a bug in how the IntervalArray was being used to search for repartitioning bounds.

This PR replaces the interval array with a map between contig names and the partitions that contain their data. The first and last partition of each contig are extended to the start/end of each contig.

[ADAM-1680] Eliminate non-determinism in the ShuffleRegionJoin.
Resolves #1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.

@fnothaft fnothaft added this to the 0.23.0 milestone Oct 7, 2017

@fnothaft fnothaft requested a review from devin-petersohn Oct 7, 2017

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2409/
Test PASSed.

AmplabJenkins commented Oct 7, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2409/
Test PASSed.

// and create a record if the sequence is not in the map
var lastIdx = 0
val missingSequences: Map[String, Seq[(ReferenceRegion, Int)]] =
sequences.records

This comment has been minimized.

@heuermh

heuermh Oct 9, 2017

Member

there wasn't a dependency on sequences before, will this still work if both RDDs have empty sequence dictionaries? I can give it a try with two FeatureRDDs

@heuermh

heuermh Oct 9, 2017

Member

there wasn't a dependency on sequences before, will this still work if both RDDs have empty sequence dictionaries? I can give it a try with two FeatureRDDs

This comment has been minimized.

@fnothaft

fnothaft Oct 11, 2017

Member

Yeah, we can't guarantee correct behavior if you don't have correctly defined sequence dictionaries.

@fnothaft

fnothaft Oct 11, 2017

Member

Yeah, we can't guarantee correct behavior if you don't have correctly defined sequence dictionaries.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 11, 2017

Member

Ping for review.

Member

fnothaft commented Oct 11, 2017

Ping for review.

@devin-petersohn

Looks good, do you have an idea how this affects runtime?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 12, 2017

Member

I can go back and rerun if you need numbers, but the perf impact was reasonably negligible.

Member

fnothaft commented Oct 12, 2017

I can go back and rerun if you need numbers, but the perf impact was reasonably negligible.

@devin-petersohn

It doesn't seem like it would add much runtime, but I wanted to see if you saw something different. Looks great!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 13, 2017

Member

Can we merge this?

Member

fnothaft commented Oct 13, 2017

Can we merge this?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 13, 2017

Member

Can we merge this?

I've been running joins of features to features, to see if behavior changed per #1754 (comment). No answer to that question yet, unfortunately

Member

heuermh commented Oct 13, 2017

Can we merge this?

I've been running joins of features to features, to see if behavior changed per #1754 (comment). No answer to that question yet, unfortunately

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 13, 2017

Member

My opinion is that we should require the user to provide valid sequence dictionaries. GIGO.

Member

fnothaft commented Oct 13, 2017

My opinion is that we should require the user to provide valid sequence dictionaries. GIGO.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 17, 2017

Member

Pinging for a merge.

Member

fnothaft commented Oct 17, 2017

Pinging for a merge.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 17, 2017

Member

I've been trying to cause troubles by joining various features with no sequence dictionaries and haven't found any, in fact the answers are correct whereas before they were not always correct.

Member

heuermh commented Oct 17, 2017

I've been trying to cause troubles by joining various features with no sequence dictionaries and haven't found any, in fact the answers are correct whereas before they were not always correct.

@heuermh heuermh merged commit 02f0abe into bigdatagenomics:master Oct 17, 2017

1 of 2 checks passed

codacy/pr Not so good... This pull request quality could be better.
Details
default Merged build finished.
Details
@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Oct 17, 2017

Member

Thank you, @fnothaft. Thank you for the review, @devin-petersohn.

Member

heuermh commented Oct 17, 2017

Thank you, @fnothaft. Thank you for the review, @devin-petersohn.

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment