New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Region join shows non-determinism #1680

Closed
fnothaft opened this Issue Aug 24, 2017 · 1 comment

Comments

Projects
1 participant
@fnothaft
Member

fnothaft commented Aug 24, 2017

Running on a WGS read file vs. the GTF of gene coordinates, I see non-determinism when running the shuffle region join. Specifically, I see about a 0.03% difference in the number of records that I get at the output of the join.

@fnothaft fnothaft added the bug label Aug 24, 2017

@fnothaft fnothaft added this to the 0.23.0 milestone Aug 24, 2017

@fnothaft fnothaft self-assigned this Aug 24, 2017

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 6, 2017

Member

The non-determinism seems to come from here. Specifically, the records from the right side of the join that fall in the first bin on the (not-first) contig are getting dropped. I'm guessing this is an issue in the "require overlap = false" part of the IntervalArray.get function.

Member

fnothaft commented Oct 6, 2017

The non-determinism seems to come from here. Specifically, the records from the right side of the join that fall in the first bin on the (not-first) contig are getting dropped. I'm guessing this is an issue in the "require overlap = false" part of the IntervalArray.get function.

fnothaft added a commit to fnothaft/adam that referenced this issue Oct 6, 2017

[ADAM-1680] Eliminate non-determinism in the ShuffleRegionJoin.
Resolves bigdatagenomics#1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.

fnothaft added a commit to fnothaft/adam that referenced this issue Oct 7, 2017

[ADAM-1680] Eliminate non-determinism in the ShuffleRegionJoin.
Resolves bigdatagenomics#1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.

heuermh added a commit that referenced this issue Oct 17, 2017

[ADAM-1680] Eliminate non-determinism in the ShuffleRegionJoin.
Resolves #1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment