Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Region join shows non-determinism #1680

Closed
fnothaft opened this issue Aug 24, 2017 · 1 comment
Closed

Region join shows non-determinism #1680

fnothaft opened this issue Aug 24, 2017 · 1 comment
Assignees
Labels
bug
Milestone

Comments

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Aug 24, 2017

Running on a WGS read file vs. the GTF of gene coordinates, I see non-determinism when running the shuffle region join. Specifically, I see about a 0.03% difference in the number of records that I get at the output of the join.

@fnothaft fnothaft added the bug label Aug 24, 2017
@fnothaft fnothaft added this to the 0.23.0 milestone Aug 24, 2017
@fnothaft fnothaft self-assigned this Aug 24, 2017
@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Oct 6, 2017

The non-determinism seems to come from here. Specifically, the records from the right side of the join that fall in the first bin on the (not-first) contig are getting dropped. I'm guessing this is an issue in the "require overlap = false" part of the IntervalArray.get function.

fnothaft added a commit to fnothaft/adam that referenced this issue Oct 6, 2017
Resolves bigdatagenomics#1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.
fnothaft added a commit to fnothaft/adam that referenced this issue Oct 7, 2017
Resolves bigdatagenomics#1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.
heuermh added a commit that referenced this issue Oct 17, 2017
Resolves #1680. In the existing shuffle region join code, the partition
start/stop boundaries are determined by sorting the data and looking at the
coordinates of the first and last record on each partition. This is done via a
sampling process, which is fundamentally non-deterministic. Once these partition
bounds are picked, we replicate records from the right side of the join into the
partitions they overlap. There was a bug in this step that led to records being
dropped from the first partition of each contig (except for the first contig).
This was due to a bug in how the IntervalArray was being used to search for
repartitioning bounds.

This PR replaces the interval array with a map between contig names and the
partitions that contain their data. The first and last partition of each contig
are extended to the start/end of each contig.
@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.