New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadPartitionedParquetAlignments fails with Reference.all #1967

Closed
heuermh opened this Issue Mar 26, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@heuermh
Copy link
Member

heuermh commented Mar 26, 2018

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignments = sc.loadAlignments("adam-core/src/test/resources/small.sam")
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = RDDBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> alignments.saveAsPartitionedParquet("small.partitioned.alignments.adam")

scala> val partitioned = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam")
partitioned: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = DatasetBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> partitioned.sequences
res1: org.bdgenomics.adam.models.SequenceDictionary =
SequenceDictionary{
1->249250621, 0
2->243199373, 1}

scala> partitioned.dataset.count()
res0: Long = 20

scala> partitioned.dataset.filter(partitioned.dataset.col("contigName").equalTo("1")).count()
res2: Long = 20

scala> partitioned.dataset.filter(partitioned.dataset.col("contigName").equalTo("2")).count()
res3: Long = 0

scala> val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion.all("1")))
chr1: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = DatasetBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> chr1.dataset.count()
res4: Long = 0

scala> val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion.fromGenomicRange("1", 0L, 249250621L)))
chr1: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = DatasetBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> chr1.dataset.count()
res5: Long = 20

Ping @jpdna for triage

@jpdna

This comment has been minimized.

Copy link
Member

jpdna commented Mar 26, 2018

Is it expected that:

scala> ReferenceRegion.all("1")
res7: org.bdgenomics.adam.models.ReferenceRegion = ReferenceRegion(1,0,9223372036854775807,INDEPENDENT)

surely 9223372036854775807 is not correct?

If I try to use that ReferenceRegion directly I get an error printed to the console

scala> val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion(1,0,9223372036854775807,INDEPENDENT) )
<console>:1: error: integer number too large
val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion(1,0,9223372036854775807,INDEPENDENT) )

but it would seem that error is getting swallowed by the loadPartitionedParquetAlignments call.

@jpdna

This comment has been minimized.

Copy link
Member

jpdna commented Mar 26, 2018

Ah yeah 9223372036854775807 is Long.MaxValue at

ReferenceRegion(referenceName, 0L, Long.MaxValue, strand = strand)

I'm looking further...
I must have in Int rather than a Long somehwere.

@jpdna

This comment has been minimized.

Copy link
Member

jpdna commented Mar 26, 2018

Should be fixed by:
#1968

@fnothaft fnothaft added this to the 0.24.1 milestone Mar 27, 2018

@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Apr 11, 2018

Closed by #1980.

@fnothaft fnothaft closed this Apr 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment