Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadPartitionedParquetAlignments fails with Reference.all #1967

Closed
heuermh opened this issue Mar 26, 2018 · 4 comments
Closed

loadPartitionedParquetAlignments fails with Reference.all #1967

heuermh opened this issue Mar 26, 2018 · 4 comments
Milestone

Comments

@heuermh
Copy link
Member

@heuermh heuermh commented Mar 26, 2018

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val alignments = sc.loadAlignments("adam-core/src/test/resources/small.sam")
alignments: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = RDDBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> alignments.saveAsPartitionedParquet("small.partitioned.alignments.adam")

scala> val partitioned = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam")
partitioned: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = DatasetBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> partitioned.sequences
res1: org.bdgenomics.adam.models.SequenceDictionary =
SequenceDictionary{
1->249250621, 0
2->243199373, 1}

scala> partitioned.dataset.count()
res0: Long = 20

scala> partitioned.dataset.filter(partitioned.dataset.col("contigName").equalTo("1")).count()
res2: Long = 20

scala> partitioned.dataset.filter(partitioned.dataset.col("contigName").equalTo("2")).count()
res3: Long = 0

scala> val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion.all("1")))
chr1: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = DatasetBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> chr1.dataset.count()
res4: Long = 0

scala> val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion.fromGenomicRange("1", 0L, 249250621L)))
chr1: org.bdgenomics.adam.rdd.read.AlignmentRecordRDD = DatasetBoundAlignmentRecordRDD with 2 reference sequences, 0 read groups, and 2 processing steps

scala> chr1.dataset.count()
res5: Long = 20

Ping @jpdna for triage

@jpdna
Copy link
Member

@jpdna jpdna commented Mar 26, 2018

Is it expected that:

scala> ReferenceRegion.all("1")
res7: org.bdgenomics.adam.models.ReferenceRegion = ReferenceRegion(1,0,9223372036854775807,INDEPENDENT)

surely 9223372036854775807 is not correct?

If I try to use that ReferenceRegion directly I get an error printed to the console

scala> val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion(1,0,9223372036854775807,INDEPENDENT) )
<console>:1: error: integer number too large
val chr1 = sc.loadPartitionedParquetAlignments("small.partitioned.alignments.adam", Seq(ReferenceRegion(1,0,9223372036854775807,INDEPENDENT) )

but it would seem that error is getting swallowed by the loadPartitionedParquetAlignments call.

@jpdna
Copy link
Member

@jpdna jpdna commented Mar 26, 2018

Ah yeah 9223372036854775807 is Long.MaxValue at

ReferenceRegion(referenceName, 0L, Long.MaxValue, strand = strand)

I'm looking further...
I must have in Int rather than a Long somehwere.

@jpdna
Copy link
Member

@jpdna jpdna commented Mar 26, 2018

Should be fixed by:
#1968

@fnothaft fnothaft added this to the 0.24.1 milestone Mar 27, 2018
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Apr 11, 2018

Closed by #1980.

@fnothaft fnothaft closed this Apr 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants