New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filterBySequenceDictionary to GenomicRDD #1575

Closed
heuermh opened this Issue Jun 21, 2017 · 6 comments

Comments

Projects
3 participants
@heuermh
Member

heuermh commented Jun 21, 2017

As suggested in comment #1557 (comment), add

def filterBySequenceDictionary(sd: SequenceDictionary): U = {
  ...
}

and

def filterBySequenceDictionary(): U = {
  filterBySequenceDictionary(this.sequences)
}

methods on GenomicRDD.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 5, 2017

Member

So, I didn't quite understand what was going on back on #1557 (comment), and I still don't quite understand what this method would do. Is it "filter items mapped to contigs present in a sequence dictionary"? If so, we should be able to mash filterByOverlappingRegions and the addition of all to the ReferenceRegion object in e98ee2d. My preference would be to overload the filterByOverlappingRegions name; filterBySequenceDictionary sounds weird to me.

Member

fnothaft commented Jul 5, 2017

So, I didn't quite understand what was going on back on #1557 (comment), and I still don't quite understand what this method would do. Is it "filter items mapped to contigs present in a sequence dictionary"? If so, we should be able to mash filterByOverlappingRegions and the addition of all to the ReferenceRegion object in e98ee2d. My preference would be to overload the filterByOverlappingRegions name; filterBySequenceDictionary sounds weird to me.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 5, 2017

Member

I was mistaken earlier in that I thought the sequence dictionary was used in region joins. The question is really what to do when reading in features:

When reading in a BED file:

  • Use an empty sequence dictionary, or
  • Create a sequence dictionary on the fly from the contig names in the BED records and length max(end), or
  • Create a sequence dictionary on the fly from the contig names in the BED records and an infinite length, similar to ReferenceRegion.all

When reading in a BED file with a genome file:

  • Load records without regard for sequence dictionary, or
  • Consider ValidationStringency, validate records against sequence dictionary

The thinking behind this suggestion was to load records without regard for sequence dictionary and then filter after the fact, if desired.

Member

heuermh commented Jul 5, 2017

I was mistaken earlier in that I thought the sequence dictionary was used in region joins. The question is really what to do when reading in features:

When reading in a BED file:

  • Use an empty sequence dictionary, or
  • Create a sequence dictionary on the fly from the contig names in the BED records and length max(end), or
  • Create a sequence dictionary on the fly from the contig names in the BED records and an infinite length, similar to ReferenceRegion.all

When reading in a BED file with a genome file:

  • Load records without regard for sequence dictionary, or
  • Consider ValidationStringency, validate records against sequence dictionary

The thinking behind this suggestion was to load records without regard for sequence dictionary and then filter after the fact, if desired.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 5, 2017

Member

I was mistaken earlier in that I thought the sequence dictionary was used in region joins.

It used to be.

WRT BED, my feeling is empty sequence dictionary in the first case, load records without regard for sequence dictionary in the second case. We don't currently validate reads or variants/genotypes against a sequence dictionary, so why add checking specific to feature file formats?

Member

fnothaft commented Jul 5, 2017

I was mistaken earlier in that I thought the sequence dictionary was used in region joins.

It used to be.

WRT BED, my feeling is empty sequence dictionary in the first case, load records without regard for sequence dictionary in the second case. We don't currently validate reads or variants/genotypes against a sequence dictionary, so why add checking specific to feature file formats?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 5, 2017

Member

+1 to both. @devin-petersohn thoughts?

Member

heuermh commented Jul 5, 2017

+1 to both. @devin-petersohn thoughts?

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Jul 6, 2017

Member

That sounds good to me.

Member

devin-petersohn commented Jul 6, 2017

That sounds good to me.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Jul 6, 2017

Member

ok, created new issue #1588, closing this one, will update pull request #1557

Member

heuermh commented Jul 6, 2017

ok, created new issue #1588, closing this one, will update pull request #1557

@heuermh heuermh closed this Jul 6, 2017

@heuermh heuermh modified the milestone: 0.23.0 Jul 22, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment