New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency on ShuffleRegionJoin returns #1734

Closed
devin-petersohn opened this Issue Sep 18, 2017 · 18 comments

Comments

Projects
3 participants
@devin-petersohn
Member

devin-petersohn commented Sep 18, 2017

I would like to make consistent the return types for all ShuffleRegionJoins, specifically to return GenericGenomicRDDs.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 18, 2017

Member

You mean, GenericGenomicRDD as opposed to GenomicRDD? Can you expand upon your motivations, pros/cons, etc?

Member

fnothaft commented Sep 18, 2017

You mean, GenericGenomicRDD as opposed to GenomicRDD? Can you expand upon your motivations, pros/cons, etc?

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 18, 2017

Member

They should be consistent for proper generalization downstream. In 37b971a it was changed only for inner shuffle join, and I think that change makes sense to propagate through to the remaining shuffle joins.

Member

devin-petersohn commented Sep 18, 2017

They should be consistent for proper generalization downstream. In 37b971a it was changed only for inner shuffle join, and I think that change makes sense to propagate through to the remaining shuffle joins.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 18, 2017

Member

Ah, I see what's going on. Yeah, that change sounds OK to me. Do you know if we can move the inner join back down to GenomicRDD instead? That'd be my preference.

Member

fnothaft commented Sep 18, 2017

Ah, I see what's going on. Yeah, that change sounds OK to me. Do you know if we can move the inner join back down to GenomicRDD instead? That'd be my preference.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 18, 2017

Member

I actually prefer GenericGenomicRDD more here (or some kind of GenomicPairRDD object) here because of the strange type-casting issue referenced in #1735. It is also a much cleaner abstraction than GenomicRDD[(T, Iterable[X]), Z].

Member

devin-petersohn commented Sep 18, 2017

I actually prefer GenericGenomicRDD more here (or some kind of GenomicPairRDD object) here because of the strange type-casting issue referenced in #1735. It is also a much cleaner abstraction than GenomicRDD[(T, Iterable[X]), Z].

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 19, 2017

Member

Still trying to get my head around this.

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

Could the methods in GenericGenomicRDD be moved up to GenomicRDD (would require traitabstract class)?

Would GenomicPairRDD limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).

Member

heuermh commented Sep 19, 2017

Still trying to get my head around this.

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

Could the methods in GenericGenomicRDD be moved up to GenomicRDD (would require traitabstract class)?

Would GenomicPairRDD limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 19, 2017

Member

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

That's exactly their purpose.

Conceptually, we could also support GenericGenomicRDD as a type that one can transmute to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.

Could the methods in GenericGenomicRDD be moved up to GenomicRDD (would require trait → abstract class)?

The ones that call copy require GenericGenomicRDD to be a case class, union requires a concrete constructor/apply method, and the getReferenceRegions implementation doesn't make sense for non-generic GenomicRDDs.

Would GenomicPairRDD limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).

Are you proposing to replace GenericGenomicRDD with GenomicPairRDD, which would be a GenomicRDD that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD?

Member

fnothaft commented Sep 19, 2017

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

That's exactly their purpose.

Conceptually, we could also support GenericGenomicRDD as a type that one can transmute to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.

Could the methods in GenericGenomicRDD be moved up to GenomicRDD (would require trait → abstract class)?

The ones that call copy require GenericGenomicRDD to be a case class, union requires a concrete constructor/apply method, and the getReferenceRegions implementation doesn't make sense for non-generic GenomicRDDs.

Would GenomicPairRDD limit joins to two types? It seems a completely reasonable use case to join reads, features, and variants. Or may be more efficient to join twice two of the same type than to union first then join (say variants and several different RDDs of features).

Are you proposing to replace GenericGenomicRDD with GenomicPairRDD, which would be a GenomicRDD that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD?

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 19, 2017

Member

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

That's exactly their purpose.

Downstream, we need a way to manage the various set theory operations without modifying the internal records, changing the regionFn to be whatever we need.

Conceptually, we could also support GenericGenomicRDD as a type that one can transmute to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.

That would be useful considering BroadcastRegionJoin and ShuffleRegionJoin group by methods do not return the same tuple ordering. Being able to swap key and value is useful.

Are you proposing to replace GenericGenomicRDD with GenomicPairRDD, which would be a GenomicRDD that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD?

The only additional benefit I see is calling ._1/._2 to get the GenomicRDD from the left/right side of the join (i.e. the AlignmentRecordRDD from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD.

Member

devin-petersohn commented Sep 19, 2017

Is there any purpose for GenericGenomicRDD other than the return value for the various join methods?

That's exactly their purpose.

Downstream, we need a way to manage the various set theory operations without modifying the internal records, changing the regionFn to be whatever we need.

Conceptually, we could also support GenericGenomicRDD as a type that one can transmute to, which I think would be reasonable. We don't support this now, but its just a matter of adding the conversion functions.

That would be useful considering BroadcastRegionJoin and ShuffleRegionJoin group by methods do not return the same tuple ordering. Being able to swap key and value is useful.

Are you proposing to replace GenericGenomicRDD with GenomicPairRDD, which would be a GenomicRDD that enforces that the RDD's value is a tuple? If so, what would be the benefit to this over GenericGenomicRDD?

The only additional benefit I see is calling ._1/._2 to get the GenomicRDD from the left/right side of the join (i.e. the AlignmentRecordRDD from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 19, 2017

Member

That would be useful considering BroadcastRegionJoin and ShuffleRegionJoin group by methods do not return the same tuple ordering. Being able to swap key and value is useful.

You can already do that with transform, and you can transmute from a GenericGenomicRDD to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD.

The only additional benefit I see is calling ._1/._2 to get the GenomicRDD from the left/right side of the join (i.e. the AlignmentRecordRDD from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD.

See above; you can already do this with transform or transmute on a GenericGenomicRDD. GenomicPairRDD wouldn't make this easier.

Member

fnothaft commented Sep 19, 2017

That would be useful considering BroadcastRegionJoin and ShuffleRegionJoin group by methods do not return the same tuple ordering. Being able to swap key and value is useful.

You can already do that with transform, and you can transmute from a GenericGenomicRDD to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD.

The only additional benefit I see is calling ._1/._2 to get the GenomicRDD from the left/right side of the join (i.e. the AlignmentRecordRDD from the left). The use case would be post processing on the set theory operation (i.e. I only want the alignments that overlapped my variants in the output). I have no strong preference between keeping it as is or adding the GenomicPairRDD.

See above; you can already do this with transform or transmute on a GenericGenomicRDD. GenomicPairRDD wouldn't make this easier.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 19, 2017

Member

You can already do that with transform, and you can transmute from a GenericGenomicRDD to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD.

I may be mistaken, but don't you need transmute to go from GenericGenomicRDD[(U, T)] to GenericGenomicRDD[(T, U)]? The transform requires a function T => T, or (U, T) => (U, T). Using transform didn't compile when I tried to go from (U, T) => (T, U).

See above; you can already do this with transform or transmute on a GenericGenomicRDD. GenomicPairRDD wouldn't make this easier.

It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD they had, and 2.) use the transmute API. I can just add that functionality to Lime if it doesn't belong here.

Member

devin-petersohn commented Sep 19, 2017

You can already do that with transform, and you can transmute from a GenericGenomicRDD to a non-generic. We are just missing conversion functions for going to a GenericGenomicRDD.

I may be mistaken, but don't you need transmute to go from GenericGenomicRDD[(U, T)] to GenericGenomicRDD[(T, U)]? The transform requires a function T => T, or (U, T) => (U, T). Using transform didn't compile when I tried to go from (U, T) => (T, U).

See above; you can already do this with transform or transmute on a GenericGenomicRDD. GenomicPairRDD wouldn't make this easier.

It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD they had, and 2.) use the transmute API. I can just add that functionality to Lime if it doesn't belong here.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 19, 2017

Member

I may be mistaken, but don't you need transmute to go from GenericGenomicRDD[(U, T)]

Yes, you are correct; my apologies for the confusion. My broader point is that we don't need a new class to support this.

It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD they had, and 2.) use the transmute API.

Ah, are you saying you'd add a function to GenomicPairRDD that would call ._1 or ._2 on all elements in the RDD? Something like .keys or .values in Spark's PairRDDFunctions? If so, that makes sense, but I'd prefer to go with Spark's approach where the functions are added implicitly to any GenericGenomicRDD[(X,Y)] (and implemented via transmute), instead of replacing GenericGenomicRDD with PairGenomicRDD.

Member

fnothaft commented Sep 19, 2017

I may be mistaken, but don't you need transmute to go from GenericGenomicRDD[(U, T)]

Yes, you are correct; my apologies for the confusion. My broader point is that we don't need a new class to support this.

It was meant to be more of a user-facing feature, rather than forcing the user to: 1.) know what type of GenomicRDD they had, and 2.) use the transmute API.

Ah, are you saying you'd add a function to GenomicPairRDD that would call ._1 or ._2 on all elements in the RDD? Something like .keys or .values in Spark's PairRDDFunctions? If so, that makes sense, but I'd prefer to go with Spark's approach where the functions are added implicitly to any GenericGenomicRDD[(X,Y)] (and implemented via transmute), instead of replacing GenericGenomicRDD with PairGenomicRDD.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 19, 2017

Member

Can you sketch out the two use cases above? Does having GenericGenomicRDD/GenomicPairRDD as return value help here?

val reads = sc.loadReads("sample.bam")
val variants = sc.loadVariants("sample.vcf.gz")
val features = sc.loadFeatures("ensembl.gff3.gz")

// 1. join all three types

val genes = sc.loadFeatures("gencode.v21.annotation.gff3.gz")
val polyAs = sc.loadFeatures("gencode.v21.polyAs.gff3.gz")
val pseudogenes = sc.loadFeatures("gencode.v21.2wayconspseudos.gff3.gz")
val tRnas = sc.loadFeatures("gencode.v21.tRNAs.gff3.gz")

// 2. join variants with all GENCODE features
Member

heuermh commented Sep 19, 2017

Can you sketch out the two use cases above? Does having GenericGenomicRDD/GenomicPairRDD as return value help here?

val reads = sc.loadReads("sample.bam")
val variants = sc.loadVariants("sample.vcf.gz")
val features = sc.loadFeatures("ensembl.gff3.gz")

// 1. join all three types

val genes = sc.loadFeatures("gencode.v21.annotation.gff3.gz")
val polyAs = sc.loadFeatures("gencode.v21.polyAs.gff3.gz")
val pseudogenes = sc.loadFeatures("gencode.v21.2wayconspseudos.gff3.gz")
val tRnas = sc.loadFeatures("gencode.v21.tRNAs.gff3.gz")

// 2. join variants with all GENCODE features
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 19, 2017

Member

To what @heuermh is saying, I would agree that it would be preferable to have the one signature change back to GenomicRDD instead of changing all the signatures to GenericGenomicRDD.

This is separate from the question of opening up the constructor on GenericGenomicRDD.

As an aside, if we go the implicit route, then we add the RDD-level _1/_2 functions while keeping the type as GenomicRDD.

Member

fnothaft commented Sep 19, 2017

To what @heuermh is saying, I would agree that it would be preferable to have the one signature change back to GenomicRDD instead of changing all the signatures to GenericGenomicRDD.

This is separate from the question of opening up the constructor on GenericGenomicRDD.

As an aside, if we go the implicit route, then we add the RDD-level _1/_2 functions while keeping the type as GenomicRDD.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 19, 2017

Member

The implicit route makes sense. I

Moving everything GenomicRDD creates problems for two functionalities that I need downstream:

1.) transform after the join creates a Nothing object. This is because the Z type is inferred to be Nothing after upcasting.
2.) I use a predicate on both the joined data and the regionFn. I cannot access the getReferenceRegions from the outside, and I need a way to get them. Performing a predicate on the regionFn from the GenericGenomicRDD after the join makes more sense to me than opening up permissions on getReferenceRegions.

Member

devin-petersohn commented Sep 19, 2017

The implicit route makes sense. I

Moving everything GenomicRDD creates problems for two functionalities that I need downstream:

1.) transform after the join creates a Nothing object. This is because the Z type is inferred to be Nothing after upcasting.
2.) I use a predicate on both the joined data and the regionFn. I cannot access the getReferenceRegions from the outside, and I need a way to get them. Performing a predicate on the regionFn from the GenericGenomicRDD after the join makes more sense to me than opening up permissions on getReferenceRegions.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 19, 2017

Member

@heuermh I don't think it affects the joins in any way either way.

Member

devin-petersohn commented Sep 19, 2017

@heuermh I don't think it affects the joins in any way either way.

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 19, 2017

Member

Actually, now that I think of it, why is getReferenceRegions protected?

Member

devin-petersohn commented Sep 19, 2017

Actually, now that I think of it, why is getReferenceRegions protected?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Sep 19, 2017

Member

Actually, now that I think of it, why is getReferenceRegions protected?

For me, that is part of a defensive API strategy. Keep things private or package private or otherwise hidden and final and immutable unless it becomes necessary to change it. That way you keep the public API as small as possible.

Member

heuermh commented Sep 19, 2017

Actually, now that I think of it, why is getReferenceRegions protected?

For me, that is part of a defensive API strategy. Keep things private or package private or otherwise hidden and final and immutable unless it becomes necessary to change it. That way you keep the public API as small as possible.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft
Member

fnothaft commented Sep 19, 2017

@devin-petersohn

This comment has been minimized.

Show comment
Hide comment
@devin-petersohn

devin-petersohn Sep 20, 2017

Member

I understand the idea of keeping the public API small, but isn't it reasonable for a user to want to see the regions for a given record? This is a getter, so mutability isn't the issue. As-is you have to build the ReferenceRegion from the contigName, start, and end, which feels messy considering we already have the plumbing to do it. This is a separate issue than GenericGenomicRDD vs GenomicRDD[((T, U), Z)], so we can revisit it later.

Member

devin-petersohn commented Sep 20, 2017

I understand the idea of keeping the public API small, but isn't it reasonable for a user to want to see the regions for a given record? This is a getter, so mutability isn't the issue. As-is you have to build the ReferenceRegion from the contigName, start, and end, which feels messy considering we already have the plumbing to do it. This is a separate issue than GenericGenomicRDD vs GenomicRDD[((T, U), Z)], so we can revisit it later.

@devin-petersohn devin-petersohn self-assigned this Sep 21, 2017

@heuermh heuermh added this to the 0.23.0 milestone Dec 7, 2017

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment