FeatureRDD instantiation tries to cache the RDD #1321

Closed
laserson opened this Issue Dec 21, 2016 · 11 comments

Comments

Projects
3 participants
@laserson
Contributor

laserson commented Dec 21, 2016

Currently, when you create a FeatureRDD via, say, sc.loadBed(), the apply function tries to cache the RDD bc it first computes a SequenceDictionary off of it:

I am working on my laptop on a 120 GB bed dataset and this is not gonna end well. This feels unexpected...is it necessary? Perhaps allow the user to specify the storage level? Thoughts?

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Dec 21, 2016

Contributor

To be fair, I'm in the middle of this computation, and I suppose Spark may deal with this issue gracefully. Will update soon.

Contributor

laserson commented Dec 21, 2016

To be fair, I'm in the middle of this computation, and I suppose Spark may deal with this issue gracefully. Will update soon.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Dec 21, 2016

Member

That's a fairly large BED file, are you doing wheat genomics now? Hope you have 1 TB RAM on that laptop. :)

Member

heuermh commented Dec 21, 2016

That's a fairly large BED file, are you doing wheat genomics now? Hope you have 1 TB RAM on that laptop. :)

@fnothaft fnothaft added this to the 0.23.0 milestone Mar 3, 2017

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 5, 2017

Member

@laserson still curious as to the use case here.

Note that #1411 modifies how the SequenceDictionary is created for BED, GFF3, GTF, and NarrowPeak formats, and that may have performance implications.

Member

heuermh commented Mar 5, 2017

@laserson still curious as to the use case here.

Note that #1411 modifies how the SequenceDictionary is created for BED, GFF3, GTF, and NarrowPeak formats, and that may have performance implications.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 5, 2017

Member

I mean, in general, I think @laserson is correct here; we shouldn't cache unless we provide an option to control how we cache (a la Transform). It's a simple patch.

Member

fnothaft commented Mar 5, 2017

I mean, in general, I think @laserson is correct here; we shouldn't cache unless we provide an option to control how we cache (a la Transform). It's a simple patch.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 5, 2017

Member

provide an option

+1

Member

heuermh commented Mar 5, 2017

provide an option

+1

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Mar 7, 2017

Contributor

I guess this is a bit hacky, but I was loading the phylop data set which is a wigfix file with a float value at every genome position. I converted it to bed for loading with adam.

Contributor

laserson commented Mar 7, 2017

I guess this is a bit hacky, but I was loading the phylop data set which is a wigfix file with a float value at every genome position. I converted it to bed for loading with adam.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 7, 2017

Member

Oh, that makes a fair bit of sense now.

Member

fnothaft commented Mar 7, 2017

Oh, that makes a fair bit of sense now.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 7, 2017

Member

To push down a cache flag to the apply method, we'd have to add it to all the loadXxx feature methods, is that worth it?

Transform works differently, in that the -cache argument calls rdd.persist explicitly
https://github.com/bigdatagenomics/adam/blob/master/adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala#L171

Member

heuermh commented Mar 7, 2017

To push down a cache flag to the apply method, we'd have to add it to all the loadXxx feature methods, is that worth it?

Transform works differently, in that the -cache argument calls rdd.persist explicitly
https://github.com/bigdatagenomics/adam/blob/master/adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala#L171

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 7, 2017

Member

To push down a cache flag to the apply method, we'd have to add it to all the loadXxx feature methods, is that worth it?

I would be fine with us adding that.

Member

fnothaft commented Mar 7, 2017

To push down a cache flag to the apply method, we'd have to add it to all the loadXxx feature methods, is that worth it?

I would be fine with us adding that.

@laserson

This comment has been minimized.

Show comment
Hide comment
@laserson

laserson Mar 7, 2017

Contributor

Does that flag make sense for the other loadXXX methods?

Contributor

laserson commented Mar 7, 2017

Does that flag make sense for the other loadXXX methods?

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 21, 2017

Member

@laserson see also #1447. transformFeatures CLI command now has -cache and -storage_level arguments.

Does that flag make sense for the other loadXXX methods?

Possibly. This case is a bit unique in thatpersist is called when constructing the RDD, instead of other places where persist might be called in-between transform steps.

Member

heuermh commented Mar 21, 2017

@laserson see also #1447. transformFeatures CLI command now has -cache and -storage_level arguments.

Does that flag make sense for the other loadXXX methods?

Possibly. This case is a bit unique in thatpersist is called when constructing the RDD, instead of other places where persist might be called in-between transform steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment