Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureRDD instantiation tries to cache the RDD #1321

Closed
laserson opened this issue Dec 21, 2016 · 11 comments
Closed

FeatureRDD instantiation tries to cache the RDD #1321

laserson opened this issue Dec 21, 2016 · 11 comments
Milestone

Comments

@laserson
Copy link
Contributor

@laserson laserson commented Dec 21, 2016

Currently, when you create a FeatureRDD via, say, sc.loadBed(), the apply function tries to cache the RDD bc it first computes a SequenceDictionary off of it:

I am working on my laptop on a 120 GB bed dataset and this is not gonna end well. This feels unexpected...is it necessary? Perhaps allow the user to specify the storage level? Thoughts?

@laserson
Copy link
Contributor Author

@laserson laserson commented Dec 21, 2016

To be fair, I'm in the middle of this computation, and I suppose Spark may deal with this issue gracefully. Will update soon.

@heuermh
Copy link
Member

@heuermh heuermh commented Dec 21, 2016

That's a fairly large BED file, are you doing wheat genomics now? Hope you have 1 TB RAM on that laptop. :)

@fnothaft fnothaft added this to the 0.23.0 milestone Mar 3, 2017
@heuermh
Copy link
Member

@heuermh heuermh commented Mar 5, 2017

@laserson still curious as to the use case here.

Note that #1411 modifies how the SequenceDictionary is created for BED, GFF3, GTF, and NarrowPeak formats, and that may have performance implications.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Mar 5, 2017

I mean, in general, I think @laserson is correct here; we shouldn't cache unless we provide an option to control how we cache (a la Transform). It's a simple patch.

@heuermh
Copy link
Member

@heuermh heuermh commented Mar 5, 2017

provide an option

+1

@laserson
Copy link
Contributor Author

@laserson laserson commented Mar 7, 2017

I guess this is a bit hacky, but I was loading the phylop data set which is a wigfix file with a float value at every genome position. I converted it to bed for loading with adam.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Mar 7, 2017

Oh, that makes a fair bit of sense now.

@heuermh
Copy link
Member

@heuermh heuermh commented Mar 7, 2017

To push down a cache flag to the apply method, we'd have to add it to all the loadXxx feature methods, is that worth it?

Transform works differently, in that the -cache argument calls rdd.persist explicitly
https://github.com/bigdatagenomics/adam/blob/master/adam-cli/src/main/scala/org/bdgenomics/adam/cli/Transform.scala#L171

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Mar 7, 2017

To push down a cache flag to the apply method, we'd have to add it to all the loadXxx feature methods, is that worth it?

I would be fine with us adding that.

@laserson
Copy link
Contributor Author

@laserson laserson commented Mar 7, 2017

Does that flag make sense for the other loadXXX methods?

@heuermh
Copy link
Member

@heuermh heuermh commented Mar 21, 2017

@laserson see also #1447. transformFeatures CLI command now has -cache and -storage_level arguments.

Does that flag make sense for the other loadXXX methods?

Possibly. This case is a bit unique in thatpersist is called when constructing the RDD, instead of other places where persist might be called in-between transform steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.