New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of dataset api in ADAM #1018

Closed
jpdna opened this Issue Apr 26, 2016 · 6 comments

Comments

Projects
6 participants
@jpdna
Member

jpdna commented Apr 26, 2016

Here is prototype using Spark Dataset API for the ADAM markduplicates operation:
here

The dataset API version of ADAM markduplicates above is now at least 10% faster than the current RDD implementation.

Prior to the modification described below using the Dataset API was 30-50% slower than the current RDD version.

I describe below the modification that produced this improvement in our use of the Dataset API:

A seemingly minor modification to return from a mapGroup function a tuple containing one Seq as opposed to the same data split into three different Seqs (either individually or nested within a case class) makes large (2x) performance difference in total processing time.

This magnitude of performance gain from passing the same amount of data as one instead of three Seqs seems surprising given that the total amount of data being returned and being next shuffled in a groupBy would seem to be the same. The difference just being the packaging as a single Seq or split into three Seqs.

While I'd imagine some additional overhead due to multiple Seqs vs. one, am surprised by the large difference, and curious to know if there is a best practice that would suggest to avoid Seqs of case classes in a dataset due to this behavior.

The Spark UI output and .explain() for the fast version can be found here
and the slow version here
Stage 2 goes from 59 sec to 2.2 minutes

@rxin

This comment has been minimized.

Show comment
Hide comment
@rxin

rxin Apr 26, 2016

Can you check how much time is spent in GC in the old vs new version?

rxin commented Apr 26, 2016

Can you check how much time is spent in GC in the old vs new version?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Apr 26, 2016

Member

Can you check how much time is spent in GC in the old vs new version?

Thanks for looking at this @rxin
There doesn't seem to be much difference in GC time.
I have now pasted the Stage details screenshots including the summary metrics in the last two pages of the google docs for slow and fast

Other than the duration metric itself - I don't see any metric which is capturing the duration difference. It doesn't appear to be stragglers - but rather evenly distributed as from the min to median to max, the tasks are all taking around twice as long to complete in the slow version.

Member

jpdna commented Apr 26, 2016

Can you check how much time is spent in GC in the old vs new version?

Thanks for looking at this @rxin
There doesn't seem to be much difference in GC time.
I have now pasted the Stage details screenshots including the summary metrics in the last two pages of the google docs for slow and fast

Other than the duration metric itself - I don't see any metric which is capturing the duration difference. It doesn't appear to be stragglers - but rather evenly distributed as from the min to median to max, the tasks are all taking around twice as long to complete in the slow version.

@hubertp

This comment has been minimized.

Show comment
Hide comment
@hubertp

hubertp Jun 12, 2016

@jpdna is this work stalled atm? Anything an eager person could pick up?

hubertp commented Jun 12, 2016

@jpdna is this work stalled atm? Anything an eager person could pick up?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Jun 13, 2016

Member

@hubertp - thanks for your interest in contributing!. The cause of performance oddity between the two different way to arrange the data prior to running groupBy as described above, with links to google docs and github, is still not understood. It would be interesting if you want to try to replicate in a general simple example. Otherwise - on general dataset API front, I plan to return to that shortly, though you are certainly welcome to have a look at what next steps make sense to you and make a PR - is there a specific area you are interested in being involved with?

One idea for where you could help - similar to https://github.com/jpdna/adam/blob/dataset_api_markdups_v4/adam-core/src/main/scala/org/bdgenomics/adam/dataset/AlignmentRecordLimitProjDS.scala we need to make a dataset API case class for the VCF based variant data like https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation and then explore using that for variant related operations - if you want to take a look at that a PR on that subject might help push along efforts. We haven't yet merged my own dataset api work yet to main ADAM repo due to the need to bump to scala 2_11 to allow larger case classes, I'll plan to today rebase the work I was doing at https://github.com/jpdna/adam/tree/dataset_api_markdups_v4 on the current ADAM repo so that will be a better place for you to start from if you want to work on dataset api stuff in ADAM, I'll update this issue with a link to that when ready.

Member

jpdna commented Jun 13, 2016

@hubertp - thanks for your interest in contributing!. The cause of performance oddity between the two different way to arrange the data prior to running groupBy as described above, with links to google docs and github, is still not understood. It would be interesting if you want to try to replicate in a general simple example. Otherwise - on general dataset API front, I plan to return to that shortly, though you are certainly welcome to have a look at what next steps make sense to you and make a PR - is there a specific area you are interested in being involved with?

One idea for where you could help - similar to https://github.com/jpdna/adam/blob/dataset_api_markdups_v4/adam-core/src/main/scala/org/bdgenomics/adam/dataset/AlignmentRecordLimitProjDS.scala we need to make a dataset API case class for the VCF based variant data like https://github.com/bigdatagenomics/adam/tree/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/variation and then explore using that for variant related operations - if you want to take a look at that a PR on that subject might help push along efforts. We haven't yet merged my own dataset api work yet to main ADAM repo due to the need to bump to scala 2_11 to allow larger case classes, I'll plan to today rebase the work I was doing at https://github.com/jpdna/adam/tree/dataset_api_markdups_v4 on the current ADAM repo so that will be a better place for you to start from if you want to work on dataset api stuff in ADAM, I'll update this issue with a link to that when ready.

@fnothaft fnothaft added this to the 0.21.0 milestone Jul 6, 2016

@heuermh heuermh modified the milestones: 0.21.0, 0.22.0 Oct 13, 2016

@Fei-Guang

This comment has been minimized.

Show comment
Hide comment
@Fei-Guang

Fei-Guang Nov 2, 2016

hope this merge in next release

Fei-Guang commented Nov 2, 2016

hope this merge in next release

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Feb 14, 2017

Member

Lo and behold; I have a candidate fix for this. Need to clean it up but will PR tomorrow.

Member

fnothaft commented Feb 14, 2017

Lo and behold; I have a candidate fix for this. Need to clean it up but will PR tomorrow.

fnothaft added a commit to fnothaft/adam that referenced this issue Feb 15, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Feb 17, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Feb 18, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

@heuermh heuermh added this to Triage in Release 0.23.0 Mar 8, 2017

fnothaft added a commit to fnothaft/adam that referenced this issue Apr 10, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue May 11, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue May 11, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue May 11, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue May 12, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue May 22, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue May 24, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue May 24, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 21, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 22, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 23, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 23, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 26, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 26, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jun 26, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jul 7, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jul 10, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jul 10, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

fnothaft added a commit to fnothaft/adam that referenced this issue Jul 10, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves bigdatagenomics#1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

heuermh added a commit that referenced this issue Jul 11, 2017

[ADAM-1018] Add support for Spark SQL Datasets.
Resolves #1018. Adds the `adam-codegen` module, which generates classes that:

1. Implement the Scala Product interface and thus can be read into a Spark SQL
Dataset.
2. Have a complete constructor that is compatible with the constructor that
Spark SQL expects to see when exporting a Dataset back to Scala.
3. And, that have methods for converting to/from the bdg-formats Avro models.

Then, we build these model classes in the `org.bdgenomics.adam.sql` package,
and use them for export from the Avro based GenomicRDDs. With a Dataset, we
can then export to a DataFrame, which enables us to expose data through
Python via RDD->Dataset->DataFrame. This is important since the Avro classes
generated by bdg-formats can't be pickled, and thus we can't do a Java RDD to
Python RDD crossing with them.

@heuermh heuermh moved this from Triage to Completed in Release 0.23.0 Jan 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment