Fix record oriented shuffle #599

Closed
fnothaft opened this Issue Mar 1, 2015 · 4 comments

Comments

Projects
None yet
4 participants
@fnothaft
Member

fnothaft commented Mar 1, 2015

Due to our shuffle being record oriented, we experience an approximately 8-10x increase in data volume when we shuffle. This is because our data is stored on disk in a columnar representation, but is shuffled in a row oriented format.

@fnothaft fnothaft added this to the 0.17.0 milestone Mar 1, 2015

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Mar 1, 2015

Contributor

So what's the proposed fix?

Contributor

tdanford commented Mar 1, 2015

So what's the proposed fix?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 1, 2015

Member

TBD?

Member

fnothaft commented Mar 1, 2015

TBD?

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Mar 1, 2015

Contributor

Gotcha.

Contributor

tdanford commented Mar 1, 2015

Gotcha.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams May 31, 2015

Member

FTR: presumably @massie's SPARK-7263 is our best hope here?

Member

ryan-williams commented May 31, 2015

FTR: presumably @massie's SPARK-7263 is our best hope here?

@fnothaft fnothaft modified the milestones: 1.0.0, 0.17.0 May 31, 2015

fnothaft added a commit to fnothaft/adam that referenced this issue Dec 27, 2015

[ADAM-599] Eliminate shuffle issues by writing metadata to avro files.
Resolves #599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.

fnothaft added a commit to fnothaft/adam that referenced this issue Dec 29, 2015

[ADAM-599] Eliminate shuffle issues by writing metadata to avro files.
Resolves #599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.

fnothaft added a commit to fnothaft/adam that referenced this issue Dec 29, 2015

[ADAM-599] Eliminate shuffle issues by writing metadata to avro files.
Resolves #599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.

fnothaft added a commit to fnothaft/adam that referenced this issue Jan 11, 2016

[ADAM-599] Eliminate shuffle issues by writing metadata to avro files.
Resolves #599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.

fnothaft added a commit to fnothaft/adam that referenced this issue Jan 12, 2016

[ADAM-599] Eliminate shuffle issues by writing metadata to avro files.
Resolves #599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.

fnothaft added a commit to fnothaft/adam that referenced this issue Jan 12, 2016

[ADAM-599] Eliminate shuffle issues by writing metadata to avro files.
Resolves #599. Since we have added the RecordGroupMetadata fields in
bdg-formats:0.7.0, we can read/write our metadata as separate Avro files. We
process these files when loading/writing the Parquet files where the alignment
data is stored. This allows us to both eliminate the bulky metadata that we are
currently storing in the AlignmentRecord, while maintaining the Sequence and
RecordGroup dictionaries that we need to keep around.

@heuermh heuermh closed this in #906 Jan 12, 2016

@heuermh heuermh modified the milestones: 1.0.0, 0.20.0 Oct 13, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment