New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade from Parquet 1.4.3 to 1.6.0rc4 #508

Merged
merged 1 commit into from Nov 25, 2014

Conversation

Projects
None yet
4 participants
@massie
Member

massie commented Nov 24, 2014

This upgrade adds dictionary support which dramatically reduces the
memory footprint.

For example, a simple program to load and count mouse mitochondrial DNA..

val reads = ac.loadAlignments("/workspace/data/mouse_chrM.adam").cache()
println(reads.count())

In 1.6.0rc4, this rdd was able to be cached in memory, e.g.

14/11/24 10:37:16 INFO BlockManagerInfo: Added rdd_1_0 in memory on zenfractal:60004 (size: 193.0 MB, free: 72.5 MB)

With 1.4.3, it was not, e.g.

14/11/24 10:33:07 WARN CacheManager: Not enough space to cache partition rdd_1_0 in memory! Free memory is 278185689 bytes.

Upgrade from Parquet 1.4.3 to 1.6.0rc4
This upgrade adds dictionary support which dramatically reduces the
memory footprint.

For example, a simple program to load and count mouse mitochondrial DNA..

  val reads = ac.loadAlignments("/workspace/data/mouse_chrM.adam").cache()
  println(reads.count())

In 1.6.0rc4, this rdd was able to be cached in memory, e.g.

14/11/24 10:37:16 INFO BlockManagerInfo: Added rdd_1_0 in memory on zenfractal:60004 (size: 193.0 MB, free: 72.5 MB)

With 1.4.3, it was not, e.g.

14/11/24 10:33:07 WARN CacheManager: Not enough space to cache partition rdd_1_0 in memory! Free memory is 278185689 bytes.
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 24, 2014

Member

w00t! +1.

val reads = ac.loadAlignments("/workspace/data/mouse_chrM.adam").cache()
println(reads.count())

Any idea what this looks like on a larger dataset?

Member

fnothaft commented Nov 24, 2014

w00t! +1.

val reads = ac.loadAlignments("/workspace/data/mouse_chrM.adam").cache()
println(reads.count())

Any idea what this looks like on a larger dataset?

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Nov 24, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/445/
Test PASSed.

AmplabJenkins commented Nov 24, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/445/
Test PASSed.

@arahuja

This comment has been minimized.

Show comment
Hide comment
@arahuja

arahuja Nov 24, 2014

Contributor

@massie could you expand a bit (or point me to) more on what's happening here? For example, once the records are deserialized does this still have an effect on their in-memory size or once they are reserialized and shuffled?

Contributor

arahuja commented Nov 24, 2014

@massie could you expand a bit (or point me to) more on what's happening here? For example, once the records are deserialized does this still have an effect on their in-memory size or once they are reserialized and shuffled?

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Nov 24, 2014

Member

This memory deduction occurs on string columns that are dictionary encoded (e.g. referenceName). Instead of having a new string created for each record, the dictionary string reference is used. For example, if the dictionary has "chromosome 1" and "chromosome 2" then every record materialized with have a reference to one of these strings instead of having a new string copy created for each record.

Member

massie commented Nov 24, 2014

This memory deduction occurs on string columns that are dictionary encoded (e.g. referenceName). Instead of having a new string created for each record, the dictionary string reference is used. For example, if the dictionary has "chromosome 1" and "chromosome 2" then every record materialized with have a reference to one of these strings instead of having a new string copy created for each record.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Nov 24, 2014

Member

The shuffle is still going to be an issue though but I'm working on that now. I want to use Parquet for in-flight objects too instead of plain Avro.

Member

massie commented Nov 24, 2014

The shuffle is still going to be an issue though but I'm working on that now. I want to use Parquet for in-flight objects too instead of plain Avro.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Nov 24, 2014

Member

@fnothaft What exactly are you trying to understand about larger datasets? It's not clear to me. You will get memory savings regardless of the dataset size.

Member

massie commented Nov 24, 2014

@fnothaft What exactly are you trying to understand about larger datasets? It's not clear to me. You will get memory savings regardless of the dataset size.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 24, 2014

Member

@massie just the % difference. E.g., if you set up a 5 node cluster and load HG00096, what is the before/after memory capacity. I don't care so much about the size of the dataset, I'd just like to see the difference when the "before" dataset fits in memory. Sorry if it was unclear.

Member

fnothaft commented Nov 24, 2014

@massie just the % difference. E.g., if you set up a 5 node cluster and load HG00096, what is the before/after memory capacity. I don't care so much about the size of the dataset, I'd just like to see the difference when the "before" dataset fits in memory. Sorry if it was unclear.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Nov 24, 2014

Member

@fnothaft It would depend on how many dictionary encoded string columns you have in your records and the cardinality of those columns.

For a dictionary-encoded string column, before the size was string length X number of records. Now, the size is string length X number of dictionary entries.

Member

massie commented Nov 24, 2014

@fnothaft It would depend on how many dictionary encoded string columns you have in your records and the cardinality of those columns.

For a dictionary-encoded string column, before the size was string length X number of records. Now, the size is string length X number of dictionary entries.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Nov 24, 2014

Member

Once this is merged, we should cut a 0.15.0 release and recommend that people move over. The 0.14.0 release also has Avro serialization regression which wraps every object in an Avro data file.

Member

massie commented Nov 24, 2014

Once this is merged, we should cut a 0.15.0 release and recommend that people move over. The 0.14.0 release also has Avro serialization regression which wraps every object in an Avro data file.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Nov 25, 2014

Member

@fnothaft Look good enough to merge? Hoping to cut a release soon.

Member

massie commented Nov 25, 2014

@fnothaft Look good enough to merge? Hoping to cut a release soon.

fnothaft added a commit that referenced this pull request Nov 25, 2014

Merge pull request #508 from massie/parquet-1.6.0
Upgrade from Parquet 1.4.3 to 1.6.0rc4

@fnothaft fnothaft merged commit 024ca02 into bigdatagenomics:master Nov 25, 2014

1 check passed

default Merged build finished.
Details
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 25, 2014

Member

Merged! Thanks @massie!

Member

fnothaft commented Nov 25, 2014

Merged! Thanks @massie!

@massie massie deleted the massie:parquet-1.6.0 branch Sep 2, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment