New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transformAlignments cannot repartition files #1808

Closed
akmorrow13 opened this Issue Nov 19, 2017 · 5 comments

Comments

3 participants
@akmorrow13
Contributor

akmorrow13 commented Nov 19, 2017

~/ADAM/adam/bin/adam-submit --packages org.apache.parquet:parquet-avro:1.8.2 --master yarn-client --num-executors 8 --executor-memory 20g -- transformAlignments /data/platinum/NA12877_S1.bam /data/platinum/transformedAlignments/NA12877_S1.bam.adam -repartition 200

or ~/ADAM/adam/bin/adam-submit --master yarn-client --num-executors 8 --executor-memory 20g -- transformAlignments -repartition 200 /data/platinum/NA12877_S1.bam /data/platinum/transformedAlignments/NA12877_S1.bam.adam

do not repartition

@fnothaft

This comment has been minimized.

Member

fnothaft commented Nov 19, 2017

What happens instead? What happens if you use -coalesce instead? If you're decreasing the number of partitions, you'll generally want to use coalesce instead of repartition, as coalesce is shuffle free.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Nov 19, 2017

I used repartition. The result was generating ~1200 partitions, instead of 200. It's hard to tell the starting partitions from a bam file, so I guess one would calculate the estimated partitions based on resources then call repartition or coalesce accordingly?

@fnothaft

This comment has been minimized.

Member

fnothaft commented Nov 19, 2017

I used repartition. The result was generating ~1200 partitions, instead of 200.

Can you send me the application ID offline? I'd like to take a look.

It's hard to tell the starting partitions from a bam file

You should get roughly partitions = file_size / block_size. We have the cluster configured with a 128MB block size, so a 128GB file would have approximately 1024 partitions.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Nov 20, 2017

Sent @fnothaft ! That would explain the large number of partitions then, given the file was ~120GB.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Jan 9, 2018

I think we debugged this locally. Closing. Please reopen @akmorrow13 if I was wrong.

@fnothaft fnothaft closed this Jan 9, 2018

@heuermh heuermh added this to the 0.24.0 milestone Jan 9, 2018

@heuermh heuermh added this to Completed in Release 0.24.0 Feb 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment