New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work With ADAM fasta2adam in a distributed mode #881

Closed
guerai opened this Issue Nov 12, 2015 · 6 comments

Comments

Projects
None yet
3 participants
@guerai

guerai commented Nov 12, 2015

Hello,
I'm beginner in genomic area but I'm studing K-mer sequence and I'm using spark on Hadoop, I want to test ADAM performance on my cluster. As First I want to convert my FASTA test files in adam files using fasta2adam.
I'm using adam-submit command:
adam-submit --conf spark.yarn.jar ${SPARK_HOME}/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar --master yarn-cluster --driver-memory 4g --num-executors 15 --executor-cores 2 --executor-memory 4g fasta2adam INPUT_FILE OUTPUT_FILE

but command does not start on cluster but as single node
could someone help me
thanks for our time
Best regards

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 12, 2015

Member

I'm not sure why ${SPARK_HOME}/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar is there, might that be mistakenly interpreted to be the application jar?

https://spark.apache.org/docs/1.5.1/submitting-applications.html#launching-applications-with-spark-submit

Member

heuermh commented Nov 12, 2015

I'm not sure why ${SPARK_HOME}/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar is there, might that be mistakenly interpreted to be the application jar?

https://spark.apache.org/docs/1.5.1/submitting-applications.html#launching-applications-with-spark-submit

@guerai

This comment has been minimized.

Show comment
Hide comment
@guerai

guerai Nov 12, 2015

Sorry it was a misprint. I want to execute fasta2adam conversion on my test cluster and I'm using this command
adam-submit --master yarn-cluster --driver-memory 4g --num-executors 15 --executor-cores 2 --executor-memory 4g fasta2adam INPUT_FILE OUTPUT_FILE
the cluster works well with hadoop and spark but not using adam-submit
Thanks for your time

guerai commented Nov 12, 2015

Sorry it was a misprint. I want to execute fasta2adam conversion on my test cluster and I'm using this command
adam-submit --master yarn-cluster --driver-memory 4g --num-executors 15 --executor-cores 2 --executor-memory 4g fasta2adam INPUT_FILE OUTPUT_FILE
the cluster works well with hadoop and spark but not using adam-submit
Thanks for your time

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Nov 12, 2015

Member

You might try

adam-submit \
  --master yarn-cluster \
  --driver-memory 4g \
  --num-executors 15 \
  --executor-cores 2 \
  --executor-memory 4g \
  -- \
  fasta2adam INPUT_FILE OUTPUT_FILE

Note the -- separating the Spark and ADAM options. This feature was added recently and may not be very obvious from documentation (e.g. I don't see it mentioned anywhere in the current version of README.md).

It is shown in the usage doc

adam-submit \
  --master yarn-cluster \
  --driver-memory 4g \
  --num-executors 15 \
  --executor-cores 2 \
  --executor-memory 4g \
  fasta2adam adam-core/src/test/resources/artificial.fa artificial.adam

Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit


     e            888~-_              e                 e    e
    d8b           888   \            d8b               d8b  d8b
   /Y88b          888    |          /Y88b             d888bdY88b
  /  Y88b         888    |         /  Y88b           / Y88Y Y888b
 /____Y88b        888   /         /____Y88b         /   YY   Y888b
/      Y88b       888_-~         /      Y88b       /          Y888b

Usage: adam-submit [<spark-args> --] <adam-args>
Member

heuermh commented Nov 12, 2015

You might try

adam-submit \
  --master yarn-cluster \
  --driver-memory 4g \
  --num-executors 15 \
  --executor-cores 2 \
  --executor-memory 4g \
  -- \
  fasta2adam INPUT_FILE OUTPUT_FILE

Note the -- separating the Spark and ADAM options. This feature was added recently and may not be very obvious from documentation (e.g. I don't see it mentioned anywhere in the current version of README.md).

It is shown in the usage doc

adam-submit \
  --master yarn-cluster \
  --driver-memory 4g \
  --num-executors 15 \
  --executor-cores 2 \
  --executor-memory 4g \
  fasta2adam adam-core/src/test/resources/artificial.fa artificial.adam

Using ADAM_MAIN=org.bdgenomics.adam.cli.ADAMMain
Using SPARK_SUBMIT=/usr/local/bin/spark-submit


     e            888~-_              e                 e    e
    d8b           888   \            d8b               d8b  d8b
   /Y88b          888    |          /Y88b             d888bdY88b
  /  Y88b         888    |         /  Y88b           / Y88Y Y888b
 /____Y88b        888   /         /____Y88b         /   YY   Y888b
/      Y88b       888_-~         /      Y88b       /          Y888b

Usage: adam-submit [<spark-args> --] <adam-args>
@guerai

This comment has been minimized.

Show comment
Hide comment
@guerai

guerai Nov 13, 2015

Thanks for you answer now my job start on my cluster I have all executors that I have been set in command line I'm trying to convert my file thanks for your help. I'll update about my progress
best regards

guerai commented Nov 13, 2015

Thanks for you answer now my job start on my cluster I have all executors that I have been set in command line I'm trying to convert my file thanks for your help. I'll update about my progress
best regards

@guerai

This comment has been minimized.

Show comment
Hide comment
@guerai

guerai Nov 14, 2015

Thanks with your help I was able to confevert a fasta file type in an adam file, but when I try to count kmers

adam-submit --master yarn-cluster --driver-memory 4g --num-executors 15 --executor-cores 2 --executor-memory 4g -- count_kmers hdfs://INPUTFILE hdfs://OUTPUTFILE 20

I receive always the same error:
User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://OUTPUTFILE already exists

also if I change it every executions.
It happen if I use the as input file the entire directory of transformed file that contains all part-** sub files and also if I use as input the merged single file created with hadoop fs -getmerge
I'm finding other solution to solve it.

guerai commented Nov 14, 2015

Thanks with your help I was able to confevert a fasta file type in an adam file, but when I try to count kmers

adam-submit --master yarn-cluster --driver-memory 4g --num-executors 15 --executor-cores 2 --executor-memory 4g -- count_kmers hdfs://INPUTFILE hdfs://OUTPUTFILE 20

I receive always the same error:
User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://OUTPUTFILE already exists

also if I change it every executions.
It happen if I use the as input file the entire directory of transformed file that contains all part-** sub files and also if I use as input the merged single file created with hadoop fs -getmerge
I'm finding other solution to solve it.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 6, 2016

Member

Closing as resolved/not an ADAM bug.

Member

fnothaft commented Jul 6, 2016

Closing as resolved/not an ADAM bug.

@fnothaft fnothaft closed this Jul 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment