New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to transform a VCF file containing multiple genome data (Muliple sample) #2029

Closed
bioinfornatics opened this Issue Aug 27, 2018 · 9 comments

Comments

Projects
None yet
3 participants
@bioinfornatics
Copy link

bioinfornatics commented Aug 27, 2018

Dear,

I tried with both transformGenotypes and transformVariants to convert a VCF file containing multiple genome data without success. While I can use transformVariants on your provided VCF example bqsr1.vcf .

$ hdfs dfs -ls  -C data
data/multiple_sample.chr1.vcf
$ bin/adam-submit \
  --master yarn  \
  --deploy-mode cluster \
  --num-executors 2 \
  --driver-memory 4g\
  --executor-memory 40g \
  --executor-cores 3 \
  --verbose \
  --conf spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console \
  -- transformGenotypes /user/jmercier/data/multiple_sample.chr1.vcf /user/jmercier/multiple_sample.chr1.adam
....
18/08/27 16:59:10 INFO Client:
         client token: N/A
         diagnostics: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/jmercier/multiple_sample.chr1.adam already exists

I do not understand what is the problem as the directory was created by the tools itself!

Did you have any tips?

Thanks for your help

Best regards

Note: I use the release 0.24

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Aug 27, 2018

Hello @bioinfornatics!

Is /user/jmercier/multiple_sample.chr1.adam a writeable location on HDFS? I note that you are reading from /user/jmercier/data/.

@bioinfornatics

This comment has been minimized.

Copy link

bioinfornatics commented Aug 27, 2018

Hello @heuermh

The directory /user/jmercier/multiple_sample.chr1.adam is created by the tool (transformVariants or transformGenotypes) .
The HDFS home directory ( /user/jmercier/ ) should be writeable as the tool create a directory and I have already successfully converted into the home dir /user/jmercier/ using bqsr1.vcf

I have tried to set the output directory into /user/jmercier/ and /user/jmercier/data with the same results.

I will retry to split my single VCF file into multiple VCF file to run again transformVariants

$ hdfs dfs -ls -d
drwxr-xr-x   - jmercier jmercier        4096 2018-08-27 17:28 .
$ hdfs dfs -ls -d data
drwxrwxrwx   - jmercier jmercier       16384 2018-08-27 14:18 data
@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Aug 27, 2018

Thank you for the clarification. transformGenotypes will complain if the output directory already exists; it shouldn't try to create the directory more than once. I will try to replicate.

@bioinfornatics

This comment has been minimized.

Copy link

bioinfornatics commented Aug 27, 2018

Thanks @heuermh

I tried too using data from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
with the same result

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Aug 28, 2018

I have been unable to replicate such a problem. Does the error happen early or late in processing?

$ wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ hadoop fs \
    -put \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ hadoop fs -ls
-rw-r--r--   2 user group   1216886729 2018-08-27 11:56 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ adam-submit --master yarn ... \
    -- \
    transformGenotypes \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.adam

$ hadoop fs -ls
drwxr-xr-x   - user group            0 2018-08-27 17:17 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.adam
-rw-r--r--   2 user group   1216886729 2018-08-27 11:56 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ adam-shell --master yarn ...

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val genotypes = sc.loadGenotypes("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.adam")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD = ParquetUnboundGenotypeRDD with 86 reference sequences and 2504 samples

scala> genotypes.dataset.count()
res0: Long = 16277357168
@bioinfornatics

This comment has been minimized.

Copy link

bioinfornatics commented Aug 28, 2018

Thanks @heuermh ,

I finally successfully transform genotypes with the command:

bin/adam-submit --master yarn  \
                              --deploy-mode cluster  \
                              --driver-memory 4g \
                              --executor-memory 4g \
                              --executor-cores 3 \
                              --verbose \
                              --num-executors 10 \
                              --conf spark.yarn.submit.waitAppCompletion=false \
                              --conf spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console \
                              -- transformGenotypes /user/jmercier/1000G/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf /user/jmercier/ALL.chr1.genotypes.adam

I do not know why this time the command work, I need to do some tests.

Thanks a lot for your help it is a real pleasure.

Have a nice day

Best regards

@bioinfornatics

This comment has been minimized.

Copy link

bioinfornatics commented Aug 28, 2018

A little request @heuermh could we add the test output directory already exists early?
Currently, this test appears after one hour of computation, that could save time.

Thanks

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Aug 28, 2018

Thank you, @bioinfornatics. Created new issue #2034.

@heuermh heuermh added this to the 0.24.1 milestone Aug 28, 2018

@akmorrow13

This comment has been minimized.

Copy link
Contributor

akmorrow13 commented Sep 2, 2018

I am also getting this issue in Cannoli: bigdatagenomics/cannoli#137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment