Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to transform a VCF file containing multiple genome data (Muliple sample) #2029

Closed
bioinfornatics opened this issue Aug 27, 2018 · 9 comments
Closed
Milestone

Comments

@bioinfornatics
Copy link

@bioinfornatics bioinfornatics commented Aug 27, 2018

Dear,

I tried with both transformGenotypes and transformVariants to convert a VCF file containing multiple genome data without success. While I can use transformVariants on your provided VCF example bqsr1.vcf .

$ hdfs dfs -ls  -C data
data/multiple_sample.chr1.vcf
$ bin/adam-submit \
  --master yarn  \
  --deploy-mode cluster \
  --num-executors 2 \
  --driver-memory 4g\
  --executor-memory 40g \
  --executor-cores 3 \
  --verbose \
  --conf spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console \
  -- transformGenotypes /user/jmercier/data/multiple_sample.chr1.vcf /user/jmercier/multiple_sample.chr1.adam
....
18/08/27 16:59:10 INFO Client:
         client token: N/A
         diagnostics: User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /user/jmercier/multiple_sample.chr1.adam already exists

I do not understand what is the problem as the directory was created by the tools itself!

Did you have any tips?

Thanks for your help

Best regards

Note: I use the release 0.24

@heuermh
Copy link
Member

@heuermh heuermh commented Aug 27, 2018

Hello @bioinfornatics!

Is /user/jmercier/multiple_sample.chr1.adam a writeable location on HDFS? I note that you are reading from /user/jmercier/data/.

@bioinfornatics
Copy link
Author

@bioinfornatics bioinfornatics commented Aug 27, 2018

Hello @heuermh

The directory /user/jmercier/multiple_sample.chr1.adam is created by the tool (transformVariants or transformGenotypes) .
The HDFS home directory ( /user/jmercier/ ) should be writeable as the tool create a directory and I have already successfully converted into the home dir /user/jmercier/ using bqsr1.vcf

I have tried to set the output directory into /user/jmercier/ and /user/jmercier/data with the same results.

I will retry to split my single VCF file into multiple VCF file to run again transformVariants

$ hdfs dfs -ls -d
drwxr-xr-x   - jmercier jmercier        4096 2018-08-27 17:28 .
$ hdfs dfs -ls -d data
drwxrwxrwx   - jmercier jmercier       16384 2018-08-27 14:18 data
@heuermh
Copy link
Member

@heuermh heuermh commented Aug 27, 2018

Thank you for the clarification. transformGenotypes will complain if the output directory already exists; it shouldn't try to create the directory more than once. I will try to replicate.

@bioinfornatics
Copy link
Author

@bioinfornatics bioinfornatics commented Aug 27, 2018

Thanks @heuermh

I tried too using data from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
with the same result

@heuermh
Copy link
Member

@heuermh heuermh commented Aug 28, 2018

I have been unable to replicate such a problem. Does the error happen early or late in processing?

$ wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ hadoop fs \
    -put \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ hadoop fs -ls
-rw-r--r--   2 user group   1216886729 2018-08-27 11:56 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ adam-submit --master yarn ... \
    -- \
    transformGenotypes \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
    ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.adam

$ hadoop fs -ls
drwxr-xr-x   - user group            0 2018-08-27 17:17 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.adam
-rw-r--r--   2 user group   1216886729 2018-08-27 11:56 ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ adam-shell --master yarn ...

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val genotypes = sc.loadGenotypes("ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.adam")
genotypes: org.bdgenomics.adam.rdd.variant.GenotypeRDD = ParquetUnboundGenotypeRDD with 86 reference sequences and 2504 samples

scala> genotypes.dataset.count()
res0: Long = 16277357168
@bioinfornatics
Copy link
Author

@bioinfornatics bioinfornatics commented Aug 28, 2018

Thanks @heuermh ,

I finally successfully transform genotypes with the command:

bin/adam-submit --master yarn  \
                              --deploy-mode cluster  \
                              --driver-memory 4g \
                              --executor-memory 4g \
                              --executor-cores 3 \
                              --verbose \
                              --num-executors 10 \
                              --conf spark.yarn.submit.waitAppCompletion=false \
                              --conf spark.driver.extraJavaOptions=-Dlog4jspark.root.logger=WARN,console \
                              -- transformGenotypes /user/jmercier/1000G/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf /user/jmercier/ALL.chr1.genotypes.adam

I do not know why this time the command work, I need to do some tests.

Thanks a lot for your help it is a real pleasure.

Have a nice day

Best regards

@bioinfornatics
Copy link
Author

@bioinfornatics bioinfornatics commented Aug 28, 2018

A little request @heuermh could we add the test output directory already exists early?
Currently, this test appears after one hour of computation, that could save time.

Thanks

@heuermh
Copy link
Member

@heuermh heuermh commented Aug 28, 2018

Thank you, @bioinfornatics. Created new issue #2034.

@heuermh heuermh added this to the 0.24.1 milestone Aug 28, 2018
@akmorrow13
Copy link
Contributor

@akmorrow13 akmorrow13 commented Sep 2, 2018

I am also getting this issue in Cannoli: bigdatagenomics/cannoli#137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants