Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

results of "genotype count" are different - depending on used programs #140

Open
jerryivanhoe opened this issue Nov 10, 2015 · 2 comments
Open

Comments

@jerryivanhoe
Copy link

Hello,

we are using Adam to count the total number of Genotypes for the input file “ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf” from 1000 Genome project stored on s3

With Adam we run these 2 steps:

Step 1: convert to Adam with vcf2adam :

adam-submit --master yarn-client --driver-memory 8g
--num-executors $TOTAL_EXECUTORS
--executor-cores $CORES_PER_EXECUTOR
--executor-memory $MEMORY_PER_EXECUTOR
--
vcf2adam
-parquet_compression_codec SNAPPY
hdfs:///user/ec2-user/1kg/chr1.vcf \

hdfs:///user/ec2-user/1kg/chr1.adam

then we start Step2: Adam-Shell

adam-shell --master yarn-client --driver-memory 8g
--num-executors $TOTAL_EXECUTORS
--executor-cores $CORES_PER_EXECUTOR
--executor-memory $MEMORY_PER_EXECUTOR
scala> import org.bdgenomics.adam.rdd.ADAMContext
scala> import org.bdgenomics.formats.avro._
scala> val ac = new ADAMContext(sc)
scala> val genotypes = ac.loadGenotypes("/user/ec2-user/1kg/chr1.adam")
scala> genotypes.count

the result : 16277357168

As a comparision we are also using the HTSJDK with this script:

public static void main(String[] args) throws InterruptedException,
                    SQLException, IOException {

             int position;
             String uuid, chromosome;
             VariantContext ctx;
             //HashMap<String,Boolean> gc = new HashMap<String, Boolean>();
             long count = 0;

             final VCFCodec codec = new VCFCodec();
             final FeatureReader<VariantContext> vcfr = AbstractFeatureReader
                           .getFeatureReader("c:/tmp/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", codec, false);

             final CloseableTribbleIterator<VariantContext> i = vcfr.iterator();

             while (i.hasNext()) {
                    ctx = i.next();

                    GenotypesContext g = ctx.getGenotypes();

                    count += g.size();
             }

       }

And this shows a different result:

16196107376

Any idea ? Are we using the wrong commands ?

greetings
-Jerry

@laserson
Copy link
Contributor

I would guess this has to do with how ADAM handles multiallelic variants. You'll notice that the ADAM Variant schema only supports a single alternate allele:

https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L430-L438

If we encounter variants with multiple alternates, we emit a Variant object for each one. Is this right, @fnothaft?

@fnothaft
Copy link
Member

@laserson that is correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants