Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to infer the A and B alleles while parsing the site... #41

Open
AkshajD opened this issue Nov 11, 2023 · 11 comments
Open

Unable to infer the A and B alleles while parsing the site... #41

AkshajD opened this issue Nov 11, 2023 · 11 comments

Comments

@AkshajD
Copy link

AkshajD commented Nov 11, 2023

I have a batch of VCF files from an array that I am trying to add ALLELE_A and ALLELE_B into to be able to run them through MoChA. I used the mochatools command shown below to do so:

bcftools +mochatools $input -- -t ALLELE_A,ALLELE_B,GC -f $reference > $output

I am getting the error: Unable to infer the A and B alleles while parsing the site: for all non 0/0 sites.
Can you please offer some advice on why this might be the case and how to fix it?

P.S. Not sure if this would have anything to do with it, but the VCFs were pre-generated by the org that provides our dataset, but we had to add in the LRR and BAF fields manually afterwards.

@freeseek
Copy link
Owner

The BCFtools/mochatools plugin will infer which allele is the A and B allele as long as at least one homozygous AA or one homozygous BB allele is observed. All sites for which all samples are heterozygous will not be inferrable. It is simply not possible to do so. If you have enough samples in the VCF, this should not be a problem. Are you running the tool on a single sample VCF? My advice is to go back to the org that provides your dataset and tell them to do the right thing and give you the IDAT files (or CEL files if it is Affymetrix data)

@AkshajD
Copy link
Author

AkshajD commented Nov 14, 2023

I have attempted running it with both a single VCF (which I now understand why it would give an error), and then with a test VCF with 10 samples. The error persists.

Is this just a result of still not having a sufficient number of samples? Just want to resolve the issue before we run the algo to add LRR and BAF to tons of different VCFs.

Will try to follow up with the org but have had low success with them about this issue in the past.

@freeseek
Copy link
Owner

With 10 samples in a VCF, for a very common variant with minor allele frequency close to 0.5 you still have ~1/1,000 chances that all samples will be heterozygous. So it is still possible that you will not be able to infer which one is ALLELE_A and which one is ALLELE_B for a few markers. To be safe, I think you need a VCF with at least ~30 samples from independent participants. Otherwise it is just not possible to retrieve this information. Remember that the root of the issue here is that the org that provides your dataset tossed that information away. This is not a limitation of MoChA

@Tianwen-lab-star
Copy link

2
1
Hello, Figure 1 is the .vcf file format of .gtc file to conversion which comes from .idat file , which is different from the basic vcf format, could you tell me how to add ALLELE A/ALLELE B/GC/LRR/BAF mentioned in Figure 2?

@freeseek
Copy link
Owner

freeseek commented Jan 2, 2024

BCFtools/gtc2vcf can automatically add ALLELE A/ALLELE B/GC/LRR/BAF when you convert a .gtc file. I have no idea what you refer to when you say basic vcf format. One thing for sure. If a VCF does not have LRR/BAF information, then there is no way to "add" this information

@Tianwen-lab-star
Copy link

image
image
Hello, sorry to bother you. I have another problem. When I perform the shapeit step, it says that there is no AC field. But my VCF file is GTC converted, how should I solve this step?

@freeseek
Copy link
Owner

freeseek commented Jan 3, 2024

SHAPEIT5, differently from SHAPEIT4, requires the AC and AN fields to be filled. You can quickly fill them with either of the following BCFtools commands:

bcftools view -c 0
bcftools +fill-AN-AC

@Tianwen-lab-star
Copy link

Thank you. Sounds like 5 is a bit more complicated than 4. I've tried a lot of online methods to make shapeit4, but they didn't success. Could you provide the shapeit4 file that has already been compiled?

@freeseek
Copy link
Owner

freeseek commented Jan 3, 2024

SHAPEIT4 and phase_common from SHAPEIT5 are identical other than requiring the AC and AN fields, with the advantage that SHAPEIT5 can handle trios. You can find binaries for SHAPEIT5 here. In the past to generate binaries for SHAPEIT4 I used the following Dockerfile:

FROM debian:testing-slim
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get -qqy update --fix-missing && \
    apt-get -qqy install --no-install-recommends \
                 wget \
                 g++ \
                 make \
                 libboost-iostreams-dev \
                 libboost-program-options-dev \
                 libhts-dev \
                 libbz2-dev \
                 libssl-dev \
                 libboost-iostreams1.74.0 \
                 libboost-program-options1.74.0 \
                 bcftools && \
    wget --no-check-certificate https://github.com/odelaneau/shapeit4/archive/v4.2.2.tar.gz && \
    tar xzf v4.2.2.tar.gz && \
    cd shapeit4-4.2.2 && \
    sed -i 's/^HTSLIB_INC=\$(HOME)\/Tools\/htslib-1.11$/HTSLIB_INC=-Ihtslib/' makefile && \
    sed -i 's/^HTSLIB_LIB=\$(HOME)\/Tools\/htslib-1.11\/libhts.a$/HTSLIB_LIB=-lhts/' makefile && \
    sed -i 's/^BOOST_LIB_IO=\/usr\/lib\/x86_64-linux-gnu\/libboost_iostreams.a$/BOOST_LIB_IO=-lboost_iostreams/' makefile && \
    sed -i 's/^BOOST_LIB_PO=\/usr\/lib\/x86_64-linux-gnu\/libboost_program_options.a$/BOOST_LIB_PO=-lboost_program_options/' makefile && \
    make && \
    mv bin/shapeit4.2 /usr/bin/ && \
    cd .. && \
    apt-get -qqy purge --auto-remove --option APT::AutoRemove::RecommendsImportant=false \
                 wget \
                 g++ \
                 make \
                 libboost-iostreams-dev \
                 libboost-program-options-dev \
                 libhts-dev \
                 libbz2-dev \
                 libssl-dev && \
    apt-get -qqy clean && \
    rm -rf v4.2.2.tar.gz \
           shapeit4-4.2.2 \
           /var/lib/apt/lists/*

@Tianwen-lab-star
Copy link

caf86605762a05ceb165f0876b103e7
Hello, when I use a VCF file to add a ALLELE_A or ALLELE_B, I use the above code and get an error:“Error: BAF format field is not present, cannot infer ALLELE_A or ALLELE_B”
VCF files were genotyped and exported by Axiom™ Analysis Suite.

@freeseek
Copy link
Owner

Your VCF does not include intensity data so it would be pointless to identify which one is the A allele and which one is the B allele. I would advise you to go back to the table data generated by the Affymetrix Power Tools when you genotyped your samples and then use BCFtools/affy2vcf to generate a VCF with BAF, LRR, ALLELE_A, and ALLELE_B. Then you don't have to worry about file formatting issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants