Genotype calls files manipulations
Set of scripts to manipulate tab-delimited genotype calls files as well as to convert them to other popular formats.
All python scripts contain description of input and output data format in a header of each file.
To see possible options, run python script with --help option:
python script.py --help
Most of these scripts require the custom python module
calls, so make sure that you also download and put the file
calls.py in the same directory where your scripts are.
Examples of a tab-delimited genotype calls file (hereafter, tab file).
Two-character coded table (e.g. produced with VariantsToTable from the GATK) :
CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 chr_1 1 A T/A ./. ./. A/A ./. ./. ./. ./. chr_1 2 C T/C T/C ./. C/C C/C ./. C/C ./. chr_1 3 C C/GCC C/C ./. C/C C/C C/C C/C C/C chr_1 4 T T/T T/T ./. T/T T/T T/T T/T T/T chr_2 1 A A/A A/A ./. A/A A/A A/A A/A A/A chr_2 2 C C/C C/C ./. C/C C/C C/C C/C C/C chr_2 3 C AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT AT/AT chr_2 4 C C/C T/T C/C C/C C/C C/C C/C C/C chr_2 5 T T/T C/C T/T C/T T/T C/T T/T T/T chr_3 1 G G/G ./. ./. G/G ./. ./. ./. ./. chr_3 2 C G/C C/C ./. C/C C/C ./. C/C ./. chr_3 3 CTT CTT/CTT CTT/C CTT/C CTT/CTT CTT/CTT CTT/CTT CTT/CTT CTT/CTT chr_3 4 TA T/T T/T ./. T/T T/T T/T T/T T/TA chr_3 5 G */* G/* ./. G/G G/G G/G C/C G/G
One-character coded tab file where heterozygous genotypes are represented by ambiguous characters R, Y, M, K, S, W. (produced from a two-character coded table with vcfTab_to_callsTab.py):
CHROM POS REF sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 chr_1 1 A W N N A N N N N chr_1 2 C Y Y N C C N C N chr_1 3 C N C N C C C C C chr_1 4 T T T N T T T T T chr_2 1 A A A N A A A A A chr_2 2 C C C N C C C C C chr_2 3 C N N N N N N N N chr_2 4 C C T C C C C C C chr_2 5 T T C T Y T Y T T chr_3 1 G G N N G N N N N chr_3 2 C S C N C C N C N chr_3 3 N N N N N N N N N chr_3 4 N T T N T T T T N chr_3 5 G - N N G G G C G
addGOannotation-to-gff3.py adds GO annotation to the gff3 file.
assessNs_in_callsTab.py calculates missing data (Ns) per position/sample and visualizes the results.
calculateNsPerWindow.py calculates number of positions with missing data (Ns) using the sliding window approach.
calls.py is a custom python module. It is a dependency for the most of the scripts listed here.
callsToBED.py converts a tab-delimited file to a bed file.
callsToFastaPhy_RAM.py converts genotype calls file to FASTA and PHYLIP with little RAM consumption.
callsToFastaPhy_speed.py converts genotype calls file to FASTA and PHYLIP fast but consumes a lot of RAM.
combine_overlapping_BEDintervals.py combines overlapping genetic intervals in the BED format.
Ensembl.dat-to-topGO.db.py converts the Ensembl.dat file to the GO reference file used in the topGO R program.
FastaToPhylip.py converts FASTA to PHYLIP.
FastaToTab.py converts FASTA to tab-delimited file with columns: Chr, Pos, REF.
filterByNs_callsTab.py removes all sites that consists of more than a given amount of missing data (Ns).
find_popSpecificAlleles_in_callsTab.py outputs only unique allele of one population relative to another.
findCommonAlleles.py outputs common and rare alleles in a given set of samples.
GFFextract.py extracts various info from the gff3 file.
keep_biallelic_in_callsTab.py removes sites with more than two alleles.
merge_phased_callsTab.py merges phased sites into two-character coded genotype file.
mergeChrPos_in_callsTab.py merges all chromosomes into continuous genomic coordinates.
mergeTabFiles.py merges two tab files by their overlapping positions.
polarizeGT_in_callsTab.py polarizes the genotype data by keeping only derived alleles relative to an outgroup/ancestral sequence.
pseudoPhasingHetero_in_callsTab.py phases the sequences by random split of heterozygous sites.
MAF-Calls_alignment-complement.py processes the Calls-MAF aligned file to complement the reverse complemented sequences of MAF and outputs Tab file with the coordinates of new genome.
MAF-TAB_reference.py transforms the MAF file to tab file with Chr Pos of both sequences. Indels are skipped.
remove_Insertions_from_callsTab.py removes insertions of longer than 1 bp and replaces deletions of 1 bp marked as "*" with "-".
remove_masked_intervals_from_callsTab.py removes the masked sites from a tab file. The masked sites are provided in a BED file.
remove_masked_intervals_fromBED.py compares a BED interval file with the BED file of masked regions and removes them.
removeMonomorphic_in_callsTab.py removes monomorphic positions, i.e. keeps only SNPs.
select_genes_by_intervals.py extracts gene names from a bed file by provided coordinates.
select_intervals_in_callsTab.py extracts lines from a calls file according to scaffold name, start and end positions.
selectSamples_in_callsTab.py subsamples a genotype calls file by sample names. It also can be used to rearrange samples in a calls file.
slidingWindowSNPs.py cuts genotype calls file with the given window size and outputs FASTA files for every window.
split_calls_by_chromosomes.py splits a calls file into several files by chromosomes.
summarizeTAB.awk summarizes the genotyope file by counting homozygot, heterozygot, missing etc.
vcf_to_SIFT4G.py converts a VCF file to SIFT4G input.
vcfTab_to_callsTab.py converts the two-character coded table produced with VariantsToTable (GATK) to the one-character coded genotype table (calls format).