Impute missing data on non-recombining DNA using SNP phylogeny
PhyloImpute complements SNP data on non-recombining DNA, such as the human Y chromosome, by leveraging the SNPs' phylogeny as reported in a phylogenetic tree.
Operating system: Linux
Type in the shell:
git clone https://github.com/ZehraKoksal/PhyloImpute.git
cd PhyloImpute/
python PhyloImpute.py -hThe following tools must be installed before running this software.
Install these inside a virtual environment using pip install ...:
- pandas (v2.1.6)
- numpy (v1.26.4)
- matplotlib (v3.9.2)
- scipy (v1.14.1)
- geopandas (v1.0.1)
- shapely (v2.0.6)
These tools are optional and only required if you use the -v parameter (see details below):
- bgzip (v1.19)
- bcftools (v1.20)
- tabix (v1.19)
PhyloImpute imputes missing data by assuming that the SNPs in a clade of the phylogenetic tree leading up to a SNP with a derived allele are derived as well. SNPs on parallel branches are expected to be ancestral.
PhyloImpute can be run with a pre-processed csv input file or vcf files:
python PhyloImpute.py -input_format csv -input ./test_run/testdata.csv -output ./output -tree Y_minimalParameters:
| Flag | Description | Required/Optional |
|---|---|---|
-input_format |
Input file format (e.g., csv). |
Required |
-input |
Path to the input file. | Required |
-tree |
Path to the available phylogenetic tree. Options: Y_minimal, NAMQY, ISOGG_2020. |
Optional* (mutually exclusive with customtree) |
-customtree |
Path to a custom phylogenetic tree. | Optional* (mutually exclusive with tree) |
-vcf_dic |
Dictionary file for custom tree markers (only used when -customtree is provided). |
Optional (used with -customtree) |
-output |
Path to an existing folder where output files will be saved. | Required |
-nucleotide |
Parameter allows obtaining nucleotides (A,C,G,T,N). Default: Obtaining ancestral states (A, D, X). | Optional |
The user is required to provide the path to the input file in tab-separated .csv format.
Rows represent variants, columns represent individuals.
The header row contains individual labels (blue), and the second column contains variant names (orange).
The table contains observed allelic states (green): ancestral A, derived D, or missing X.
Currently, a pre-processed phylogenetic tree is available for the human Y chromosome:
python PhyloImpute.py -input_format csv -input ./test_run/testdata.csv -output ./output -tree Y_minimalAlternatively, custom phylogenetic trees can be provided:
python PhyloImpute.py -input_format csv -input ./test_run/testdata.csv -output ./output -customtree ./test_run/minimal_y_tree_hgs_custom_example.csv -vcf_dic ./Y_minimal_dic.csvThe custom phylogenetic tree must be provided in tab-separated .csv format.
It follows the ISOGG tree nomenclature and this structure:
- SNPs that cannot be separated ("equal") are comma-separated in the same branch (green).
- SNPs of downstream branches are indented using a tab (orange).
- SNPs from parallel branches are in separate rows (blue).
Required when using -customtree. The custom phylogenetic tree needs to be supplemented with a dictionary file (vcf_dic).
Structure:
Columns:
- marker: names used in tree
- GRCh37, GRCh38, T2T: positions
- Anc, Der: alleles
- Hg: haplogroups
- Ref: reference allele according to chosen reference genome (i.e., GRCh37, GRCH38 or T2T).
This file contains observed (D, A, X) and imputed (d, a) allelic states for reported SNPs plus additional SNPs from the phylogenetic tree.
Option to receive nucleotides (A/a, C/c, G/g, T/t, N) instead of default ancestral states (A/a, D/d, X), if specifying -nucleotide.
PhyloImpute compares allelic states of all observed SNPs with the SNP relationships in the phylogenetic tree to verify the accuracy of the phylogenetic tree and the sequencing data.
It outputs:
- Predicted haplogroup: based on tree branch with the most SNPs that are present in the analyzed sequence ("main tree branch"). PhyloImpute v1.3 prefixes each haplogroup with an asterisk (*) if the Penalty Value 1 is higher than the Confidence Value (see below) to indicate potentially erroneous predictions that require manual inspection.
- Confidence value: proportion of derived alleles in main tree branch. Low value can be due to low sequence coverage.
- Penalty value 1: proportion of ancestral alleles in main branch. The observed ancestral alleles could be the consequence of backmutations, but a high penalty value could hint towards an incorrect haplogroup prediction.
- Penalty value 2: proportion of derived alleles in parallel branches. Markers in parallel branches in the derived allelic state could be the consequence of recurrent mutations that are identical by state, rather than by descent. However, a high penalty value 2 could indicate that the sequencing data comprises a mixture of DNA from different individuals.
- downstream_ancestral:Indicates downstream haplogroups and the specific SNPs that were found in ancestral state.
- downstream_unknown:Indicates downstream haplogroups and the specific SNPs that were found in unknown state. Taking this information into account may help in assessing whether the predicted haplogroup is potentially lacking resolution due to missing downstream alleles.
Contains SNPs that cause penalty values:
- "(ancestral allele inside main branch)" → penalty value 1
- "(derived allele inside parallel branch)" → penalty value 2
In the main output file (phyloimputed.csv), PhyloImpute keeps the observed allelic states.
Generates a log file containing the commands executed by the tool, with each entry timestamped for easier tracking and reproducibility.
python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree Y_minimal -vcf_ref GRCh37 -vcf_chr NC_000024.9Parameters:
| Flag | Description | Required/Optional |
|---|---|---|
-input_format |
Input file format (vcf). |
Required |
-input |
Path to the folder containing VCF files. | Required |
-output |
Output folder path where results will be saved. | Required |
-tree |
Phylogenetic tree to use. Options: Y_minimal, NAMQY, ISOGG_2020. |
Optional* (mutually exclusive with customtree) |
-customtree |
Path to a custom phylogenetic tree. | Optional* (mutually exclusive with tree) |
-vcf_ref |
Reference genome used for VCF files. Options: GRCh37, GRCh38, T2T. |
Required |
-vcf_chr |
Chromosome ID in the VCF file (e.g., NC_000024.9 for GRCh37). |
Required |
-vcf_dic |
Dictionary file for custom tree markers (only used when -customtree is provided). |
Optional (used with -customtree) |
-nucleotide |
Parameter allows obtaining nucleotides (A,C,G,T,N). Default: Obtaining ancestral states (A, D, X). | Optional |
-v |
Parameter allows multisample vcf files as output. The imputed output file will change accordingly adding ":999" suffix after imputed allele (0,1, ...). | Optional |
Provide a folder containing all .vcf files. Or one multisampel vcf file by using parameter -v.
Supported trees:
- Minimal Y tree (doi:10.1002/humu.22468)
- NAMQY tree (https://doi.org/10.1155/2024/3046495)
- ISOGG 2020 tree (https://isogg.org/tree/)
python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree Y_minimal -vcf_ref GRCh37 -vcf_chr NC_000024.9
python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree NAMQY -vcf_ref GRCh37 -vcf_chr NC_000024.9
python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree ISOGG_2020 -vcf_ref GRCh37 -vcf_chr NC_000024.9python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -vcf_ref GRCh37 -vcf_chr NC_000024.9 -customtree ./test_run/minimal_y_tree_hgs_custom_example.csv -vcf_dic ./Y_minimal_dic.csvStructure of the tree in tab-separated .csv file:

SNPs that cannot be separated ("equal") are divided by commas in the same branch (green). SNPs of downstream branches are presented in the row below with one additional indentation using a tab (orange). And SNPs from parallel branches are on separate, mutually exclusive branches (blue).
One example can be viewed in the test_run folder provided here.
Required when using -customtree. Structure:
Columns:
- marker: names used in tree
- GRCh37, GRCh38, T2T: positions
- Anc, Der: alleles
- Hg: haplogroups
- Ref: reference allele according to chosen reference genome (i.e., GRCh37, GRCH38 or T2T).
Same as section 3.1.1.3
The output file of PhyloImpute (_phyloimputed.csv) can further be used to generate an allele frequency map of derived alleles of selected SNPs.

For this, the following code can be run:
python PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNPX -f_coordinates ./sample_coordinates_example.csv -continent 'South America' 'North America' -af_map png| Flag | Description | Required/Optional |
|---|---|---|
-freqmap |
Define this to generate allele frequency maps. | Required |
-input |
Specify path to PhyloImpute output file (_phyloimputed.csv) or any file in the same format (see image in 3.1.1.3.1). |
Required |
-output |
Define output file name. | Required |
-f_snp |
Define name of SNP for allele frequency map. | Required |
-f_coordinates |
Provide path to tab-separated CSV file defining coordinates of samples from the -input file. Format: 1st column = sample names, 2nd = latitude, 3rd = longitude. Example: /test_run/sample_coordinates_example.csv. |
Required |
-continent |
Specify one or several continents to plot: Oceania, Africa, North America, Asia, South America, Europe. Use single quotes, e.g., 'South America'. |
Optional |
-country |
Specify one or more countries to plot. Names must match Countries_list.csv. Use single quotes, e.g., 'Ecuador'. |
Optional |
-whole_world |
Plot the entire world map instead of specific regions. | Optional |
-af_map |
Select output file format for the allele frequency map: svg, pdf, or png. Default is svg. |
Optional |
PhyloImpute maximizes the available information on the allelic states of SNPs by first imputing missing alleles (see above) and then by interpolating the remaining information between sample points using a radial basis function (RBF). This interpolation can be tuned by changing a parameter (epsilon). The default value of epsilon is 2.3, and the higher this value the stronger the "smoothing" of the data.

I recommend illustrating datapoints with ancestral (black dots) and derived alleles (white dots) with the parameters -ancestral_coordinates and -derived_coordinates. Sometimes close-by datapoints can have extreme differences in allele frequencies (e.g., due to sampling strategy), which can cause artifacts in the RBF. The artifacts are visible as high derived allele frequencies of the SNP in a region, where no derived alleles are (i.e., no white dots). In these cases, the user should reduce the smoothing factor (epsilon) incrementally (by defining -smoothing [number]) until the artifact disappears.
python PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNPX -f_coordinates ./sample_coordinates.csv -color pink -derived_coordinates -ancestral_coordinates -continent 'South America' -af_map png -smoothing 2- -color: The color palette can be changed by specifying one of these colors: blue,orange,pink,red,green,yellow,purple,violet,grey [Default:blue]
- -contour: Adjust the number of different shades for the allele frequencies [Default:15]
python PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNPX -f_coordinates ./sample_coordinates.csv -color pink -continent 'South America' 'North America' -af_map pngpython PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNP1 -f_coordinates ./sample_coordinates.csv -color orange -country 'Ecuador' -af_map pngPhyloImpute v1.3 with a graphical user interface (GUI) is available for windows and linux: Access the link: https://zenodo.org/records/15864850 Via the link you can also access a tutorial for the GUI.




