Skip to content

ZehraKoksal/PhyloImpute

Repository files navigation

PhyloImpute

Impute missing data on non-recombining DNA using SNP phylogeny

1) About PhyloImpute

PhyloImpute complements SNP data on non-recombining DNA, such as the human Y chromosome, by leveraging the SNPs' phylogeny as reported in a phylogenetic tree.

2) Installation

Operating system: Linux

Type in the shell:

git clone https://github.com/ZehraKoksal/PhyloImpute.git
cd PhyloImpute/
python PhyloImpute.py -h

The following tools must be installed before running this software.

A) Python libraries

Install these inside a virtual environment using pip install ...:

  • pandas (v2.1.6)
  • numpy (v1.26.4)
  • matplotlib (v3.9.2)
  • scipy (v1.14.1)
  • geopandas (v1.0.1)
  • shapely (v2.0.6)

B) Native command-line tools

These tools are optional and only required if you use the -v parameter (see details below):

  • bgzip (v1.19)
  • bcftools (v1.20)
  • tabix (v1.19)

3) Algorithm and Commands

3.1) Imputation

PhyloImpute imputes missing data by assuming that the SNPs in a clade of the phylogenetic tree leading up to a SNP with a derived allele are derived as well. SNPs on parallel branches are expected to be ancestral.

PhyloImpute can be run with a pre-processed csv input file or vcf files:

3.1.1) CSV file

python PhyloImpute.py -input_format csv -input ./test_run/testdata.csv -output ./output -tree Y_minimal

Parameters:

Flag Description Required/Optional
-input_format Input file format (e.g., csv). Required
-input Path to the input file. Required
-tree Path to the available phylogenetic tree. Options: Y_minimal, NAMQY, ISOGG_2020. Optional* (mutually exclusive with customtree)
-customtree Path to a custom phylogenetic tree. Optional* (mutually exclusive with tree)
-vcf_dic Dictionary file for custom tree markers (only used when -customtree is provided). Optional (used with -customtree)
-output Path to an existing folder where output files will be saved. Required
-nucleotide Parameter allows obtaining nucleotides (A,C,G,T,N). Default: Obtaining ancestral states (A, D, X). Optional

3.1.1.1) CSV Input file

The user is required to provide the path to the input file in tab-separated .csv format.

Input file style

Rows represent variants, columns represent individuals.
The header row contains individual labels (blue), and the second column contains variant names (orange).
The table contains observed allelic states (green): ancestral A, derived D, or missing X.

3.1.1.2) Phylogenetic tree

3.1.1.2.1) Pre-processed phylogenetic tree

Currently, a pre-processed phylogenetic tree is available for the human Y chromosome:

python PhyloImpute.py -input_format csv -input ./test_run/testdata.csv -output ./output -tree Y_minimal
3.1.1.2.2) Custom phylogenetic tree

Alternatively, custom phylogenetic trees can be provided:

python PhyloImpute.py -input_format csv -input ./test_run/testdata.csv -output ./output -customtree ./test_run/minimal_y_tree_hgs_custom_example.csv -vcf_dic ./Y_minimal_dic.csv

The custom phylogenetic tree must be provided in tab-separated .csv format.
It follows the ISOGG tree nomenclature and this structure:

Input file style

  • SNPs that cannot be separated ("equal") are comma-separated in the same branch (green).
  • SNPs of downstream branches are indented using a tab (orange).
  • SNPs from parallel branches are in separate rows (blue).
3.1.1.2.3) Custom tree marker dictionary

Required when using -customtree. The custom phylogenetic tree needs to be supplemented with a dictionary file (vcf_dic).

Structure:

Dictionary file format

Columns:

  • marker: names used in tree
  • GRCh37, GRCh38, T2T: positions
  • Anc, Der: alleles
  • Hg: haplogroups
  • Ref: reference allele according to chosen reference genome (i.e., GRCh37, GRCH38 or T2T).


3.1.1.3) Output files

3.1.1.3.1) phyloimputed.csv

This file contains observed (D, A, X) and imputed (d, a) allelic states for reported SNPs plus additional SNPs from the phylogenetic tree.

Output preview

Option to receive nucleotides (A/a, C/c, G/g, T/t, N) instead of default ancestral states (A/a, D/d, X), if specifying -nucleotide.

3.1.1.3.2) haplogroups.csv

PhyloImpute compares allelic states of all observed SNPs with the SNP relationships in the phylogenetic tree to verify the accuracy of the phylogenetic tree and the sequencing data.

It outputs:

  • Predicted haplogroup: based on tree branch with the most SNPs that are present in the analyzed sequence ("main tree branch"). PhyloImpute v1.3 prefixes each haplogroup with an asterisk (*) if the Penalty Value 1 is higher than the Confidence Value (see below) to indicate potentially erroneous predictions that require manual inspection.
  • Confidence value: proportion of derived alleles in main tree branch. Low value can be due to low sequence coverage.
  • Penalty value 1: proportion of ancestral alleles in main branch. The observed ancestral alleles could be the consequence of backmutations, but a high penalty value could hint towards an incorrect haplogroup prediction.
  • Penalty value 2: proportion of derived alleles in parallel branches. Markers in parallel branches in the derived allelic state could be the consequence of recurrent mutations that are identical by state, rather than by descent. However, a high penalty value 2 could indicate that the sequencing data comprises a mixture of DNA from different individuals.
  • downstream_ancestral:Indicates downstream haplogroups and the specific SNPs that were found in ancestral state.
  • downstream_unknown:Indicates downstream haplogroups and the specific SNPs that were found in unknown state. Taking this information into account may help in assessing whether the predicted haplogroup is potentially lacking resolution due to missing downstream alleles.

3.1.1.3.3) conflicting_SNPs.csv

Contains SNPs that cause penalty values:

  • "(ancestral allele inside main branch)" → penalty value 1
  • "(derived allele inside parallel branch)" → penalty value 2

In the main output file (phyloimputed.csv), PhyloImpute keeps the observed allelic states.

3.1.1.3.4) logfile

Generates a log file containing the commands executed by the tool, with each entry timestamped for easier tracking and reproducibility.


3.1.2) VCF file

python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree Y_minimal -vcf_ref GRCh37 -vcf_chr NC_000024.9

Parameters:

Flag Description Required/Optional
-input_format Input file format (vcf). Required
-input Path to the folder containing VCF files. Required
-output Output folder path where results will be saved. Required
-tree Phylogenetic tree to use. Options: Y_minimal, NAMQY, ISOGG_2020. Optional* (mutually exclusive with customtree)
-customtree Path to a custom phylogenetic tree. Optional* (mutually exclusive with tree)
-vcf_ref Reference genome used for VCF files. Options: GRCh37, GRCh38, T2T. Required
-vcf_chr Chromosome ID in the VCF file (e.g., NC_000024.9 for GRCh37). Required
-vcf_dic Dictionary file for custom tree markers (only used when -customtree is provided). Optional (used with -customtree)
-nucleotide Parameter allows obtaining nucleotides (A,C,G,T,N). Default: Obtaining ancestral states (A, D, X). Optional
-v Parameter allows multisample vcf files as output. The imputed output file will change accordingly adding ":999" suffix after imputed allele (0,1, ...). Optional

3.1.2.1) VCF Input file

Provide a folder containing all .vcf files. Or one multisampel vcf file by using parameter -v.

3.1.2.2) Phylogenetic tree

3.1.2.2.1) Pre-processed phylogenetic tree

Supported trees:

python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree Y_minimal -vcf_ref GRCh37 -vcf_chr NC_000024.9
python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree NAMQY -vcf_ref GRCh37 -vcf_chr NC_000024.9
python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -tree ISOGG_2020 -vcf_ref GRCh37 -vcf_chr NC_000024.9
3.1.2.2.2) Custom phylogenetic tree
python PhyloImpute.py -input_format vcf -input ./test_run/input_vcf/ -output ./output -vcf_ref GRCh37 -vcf_chr NC_000024.9 -customtree ./test_run/minimal_y_tree_hgs_custom_example.csv -vcf_dic ./Y_minimal_dic.csv

Structure of the tree in tab-separated .csv file: Tree file format

SNPs that cannot be separated ("equal") are divided by commas in the same branch (green). SNPs of downstream branches are presented in the row below with one additional indentation using a tab (orange). And SNPs from parallel branches are on separate, mutually exclusive branches (blue). One example can be viewed in the test_run folder provided here.

3.1.2.2.3) Custom tree marker dictionary

Required when using -customtree. Structure:

Dictionary file format

Columns:

  • marker: names used in tree
  • GRCh37, GRCh38, T2T: positions
  • Anc, Der: alleles
  • Hg: haplogroups
  • Ref: reference allele according to chosen reference genome (i.e., GRCh37, GRCH38 or T2T).

3.1.2.3) Output files

Same as section 3.1.1.3

3.2) Allele frequency plots

The output file of PhyloImpute (_phyloimputed.csv) can further be used to generate an allele frequency map of derived alleles of selected SNPs. general_allele_frequency_map

For this, the following code can be run:

python PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNPX -f_coordinates ./sample_coordinates_example.csv -continent 'South America' 'North America' -af_map png
Flag Description Required/Optional
-freqmap Define this to generate allele frequency maps. Required
-input Specify path to PhyloImpute output file (_phyloimputed.csv) or any file in the same format (see image in 3.1.1.3.1). Required
-output Define output file name. Required
-f_snp Define name of SNP for allele frequency map. Required
-f_coordinates Provide path to tab-separated CSV file defining coordinates of samples from the -input file. Format: 1st column = sample names, 2nd = latitude, 3rd = longitude. Example: /test_run/sample_coordinates_example.csv. Required
-continent Specify one or several continents to plot: Oceania, Africa, North America, Asia, South America, Europe. Use single quotes, e.g., 'South America'. Optional
-country Specify one or more countries to plot. Names must match Countries_list.csv. Use single quotes, e.g., 'Ecuador'. Optional
-whole_world Plot the entire world map instead of specific regions. Optional
-af_map Select output file format for the allele frequency map: svg, pdf, or png. Default is svg. Optional

3.2.1) Tune interpolator

PhyloImpute maximizes the available information on the allelic states of SNPs by first imputing missing alleles (see above) and then by interpolating the remaining information between sample points using a radial basis function (RBF). This interpolation can be tuned by changing a parameter (epsilon). The default value of epsilon is 2.3, and the higher this value the stronger the "smoothing" of the data.

general_allele_frequency_map

I recommend illustrating datapoints with ancestral (black dots) and derived alleles (white dots) with the parameters -ancestral_coordinates and -derived_coordinates. Sometimes close-by datapoints can have extreme differences in allele frequencies (e.g., due to sampling strategy), which can cause artifacts in the RBF. The artifacts are visible as high derived allele frequencies of the SNP in a region, where no derived alleles are (i.e., no white dots). In these cases, the user should reduce the smoothing factor (epsilon) incrementally (by defining -smoothing [number]) until the artifact disappears.

python PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNPX -f_coordinates ./sample_coordinates.csv -color pink -derived_coordinates -ancestral_coordinates -continent 'South America' -af_map png -smoothing 2

3.2.2) Additionally, some parameters can be changed to customize the maps

general_allele_frequency_map

  • -color: The color palette can be changed by specifying one of these colors: blue,orange,pink,red,green,yellow,purple,violet,grey [Default:blue]
  • -contour: Adjust the number of different shades for the allele frequencies [Default:15]
python PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNPX -f_coordinates ./sample_coordinates.csv -color pink -continent 'South America' 'North America' -af_map png

3.2.3) More examples

python PhyloImpute.py -freqmap -input ./sample_data.csv -output ./freqmap -f_snp SNP1 -f_coordinates ./sample_coordinates.csv -color orange -country 'Ecuador' -af_map png

country_specific_allele_frequency_map

4) Graphical user interface

PhyloImpute v1.3 with a graphical user interface (GUI) is available for windows and linux: Access the link: https://zenodo.org/records/15864850 Via the link you can also access a tutorial for the GUI.



About

Impute missing data on non-recombining DNA using SNP phylogeny

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors