Detection of structural variants (SVs) in stem rust by whole genome alignment and genotyping using diagnostic k-mers

This repository includes custom python scripts used for detecting SVs in a panel of diverse Puccinia graminis f.sp. tritici isolates that causes stem rust disease in wheat. The results of these analyses are soon to be pubslished (Yuanwen Guo, Bliss Betzen, Andres Salcedo, Fei He, Robert L. Bowden, John P. Fellers, Katherine W. Jordan, Alina Akhunova, Mathew N. Rouse, Les J. Szabo, Eduard Akhunov. Population genomics of Puccinia graminis f.sp. tritici highlights the role of admixture in the origin of virulent wheat rust races. 2022, submitted.)

Questions about the codes from this repository should be addressed to yuanwenguo2015@gmail.com. If you have general questions about diversity analyses in this study or mapping Sr-AvrSr genes, please contact eakhunov@ksu.edu.

Dependencies needed for analysis:
MUMmer (https://mummer4.github.io/index.html)
BEDtools (https://bedtools.readthedocs.io/en/latest/)
Python3+ (with modules of pandas, numpy, pybedtools, fnmatch installed)
Jellyfish2 (https://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual-1.1.pdf)

SV detection

Create reference chromosome files in fasta format for each stem rust isolate haplotype. One file should contain one chromosome. The sequences of each homologous chromosome pair should be in same 5'-to-3' orientation. In our study, we used for haplotype-resolved assemblies: 99KS76A-E, 99KS76-F, Pgt21-A1, and Pgt21-B1. Each haplotype had 18 chromosomes.
Chromosomes for 99KS76A-E are: chr1-E, chr2-E…
Chromosomes for 99KS76A-F are: chr1-F, chr2-F…
Chromosomes for Pgt21-A1 are: a1_chr_1, a1_chr_2…
Chromosomes for Pgt21-B1 are: b1_chr_1, b1_chr_2…

Use MUMMER to align pairs of homologous chromosomes from each haplotype:
nucmer --mum -p ref_chr1-E.que_chr1-F -t 10 $ref $que
Detailed instruction for running MUMMER can be found at: https://mummer4.github.io/index.html.

Filter alignments to keep alignments that show the identify level above 90%:
delta-filter -i 90 -r -q ref_chr1-E.que_chr1-F.delta > ref_chr1-E.que_chr1-F.filter.delta

Convert alignments to "coords" format (see MUMMER manual for details about this format):
show-coords -rc -B ref_chr1-E.que_chr1-F.filter.delta > ref_chr1-E.que_chr1-F.filter.coords

Prepare a meta-file listing all homologous chromosomes used for pair-wise MUMMER alignments in the following format (one homologous chromosome set per line):
chr1-E chr1-F a1_chr_1 b1_chr_1
chr2-E chr2-F a1_chr_2 b1_chr_2
chr3-E chr3-F a1_chr_3 b1_chr_3

Create a directory (for example, "SV_detection") and then copy the following files into this directory: alignments file in .coords format and a meta file with the list of homologous chromosomes. The script's output file will be generated in the same directory.

Run SV parsing script:
python SV_detection.py path_to_your_directory meta_file_name reference_haplotype_name first_query_haplotype_name second_query_haplotype_name third_query_haplotype_name output_file_name_you_specified

Please note, the order of haplotype names “reference_haplotype_name first_query_haplotype_name second_query_haplotype_name third_query_haplotype_name” should correspond to the order of chromosomal haplotypes specified in the meta_file. For example:
python SV_detection.py $/SV_detection/ each_chr_haplotypes.txt 99KS76A-E 99KS76-F Pgt21-A1 Pgt21-B1 SV_infor.txt

The format of the output file SV_infor.txt:
ref ref_start ref_stop que que_start que_stop SV_type SV_size name
chr1-E 376866 381115 chr1-F 113706.0 147078.0 expansion 29123.0 chr1-E_376866-381115
chr1-E 422631 463706 chr1-F 174136.0 196734.0 contraction 18477.0 chr1-E_422631-463706
chr1-E 465024 465075 chr1-F 198037.0 208717.0 expansion 10629.0 chr1-E_465024-465075
chr1-E 1033102 1033106 chr1-F 669334.0 681549.0 insertion 12211.0 chr1-E_1033102-1033106
chr1-E 1414005 1438065 chr1-F 970625.0 1011943.0 expansion 17258.0 chr1-E_1414005-1438065

k-mer-based genotyping of SV sites

Organize your data. Create one folder per stem rust isolate with folders named using the stem rust isolates' IDs. Each folder should contain raw fastq files generated by Illumina technology for a particular isolate. Most of downstream analyses are performed within these folders.

Run jellyfish2 wihtin each folder to create k-mer database for each stem rust isolate in the panel. For example, to create k-mer database for isolate 00M063C_S59 using raw Illumina fastq data, run the following command:
zcat *.fastq.gz | jellyfish count -C -o 00M063C_S59.jf -m 31 -t 30 -s 3G /dev/fd/0
Detailed instruction about running Jellyfish can be found at: https://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual-1.1.pdf

Based on the analyses of SVs using MUMMER alignments, for each insertion/deletion SV, create a diagnostic set of 31 bp k-mers. These diagnostic k-mers should span the boundaries of the SV sites before and after insertion. For each SV, there should be three sets of k-mers: 1 set of site before insertion and two sets for leaft and right boundaries after insertion. More detailed description of k-mer selection procedure could be found in the paper referenced above. The example of the diagnostic k-mer file format:
ID kmer
chr1-E_156237-156412_presence_deletion::chr1-E:156209-156265 CATTTCTCCAGATTCCCATGCACAAGGCAGA
chr1-E_156237-156412_presence_deletion::chr1-E:156209-156265 ATTTCTCCAGATTCCCATGCACAAGGCAGAC
chr1-E_156237-156412_presence_deletion::chr1-E:156209-156265 TTTCTCCAGATTCCCATGCACAAGGCAGACT
chr1-E_156237-156412_presence_deletion::chr1-E:156209-156265 TTCTCCAGATTCCCATGCACAAGGCAGACTC
chr1-E_156237-156412_presence_deletion::chr1-E:156209-156265 TCTCCAGATTCCCATGCACAAGGCAGACTCT

Use this diagnostic k-mer file to query the the k-mer databases generated by jellyfish for each isolate. For each SV, three queries need to be made. A query should include a set of diagnostic k-mers tagging either SV absence allele or each of the two boundaries of SV presence allele.

Example of jellyfish command to obtain the counts of diagnostic k-mers in the k-mer databases generated for each isolate:
jellyfish query 00M063C_S59.jf CATTAAAACCTAGGAAATAAATTTAAAAGGG ATTAAAACCTAGGAAATAAATTTAAAAGGGC TTAAAACCTAGGAAATAAATTTAAAAGGGCC TAAAACCTAGGAAATAAATTTAAAAGGGCCT (... list of all diagnostic k-mers for a SV site ....) >chr1-E_156237-156412_absence_deletion.txt

Please note that the format the output file name from this command should match this example: chr1-E_156237-156412_absence_deletion.txt

Example of the file with k-mer counts generated by the jellyfish query:
AGATTATGATCAGCCTGGACACCACGACACT 1623
AAGATTATGATCAGCCTGGACACCACGACAC 1615
CAAGATTATGATCAGCCTGGACACCACGACA 1604
GTCGTGGTGTCCAGGCTGATCATAATCTTGA 1649
GTCAAGATTATGATCAGCCTGGACACCACGA 1642
AGTCAAGATTATGATCAGCCTGGACACCACG 1660
GAGTCAAGATTATGATCAGCCTGGACACCAC 1
CGAGTCAAGATTATGATCAGCCTGGACACCA 1
GCGAGTCAAGATTATGATCAGCCTGGACACC 0
GGCGAGTCAAGATTATGATCAGCCTGGACAC 0

These k-mer count files should be saved within the folders created for each isolate. For this purpose, create a directory (for example, "SV_kmer_genotyping") with a subdirectory named “kmer_counts_in_each_isolate”. Within this subdirectory, create directories named after each isolate ID. For example, if you specify a directory named “SV_kmer_genotyping”,the k-mer counts for isolate 00M063C_S59 should be saved in directory /SV_kmer_genotyping/kmer_counts_in_each_isolate/00M063C_S59/

Create a file listing each isolate's ID:
00M063C_S59
00MN99C_S38
01MN84-A-1-2_S51
Place this file into the "SV_kmer_genotyping" directory.

Run k-mer genotyping script. This script will use the list of isolate IDs to process k-mer count files in each isolates folder and summarize data as SV genotype calls in a single file.
How to run script:
python SV_kmer_genotyping.py main_directory_you_specified SV_information_file diagnostic_kmer_file isolate_meta_file outputfile_you_specified

Command example:
python SV_kmer_genotyping.py $/SV_kmer_genotyping SV_infor.insertion_deletion.txt SV.insertion_deletion.kmer isolates.ID SV_kmer_genotype.txt

After running, the output file SV_kmer_genotype.txt will be in the main directory (“SV_kmer_genotyping”). The example of the output file format:
SV_ID SV_type SV_size group pos 00M063C_S59 00MN99C_S38 01MN84-A-1-2_S51
chr1-E_156237-156412 deletion 165.0 1 156237 1/0 1/0 1/0
chr1-E_218030-218872 deletion 845.0 1 218030 1/0 1/0 0/0
chr1-E_219741-219742 insertion 94.0 1 219741 1/0 1/0 ./.
chr1-E_298216-298384 deletion 165.0 1 298216 0/0 0/0 1/1
chr1-E_366296-366312 insertion 2478.0 1 366296 1/0 1/0 0/0
chr1-E_484609-484617 insertion 1326.0 1 484609 1/1 1/1 1/0

To test these scripts, please use examples of the provided input and output files.

Y. Guo and E. Akhunov (Aug. 3, 2022).

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Codes		Codes
Example_files		Example_files
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codes

Codes

Example_files

Example_files

README.md

README.md

Repository files navigation

Detection of structural variants (SVs) in stem rust by whole genome alignment and genotyping using diagnostic k-mers

SV detection

k-mer-based genotyping of SV sites

About

Releases 1

Packages

Contributors 2

Languages

akhunovlab/stem_rust_diversity

Folders and files

Latest commit

History

Repository files navigation

Detection of structural variants (SVs) in stem rust by whole genome alignment and genotyping using diagnostic k-mers

SV detection

k-mer-based genotyping of SV sites

About

Resources

Stars

Watchers

Forks

Languages