Note: It is strongly recommended to have fisher installed on your system. It is fast in dealing with large datasets.
Citation: If you used PyBSASeq in your pulications, please cite:
- Zhang, J., Panthee, D.R. PyBSASeq: a simple and effective algorithm for bulked segregant analysis with whole-genome sequencing data. BMC Bioinformatics 21, 99 (2020). https://doi.org/10.1186/s12859-020-3435-8
- Zhang, J., Panthee, D.R. Next-generation sequencing-based bulked segregant analysis without sequencing the parental genomes. G3 Genes|Genomes|Genetics, jkab400 (2021), https://doi.org/10.1093/g3journal/jkab400
A novel algorithm with high detection power for BSA-Seq data analysis - the significant structural variant method
Python 3.6 or above is required to run the script.
$ python PyBSASeq.py -i input -o output -p popstrct -b fbsize,sbsize
Here are the details of the options used in the script:
input
– the names of the input files (the GATK4-generated tsv/csv file). If you have variant calling data from both the parents and the bulks, the format is as follows:parents.tsv,bulks.tsv
. If you have only the variant calling data of the bulks, the format is as follows:bulks.tsv
. The scriptPyBSASeq.py
can handle both situations now. The script and the input files should be in the same folder.output
– the name of the output file. Default isBSASeq.csv
popstrct
– population structure; three choices available: F2 for an F2 population, RIL for a population of recombinant inbred lines, or BC for a backcross populationfbsize
– the number of individuals in the first bulksbsize
– the number of individuals in the second bulk
The default cutoff p-value for identifying significant structural variants (sSV) from the SV dataset is 0.01 (alpha
), and the default cutoff p-value for identifying sSVs from the simulated dataset is 0.01 (smalpha
). These values can be changed using the following options:
-v alpha,smalpha
alpha
and smalpha
should be in the range of 0.0 – 1.0, the chosen value should make statistical sense. The greater the smalpha
value, the higher the threshold and the lower the false positive rate.
The default size of the sliding window is 2000000 (base pairs) and the incremental step is 10000 (base pairs), and their values can be changed using the following option:
-s slidingWindowSize,incrementalStep
Four files (sliding_windows.csv
, sv_fagz.csv
, sv_fagz_fep.csv
, and threshold.txt
) and a folder in the date_time
format containing BSASeq.csv
and BSASeq.pdf
will be generated in the ./Results
folder after succesfully running the script. If gaps between subplots in PyBSASeq.pdf
are too wide or too narrow, we can rerun the script to fine-tune the gaps by changing the values of a
and/or b
using the options below:
-a True -g a,b,c,d,e,f
a
- the horizontal gap between subplotsb
- the vertical gap between subplotsc
,d
,e
, andf
- the top, bottom, left, and right margins of the plot, respectively
The default values for a
, b
, c
, d
, e
, and f
are 0.028, 0.056, 0.0264, 0.054, 0.076, 0.002, 0.002, respectively. The script uses the sliding_windows.csv
and threshold.txt
files generated previously for plotting, and it is very fast.
If two or more peaks/valleys and all the values in between are beyond the confidence intervals/thresholds, only the highest peak or the lowerest valley will be identified as the peak/valley of this region. We can rerun the script to identify the positions of the other peaks/valleys and test their significance using the option below:
-e a1,b1,c1,a2,b2,c2,......,an,bn,cn
a
- chromosome idb
- the start point of a chromosomal fragmentc
- the end point of a chromosomal fragment
Right now, this option will not work if the chromosome IDs in the reference genome sequences are not digits, with the exception of sex chromosomes; we can use 1000 - 1005 to respectively represent sex chromosomes X, Y, Z, W, U, and V when specify regions on these chromosomes. Multiple regions on the same chromsome can be selected.
- Structural variant (SV) filtering
- Perform Fisher's exact test using the AD values of each SV in both bulks. A SV would be identified as an sSV if its p-value is less than
alpha
. In the meantime, simulated REF/ALT reads of each SV is obtained via simulation under null hypothesis, and Fisher's exact test is also performed using these simulated AD values; for each SV, it would be an sSV if its p-value is less thansmalpha
. Identification of sSVs from the simulated dataset is for threshold calculation. - Threshold calculation. The result is saved in the
threshold.txt
file. Thethreshold.txt
file needs to be deleted if starting over is desired (e.g, if the size of the sliding window is changed). - Plotting.
Two SV datasets for testing purpose, one from rice while the other from maize, are stored in the Data folder. The rice .csv files were generated via GATK4 using the sequencing data from the work of Lahari et al while the maize .csv files were extracted from .csv/.tsv files generated by Zheng et al.
In the file PyBSASeq.pdf
, peaks/valleys in panels B
, C
, and D
that are beyond their genome-wide thresholds/confidence intervals very likely represent genomic regions controlling the trait of interest. The genomic region-specific thresholds/confidence intervals of these peaks/valleys are calculated to verrify their significance status. The results are presented in the file BSASeq.csv
. A peak/valley is significant if its significance_status is equal to 1, not significant otherwise.
In BSASeq.csv
, each row indicates a sliding window that contains a potentially significant peak/valley. The meanning of each column is shown below.
- firstbulk_id.AvgLD: the average sequencing depth of the first bulk
- secondbulk_id.AvgLD: the average sequencing depth of the second bulk
- sSV: the number of the significant SVs in the sliding window
- totalSV: the number of the total SVs in the sliding window
- sSV/totalSV: the sSV/totalSV ratio in the sliding window
- Threshold_sSV: the threshold of the sSV/totalSV ratio of the sliding window
- GS: the G-statistic value of the sliding window
- Threshold_GS: the threshold of the G-statistic value of the sliding window
- DAF: the allele frequency difference of the sliding window
- DAF_CI_LB: the DAF confidence interval lower bound of the sliding window
- DAF_CI_UB: the DAF confidence interval upper bound of the sliding window
- Threshold_DAF: the DAF threshold of the sliding window when parental genome sequences are not avaiable (or not used)
- pvalue_tt: the t-test p-value of the sliding window
- Significance_SSV: the significant status of the sliding window when using the significant SV method
- Significance_GS: the significant status of the sliding window when using the G-statistic method
- Significance_AF: the significant status of the sliding window when using the allele frequency method
- Significance_TT: the significant status of the sliding window when using the t-test method
Please report here if encounter any issue. I can receive issue notifications now.
BSA-Seq data analysis can be done using either the SNP index (allele frequency) method or the G-statistic method as well. I implemented both methods in Python for the purpose of comparison: PySNPIndex and PyGStatistic. The Python implementation of the original G-statistic method by Magwene can be found here (just found this site today, 6/27/2019), and the R implementation of both methods by Mansfeld can be found here.