This pipeline is part of the publication: Matthey-Doret et al., 2019, Genome Biology and evolution, doi: 10.1093/gbe/evz219
In this project, we use restriction-site associated DNA-sequencing (RAD-seq) and build a custom pipeline with the aim to locate and identify the complementary sex determination (CSD) locus/loci in the parasitoid wasp species Lysiphlebus fabarum. This repository contains a pipeline to map the reads using BWA and build loci using the different components of STACKS with optimal parameters and perform downstream analyses. It was designed to run on a distributed High Performance Computing (HPC) environment with LSF but can also be ran on a personal machine.
To run the pipeline with the data provided:
- Download or clone this repository.
- Download the data archive from zenodo at: [https://zenodo.org/record/1488603#.W-1V9KBReLw]
- Uncompress the downloaded file using
tar xzvf 20181115_csd_data.tar.gz data
and the resultingdata
folder to the root of the repository. - Download the gzipped fastq files from BioProject PRJNA505237, on SRA.
- Move the RADseq files to
data/processed
and the WGS files todata/wgs/raw/
. - Run the pipeline
- On a cluster with LSF:
make
to run the STACKS pipelinemake assoc_mapping
to run the association mapping (needs STACKS data)make collinearity
to compute collinearity blocks (needs STACKS data)make wgs_wild
to run the analysis of wild WGS samples
- On a local machine
make ref_local
to run the STACKS pipelinemake assoc mapping
to run the association mapping (needs STACKS data)make collinearity LOCAL=yes
to compute collinearity blocks (needs STACKS data)make wgs_wild LOCAL=yes
to run the analysis of wild WGS samples
- On a cluster with LSF:
To run the STACKS pipeline with new data in the form of demultiplexed, trimmed single end reads in compressed fastq files (.fq.gz):
-
Describe your samples by writing 2 files named
popmap.tsv
andindividuals.tsv
, respectively. The structure of thepopmap.tsv
file is described on the official STACKS website (here, populations should be the sex of individuals). Theindividuals.tsv
file is a tab delimited text file with 4 columns with the following headers included:- Name: The names of samples. This should be the name of their data files (e.g. if the sample name is SAMPLE1, the corresponding reads file should be named SAMPLE1.fq.gz).
- Sex: F for females and M for males.
- Family: Clutches to which the individual belongs. These can be any combination of alphanumeric characters.
- Generation: Useful if there are mothers and offspring. Values should be F3 for mothers and F4 for offspring.
-
Create an empty folder named data and place the 2 files inside. This folder needs to be located inside the same directory as src.
-
Place your (trimmed, demultiplexed) reads in a subfolder of data named
processed
and your reference genome in a subfolder namedref_genome
. You will also need to edit theREF
path inconfig.mk
accordingly. If you wish to use different folder names, just edit the corresponding paths inconfig.mk
. -
Set the variable
D
inconfig.mk
to 20 (minimum locus depth for STACKS populations). Typemake
in the command line (ormake ref_local
if running on a local machine). Once the pipeline has finished running, set the variableD
back to 5 and typemake ploidy
to infer ploidy from the homozygosity of variant sites. Note the threshold selected to define ploidy is adapted to the dataset presented here. You might want to define a threshold yourself by inspecting the distribution of homozygosity (HOM variable) indata/ploidy/thresholds/fixed.tsv
and the variableHOM_PLOID
inconfig.mk
to this value. Once you have modified the threshold, runmake ploidy
again to update the ploidy classification with the new threshold. -
Type
make -B
to run the pipeline again without haploids.
The easiest and recommended way to install the dependencies is to install Anaconda and setup an environment. For convenience, a YAML environment file is provided in the docs
folder and you can setup the conda environment directly from it using conda env create -f csd_env_conda.yml
. You can then activate it using source activate csd_env
and all dependencies should be enabled.
These dependencies are required to run the main analysis. This include the RADseq pipeline and association mapping.
- FastQC 0.11.5: Quality control of sequencing data.
- BWA 0.7.15
- STACKS 1.48 (Download link)
- SAMtools 1.4
- VCFtools 0.1.15
- BEDtools 2.26
- Trimmomatic 0.36
- R 3.4.3
- Python 2.7.x
These tools are required only to run additional analyses.
The src
folders contains all scripts required to run the analysis along with other programs used for automated report generation and benchmarking. Those scripts are organized into several sub folders:
-
assoc_mapping
: This folder contains scripts used to locate candidate CSD region(s).case_control.R
: Performs the actual association mapping, using the Fisher exact test.chrom_types.R
: Modeling the proportion of recombinant offspring along chromosomes to locate centromeres.process_genomic.py
: Process genomic output from STACKS' populations module by transforming numeric encoding of genotypes into homozygous/heterozygous/missing, removing loci that are either homozygous or missing in mothers from their families and computing proportion of homozygous individuals per sex/family at each site.
-
mapping
: This folder contains scripts used to map processed sequencing reads to the reference genome of Lysiphlebus fabarum.bwa_script.sh
: Coordinates the mapping of all samples using BWA, sending the output tosplit_sam.pl
.split_sam.pl
: Parses the output sam files to split single hits and multiple hits into separate files and convert the files into bam format.
-
misc
: This folder contains various scripts used at different steps of the pipeline.parse_VCF.sh
: Uses vcftools to compute several statistics from the output VCF file returned by the populations module of STACKS and store them in text files inside thevcftools
subfolder. This script is used in theploidy
Makefile rule, when inferring the ploidy of males.
-
ploidy
: This folder contains scripts required to classify males as diploid or haploid based on the genomic data.haplo_males.py
: Uses a threshold to infer the ploidy of males.blacklist_haploloci.py
: Generates a list of loci to blacklist in subsequent runs. Loci blacklisted are those heterozygous in more than 50% of haploid males, by default.
-
process_reads
: This folder contains the script used for demultiplexing, trimming and removing adaptors from raw sequencing reads. These are not implemented in the pipeline as the processed reads should be provided. -
stacks_pipeline
: This folder contains scripts required to run the different components of the STACKS suitepstacks_script.sh
: Produces 'stacks' from processed reads, using the Pstacks module.cstacks_script.sh
: Constructs a catalogue of loci from the output ofpstacks_script.sh
. Only files containing at least 10% of the mean number of RAD-tags (computed over all files) are included in the catalogue to remove poor quality samples.group_sstacks.sh
: Copy pstacks and cstacks output files to the sample folder to provide a working directory formulti_sstacks.sh
.sstacks_script.sh
: Produces 'match' files from pstacks and cstacks output files (stacks and catalogue, respectively).populations.sh
: Uses the populations module to compute populations statistics and generate different outputs from sstacks output files.
Once the data.tar.gz
has been uncompressed, the data folder should contain the following files:
annotations
: Contains genome annotations from the BIPAA.individuals.tsv
: Detailed characteristic of each individuals: Name, Sex, Family and Generation where F4 are son/daughter and F3 is the mother.ploidy
: contains information about the ploidy of individuals in the dataset. Ploidy information is stored in `thresholds/fixed.tsv.popmap.tsv
: Population map required by STACKS to match sample names to population group (i.e. male and female).processed
: This folder contains the processed RAD-tags generated for each sample using process_radtags.ref_genome
: This folder contains the reference genome.wgs
: contains the raw reads from whole genome sequencing of a wild population of L. fabarum andwgs_samples.tsv
, a list of sample names with their sex.
After the pipeline has been running, all intermediary and final output files will be generated and stored in their respective sub-folders inside data
. Notable output files include:
- Association mapping hits:
data/assoc_mapping/case_control/case_control_hits.tsv
- STACKS populations output files:
data/popula
- Locations of centromeres:
data/assoc_mapping/centro
- List of samples including ploidy and coverage statistics:
data/ploidy/thresholds/fixed.tsv
- Nucleotidic diversity stats from WGS samples:
data/wgs_wild/stats/
- List of SNPs with centiMorgan information extracted from the linkage map:
data/linkage_map/bp2cm/bp2cm_csd_snps.tsv