TIGER_Scripts-for-distribution

This repository contains the scripts and documentation for the TIGER pipeline described in Rowan, Patel et al. 2015 in G3 (doi: 10.1534/g3.114.016501) for reconstructing recombination genomes from low-coverage, short-read sequencing data.

#TIGER Documentation

The TIGER Scripts were written by Vipul Patel and Korbinian Schneeberger.

Scripts for generating TIGER input files were written by Beth Rowan.

The get_subset.pl script was written by Joerg Hagmann.

This documentation file was written by Beth Rowan.

Questions about these scripts can be directed to Beth Rowan (beth.rowan@tuebingen.mpg.de)

Last updated 13 April 2016

#Getting started

1.Create input files and get set up

Convert SHORE alignment output (map.list) or BWA alignment file (.bam) into an input file. The input file has a format like this: <#chr><#pos><#parent1allele><#reads supportingparent1allele><#readssupportingparent2allele>

After marker creation, contigs/scaffolds with low maker density has to be filtered out by running (minimum should be at least 100 markers or higher per contig/scaffold

perl  filter_marker_set_density.pl -c marker_set -n minimum_number_of_markers -o output

Make one file with all markers with only indels filtered out (this one is called “complete”)

In our example, the file is called input.complete24.txt

Make a second file with all filters applied (this one is called “corrected”)

In our example, the file is called input.corrected24.txt

Put these input files into a separate directory (called input).

You will also need a tab-delimited file showing the chromosome numbers and their lengths.

We have scripts for generating these input files from either sam (samtotiger.sh) or bam (bamtotiger.sh) files. You will need to provide your own filtered and unfiltered marker files. All that is required for the marker files is that the chromosome number is in the first column,the position (in bp) is in the second column and certain density per chromosome.

Finally, you will need to install the Java 7 environment to run the java scripts.

What follows is a list of the individual commands for the TIGER pipeline. I have also provided an example of how to run all of the steps in parallel for our compute cluster (tigerclusterrun.sh).

Run the base caller on the corrected input file.

General command: java -jar base_caller.jar -r $CORRECTED_INPUT_FILE -o $BASE_CALL_OUTPUT -n bi

Notes: -r specifies the corrected marker counts input file -o designates the output file -n specifies that it is biparental

In the provided example, this was the command I ran:

java -jar base_caller.jar -r ./input/allele_count.corrected.txt -o allele_count.base_call.txt -n bi

This output file is basically a single line for each chromosome with all of the positions assigned a base caller genotype. Note that the genotype notation uses C and L for the two different parental alleles instead of A and B as described in Rowan, Patel et al. 2015.

Run the allele frequency estimator

General command: java -jar allele_freq_estimator.jar -r $CORRECTED_INPUT_FILE -o $ALLELE_FREQUENCIES_FOR_BMM -n bi -w 1000

-r specifies the input file (this is again the corrected marker counts input file) -0 specifies an output file -n specifies that it’s biparental -w specifies window size

In the provided example, this was the command I ran:

java -jar allele_freq_estimator.jar -r ./input/allele_count.corrected.txt -o frequencies_for_bmm.txt -n bi -w 1000

The output file has this format:

Apply the beta mixture model

General command: Rscript--vanilla beta_mixture_model.R $ALLELE_FREQUENCIES_FOR_BMM $BMM_OUTPUT

First argument: This is the output of the allele_freq_estimator.jar command Second argument: Specifies an output file.

The output file will contain two numbers that specify the intersections of the curves

In the example, this was the command that I ran:

Rscript --vanilla beta_mixture_model.R frequencies_for_bmm.txt bmm.intersections.txt

Please note that this step will probably take several hours to process. It takes about three hours for me.

Prepare files for HMM probability estimation using the BASECALLER output and the output of the beta mixture model.

General command: perl prep_prob.pl -s $LABEL -m $CORRECTED_INPUT_FILE -b $BASE_CALL_OUTPUT -c $CHRSIZES -o $FILE_FOR_PROB

-s specifies sample label (24 in our example case) -m corrected marker input file -b file that was the output from the base caller -c chromosome sizes file -o specifies output file

The output file looks like this:

In the example, here is the command that I ran:

perl prep_prob.pl -s 24 -m ./input/allele_count.corrected.txt -b allele_count.base_call.txt -c TAIR10_chrSize.txt -o file_for_probabilities.txt

Calculate transmission and emission probabilities for the HMM General command: perl hmm_prob.pl -s $ALLELE_FREQUENCIES_FOR_BMM -p $FILE_FOR_PROB -o sample -a $BMM_OUTPUT -c $CHRSIZES -s output from the sliding window (allele frequency estimator) -p output file from previous script (prep_prob.pl) -o sample (gives a prefix for the output files) -a output from beta mixture model -c chromosome sizes file

This gives two output files: sample_hmm_model (probabilities for the HMM) sample_sliding_window (genotyping only based on the sliding window)

In the provided example, this is the command I used:

perl hmm_prob.pl -s frequencies_for_bmm.txt -p file_for_probabilities.txt -o sample -a bmm.intersections.txt -c TAIR10_chrSize.txt

Run the HMM

General command: java -jar hmm_play.jar -r $BASE_CALL_OUTPUT -o $HMM_OUTPUT -t bi -z sample_hmm_model

-r output of the base caller file -o specify output file -t bi (for biparental) -z output from last script (probabilities contained in file sample_hmm_model)

Example script:

java -jar hmm_play.jar -r allele_count.base_call.txt -o hmm.out.txt -t bi -z sample_hmm_model

The output file has a single line of base caller genotypes for each chromosome, then two lines of genotypes inferred from HMM, and a fourth line with just the probabilities.

Get rough estimate of recombination breakpoint positions

General command:

perl prepare_break.pl -s $LABEL -m $CORRECTED_INPUT_FILE -b hmm.out.txt -c $CHRSIZES -o $ROUGH_CO

-s sample label -m marker file with corrected markers -b output file from previous script (hmm_play.jar) -c chromosome sizes file -o output file

Example script:

perl prepare_break.pl -s 24 -m input/allele_count.corrected.txt  -b hmm.out.txt -c TAIR10_chrSize.txt -o rough_COs.txt

Two output files: $ROUGH_CO.txt <hmm_inferred_genotype1><hmm_inferred_genotype2>

$ROUGH_CO.breaks.txt

Refine recombination breaks

General command: perl refine_recombination_break.pl $COMPLETE_INPUT_FILE $ROUGH_CO.breaks.txt

First argument: marker input file with complete data Second argument: “breaks” output file from previous script

Example script:

perl refine_recombination_break.pl input/allele_count.complete.txt rough_COs.breaks.txt

Gives output $ROUGH_CO.recomb.txt $ROUGH_CO.refined.breaks.txt $ROUGH_CO.refined.recomb.txt

Smooth out breaks

General command:

perl breaks_smoother.pl -b $ROUGH_CO.refined.breaks.txt -o $SMOOTH_CO

-b “refined breaks” output from previous script -o specify output file

Example script:

perl breaks_smoother.pl -b rough_COs.refined.breaks.txt -o corrected.refined.breaks.txt

output Like the “breaks” output, but with corrected breakpoints based on the markers that had been filtered out.

Visualize output

General output

R --slave --vanilla --args plot_genotyping.R sample_id $VISUALIZATION.pdf $ROUGH_CO.breaks.txt $ROUGH_CO.refined.breaks.txt corrected.refined.breaks.txt $ALLELE_FREQUENCIES_FOR_BMM sample_sliding_window.breaks.txt < plot_genotyping.R

usage:

Example script:

R --slave --vanilla --args 24 visual_out.pdf rough_COs.breaks.txt rough_COs.refined.breaks.txt corrected.refined.breaks.txt frequencies_for_bmm.txt sample_sliding_window.breaks.txt < plot_genotyping.R

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
input		input
README.md		README.md
TAIR10_chrSize.txt		TAIR10_chrSize.txt
allele_count.base_call.txt		allele_count.base_call.txt
allele_freq_estimator.jar		allele_freq_estimator.jar
bamtotiger.sh		bamtotiger.sh
bamtotiger_1.sh		bamtotiger_1.sh
base_caller.jar		base_caller.jar
beta_mixture_model.R		beta_mixture_model.R
bmm.intersections.txt		bmm.intersections.txt
breaks_smoother.pl		breaks_smoother.pl
chr_convert.sh		chr_convert.sh
corrected.refined.breaks.txt		corrected.refined.breaks.txt
file_for_probabilities.txt		file_for_probabilities.txt
filter_marker_set_density.pl		filter_marker_set_density.pl
frequencies_for_bmm.txt		frequencies_for_bmm.txt
get_subset.pl		get_subset.pl
get_subset_2.pl		get_subset_2.pl
gpl.txt		gpl.txt
hmm.out.txt		hmm.out.txt
hmm.out.txt_log		hmm.out.txt_log
hmm_play.jar		hmm_play.jar
hmm_prob.pl		hmm_prob.pl
nohup.out		nohup.out
plot_genotyping.R		plot_genotyping.R
prep_prob.pl		prep_prob.pl
prepare_break.pl		prepare_break.pl
refine_recombination_break.pl		refine_recombination_break.pl
rough_COs.breaks.txt		rough_COs.breaks.txt
rough_COs.recomb.txt		rough_COs.recomb.txt
rough_COs.refined.breaks.txt		rough_COs.refined.breaks.txt
rough_COs.refined.recomb.txt		rough_COs.refined.recomb.txt
rough_COs.txt		rough_COs.txt
sample_hmm_model		sample_hmm_model
sample_sliding_window.breaks.txt		sample_sliding_window.breaks.txt
samtotiger.sh		samtotiger.sh
tigerrunclust.sh		tigerrunclust.sh
visual_out.pdf		visual_out.pdf

betharowan/TIGER_Scripts-for-distribution

Folders and files

Latest commit

History

Repository files navigation

TIGER_Scripts-for-distribution

About

Resources

Stars

Watchers

Forks

Languages