Skip to content
/ ViR Public

ViR is designed to solve the dispersion of reads due to intrasample variability for a reliable identification of lateral gene transfer events, with a focus on viral integrations

Notifications You must be signed in to change notification settings

epischedda/ViR

Repository files navigation

ViR

Pipeline developped by Elisa Pischedda, while in the Bonizzoni Lab at the University of Pavia (Italy).

see the ViR paper in BMC Bioinformatics: Pischedda et al., 2021

Purpose

ViR is designed to solve the dispersion of reads due to intrasample variability for a reliable identification of lateral gene transfer events, with a focus on viral integrations. Repetitive elements and/or fragmentation of a genome assembly result in intra-host variability leading to dispersion of reads across sequence-identical regions of the genome (here after called 'equivalent regions') when mapping short paired end reads to predict sites of an integration event. ViR solve the dispersions of reads by recognizing the membership of spared chimeric reads across equivalent genomic region and reconstructing the insertion site, based on available reads.

Figure 1: ViR overview

ViR is composed of four scripts, which work in two modules. The first three scripts "ViR_RefineCandidates.sh", "ViR_SolveDispersion.sh" and "ViR_AlignToGroup.sh" work together to overcome the dispersion of reads due to intrasample variability (module 1). "ViR_LTFinder.sh" is designed to run independently from the others, when testing for integration events of non-host sequences which have none or limited (threshold defined by the user) sequence similarity to sequences of the host (module 2).

Docker

The Dockerfile of ViR is available. See below the commands to run the scripts.

Installation

Dependencies

ViR was evaluated in Ubuntu 16.04 LTS Linux environment and uses the following programs which need to be installed and their path reported as values of the input parameters of the scripts.



Module-1: "ViR_RefineCandidates.sh"


Files preparation

Needed:

  1. A sam file (SAM) generated by the alignment of the raw reads of the sample against the host reference genome. Example: './Example_files/ViR_inputFiles/SCL_VI600.sam'. The path of the sam file must be used as value of the parameter -sam_file in "ViR_RefineCandidates.sh" script;

  2. A tab separated table of the chimeric reads (TXT). Example: './Example_files/ViR_inputFiles/SCL_VI600_chimeric_reads.txt'. The path of the text file must be used as value of the parameter -chimeric_reads_file in "ViR_RefineCandidates.sh" script. Example:

#SAMPLE_ID	READ_ID	HR_CHR	HR_START	HR_MQ	VR_SEQ	VIRUS_ID	VIRUS_START	VIRUS_END	VIRAL_SEQ
SSR4	E00338:95:HGNWJCCXY:1:1221:16640:64984	NW_021838865.1	556999	0	GTCATTGCCGCCATCATCAACGGCATTGAGTGGATCCGTGGCATGGGTGAGTGCCGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTAAACATAGGGGGGTAATCTTCCCCCACGAACCCCCAAAAGAAAGGTTTGGTTGTGTCGGG	NC_043569.1	826	863	TTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTTAAA
  1. Host reference genome (FASTA) in fasta format. The path of the fasta file must be used as value of the parameter -reference_fasta in "ViR_RefineCandidates.sh" script. To use the example files please download the latest version of Aedes albopictus reference genome and run the following command to replace the default scaffold names:
cat GCF_006496715.1_Aalbo_primary.1_genomic.fna | awk -F '\\ Aedes' '{print $1}' >  GCF_006496715.1_Aalbo_primary.1_genomic _clean.fasta

The BLAST database of the fasta file is needed; you can produce it with the following command:

/absolute_path_to/ncbi-blast-2.6.0+/bin/makeblastdb -in /absolute_path_to/Host_Reference_Genome.fasta -dbtype nucl

Paramters

Fill the appropriated parameter values; absolute paths will allow to run "ViR_RefineCandidates.sh" from any directory.

parameter name description default
work_files_dir absolute path of ViR directory
sample_name name of the sample
sam_file absolute path of the SAM file
chimeric_reads_file absolute path of the tab separated table of the chimeric reads
out absolute path of the output directory
max_percentage_dinucleotide_in_ViralSeq maximum percentage of dinucleotide accepted in the viral sequence 0.8
minimum_virus_len minimum length of the viral sequence 30
reference_fasta absolute path of the host reference genome
path_to_blastn absolute path of blastn executable
path_to_bedtools absolute path of bedtools executable
blastn_evalue maximum evalue for the reads alignment 1e-15
min_mate_distance minimum distance between mates to be maintained 10000

Running

Run the following command:

nohup bash /absolute_path_to/VIR-master/ViR_RefineCandidates.sh \
-work_files_dir /absolute_path_to/VIR-master/ \
-sample_name SAMPLE_ID \
-sam_file /absolute_path_to/sample_file.sam \
-chimeric_reads_file /absolute_path_to/sample_chimeric_reads.txt \
-out /absolute_path_to/output_directory_refineCandidates \
-max_percentage_dinucleotide_in_ViralSeq 0.8 \
-minimum_virus_len 30 \
-reference_fasta /absolute_path_to/Host_reference_genome.fasta \
-path_to_blastn /absolute_path_to/blastn \
-path_to_bedtools /absolute_path_to/bedtools \
-blastn_evalue 1e-15 \
-min_mate_distance 10000 &

Docker command

Set the absolute path to the local directory of the reference file (PATH_TO_REF_FILE_DIR); Set the absolute path to the local directory of the output (PATH_TO_OUTPUT_DIR);

docker run \
-v /PATH_TO_OUTPUT_DIR/:/out \
-v /PATH_TO_REF_FILE_DIR/:/ref \
-i pipelinetools:vir \
bash ViR_RefineCandidates.sh \
-work_files_dir ./ \
-sample_name SCL_VI600 \
-sam_file ./Example_files/ViR_inputFiles/SCL_VI600.sam \
-chimeric_reads_file ./Example_files/ViR_inputFiles/SCL_VI600_chimeric_reads.txt \
-out /out \
-max_percentage_dinucleotide_in_ViralSeq 0.8 \
-minimum_virus_len 30 \
-reference_fasta /ref/GCF_006496715.1_Aalbo_primary.1_genomic_clean.fasta \
-path_to_blastn blastn \
-path_to_bedtools bedtools \
-blastn_evalue 1e-15 \
-min_mate_distance 850

Output

"ViR_RefineCandidates.sh" outputs three files: "Final_ChimericPairs_Info.txt", "Final_HostReads.fasta" and "Final_ViralReads.fasta". Example output files are in the 'RC_OutputFiles' directory.

  1. The "Final_ChimericPairs_Info.txt" is a tab delimited file including the following columns for the chimeric reads passing the filters;
# SAMPLE_ID	READ_ID	HR_CHR	HR_START	HR_END	HR_SEQ	HR_FLAG	HR_NT_BQ20	HR_MQ	VR_SEQ	VR_FLAG	VR_NT_BQ20	VIRUS_ID	VIRUS_START	VIRUS_END	VIRUS_SEQ	VIRUS_SEQ_LEN	VR_AlignToRef?
SSR4	E00338:95:HGNWJCCXY:4:1122:28716:42481	NW_021837046.1	23938349	23938500	TGTTCTGGGCGGTGGAACGCCATCAGAAAGTTTTGTCCGCTTGCCTCGAAGCCGTAGCAGCATCAGTATTGTTGGGGGCGTATCGCGGCGTGATCCAAGATCTTCCGCCCGAAGTTTTGGTATTCTGTCTGACAATGGGGTGGAAGACGTC	73	147	0	GCTGCCACCGTCGGATGATTGGCTCTCTCTGCGGCCCAGATAGCCCCAGCGCTCGCTAAGGACACAGTCCAAAACCACACGTTCATTCCAGAGTCGCTGAATGAGCTCTTGGGCTCGCGCCAATTCCACACGAATCTGTAGAGCACGTTC	133	135	M91671.1	4038	4071	CCCAGCGCTCGCTAAGGACACAGTCCAAAACCA	33	Yes
  1. The fasta file of the host reads in the sample, "Final_HostReads.fasta";

  2. The fasta file of the viral reads in the sample, "Final_ViralReads.fasta".



Module-1: "ViR_SolveDispersion.sh"

(including "ViR_AlignToGroup.sh")


Files preparation

The followings are needed:

  1. The output directory of ViR_RefineCandidates (DIRECTORY) which has been used as value of the parameter -out in "ViR_RefineCandidates.sh". This directory has to be set as value of -outdict_RefCand in "ViR_SolveDispersion.sh";

  2. The list of the samples (TXT) to analyze together. The path of the text file must be used as value of the parameter -sample_list in "ViR_SolveDispersion.sh" script. A single or multiple samples can be used at the same time. Example: './Example_files/otherFiles/example_SD_sample_list.txt';

  3. Host reference genome (FASTA) in fasta format. The path of the fasta file must be used as value of the parameter -reference_fasta in "ViR_SolveDispersion.sh" script. The same reference genome used for ViR_RefineCandidates;

  4. OPTIONAL Transposable elements (FASTA) in fasta format. The path of the fasta file must be used as value of the parameter -repreg_fasta in "ViR_SolveDispersion.sh" script. If -repreg_fasta is used, also set the -min_TE_al_length parameter. The BLAST database of the fasta file is needed; you can produce it with the following command:

/absolute_path_to/ncbi-blast-XXX+/bin/makeblastdb -in /absolute_path_to/Transposable_elements.fasta -dbtype nucl
  1. OPTIONAL Bed file of the annotated EVEs (BED) in the reference genome. The path of the bed file must be used as value of the parameter -bed_EVE_annotated in "ViR_SolveDispersion.sh" script. If -bed_EVE_annotated is used, also set the -eve_dist parameter. Example: './Example_files/otherFiles/Aalb_nrEVEs.bed';

  2. OPTIONAL Bed file of the annotated piwi clusters (BED) in the reference genome. The path of the fasta file must be used as value of the parameter -bed_piwi_clusters in "ViR_SolveDispersion.sh" script. If -bed_piwi_clusters is used, also set the -piwi_dist parameter.


Paramters

Fill the appropriated parameter values; it is suggest to use absolute paths to allow you to run "ViR_SolveDispersion.sh" from any directory.

parameter name description default
work_files_dir absolute path of ViR directory
outdict_RefCand absolute path of ViR_RefineCandidates output directory
analysis_name name of the analysis
sample_list absolute path of the file including the list of the sample to analyze
out absolute path of the output directory
reference_fasta absolute path of the host reference genome
repreg_fasta absolute path of the repeated element fasta file
min_TE_al_length minimum alignment length in bp of a repeated element in the host genome 100
bed_EVE_annotated absolute path pf the bed file of the annotated EVEs in the host genome
eve_dist minimum distance in bp between the candidate integration and an EVE annotated to be reported 10000
bed_piwi_clusters absolute path of the bed file of the piwi clusters annotated in the host genome
piwi_dist minimum distance in bp between the candidate integration and a piwi cluster annotated to be reported 0
path_to_blastn absolute path of blastn executable
path_to_bedtools absolute path of bedtools executable
trinity_exe absolute path of trinity executable
samtools_exe absolute path of samtools executable
bwa_exe absolute path of bwa executable
merge_dist maximum distance in bp between aligned reads to be considered in the same read group 1000
minReads_inRegion minimum number of reads in a read group 2
percReadsShared_inGroup_union minimum percentage of reads shared by two read groups to be merged 0.8

Running

Run the following command:

nohup bash /absolute_path_to/VIR-master/ViR_SolveDispersion.sh \
-work_files_dir /absolute_path_to/VIR-master/ \
-outdict_RefCand /absolute_path_to/output_directory_refineCandidates \
-analysis_name SAMPLE_ID \
-sample_list /absolute_path_to/sample_list.list \
-out /absolute_path_to/output_directory_solveDispersion \
-reference_fasta /absolute_path_to/Host_Reference_Genome.fa \
-repreg_fasta /absolute_path_to/Transposable_elements.fa \
-min_TE_al_length 100 \
-bed_EVE_annotated /absolute_path_to/EVEs.bed \
-eve_dist 10000 \
-bed_piwi_clusters /absolute_path_to/piwiClusters.bed \
-piwi_dist 0 \
-path_to_blastn /absolute_path_to/blastn \
-path_to_bedtools /absolute_path_to/bedtools \
-trinity_exe /absolute_path_to/Trinity \
-samtools_exe /absolute_path_to/samtools \
-bwa_exe /absolute_path_to/bwa \
-merge_dist 1000 \
-minReads_inRegion 2 \
-percReadsShared_inGroup_union 0.8 &

Docker command

Set the absolute path to the local directory of the reference file (PATH_TO_REF_FILE_DIR); Set the absolute path to the local directory of the output (PATH_TO_OUTPUT_DIR);

docker run \
-v /PATH_TO_OUTPUT_DIR/:/out \
-v /PATH_TO_REF_FILE_DIR/:/ref \
-i pipelinetools:vir \
bash ViR_SolveDispersion.sh \
-work_files_dir ./ \
-outdict_RefCand ./Example_files/ViR_Outputs/ViR_RefineCandidates \
-analysis_name SCL_VI600 \
-sample_list ./Example_files/otherFiles/example_SD_sample_list.txt \
-out /out \
-reference_fasta /ref/GCF_006496715.1_Aalbo_primary.1_genomic_clean.fasta \
-bed_EVE_annotated ./Example_files/otherFiles/Aalb_nrEVEs.bed \
-path_to_blastn blastn \
-path_to_bedtools bedtools \
-trinity_exe Trinity \
-samtools_exe samtools \
-bwa_exe bwa \
-merge_dist 1000 \
-eve_dist 10000 \
-minReads_inRegion 2 \
-percReadsShared_inGroup_union 0.8

Output

"ViR_SolveDispersion.sh" outputs five files ( "All_Equivalent_Regions_per_Read_Group.txt", "Complete_Dataset_Info.txt", "Complete_Read_Groups_Info.txt", "Equivalent_region_per_Read_Group.fasta" and "Equivalent_region_per_Read_Group_sorted.bed") and a directory named 'RG'. Example output files are in the 'SD_OutputFiles' directory.

  1. The "All_Equivalent_Regions_per_Read_Group.txt" includes for each read groups found in the sample, the list of the read ids in the group and all the equivalent regions in the genome in which the reads have an alignements. Usually, equivalent regions are repetivive elements in the genome;

  2. The "Complete_Dataset_Info.txt" is the same as "Final_ChimericPairs_Info.txt" plus a column including the read group assigned to the read pair. If there is not a group, the read pair is tagged as 'Ungrouped';

  3. The "Complete_Read_Groups_Info.txt" is a tab delimited file including the description of the integration site and the related boundaries region. For each sample, the number of reads belonging from each group is reported. Example:

READ_GROUP	SCAFFOLD	START	END	LENGHT	NUM_EQ_RG	near/in_EVEs	in_PIRCS	near_TE	VIRUS_ID	VIRUS_START	VIRUS_END	MAX_VIRAL_SEQ	NUM_READS	NUM_SAMPLES	SSR5
RG1	NW_021838429.1	852077	859110	7033	6	Rhabdo39,Rhabdo63,Rhabdo60,Rhabdo61,Rhabdo40,Rhabdo41,Rhabdo42		rnd-1_family-630#LTR/Gypsy	NC_025378.1	6775	6810	35	5	1	5
RG2	NW_021838154.1	19239220	19239448	228	0	Flavi19			NC_012932.1	4036	4156	120	2	1	2
RG3	NW_021838153.1	165688121	165689244	1123	16				NC_018464.1	837	873	32	6	1	6
Ungrouped									NC_012932.1				1	1	1
Shared													0	
  1. For each group, the coordinates of an equivalent region are reported in "Equivalent_region_per_Read_Group_sorted.bed";

  2. For each group, the sequence of an equivalent region is reported in "Equivalent_region_per_Read_Group.fasta";

  3. The 'RG' directory includes for each sample, the realignment of the reads in the assigned read group produced by "ViR_AlignToGroup.sh". If enough reads are available a consensus sequence of the integration site is created.



Module-2: "ViR_LTFinder.sh"


Files preparation

You will need:

  1. The list of the sample reads (TXT) that you want to analyze in a tab delimited text file. The path of the text file must be used as value of the parameter -read_list in "ViR_LTFinder.sh" script. See the example file "./Example_files/otherFiles/example_LTFinder_sample_list.txt". Structure example:
/absolute_path_to/Reads_R1.fq /absolute_path_to/Reads_R2.fq
  1. A working directory (DIRECTORY). The path of the directory must be used as value of the parameter -working_dir in "ViR_LTFinder.sh" script;

  2. The non-host sequence (FASTA) in fasta format. The path of the fasta file must be used as value of the parameter -non_host_fasta in "ViR_LTFinder.sh" script. Example: "./Example_files/otherFiles/DENV2.fasta".


Paramters

Fill the appropriated parameter values; it is suggest to use absolute paths to allow you to run "ViR_LTFinder.sh" from any directory.

parameter name description
analysis_name sample name
read_list absolute path of the file including the list of the sample reads
working_dir absolute path of the workind directory
th threads
mem RAM
non_host_fasta absolute path of the fasta file of the non host sequence
trinity_exe absolute path of trinity executable

Running

Fill the appropriate parameter values; absolute paths will allow o run "ViR_LTFinder.sh" from any directory. Run the following command:

nohup nohup bash /absolute_path_to/VIR-master/ViR_LTFinder.sh \
-analysis_name SAMPLE_ID \
-read_list /absolute_path_to/sample_reads.txt \
-working_dir /absolute_path_to/working_directory \
-th 1 \
-mem 2G \
-non_host_fasta /absolute_path_to/non-host_sequence.fasta \
-trinity_exe /absolute_path_to/Trinity &

Docker command

Set the absolute path to the local directory of the output (PATH_TO_OUTPUT_DIR);

docker run \
-v /PATH_TO_OUTPUT_DIR/:/out \
-i pipelinetools:vir \
bash ViR_LTFinder.sh \
-analysis_name SSM_Cov_30_SCL_VI600 \
-read_list ./Example_files/otherFiles/example_LTFinder_sample_list.txt \
-working_dir /out \
-th 1 \
-mem 2G \
-non_host_fasta ./Example_files/otherFiles/DENV2.fasta \
-trinity_exe Trinity

Output

If it is possible to create a de novo consensus sequence of the aligned reads, "ViR_LTFinder.sh" creates a "{file}_assembly.fa" including that sequence.




About

ViR is designed to solve the dispersion of reads due to intrasample variability for a reliable identification of lateral gene transfer events, with a focus on viral integrations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages