Skip to content
No description, website, or topics provided.
Branch: master
Clone or download

Latest commit

Fetching latest commit…
Cannot retrieve the latest commit at this time.


Type Name Latest commit message Commit time
Failed to load latest commit information.
Dataset upload dataset May 6, 2019
Overlap_extension @ 9a6cc16 update tar-vir test data to paired-end Mar 21, 2019
PEHaplo @ 8bb8f62 pull from Overlap_extension and PEHaplo Mar 19, 2019
.DS_Store upload dataset May 6, 2019
.gitmodules initial Jan 31, 2018
Guidance for Installing PEHaplo and Update Guidance for Installing PEHaplo and Mar 27, 2019 Update Mar 28, 2019


TAR-VIR is developed to classify RNA viral reads from viral metagenomic data and and also to produce the assembled viral strains (i.e. haplotypes) from classified reads. It mainly has two components: (1) Viral read classification using partial or remotely related reference genomes; (2) de novo assembly of viral haplotypes from recruited reads with PEHaplo, which is a haplotype reconstruction tool. As TAR-VIR has a modular structure, the users have options to use other assembly tools after read classification in step (1).

To use TAR-VIR, you need to have two types of data. (1) read set, such as viral metagenomic data containing reads from viruses. (2) a reference sequence, which can be a gene or a related genome. In the first step, you need to align the reads against the reference sequence using a read mapping tool. We recommend to use Bowtie2 with default parameters and the allowed error function "L,0,-0.6". The output of this step is a sam file. This sam file and the read data set will be used as input to TAR-VIR.

We provide two methods for installing TAR-VIR and PEHaplo. You can directly install these tools following the instructions below. In addition, we also provide packaged TAR-VIR and PEHaplo via Anaconda, which makes the installation more straightforward.

New version

If you installed TAR-VIR via conda. New version is available now. please use conda install -c kennethshang overlap_extension to update. In the new version of TAR-VIR, we provide a new parameter that allow TAR-VIR run in parallel to save time. If you want to run TAR-VIR on a large dataset, this might be helpfull. The information is shown below:

  -t  number of threads

   Due to the inplementation of TAR-VIR, the number of '-t' satisfy the equation:
                                   k = nt     (n = 1, 2, 3 ...)
   k is a paramter represents partations of Index. n is any positive integer

Installing via conda (recommended if you want to install both TAR-VIR and PEHaplo)

Noted that all the packages can be found on, which means you can easily install them by using conda. You can follow the Guidance to install step by step.

Installation without using conda

If you would like to install all the programs without using conda, please still take a look at the Guaidance, which contains a running example.

To download the source code:
git clone --recursive

  1. Install Overlap extension module
    This program requries the supports of C++11.
    cd TAR-VIR
    cd Overlap_extension

  2. Install PEHaplo
    Please look at the ReadMe file for PEHaplo at:


  1. You need to conduct error correction for the reads. By dafault, we use karect ("Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data", Bioinformatics)
  2. As mentioned earlier, please use Bowtie2 to align the input reads against the reference sequence. Use all the default parameters except the error function "L,0,-0.6". The output sam file will be used as input to the next step.


  1. Overlap extension
    After compilation, there will be two binary files: build and overlap
    (1) build reads index
    ./build -f reads.fa -o prefix
    (2) recruite reads
    ./overlap -S align.sam -x prefix -f reads.fa -c overlap_cutoff -o recruited_reads.fa
    align.sam is the alignment results of reads.fa on available reference

Test data and running examples
The test data sets are in folder Overlap_extension/test_data/.
This data set contains simulated viral reads from HIV-1, HCV genotype 1, and HGV.
This SAM file contains a small subset of aligned HIV-1 reads. With these aligned (287) reads, more HIV-1 reads can be recruited from the viral metagenomics data set.

cd Overlap_extension/
./build -f test_data/virus.fa -o virus
./overlap -S test_data/HIV.sam -x virus -f test_data/virus.fa -c 180 -o virus_recruit.fa
If everything is good, the recruited reads number should be 8008.

  1. Assemble
    The recruited reads usually contain both single- or paired-end reads, use the '-f' option of PEHaplo to input one fasta file.
    For details, please look at
You can’t perform that action at this time.