Skip to content

Human Virus Database Prep

Brian Haas edited this page Jan 27, 2023 · 12 revisions

Preparing CTAT-VIF Genome and Virus Resources

This assumes that you have already installed your chosen CTAT Genome Library.

Obtain our customized human virus database: virus_db.nr.fasta

Install the virus database into the CTAT genome library using the utility included in the CTAT-VirusIntegrationFinder software package like so:

CTAT-VirusIntegrationFinder/prep_genome_lib/ctat-vif-lib-integration.py \
    --virus_db virus_db.nr.fasta \
    --genome_lib_dir /path/to/your/ctat_genome_lib_base_dir \
    --CPU 4 # use however many threads you'd like for the STAR index build process.


(or if running Singularity):

singularity exec -e -B `pwd` -B /path/to/your/ctat_genome_lib_base_dir ctat_vif.simg \
    /usr/local/bin/prep_genome_lib/ctat-vif-lib-integration.py \
    --virus_db virus_db.nr.fasta \
    --genome_lib_dir /path/to/your/ctat_genome_lib_base_dir \
    --CPU 4 


Once the above process completes, you'll find a VIF/ subdirectory in your CTAT genome library that contains the viral fasta database and a STAR index that combines the human genome with the viral database. These additional resources will be leverage by ctat-vif in addition to those provided as part of the standard CTAT genome library.

Background Info: How the virus database fasta was originally compiled

Nothing to do here wrt setting up CTAT-VIF... just info on how the above virus fasta file was constructed.

Below is documentation on how we created the virus database that's used with CTAT-VirusIntegrationFinder (CTAT-VIF).

Human viruses were downloaded from http://www.virusite.org

The list of human viruses were downloaded with parameter setting 'group=human' as file 'human_viruses.list.csv'.

All virus sequences from virusite.org were downloaded as 'genomes.fasta'

and then the subset of human viruses were extracted via:

CTAT-VirusIntegrationFinder/util/misc/extract_human_viruses.py | \
    perl -lane 's/_complete_(sequence|genome)//; s/refseq_//; print;'\
    >  human_viruses.fasta 

We prepended 143 HPV sequences to this fasta file as obtained from collaborating researchers.

To exclude additional occurrences of non-unique virus entries, we removed the redundant entries using cd-hit, and preferentially retained our HPV-labeled entries over the virussite entries.

cd-hit-est -i virus_db.fasta -o virus_db.fasta.cdhit -c .98 -d 0 

select_cluster_rep_HPV_pref.py > accs.selected

acc_to_FASTA_file.pl accs.selected virus_db.fasta > virus_db.nr.fasta

This final virus fasta file was named as virus_db.nr.fasta and is included as part of supplementary data resources to be used with CTAT-VIF.