-
Notifications
You must be signed in to change notification settings - Fork 3
Human Virus Database Prep
This assumes that you have already installed your chosen CTAT Genome Library.
Obtain our customized human virus database: virus_db.nr.fasta
Install the virus database into the CTAT genome library using the utility included in the CTAT-VirusIntegrationFinder software package like so:
CTAT-VirusIntegrationFinder/prep_genome_lib/ctat-vif-lib-integration.py \
--virus_db virus_db.nr.fasta \
--genome_lib_dir /path/to/your/ctat_genome_lib_base_dir \
--CPU 4 # use however many threads you'd like for the STAR index build process.
(or if running Singularity):
singularity exec -e -B `pwd` -B /path/to/your/ctat_genome_lib_base_dir ctat_vif.simg \
/usr/local/bin/prep_genome_lib/ctat-vif-lib-integration.py \
--virus_db virus_db.nr.fasta \
--genome_lib_dir /path/to/your/ctat_genome_lib_base_dir \
--CPU 4
Once the above process completes, you'll find a VIF/ subdirectory in your CTAT genome library that contains the viral fasta database and a STAR index that combines the human genome with the viral database. These additional resources will be leverage by ctat-vif in addition to those provided as part of the standard CTAT genome library.
Nothing to do here wrt setting up CTAT-VIF... just info on how the above virus fasta file was constructed.
Below is documentation on how we created the virus database that's used with CTAT-VirusIntegrationFinder (CTAT-VIF).
Human viruses were downloaded from http://www.virusite.org
The list of human viruses were downloaded with parameter setting 'group=human' as file 'human_viruses.list.csv'.
All virus sequences from virusite.org were downloaded as 'genomes.fasta'
and then the subset of human viruses were extracted via:
CTAT-VirusIntegrationFinder/util/misc/extract_human_viruses.py | \
perl -lane 's/_complete_(sequence|genome)//; s/refseq_//; print;'\
> human_viruses.fasta
We prepended 143 HPV sequences to this fasta file as obtained from collaborating researchers.
To exclude additional occurrences of non-unique virus entries, we removed the redundant entries using cd-hit, and preferentially retained our HPV-labeled entries over the virussite entries.
cd-hit-est -i virus_db.fasta -o virus_db.fasta.cdhit -c .98 -d 0
select_cluster_rep_HPV_pref.py > accs.selected
acc_to_FASTA_file.pl accs.selected virus_db.fasta > virus_db.nr.fasta
This final virus fasta file was named as virus_db.nr.fasta and is included as part of supplementary data resources to be used with CTAT-VIF.