# Brugia Viral Search

The goal of this analysis is to identify and hopefully assemble to some degree the viruses that are found present in the Brugia RNA-Seq data.

The first step is to run kraken on the data to identify likely viruses from the RefSeq database. Given it is limited to the RefSeq database, it is surely not comprehensive enough to be extrememly beneficial in de novo assembly, so LMAT is going to be run as well.

All data for this project can be found here: /scratch/at120/brugia-rnaseq/brugia_viral_search
Currently only working with this run: 2015-07-24_H55MKBCXX

The Brugia data have been filtered out via tophat which may or may not be suffice. We will have to see

## H55MKBCXX Run

In [None]:
qsub run-kraken.sh

Once kraken has generated output, obtaining the reads that are viral and not bactieria, arcahea, or phiX can be done with the following command

In [None]:
grep 'Virus' kraken.translate.out | grep -v 'phiX' | cut -f1 > viral-read-names.txt

Once these read names are generate the below script can be executed to extract the reads that are either: 
A) viral and not phiX based on the kraken classification
B) not listed in the kraken output since these are "non-Brugia/wolbachia" reads.

In [None]:
python extract-viral-reads.py

Once that's done we need to pair up the reads into pairs and orphans

I need to add the functionality to original extraction script but for now I'm going to use the khmer script

In [None]:
module load biopython
python extract-paired-reads.py all-unmapped.r1.viral.fastq all-unmapped.r2.viral.fastq all-unmapped.r1.viral.pe.fastq all-unmapped.r2.viral.pe.fastq all-unmapped.viral.se.fastq

For the first run, I'm sending it through IDBA to see if there's anything fruitful without manipulating the data (normalizing, partioning, more filtering, etc.).

In [None]:
python interleave-fastq-to-fasta.py all-unmapped.r1.viral.pe.fastq all-unmapped.r2.viral.pe.fastq > all-unmapped.viral.interleaved.fasta

qsub -v path=/scratch/at120/brugia-rnaseq/brugia_viral_search,fasta=all-unmapped.viral.interleaved.fasta run-idba.sh

Putting this through IDBA, blasting, and then running Krona yielded 0 virus contigs LOL.

### Choristoneura occidentalis granulovirus

This, next to PhiX, is the most prevalent virus in this run, so I'm going to start with this one first.

http://www.ncbi.nlm.nih.gov/nuccore/NC_008168.1

#### Mapping

I'm mapping the reads to the genoem initially to see what kind of overage we're getting and if a near complete genome is possible.

I think the insert size is -50? Bowtie2 doesn't allow negative insert sizes so that sucks

In [None]:
module load bowtie2/2.2.7
module load samtools/intel/1.3
cd /scratch/at120/brugia-rnaseq/brugia_viral_search/suspected-viruses/Choristoneura_occidentalis_granulovirus

bowtie2-build choristoneura_genome.fasta choristoneura_genome.fasta

bowtie2 \
-p 12 \
--very-sensitive-local \
--un-conc non-choco-virus.fastq \
-x choristoneura_genome.fasta \
-1 all-unmapped.r1.viral.pe.fastq \
-2 all-unmapped.r2.viral.pe.fastq \
-S choristoneura_genome.sam

module load samtools/intel/1.3
samtools view -b -o choristoneura_genome.bam choristoneura_genome.sam

4011418 reads; of these:
  4011418 (100.00%) were paired; of these:
    968151 (24.13%) aligned concordantly 0 times
    3041538 (75.82%) aligned concordantly exactly 1 time
    1729 (0.04%) aligned concordantly >1 times
    ----
    968151 pairs aligned concordantly 0 times; of these:
      678294 (70.06%) aligned discordantly 1 time
    ----
    289857 pairs aligned 0 times concordantly or discordantly; of these:
      579714 mates make up the pairs; of these:
        579000 (99.88%) aligned 0 times
        580 (0.10%) aligned exactly 1 time
        134 (0.02%) aligned >1 times
92.78% overall alignment rate

The majority of these reads mapped to a very small segment that was presumably one of the unique regions chosen by Kraken, hence why the majority of these reads were assigned to this virus.

I'm concluding this isn't present in this sample.

![alt text](/Users/alan/projects/brugia_viral_search/choco-virus/igv_snapshot.png")

### Simbu Virus

http://www.ncbi.nlm.nih.gov/genome/?term=Simbu%20virus

all of the reference sequences can be found with the kraken database: /scratch/at120/shared/db/kraken/standard/library/Viruses/

This one had the same phenomenon, so rule this out.

### All Kraken Virsues

I'm going to map all of the reads to the list of viruses that have more than 500 reads assigned to them and see what comes of it. It'll be quicker this way to interrogate potentially full genomes.

In [None]:
cd /scratch/at120/brugia-rnaseq/brugia_viral_search/all-kraken-viruses

cat /scratch/at120/shared/db/kraken/standard/library/Viruses/*/*fna > all-kraken-seqs.fasta

bowtie2-build all-kraken-seqs.fasta all-kraken-seqs.fasta
bowtie2 \
-p 12 \
--very-sensitive-local \
--un-conc unconc.fastq \
-x all-kraken-seqs.fasta \
-1 all-unmapped.r1.viral.pe.fastq \
-2 all-unmapped.r2.viral.pe.fastq \
-S all-kraken-seqs.sam

samtools view -b -o all-kraken-seqs.bam  all-kraken-seqs.sam
samtools sort -o all-kraken-seqs.sort.bam all-kraken-seqs.bam
samtools index all-kraken-seqs.bam

The top 15 of these all had a 100bp region where a lot of reads mapped, nothing fruitful other than that.

### LMAT

Since kraken failed pretty miserably in providing any decent results, I'm going to give Kraken a try to see if that does any better at providing me a frame of reference for finding viruses.

## All Brugia Runs

### Filter Reads

In [None]:
cd /scratch/at120/brugia-rnaseq/brugia_viral_search/all-runs

cat /data/cgsb/gencore/out/Ghedin/2014-08-12_HAAULADXX/*R1* > all-r1.fastq.gz
cat /data/cgsb/gencore/out/Ghedin/2014-08-12_HAAULADXX/*R2* > all-r2.fastq.gz
cat /data/cgsb/gencore/out/Ghedin/2015-03-24_H2V3NBCXX/*/*n01* >> all-r1.fastq.gz
cat /data/cgsb/gencore/out/Ghedin/2015-03-24_H2V3NBCXX/*/*n02* >> all-r2.fastq.gz
cat /data/cgsb/gencore/out/Ghedin/2015-10-06_H5KNLBGXX/combined/*n01* >> all-r1.fastq.gz
cat /data/cgsb/gencore/out/Ghedin/2015-10-06_H5KNLBGXX/combined/*n02* >> all-r2.fastq.gz

qsub -v path=/scratch/at120/brugia-rnaseq/brugia_viral_search,fastq=all ../run-tophat.sh

This took WAY too long so I used the filter script seen here:

In [None]:
cd /scratch/at120/brugia-rnaseq/brugia_viral_search/filtered-data

qsub -v fastq=all ../scripts/filter-brugia.sh

Since there was a lot of adaptor readthrough I will run Trimmomatic on these data (see below section "weird looking reads").

The adaptor sequences I used are:
GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
ATCTCGTATGCCGTCTTCTGCTTG

In [None]:
module load trimmomatic

cd /scratch/at120/brugia-rnaseq/brugia_viral_search/filtered-data/trimmed

java -jar /share/apps/trimmomatic/0.32/trimmomatic-0.32.jar \
PE \
all.non-rRNA.deconseq_clean.r1.fastq \
all.non-rRNA.deconseq_clean.r2.fastq \
all.non-rRNA.deconseq_clean.trimmed.r1.fastq \
all.non-rRNA.deconseq_clean.trimmed.se.r1.fastq \
all.non-rRNA.deconseq_clean.trimmed.r2.fastq \
all.non-rRNA.deconseq_clean.trimmed.se.r2.fastq \
ILLUMINACLIP:adaptors.fa:2:30:10 \
LEADING:3 \
TRAILING:3 \
SLIDINGWINDOW:4:15 \
MINLEN:28 

Lots of data thrown out, which was expected after looking at the reads.

Input Read Pairs: 11942310 Both Surviving: 491434 (4.12%) Forward Only Surviving: 47105 (0.39%) Reverse Only Surviving: 19382 (0.16%) Dropped: 11384389 (95.33%)

### LMAT

Since Kraken didn't do well on the individual run I decided to give LMAT a shot since it has much more breadth in terms of search space. Just followed the tutorial off the bat to see what comes of this.

https://sourceforge.net/p/lmat/wiki/Example%20LMAT%20Run/

Example input data once the reads are merged together and <= q10 bases are masked:

-bash-4.1$ head all.non-rRNA.deconseq_clean.merged.q10.fasta
>HWI-ST911:229:HAAULADXX:2:2205:15612:20318 1:N:0:CGCTGT
GATCGGAAGAGCACACGTCTGAACTNCAGTCACCGATGTATCTNNTATGCCGTCTTCNGNTNNAAAAAAAANNNNTAANANGTTNNNNNGACNNNANNNTNNGNNANGNNNAGGGGNNNGNNNNGGNANNGNNNNNNATTNCGNNNNCNNNNGNNNNTTTAAAAAANNNNNNANCNACAGCNCAANTTGANNNNNTNGANNNC
>HWI-ST911:252:H2V3NBCXX:1:1111:8969:23340 1:N:0:ACAGTG
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACGAGCACACCCACTACGACCCGAAAAATTGTCAACCACACCCCAGCAGGCCCAGATCTGAATATCATACCACCTAAANGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAGCGACAAGGCAATCACGCCCCGTGGCCCCCATTGAATGACAAGAAAGGCCTGTCCATAACATAAAAGACCGGATTACAGGCT
>HWI-ST911:229:HAAULADXX:1:1105:13220:76584 1:N:0:TTAGGC
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTGNAAANANANNNNANANANNCGNTNNTAGNNNNNNTNATNGANNNNNANANNNNNCNNNCGNGGAANNNNAGNNNNNNCGGGNNNNNNNNNNNCNNNTNAAAAAANAAANNNNNNNNNNNNNCNNCNTCNNNATNNNANNN
>HWI-ST911:252:H2V3NBCXX:2:1107:14691:5102 1:N:0:ACAGTG
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACATACACAAACTAACAAATAAGAATCAAACATAAAAACACAACAATAACATAAAAAACAAACAAAAAAAAAAAACCANGATCGGAAGAGCGTCGTGGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAACAAAAGAAAAAAACAGCGACAATATACAAGAAACACAAAGAATCTAGCATAAAAACAAGCAGAATCTACACACACACGAAG
>HWI-ST911:229:HAAULADXX:1:2114:10038:59705 1:N:0:ATCACG
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAANNNNNNNNANNNNNGNNNAAGGNNCNGANGNTCNGNANANNGNNNNNTTGGGGAANNNNGNNNNNNNCNGNNGGNNNNNNNNNCNTNNAAAAAAAAAAAANNNANANNNNNAGNNNNNNNNNNANNCNNN

If there are a lot of N's then they were most likely marked as "read too short"

In [None]:
module load lmat
module load seqtk

cd /scratch/at120/brugia-rnaseq/brugia_viral_search/lmat

merge_fastq_reads_with_N_separator.pl all.non-rRNA.deconseq_clean.r1.fastq all.non-rRNA.deconseq_clean.r2.fastq all.non-rRNA.deconseq_clean.merged.fastq

seqtk seq -A -q 10 -n N all.non-rRNA.deconseq_clean.merged.fastq all.non-rRNA.deconseq_clean.merged.q10.fasta

run_rl.sh \
--db_file=/scratch/work/public/gen-data/lmat/runtime_inputs/kML+Human.v4-14.20.g10.db \
--query_file=all.non-rRNA.deconseq_clean.merged.q10.fasta

#### Viral reads from LMAT:

-bash-4.1$ grep -i 'virus' all.non-rRNA.deconseq_clean.merged.q10.fasta.kML+Human.v4-14.20.g10.db.lo.rl_output.0.30.fastsummary.species

Average Read Score      Total Read Score        Read Count      TaxID   Name    Strain Info
1.5235  6.09389 4       12321   Alfalfa mosaic virus
0.7169  2.15073 3       11830   Murine osteosarcoma virus
0.6137  0.613687        1       42478   Saccharomyces cerevisiae virus L-BC (La)
0.2391  0.239123        1       40051   Bluetongue virus        0.239123        1       35328   no rank,Bluetongue virus 2
0.0869  0.0868627       1       11103   Hepatitis C virus       0.0868627       1       31647   no rank,Hepatitis C virus subtype 1b
0.0427  0.0854622       2       39420   Feldmannia species virus
0.0646  0.0645626       1       622416  Avian paramyxovirus 7
0.0576  0.0576421       1       11591   Uukuniemi virus
0.0576  0.0576421       1       36772   Subterranean clover stunt virus
0.0282  0.0564636       2       687367  Torque teno virus 28
0.0192  0.0384451       2       1285600 Nile crocodilepox virus
0.0378  0.0377747       1       12287   Flock house virus
0.0372  0.0371636       1       46607   Andes virus     0.0371636       1       69245   no rank,Lechiguanas virus
0.0331  0.0330648       1       1235314 Megavirus lba
0.0158  0.0157732       1       1330520 Enterovirus F   0.0157732       1       269638  no rank,Bovine enterovirus type 2
0.0035  0.00348104      1       12461   Hepatitis E virus

#### LMAT classification stats

ReadTooShort    1364143
NoDbHits        64385
LowScore        193359

### Kraken

For sake of comparison and thoroughness I'm running Kraken to see if it can pick up any other viruses.

In [None]:
module load kraken

cd /scratch/at120/brugia-rnaseq/brugia_viral_search/kraken-data

kraken --db /scratch/at120/shared/db/kraken/standard --fastq-input --threads 12 --output all.non-rRNA.deconseq_clean.kraken.out --preload --unclassified-out all.non-rRNA.deconseq_clean.kraken.u
nclassified.fastq all.non-rRNA.deconseq_clean.fastq

23884620 sequences (2670.89 Mbp) processed in 2176.171s (658.5 Kseq/m, 73.64 Mbp/m).
10829641 sequences classified (45.34%)
13054979 sequences unclassified (54.66%)

#### Kraken viruses

A lot of overlap with the unfiltered data though the counts are significantly less.

d__Viruses|f__Microviridae|g__Microvirus|s__Enterobacteria_phage_phiX174_sensu_lato	2227
d__Viruses|o__Picornavirales|f__Dicistroviridae|s__Formica_exsecta_virus_1	869
d__Viruses|f__Partitiviridae|g__Betapartitivirus|s__Pleurotus_ostreatus_virus_1	290
d__Viruses|f__Partitiviridae|g__Betapartitivirus|s__Hop_trefoil_cryptic_virus_2	287
d__Viruses|f__Partitiviridae|s__Dill_cryptic_virus_1	259
d__Viruses|s__Saccharomyces_cerevisiae_killer_virus_M1	190
d__Viruses|f__Flaviviridae|g__Hepacivirus|s__Hepatitis_C_virus	187
d__Viruses|f__Partitiviridae|g__Betapartitivirus|s__Dill_cryptic_virus_2	142
d__Viruses|o__Tymovirales|f__Tymoviridae|g__Tymovirus|s__Dulcamara_mottle_virus	139
d__Viruses|f__Partitiviridae|g__Alphapartitivirus|s__Rosellinia_necatrix_partitivirus_2	127
d__Viruses|f__Partitiviridae|g__Betapartitivirus|s__Red_clover_cryptic_virus_2	122
d__Viruses|f__Partitiviridae|g__Alphapartitivirus|s__Vicia_cryptic_virus	122
d__Viruses|f__Polydnaviridae|g__Bracovirus|s__Cotesia_congregata_bracovirus	108
d__Viruses|f__Partitiviridae|g__Betapartitivirus|s__White_clover_cryptic_virus_2	79
d__Viruses|f__Partitiviridae|g__Betapartitivirus|s__Primula_malacoides_virus_1	66
d__Viruses|s__Pandoravirus_dulcis	65
d__Viruses|o__Caudovirales|f__Myoviridae|s__Enterobacteria_phage_4MG	65
d__Viruses|f__Partitiviridae|s__Red_clover_cryptic_virus_1	62
d__Viruses|o__Picornavirales|f__Picornaviridae|g__Senecavirus|s__Senecavirus_A	56
d__Viruses|o__Herpesvirales|f__Herpesviridae|s__Caviid_herpesvirus_2	49
d__Viruses|f__Anelloviridae|g__Gammatorquevirus|s__Torque_teno_midi_virus_2	45
d__Viruses|f__Astroviridae|g__Mamastrovirus|s__Porcine_astrovirus_2	40
d__Viruses|o__Picornavirales|f__Picornaviridae|s__Fathead_minnow_picornavirus	32
d__Viruses|o__Caudovirales|f__Myoviridae|g__Spounalikevirus|s__Bacillus_phage_SPO1	27
d__Viruses|f__Baculoviridae|g__Deltabaculovirus|s__Culex_nigripalpus_nucleopolyhedrovirus	26
d__Viruses|o__Caudovirales|f__Myoviridae|s__Cronobacter_phage_vB_CsaM_GAP31	24
d__Viruses|f__Bromoviridae|g__Bromovirus|s__Spring_beauty_latent_virus	22
d__Viruses|s__Pandoravirus_salinus	19
d__Viruses|o__Caudovirales|f__Myoviridae|s__Salmonella_phage_PVP-SE1	18
d__Viruses|f__Retroviridae|g__Gammaretrovirus|s__Murine_osteosarcoma_virus	16
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Vibrio_phage_SHOU24	15
d__Viruses|f__Phycodnaviridae|s__Aureococcus_anophagefferens_virus	12
d__Viruses|o__Herpesvirales|f__Alloherpesviridae|g__Cyprinivirus|s__Cyprinid_herpesvirus_2	11
d__Viruses|o__Tymovirales|f__Betaflexiviridae|g__Carlavirus|s__Garlic_common_latent_virus	10
d__Viruses|o__Tymovirales|f__Alphaflexiviridae|g__Sclerodarnavirus|s__Sclerotinia_sclerotiorum_debilitation-associated_RNA_virus	10
d__Viruses|o__Herpesvirales|s__Abalone_herpesvirus_Victoria/AUS/2009	10
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Macavirus|s__Bovine_herpesvirus_6	10
d__Viruses|s__Phytophthora_infestans_RNA_virus_1	9
d__Viruses|g__Tenuivirus|s__Rice_grassy_stunt_virus	9
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Simplexvirus|s__Macacine_herpesvirus_1	8
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Roseolovirus|s__Human_herpesvirus_6B	8
d__Viruses|f__Retroviridae|s__Human_endogenous_retrovirus_K	8
d__Viruses|f__Partitiviridae|s__Sclerotinia_sclerotiorum_partitivirus_S	8
d__Viruses|f__Papillomaviridae|g__Lambdapapillomavirus|s__Lambdapapillomavirus_2	8
d__Viruses|f__Orthomyxoviridae|g__Influenzavirus_A|s__Influenza_A_virus	8
d__Viruses|f__Bromoviridae|g__Alfamovirus|s__Alfalfa_mosaic_virus	8
d__Viruses|o__Caudovirales|s__Sinorhizobium_phage_PBC5	7
d__Viruses|f__Virgaviridae|g__Tobamovirus|s__Hibiscus_latent_Singapore_virus	7
d__Viruses|f__Nudiviridae|s__Oryctes_rhinoceros_nudivirus	7
d__Viruses|f__Baculoviridae|g__Alphabaculovirus|s__Choristoneura_rosaceana_alphabaculovirus	7
d__Viruses|s__Gentian_ovary_ring-spot_virus	6
d__Viruses|o__Picornavirales|f__Picornaviridae|g__Kobuvirus|s__Mouse_kobuvirus_M-5/USA/2010	6
d__Viruses|o__Picornavirales|f__Picornaviridae|g__Cardiovirus|s__Encephalomyocarditis_virus	6
d__Viruses|o__Herpesvirales|f__Alloherpesviridae|g__Cyprinivirus|s__Cyprinid_herpesvirus_1	6
d__Viruses|f__Togaviridae|g__Alphavirus|s__Chikungunya_virus	6
d__Viruses|f__Adenoviridae|g__Aviadenovirus|s__Fowl_aviadenovirus_B	6
d__Viruses|s__Andrographis_yellow_vein_leaf_curl_betasatellite	5
d__Viruses|o__Picornavirales|f__Dicistroviridae|s__Mud_crab_dicistrovirus	5
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Mardivirus|s__Gallid_herpesvirus_2	5
d__Viruses|o__Picornavirales|f__Picornaviridae|s__Eel_picornavirus_1	4
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Simplexvirus|s__Cercopithecine_herpesvirus_2	4
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Muromegalovirus|s__Murid_herpesvirus_8	4
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Mardivirus|s__Gallid_herpesvirus_3	4
d__Viruses|f__Bunyaviridae|g__Tospovirus|s__Impatiens_necrotic_spot_virus	4
d__Viruses|f__Adenoviridae|g__Mastadenovirus|s__Human_mastadenovirus_C	4
d__Viruses|f__Adenoviridae|g__Aviadenovirus|s__Fowl_aviadenovirus_E	4
d__Viruses|s__Vernonia_yellow_vein_Fujian_virus_betasatellite	3
d__Viruses|s__Jingmen_tick_virus	3
d__Viruses|s__Alternaria_alternata_virus_1	3
d__Viruses|s__Ageratum_yellow_vein_Singapore_alphasatellite	3
d__Viruses|o__Picornavirales|s__Carp_picornavirus_1	3
d__Viruses|o__Herpesvirales|f__Alloherpesviridae|g__Cyprinivirus|s__Cyprinid_herpesvirus_3	3
d__Viruses|o__Herpesvirales|f__Alloherpesviridae|g__Cyprinivirus|s__Anguillid_herpesvirus_1	3
d__Viruses|o__Caudovirales|f__Podoviridae|g__T7likevirus|s__Enterobacteria_phage_T7	3
d__Viruses|f__Polydnaviridae|g__Ichnovirus|s__Glypta_fumiferanae_ichnovirus	3
d__Viruses|f__Phycodnaviridae|g__Prymnesiovirus|s__Phaeocystis_globosa_virus	3
d__Viruses|f__Iridoviridae|g__Iridovirus|s__Invertebrate_iridescent_virus_6	3
d__Viruses|f__Flaviviridae|g__Flavivirus|s__Tick-borne_encephalitis_virus	3
d__Viruses|f__Bunyaviridae|g__Orthobunyavirus|s__Simbu_virus	3
d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Choristoneura_occidentalis_granulovirus	3
d__Viruses|f__Baculoviridae|g__Alphabaculovirus|s__Spodoptera_litura_nucleopolyhedrovirus	3
d__Viruses|s__Posavirus_1	2
d__Viruses|s__McMurdo_Ice_Shelf_pond-associated_circular_DNA_virus-8	2
d__Viruses|s__Halovirus_HRTV-8	2
d__Viruses|o__Nidovirales|f__Coronaviridae|g__Bafinivirus|s__White_bream_virus	2
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Varicellovirus|s__Bovine_herpesvirus_1	2
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Simplexvirus|s__Papiine_herpesvirus_2	2
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Rhadinovirus|s__Human_herpesvirus_8	2
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Lactococcus_phage_phiL47	2
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Lactococcus_phage_phi7	2
d__Viruses|o__Caudovirales|f__Myoviridae|s__Bacillus_phage_CampHawk	2
d__Viruses|o__Caudovirales|f__Myoviridae|g__Phikzlikevirus|s__Pseudomonas_phage_EL	2
d__Viruses|g__Emaravirus|s__Rose_rosette_virus	2
d__Viruses|f__Totiviridae|g__Totivirus|s__Saccharomyces_cerevisiae_virus_L-BC_(La)	2
d__Viruses|f__Retroviridae|g__Alpharetrovirus|s__Rous_sarcoma_virus	2
d__Viruses|f__Poxviridae|g__Capripoxvirus|s__Sheeppox_virus	2
d__Viruses|f__Polyomaviridae|g__Polyomavirus|s__Simian_virus_40	2
d__Viruses|f__Phycodnaviridae|g__Chlorovirus|s__Paramecium_bursaria_Chlorella_virus_NY2A	2
d__Viruses|f__Phycodnaviridae|g__Chlorovirus|s__Acanthocystis_turfacea_Chlorella_virus_1	2
d__Viruses|f__Partitiviridae|s__Rhizoctonia_solani_dsRNA_virus_2	2
d__Viruses|f__Papillomaviridae|g__Alphapapillomavirus|s__Alphapapillomavirus_3	2
d__Viruses|f__Nudiviridae|s__Heliothis_zea_nudivirus	2
d__Viruses|f__Nodaviridae|g__Alphanodavirus|s__Black_beetle_virus	2
d__Viruses|f__Microviridae|g__Microvirus|s__Enterobacteria_phage_G4_sensu_lato	2
d__Viruses|f__Hepeviridae|s__Bat_hepevirus	2
d__Viruses|f__Chrysoviridae|g__Chrysovirus|s__Amasya_cherry_disease_associated_chrysovirus	2
d__Viruses|f__Caulimoviridae|g__Badnavirus|s__Sweet_potato_badnavirus_A	2
d__Viruses|f__Caliciviridae|s__St-Valerien_swine_virus	2
d__Viruses|f__Bunyaviridae|g__Tospovirus|s__Melon_yellow_spot_virus	2
d__Viruses|f__Adenoviridae|g__Aviadenovirus|s__Fowl_aviadenovirus_A	2
d__Viruses|s__Tomato_yellow_leaf_curl_Vietnam_betasatellite	1
d__Viruses|s__Tomato_begomovirus_satellite_DNA_beta	1
d__Viruses|s__Tobacco_leaf_curl_disease_associated_sequence	1
d__Viruses|s__Papaya_leaf_curl_alphasatellite	1
d__Viruses|s__Croton_yellow_vein_mosaic_alphasatellite	1
d__Viruses|o__Tymovirales|s__Sclerotinia_sclerotiorum_debilitation-associated_RNA_virus_2	1
d__Viruses|o__Picornavirales|f__Secoviridae|g__Nepovirus|s__Grapevine_Bulgarian_latent_virus	1
d__Viruses|o__Picornavirales|f__Picornaviridae|g__Enterovirus|s__Enterovirus_H	1
d__Viruses|o__Picornavirales|f__Picornaviridae|g__Cardiovirus|s__Theilovirus	1
d__Viruses|o__Picornavirales|f__Picornaviridae|g__Aquamavirus|s__Aquamavirus_A	1
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Roseolovirus|s__Human_herpesvirus_7	1
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Roseolovirus|s__Human_herpesvirus_6A	1
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Mardivirus|s__Falconid_herpesvirus_1	1
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Macavirus|s__Ovine_herpesvirus_2	1
d__Viruses|o__Herpesvirales|f__Herpesviridae|g__Cytomegalovirus|s__Saimiriine_herpesvirus_4	1
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Propionibacterium_phage_PHL114L00	1
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Propionibacterium_phage_PHL113M01	1
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Propionibacterium_phage_P101A	1
d__Viruses|o__Caudovirales|f__Siphoviridae|s__Propionibacterium_phage_P100_A	1
d__Viruses|o__Caudovirales|f__Siphoviridae|g__Lambdalikevirus|s__Enterobacterial_phage_mEp390	1
d__Viruses|o__Caudovirales|f__Siphoviridae|g__Lambdalikevirus|s__Enterobacteria_phage_lambda	1
d__Viruses|o__Caudovirales|f__Myoviridae|s__Streptococcus_phage_EJ-1	1
d__Viruses|g__Bacilladnavirus|s__Chaetoceros_lorenzianus_DNA_virus	1
d__Viruses|f__Virgaviridae|g__Tobamovirus|s__Hibiscus_latent_Fort_Pierce_virus	1
d__Viruses|f__Virgaviridae|g__Hordeivirus|s__Barley_stripe_mosaic_virus	1
d__Viruses|f__Togaviridae|g__Alphavirus|s__Ndumu_virus	1
d__Viruses|f__Tectiviridae|g__Tectivirus|s__Bacillus_phage_Wip1	1
d__Viruses|f__Poxviridae|g__Capripoxvirus|s__Goatpox_virus	1
d__Viruses|f__Poxviridae|g__Betaentomopoxvirus|s__Choristoneura_biennis_entomopoxvirus_'L'	1
d__Viruses|f__Poxviridae|g__Betaentomopoxvirus|s__Amsacta_moorei_entomopoxvirus_'L'	1
d__Viruses|f__Potyviridae|g__Potyvirus|s__Wild_potato_mosaic_virus	1
d__Viruses|f__Potyviridae|g__Potyvirus|s__Chilli_ringspot_virus	1
d__Viruses|f__Partitiviridae|g__Alphapartitivirus|s__Beet_cryptic_virus_1	1
d__Viruses|f__Orthomyxoviridae|g__Isavirus|s__Infectious_salmon_anemia_virus	1
d__Viruses|f__Mimiviridae|s__Moumouvirus	1
d__Viruses|f__Microviridae|g__Microvirus|s__Enterobacteria_phage_ID2_Moscow/ID/2001	1
d__Viruses|f__Microviridae|g__Microvirus|s__Enterobacteria_phage_ID18_sensu_lato	1
d__Viruses|f__Iridoviridae|g__Iridovirus|s__Invertebrate_iridescent_virus_31	1
d__Viruses|f__Hytrosaviridae|g__Glossinavirus|s__Glossina_hytrovirus	1
d__Viruses|f__Hypoviridae|g__Hypovirus|s__Cryphonectria_hypovirus_1	1
d__Viruses|f__Chrysoviridae|g__Chrysovirus|s__Magnaporthe_oryzae_chrysovirus_1	1
d__Viruses|f__Bunyaviridae|g__Tospovirus|s__Tomato_spotted_wilt_virus	1
d__Viruses|f__Bunyaviridae|g__Orthobunyavirus|s__Shamonda_virus	1
d__Viruses|f__Bunyaviridae|g__Orthobunyavirus|s__Brazoran_virus	1
d__Viruses|f__Baculoviridae|g__Betabaculovirus|s__Cryptophlebia_leucotreta_granulovirus	1
d__Viruses|f__Baculoviridae|g__Alphabaculovirus|s__Autographa_californica_multiple_nucleopolyhedrovirus	1

### Blast

Last but not least, a blast run to see if the kmer based classifiers are missing anything

In [None]:
module load blast+

cd /scratch/at120/brugia-rnaseq/brugia_viral_search/blast

blastn \
-num_threads 12 \
-db /scratch/at120/shared/db/blast/nt/nt \
-query all.non-rRNA.deconseq_clean.fasta \
-out all.non-rRNA.deconseq_clean.blast.xml \
-outfmt 5 \
-max_target_seqs 1 \
-culling_limit 2 \
-evalue 0.05

Going to blast the trimmed data as well (the only relevant data).

In [None]:
module load blast+ 
module load bioperl

cd /scratch/at120/brugia-rnaseq/brugia_viral_search/blast/trimmed

perl ../../scripts/1line-fasta.pl all-filtered-trimmed.fastq > all-filtered-trimmed.fasta

blastn \
-num_threads 20 \
-db /scratch/at120/shared/db/blast/nt/nt \
-query all-filtered-trimmed.fasta \
-out all-filtered-trimmed.blast.xml \
-outfmt 5 \
-max_target_seqs 1 \
-culling_limit 2 \
-evalue 0.05

### Weird looking reads

Almost all of the r1 reads have a conserved sequenced on the 5' followed by a poly(A) as seen below. I'm assuming there's some adapter readthrough here, which may explain the majority of the "synthetic construct" classified. Also may explain why there's so little classification for viruses (since I'm assuming it's assigning one match per read, which only makes sense), but then again there shouldn't be a poly(A) in a virus, right?

-bash-4.1$ head all.non-rRNA.deconseq_clean.r?.fastq
==> all.non-rRNA.deconseq_clean.r1.fastq <==
@HWI-ST911:229:HAAULADXX:2:2205:15612:20318 1:N:0:CGCTGT
<span class="burk">GATCGGAAGAGCACACGTCTGAACTACAGTCACCGATGTATCTCGTATGCCGTCTTCTGATTGAAAAAAAAAAA</span>CTAACAAGTTGACGGGACAAAAAAATA
+
B<0<BB<BFFBFFB<<BB<FF<BBB'BBF7FFBBFFBFFI<B0''00<F7BB<0BB<'7'0''0770B777''''00<'0'00<'''''070'''0'''0'
@HWI-ST911:252:H2V3NBCXX:1:1111:8969:23340 1:N:0:ACAGTG
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAACGAGCACACCCACTACGACCCGAAAAATTGTCAACCACACCCCAGCAGGCCCAGATCTGAATATCATACCACCTAAA
+
GAAGAGIGGGGGAGGGGGGGAGAGGGGGG<A.A.GGGGGG<G.GAG<<GGI<<GAAGGAGAA.AGGGAAAGAG.<...<<<GA.<<<<G.<AA.<.<<......<.<<<<G<...<...<....<...<.<..<<<A..<.<...77..<.
@HWI-ST911:229:HAAULADXX:1:1105:13220:76584 1:N:0:TTAGGC
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTGAAAACAAAAAGTAGAAAAGCGTTTATAGCGACGATGAT

==> all.non-rRNA.deconseq_clean.r2.fastq <==
@HWI-ST911:229:HAAULADXX:2:2205:15612:20318 2:N:0:CGCTGT
GAAATGCAGAGGGGGGGGGAGGGGAAGAGGGGATTATTCCGGTGGCGCGCGGGATTTT<span class="burk">AAAAAAAAAAAAA</span>GCAACAGCACAAGTTGATTATGTCGACAAC
+
0''0'0'''00007'''7''''00'0''0''''''007'00''''0''''0''''0007<7<<<''''''0'0'07077'000'0707'''''0'00'''0
@HWI-ST911:252:H2V3NBCXX:1:1111:8969:23340 2:N:0:ACAGTG
GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAGCGACAAGGCAATCACGCCCCGTGGCCCCCATTGAATGACAAGAAAGGCCTGTCCATAACATAAAAGACCGGATTACAGGCT
+
AGAGAGGAA<AAG.<A.AAG.AA.<AA<AGGGGGGGGIGGAA.GGG.AG<.GGGAGGAGGAG.G.....<..<...<.....<..<..<<.<.<...<...........<.....<.<.<...<...<.....<..<........77..7.
@HWI-ST911:229:HAAULADXX:1:1105:13220:76584 2:N:0:TTAGGC
GAAAAGAAAAGCGGCCTCCCGGGGAACAAAAGCAAACTCGGGGGGCCCCGTGACCGTTAAAAAAACAAAACATAGGCGAATTCTACTTCCACATGATAAGA

These are definitely adapters. The Illumina adaptor is:
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[barcode]ATCTCGTATGCCGTCTTCTGCTTG
and that's what we see.
