# CoV Pan-genome test alignments
```
Lead     : Ababaian
Issue    : #27
start    : 2020 04 07
complete : 2020 04 07
files    : s3://serratus-public/notebook/200407/
```

## Introduction
We have an initial version of the CoV pan-genome, `cov0r`. As an initial sanity check, align a known not-CoV transcriptome, a genome, and a CoV+ transcriptome against `cov0r` to see how well the alignments 'stick'. More importantly how much bleed is there from a mammalian transcriptome into the `cov0r` pan-genome.


## Materials and Methods


### Sequence Accessions/Sources

- CoV Index: `s3://serratus-public/seq/cov0/cov0r.fa`
- Human WGS reads: `s3://serratus-public/test-data/fq/NA12878.1.fq.gz`
- Cat transcriptome: `fasterq-dump SRR6639048`
- SARS-CoV-2 patient meta-transcriptome: `fasterq-dump SRR11454613`

### Test alignments

In [None]:
## C4.large Instance
## in serratus-align v0.2 container

# Magic Blast alignments
GENOME='cov0r'
ALG_ARG='-splice F -no_unaligned -max_db_word_count 1000000'

## Human WGS
magicblast -infmt fastq $ALG_ARG  \
  -query NA12878.1.fq.gz \
  -db $GENOME |\
  samtools view -b - > NA12878.mb.bam
  
## Cat RNAseq
magicblast -infmt fastq $ALG_ARG  \
  -query SRR6639048.fastq \
  -db $GENOME |\
  samtools view -b - > SRR6639048.mb.bam

## CoV+
magicblast -infmt fastq $ALG_ARG  -paired \
  -query SRR11454613_1.fastq -query_mate SRR11454613_2.fastq  \
  -db $GENOME | \
  samtools view -b - > SRR11454613.mb.bam


In [None]:
## C4.large Instance
## in serratus-align v0.2 container

# Bowtie2 alignments
GENOME='cov0r'
ALG_ARG='--very-sensitive-local'

## Human WGS
bowtie2 $BT2_ARG \
  -x $GENOME \
  -U NA12878.1.fq.gz | \
  samtools view -bS - > NA12878.bt2.bam
  
## Cat RNAseq
bowtie2 $BT2_ARG \
  -x $GENOME \
  -U SRR6639048.fastq | \
  samtools view -bS - > SRR6639048.bt2.bam

## CoV+
bowtie2 $BT2_ARG \
  -x $GENOME \
  -1 SRR11454613_1.fastq -2 SRR11454613_2.fastq | \
  samtools view -bS - > SRR11454613.bt2.bam

## Flagstat output from alignments

### Human WGS

```
NA12878.bt2.bam
840909 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
206 + 0 mapped (0.02% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

NA12878.mb.bam
0 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (N/A : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
```

### Cat RNAseq
```
SRR6639048.bt2.bam
31886603 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
169 + 0 mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

SRR6639048.mb.bam
217 + 0 in total (QC-passed reads + QC-failed reads)
47 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
217 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
```

### CoV metaRNAseq
```
SRR11454613.bt2.bam
10837378 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
451808 + 0 mapped (4.17% : N/A)
10837378 + 0 paired in sequencing
5418689 + 0 read1
5418689 + 0 read2
263554 + 0 properly paired (2.43% : N/A)
404092 + 0 with itself and mate mapped
47716 + 0 singletons (0.44% : N/A)
131098 + 0 with mate mapped to a different chr
196 + 0 with mate mapped to a different chr (mapQ>=5)

SRR11454613.mb.bam
61615 + 0 in total (QC-passed reads + QC-failed reads)
42263 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
61615 + 0 mapped (100.00% : N/A)
19352 + 0 paired in sequencing
10061 + 0 read1
9291 + 0 read2
8266 + 0 properly paired (42.71% : N/A)
8296 + 0 with itself and mate mapped
11056 + 0 singletons (57.13% : N/A)
70 + 0 with mate mapped to a different chr
70 + 0 with mate mapped to a different chr (mapQ>=5)
```



In [None]:
# Convert to only mapped reads and their pairs
# for each library
samtools view -b -F 4 NA12878.bt2.bam > NA12878.bt.bam
mv NA12878.bt.bam NA12878.bt2.bam

samtools view -b -F 4 SRR6639048.bt2.bam > SRR6639048.bt.bam
mv SRR6639048.bt.bam SRR6639048.bt2.bam

samtools view -b -G 12 SRR11454613.bt2.bam > SRR11454613.bt.bam
mv SRR11454613.bt.bam SRR11454613.bt2.bam

In [None]:
# In dir with only bam files
aws s3 cp --recursive ./ s3://serratus-public/notebook/20200407/

## Results & Discussion

Sources of error in analysis

### KC786228.1

- BT2 gave 206 aligned reads from human WGS to `cov0` that MB did not. All but one read were mapped to `KC786228.1`. In the human genome this maps to ribosomal DNA. Most likely some kind of rDNA contaminant in accession `KC786228.1: UNVERIFIED: Infectious bronchitis virus isolate YN/V90/2012 nucleocapsid protein-like (N) gene, partial sequence`. Recommend removal of this accession.

```
ERR194147.505740963     0       KC786228.1      94      0       101M    *       0       0       ACGGGTTACCCGCGCCTGCCGGCGTAGGGTAGGCACACGCTGAGCCAGTCAGTGTAGCGCGCGTGCAGCCCCGGACATCTAAGGGCATCACAGACCTGTTA   CCCFFDFFHHHHHIIJJJJJIIJIGIIJJ@FEHIJIHHGFFDDDEDDDCDDDCC@DDDDDDDDBDDDDDDDDDDB<@CCCDDDDDDDDDDDDDDDDDD@C@   AS:i:-57        XN:i:0  XM:i:11 XO:i:0  XG:i:0  NM:i:11 MD:Z:0G23G1T5A1T2A18A0T27T6C5C2 YT:Z:UU
ERR194147.505741305     0       KC786228.1      94      0       101M    *       0       0       ACGGGTTACCCGCGCCTGCCGGCGTAGGGTAGGCACACGCTGAGCCAGTCAGTGTAGCGCGCGTGCAGCCCCGGACATCTAAGGGCATCACAGACCTGTTA   CCCFFFFFHHHHHJJJJJJJJJIJHIJJJFHIIEHGHHHFFDDDDDDDDDDDDDDFDDDDDDDBDDDDCABDDDB9>CCDDDDDDDDDCCDDDDDDDDCDC   AS:i:-57        XN:i:0  XM:i:11 XO:i:0  XG:i:0  NM:i:11 MD:Z:0G23G1T5A1T2A18A0T27T6C5C2 YT:Z:UU
ERR194147.505741455     0       KC786228.1      97      0       101M    *       0       0       GGTTACCCGCGCCTGCCGGCGTAGGGTAGGCACACGCTGAGCCAGTCAGTGTAGCGCGCGTGCAGCCCCGGACATCTAAGGGCATCACAGACCTGTTATTG   @@CFFFFFHDHGHJJJIJIFI@:FHHBBCGGGGHGGHEFDEFEEE>CCCCC>B>>==B;B<77A@@BBDBD@B>C@CCDCABBDCCDDDDC?BC<+4::@:   AS:i:-59        XN:i:0  XM:i:13 XO:i:0  XG:i:0  NM:i:13 MD:Z:21G1T5A1T2A18A0T27T6C5C2C0A0A0     YT:Z:UU
ERR194147.612172157     0       KC786228.1      93      0       101M    *       0       0       AACGGGTTACCCGCGCCTGCCGGCGTAGGGTAGGCACACGCTGAGCCAGTCAGTGTAGCGCGCGTGCAGCCCCGGACATCTAAGGGCATCACAGACCTGTT   CCCFFFFFHHHHGJIJJJJJJIIIIBGGJJ@EEIFHHEEEFACDDCDDDCCCDCDDFDDDDDDD@DDDDDDDDDDD>@CCDDDCDDDDDDDD>CCBDCCCD   AS:i:-61        XN:i:0  XM:i:12 XO:i:0  XG:i:0  NM:i:12 MD:Z:0C0G23G1T5A1T2A18A0T27T6C5C1       YT:Z:UU
```

### AX191447.1 and AX191449.1

These are labelled as from Rat Coronavirus but both correspond to a patent filing related to rat  Xylosyltransferase. Exclude from cov0. Can confirm that these reads when blat against human map to `YXLT2`.
