In [1]:
import pandas as pd

# Week 1 Homework

## Question 1. Coverage Analysis 

### Question 1a. How long is the reference genome?

Running the command: samtools faidx ref.fa 
on the reference genome ref.fa will generate the index file ref.fa.idx, which reads: 

Halomonas	233806	11	70	71

The first numeric field here is the length of the genome – 233806 bp.

### Question 1b. How many reads are provided and how long are they? 

Running the command: FastQC file_name 
for all fq files (that is, frag180.1.fq, frag180.2.fq,	jump2k.1.fq, and jump2k.2.fq)

Gives the following information: 
frag180.1.fq contains 35178 sequences that are 100 bp long
frag180.2.fw contains 35178 sequences that are 100 bp long 

jump2k.1.fq contains 70355 sequences that are 50 bp long
jump2k.2.fq contains 70355 sequences that are 50 bp long 

### Question 1c. How much coverage do you expect to have?

The frag180 files contain 35178 reads, 100 bp each. This makes for a total of 3517800 bp. 3517800/233806 = 15.05x coverage

The jump2k files contain 70355 reads, 50 bp each. This makes for a total of 3517750 bp. 3517750/233806 = 15.05x coverage as well. 

### Question 1d. Plot the average quality value across the length of the reads

Screenshots are in the week 1 homework directory 

## Question 2. Kmer Analysis 

### Question 2a. How many kmers occur exactly 50 times? 
Running the following: 

jellyfish count -m 21 -C -s 1000000 *.fq 

jellyfish histo mer_counts.jf 

Generates a histogram. From this histogram, 1091 21-mers occur 50 times exactly. 

### Question 2b. What are the top 10 most frequently occuring kmers? 

Running the following script generates the answer: 

jellyfish dump -c  mer_counts.jf | sort -n -r -k 2 | head 

the first part of this code outputs a list of 21-mers and their counts. The sort command rearranges them in descending numerical order based on the counts. The head commmand subsets the top 10 most common 21-mers, which are: 

GCCCACTAATTAGTGGGCGCC 105
CGCCCACTAATTAGTGGGCGC 104
CCCACTAATTAGTGGGCGCCG 104
ACGGCGCCCACTAATTAGTGG 101
CAGGCCAGCTTATAAGCTGGC 98
AACAGGCCAGCTTATAAGCTG 98
ACAGGCCAGCTTATAAGCTGG 97
AGGCCAGCTTATAAGCTGGCC 95
AGCATCGCCCACATGTGGGCG 83
GCATCGCCCACATGTGGGCGA 82

### Question 2c. What is the estimated genome size based on the kmer frequencies? 

Based on the kmer frequencies, the estimated genome size lies between 233,468 bp and 233,805 bp 

### Question 2d. How does the genome GenomeScope genome size estimate compare to the reference genome?

This estimate is very close to the actual genome size of 233,806 bp.

## Question 3. De novo assembly

### Question 3a. How many contigs were produced?
The command:

grep -c '>' contigs.fasta 

shows that there are four contigs produced

### Question 3b. What is the total length of the contigs 
The command: 
samtools faidx contigs.fasta

generates the index for the contigs.

In [23]:
# read in the index as a pandas data frame
contig_idx = pd.read_csv('~/qbb2020-answers/week1-homework/asm/asm/contigs.fasta.fai', 
                         sep = '\t', header = None, names = ['name', 'size',2,3,4 ])
# get the sum
contig_idx['size'].sum()

234467

### Question 3c. What is the size of your largest contig?

The command:
cat contigs.fasta.fai | sort -n  -k 2 -r

gives: 
NODE_1_length_105831_cov_20.671371	105831	36	60	61

### Question 3d. What is the contig N50 size?

Size sorted contigs file made with: 
cat contigs.fasta.fai | sort -n  -k 2 -r > contigs.size_sorted.fasta.fai

In [24]:
# Define half point of genome
genome_size = 233806
half_genome_size = genome_size/2

# Import sorted contigs file 
sotred_contigs = pd.read_csv('~/qbb2020-answers/week1-homework/asm/asm/contigs.size_sorted.fasta.fai', 
                         sep = '\t', header = None, names = ['name', 'size',2,3,4 ])

# Add contigs until you get 50% of the genome
check = 0
contig_sum_size = 0
counter = 0 
while check == 0: 
    for i in sotred_contigs['size']:
        contig_sum_size += i
        counter += 1
        print(contig_sum_size)
        if contig_sum_size > half_genome_size:
            check = 1
            break

counter

105831
153692


2

It takes the two largest contigs to span half the genome

## Question 4. Whole Genome Alignment

### Question 4a. What is the average identity of your assembly compared to the reference? 

Running the command:

dnadiff ref.fa asm/contigs.fasta 

generates a series of output files. Looking through out.report, 99.70% of bases in the query align, with an AvgIdentity of 100.00. 

### Question 4b. What is the longest alignment? 
The commands: 

nucmer ref.fa asm/contigs.fasta
show-choords out.delta

show the following table: 

    [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  | [TAGS]
=====================================================================================
  127965   233795  |   105831        1  |   105831   105831  |    99.99  | Halomonas	NODE_1_length_105831_cov_20.671371
   40651    88511  |    47861        1  |    47861    47861  |   100.00  | Halomonas	NODE_2_length_47861_cov_20.231319
       3    26789  |    41352    14566  |    26787    26787  |   100.00  | Halomonas	NODE_3_length_41352_cov_20.588756
   26790    40642  |    13853        1  |    13853    13853  |   100.00  | Halomonas	NODE_3_length_41352_cov_20.588756
   88532   127954  |    39423        1  |    39423    39423  |   100.00  | Halomonas	NODE_4_length_39423_cov_20.384723
   
From this table, the largest alignment is from reference sequence positions 127965 to 233795, with length 105831

### Question 4c. How many insertions and deletions are there in the assembly?

From the out.report data, there are 5 deletions in the query files, averaging 10.20 bp. There is one insertion in the query with length 712 bp. 

## Question 5 

### Question 5a. What is the position of the insertion in your assembly? Provide the corresponding position in the reference. (and) 5b. How long is the novel insertion?

show-coords out.delta shows the table displayed above 

From this, we can see that the third contig aligns twice, with one alignment comprising bases 1-13853 and the second from bases 14566-41352 (base positions in query sequence). This 712 bp gap between the two is consistent with the reported 712 bp insertion from the out.report file. 

In the reference sequence, this insertion happens at position 26789.

### Question 5c. What is the DNA sequence of the encoded message? 

The command: 

samtools faidx asm/contigs.fasta NODE_3_length_41352_cov_20.588756:13853-14566 

gives: >NODE_3_length_41352_cov_20.588756:13853-14566
CTAACGATTTACATCGGGAAAGCTTAATGCAATTCACGCAGATATTCAGCTTAGAAGGTA
CGCAGCGGTGACGGGGTGCGGTCCATAATCTATGAAGCTATGAATTCGTACCTCAAGTAA
TGTTTTCTTCGCTGCAGTTCAGAAGTGATAAAGGTATCCCGCTTAGCCTGGCATACTTTG
TGCGTTCGTACCGCCCAGCATTAATGACTTGTGTAGGCAAGTAATGAACGACTCTTCTAC
GCCGCGCCTAACCTCCGCACATAATGGCAGCATGTGGTAGTTACATACGCACAGAAGTGG
TTCGGTTTTAACTATAGTCAGATATGAATAAGCTGCGTGTGTCGTTGTGTCGGCGTGTCG
TACTTACCTCCTGACATAGGTGAATTTCAGCCTACTGTAAGTTTGGAGTCGCGCTCTTTT
CTTATTATATTCTTTGGTATGTGTGTGATGGGTTCGGGCGTGTATTGATGTCTCTAAGGC
TCATGTTAGTGTTTATTTGGTCAGTTATGACGGTGTTCCTGTCGTACGTGTTGGCTTAGC
GGACTTGTAGACGGGATCAAGGTTGTCTGACCCTCCGGTCGACCGTGGGTCGGCCGTCCC
GGCCAGAATACAAGCCGCTTAGACTTTCGAAAGAGGGTAAGTTACTACGCGCGAACGTTA
TACCTCGTTTCAGTATGCACTCCCTTAAGTCACTCAGAAAAGACTAAGGGGCTG

samtools faidx asm/contigs.fasta NODE_3_length_41352_cov_20.588756:13853-14565  > to_decode.fa

python ported_decoder.py -d --rev_comp --input to_decode.fa 

Gives the message: The decoded message :  Congratulations to the 2020 CMDB @ JHU class!  Keep on looking for little green aliens...