# Computational Genomics (COMP90016) - Exam part B
Version 1. Last updated 9/6/2022.


## Semester 1, 2022


This exam should be completed by each student individually. Make sure you read this entire document, and ask for help if anything is not clear. Any changes or clarifications to this document will be announced via the LMS.

Please make sure you review the University's rules on academic honesty and plagiarism: https://academichonesty.unimelb.edu.au/

Do not copy any code from other students or from the internet. This is considered plagiarism.

Your completed notebook file containing all your answers will be turned in via LMS. Please also submit an HTML file.

To complete the exam, finish the tasks in this notebook.

The tasks are a combination of writing your own code and answering short and long-answer questions.

In some cases, we have provided test input and test output that you can use to try out your solutions. These tests are just samples and are **not** exhaustive - they may warn you if you've made a mistake, but they are not guaranteed to. It's up to you to decide whether your code is correct.

**Remember to save your work early and often.**

## Marking

Cells that must be completed to receive marks are clearly labelled. Some cells are code cells, in which you must complete the code to solve a problem. Others are markdown cells, in which you must write your answers to short-answer questions. 

Cells that must be completed to receive marks are labelled like this:

`# -- GRADED CELL (1 mark) - complete this cell --`

### Completing code cells

- You will see the following text in graded code cells:

``` python
# YOUR CODE HERE
raise NotImplementedError()
```

- ***You must remove the `raise NotImplementedError()` line from the cell, and replace it with your solution.***


- Only add answers to graded cells. If you want to import a library or use a helper function, this must be included in a graded cell.


- Run-time limits will be imposed for each coding question. The run-time of a code cell can be calculated by including `%time` at the top of your cell. Cells exceeding the run-time limit **will not be marked**. The run-time limits only apply to test cases that are included in this document.


### Editing the notebook

**Only** graded cells will be marked.
- Don't enter solutions outside of graded cells
- Do **NOT** duplicate or remove cells from the notebook
- You may add new cells to test code, new cells will not be graded.
- Word limits, where stated, will be strictly enforced. Answers exceeding the limit **will not be marked**.



### Marks

No marks are allocated to commenting in your code. We do however, encourage efficient and well commented code.

The total marks for the assignment add up to 100, and it will be worth 20% of your overall subject grade.

Part 1: 15 marks

Part 2: 15 marks

Part 3: 30 marks

Part 4: 40 marks

## Submitting

Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and student ID number below:

<div class="alert alert-info">
Name: Benjamin Field
    
Student ID: 831975
</div>

## Overview

In this assignment, you will answer questions about k-mers, assemblies, genetic codes and health-related case studies.

You will use the `skbio` library in your functions. You may want to refer to sections of the `skbio` documentation for additional help (scikit-bio.org/docs/0.5.6/index.html). Additional to `skbio` and standard Python 3 functions and methods, you may also use any other library we have used in COMP90016 including `collections`, `numpy`, `pandas`, `math`, `itertools`, `seaborn`, `pysam` and `matplotlib`.

## Part 1: K-mers

### Setup

The FASTQ file you will be using can be found in the data directory inside the zipped folder available on the LMS. **DO NOT** rename the file.

In [None]:
# Import the skbio library.
import skbio

In [None]:
# Read in the FASTQ file to produce a generator object named registry.
fname = 'data/comp90016_exam_B_readset.fastq'
registry = skbio.io.read(fname, format = 'fastq', phred_offset = 33)

In [None]:
# Append the reads from registry to a list named readset.
readset = []
for r in registry:
    readset.append(r)

### Question 1.1

(5 marks, max 1 min run-time) 

<div class="alert alert-success"> 
Write a Python function to compute the total number of distinct k-mers in a read set for a given value of k. Distinct k-mers are counted only once, even if they appear multiple times. The output should be a count of the total number of k-mers that occur at least once in a read set. Consider overlapping k-mers.
    
- [ ] Assume the value of k is greater than or equal to 1, and less than or equal to the length of the shortest read.
- [ ] Assume the input reads are a list of skbio.sequence.DNA or skbio.sequence.Sequence objects.
- [ ] Return a positive integer.
- [ ] If reads is empty, return None.
</div>

In [None]:
# GRADED CELL 1.1 (5 marks, max 1 min run-time)

def distinct_kmers(reads, k):
    """
    Computes the total number of distinct k-mers in a read set for a given value of k.
    Assume the value of k is greater than or equal to 1, and less than or equal to the length of the shortest read.
    Assume the input reads are a list of skbio.sequence.DNA or skbio.sequence.Sequence objects.
    Return a positive integer.
    If reads is empty, return None.
    """
    
    # YOUR CODE HERE
    if bool(reads):
        
        k_mer_dict = {}
        for read in reads:
            for i in range(0, len(read) - k + 1):
                if str(read)[i: i + k] not in k_mer_dict:
                    k_mer_dict[str(read)[i: i + k]] = 1
                else:
                    k_mer_dict[str(read)[i: i + k]] += 1
        return len(k_mer_dict)
    
    else:
        return None

In [None]:
# Test your function in this cell

# Test the function run time
%timeit distinct_kmers(readset, 9)

# Check expected output on demo data
demo_reads = [skbio.sequence.DNA('AAAAATTC'), skbio.sequence.DNA('CAT')]
print(distinct_kmers(demo_reads, 3)) # should return 5
print(distinct_kmers(readset, 9))

# Write your own tests here:


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


### Question 1.2

(5 marks, max 1 min run-time) 

<div class="alert alert-success"> 
Write a Python function to identify the most abundant k-mer in a read set for a given value of k. Consider overlapping k-mers.
    
- [ ] Assume the value of k is greater than or equal to 1, and less than or equal to the length of the shortest read.
- [ ] Assume the input reads are a list of skbio.sequence.DNA or skbio.sequence.Sequence objects.
- [ ] Return the most abundant k-mer as a string.
- [ ] If there is a tie for the most abundant k-mer, return one of the most abundant k-mers only.
- [ ] If reads is empty, return None.
</div>

In [None]:
# GRADED CELL 1.2 (5 marks, max 1 min run-time)

def top_kmer(reads, k):
    """
    Identify the most abundant k-mer in a read set for a given value of k.
    Assume the value of k is greater than or equal to 1, and less than or equal to the length of the shortest read.
    Assume the input reads are a list of skbio.sequence.DNA or skbio.sequence.Sequence objects.
    Return the most abundant k-mer as a string.
    If there is a tie for the most abundant k-mer, return one of the most abundant k-mers only.
    If reads is empty, return None.
    """
    
    # YOUR CODE HERE
    if bool(reads):
        
        k_mer_dict = {}
        for read in reads:
            for i in range(0, len(read) - k + 1):
                if str(read)[i: i + k] not in k_mer_dict:
                    k_mer_dict[str(read)[i: i + k]] = 1
                else:
                    k_mer_dict[str(read)[i: i + k]] += 1
        
        max_value = max(k_mer_dict, key=k_mer_dict.get)
        return max_value
    
    else:
        return None    

In [None]:
# Test your function in this cell

# Test the function run time
%timeit top_kmer(readset, 9)

# Check expected output on demo data
print(top_kmer(demo_reads, 3)) # should return 'AAA'
print(top_kmer(readset, 9))

# Write your own tests here:


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----



In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----



### Question 1.3

(5 marks, max 100 words)

<div class="alert alert-info">
Explain how a function that finds the most abundant k-mer in a read set could be useful for quality control.
</div>

The most abundant k-mer can provide clues as to whether certain reads may have been overexpressed. A k-mer that is very abundant means that it is highly recurrent in the assembly, this could indicate that there has been a sequencing error in which certain reads have been overproduced which could negatively affect the biological accuracy of the assembly. 

### END OF PART 1

## Part 2: Assemblies

### Setup

The FASTA file you will be using can be found in the data directory inside the zipped folder available on the LMS. **DO NOT** rename the file.

In [None]:
# Read in the FASTA file to produce a generator object named registry.
fname = 'data/comp90016_exam_B_assembly.fasta'
registry = skbio.io.read(fname, format = 'fasta')

In [None]:
# Append the reads from registry to a list named assembly_contigs.
assembly_contigs = []
for contig in registry:
    assembly_contigs.append(contig)

### Question 2.1 

(5 marks, max 1 min run-time) 

<div class="alert alert-success"> 
    
The Nx length is defined as the length L for which x% of all bases in the assembly are in contigs that are equal to or longer than L. For example, if x = 50, the Nx (or N50) length might be 800. This means that 50% of the total bases in the assembly are in contigs that are equal to or longer than 800 bases.


Write a Python function to compute the Nx length for an assembly.
    
- [ ] Assume x is an integer between 1 and 99.
- [ ] Assume the input assembly is a list of skbio.sequence.DNA or skbio.sequence.Sequence objects.
- [ ] Assume each contig has length > 0.
- [ ] Return a positive integer.
- [ ] If assembly is an empty list, return None.
</div>

In [None]:
# GRADED CELL 2.1 (5 marks, max 1 min run-time)

def nx_length(assembly, x):
    

    if bool(assembly):
        contig_lens = []
        for contig in assembly:
            contig_lens.append(len(contig))
        
        contig_lens.sort(reverse=True)
        
        total_nucleotides = sum(contig_lens)
        n_point = total_nucleotides * x/100
        contig_count = 0
        for contig in contig_lens:
            contig_count += contig
            if contig_count >= n_point:
                return contig
            
    else:
        return None
    
  

In [None]:
# Test your function in this cell

# Test the function run time
#%timeit nx_length(assembly_contigs, 75)

# Check expected output on demo data
demo_assembly = [skbio.sequence.DNA('GCAGAT'), skbio.sequence.DNA('GAAG'), skbio.sequence.DNA('GCG'), skbio.sequence.DNA('GAT')]
print(nx_length(demo_assembly, 50)) # should return 4

# Write your own tests here:
print(nx_length(assembly_contigs, 75))

In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----





In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


### Question 2.2

(5 marks, max 100 words)

<div class="alert alert-info">
Explain why a high Nx length (for example, N50 length), does not guarantee that an assembly is good quality.
</div>

The Nx length can be easily manipulated depending on the sort of filtering applied to the assembly. A high Nx score can be artifical, by filtering out short reads from the assembly (which is a perfectly legitimate thing to do), the Nx score will be raised. This does not mean that the filtered assembly is any better quality than the unfiltered one, just that the mean contig length has been artifically increased. 

### Question 2.3

(5 marks, max 100 words)

<div class="alert alert-info">
Describe one other way that the quality of an assembly could be assessed.
</div>

The number of overall contigs could be evaluated. An assembly with a large number of smaller contigs can indicate lesser quality than an assembly with a smaller number of larger contigs. 

### END OF PART 2

## Part 3: Genetic codes

### Setup

The FASTA file you will be using can be found in the data directory inside the zipped folder available on the LMS. **DO NOT** rename the file.

In [None]:
# Read in the FASTA file to produce a generator object named registry.
fname = 'data/comp90016_exam_B_sequence.fasta'
registry = skbio.io.read(fname, format = 'fasta')
registry

In [None]:
# Assign the sequence to an skbio.DNA.sequence object named sequence.
for seq in registry:
    sequence = seq
sequence = skbio.sequence.DNA(seq, lowercase = True)

<br/>

Some organisms use different variations of the genetic code. This must be taken into account when predicting genes ab initio. You can see the differences in the alternative genetic codes at the site below. Note that some genetic codes have multiple initiation codons, some are more commonly used than others. The first amino acid of any ORF is always methionine, even when an alternative initiation codon is used. In any other position, that codon may encode a different amino acid but, as a start codon, it will always code for methionine.

>https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi 

These genetic codes can be referred to by a number. For example, the standard genetic code is genetic code 1. 

Fortunately, some `skbio` functions can interpret alternate genetic codes. See the documentation for methods like .translate() and .translate_six_frames() 

The GeneticCode function can also be used to help you incorporate the genetic codes into your functions. 

In [None]:
# Display the standard genetic code. Try changing the number to view a different code.
# Note that RNA codons are shown with the GeneticCode function where DNA codons are used on the NCBI site.
skbio.GeneticCode.from_ncbi(1)

### Question 3.1 

(20 marks, max 1 min run-time) 

<div class="alert alert-success"> 

An open reading frame (ORF) is a continuous sequence of DNA that begins with a start codon and ends at a stop codon. ORFs can be in any reading frame on the positive or negative strand, can be overlapping or can be nested within another ORF.

Write a Python function to identify ORFs in a DNA sequence and translate them into amino-acid sequences.

- [ ] Return a list of skbio.sequence.Protein objects where the protein sequence is of a length equal to or longer than min_len.
- [ ] Assume the input sequence is an skbio.sequence.DNA object.
- [ ] Use the genetic code encoded by genetic_code.
- [ ] Assume ORFs can start with any of the initiation codons and finish with any of the stop codons specified in the genetic code.
- [ ] Only consider ORFs wholly contained within the sequence (including the start and stop codons).
- [ ] Assume genetic_code is an integer between 1 and 6.
- [ ] Assume min_len is an integer.
- [ ] If the input seq is empty, return None.
- [ ] If no ORFs are found, return an empty list.
    
</div>

In [None]:
# GRADED CELL 3.1 (20 marks, max 1 min run-time)

def translate_orfs(seq, min_len, genetic_code):
    from skbio import Protein
    """
    Return a list of skbio.sequence.Protein objects where the protein sequence is of a length equal to or longer than min_len.
    Assume the input sequence is an skbio.sequence.DNA object.
    Use the genetic code encoded by genetic_code.
    Assume ORFs can start with any of the initiation codons and finish with any of the stop codons specified in the genetic code.
    Only consider ORFs wholly contained within the sequence (including the start and stop codons).
    Assume genetic_code is an integer between 1 and 6.
    Assume min_len is an integer.
    If the input seq is empty, return None.
    If no ORFs are found, return an empty list.
    """
    
    # YOUR CODE HERE
    # identify genetic code
    if not bool(seq):
        return None
    
    alt_code = skbio.GeneticCode.from_ncbi(genetic_code)
    RNA = seq.transcribe()
    ORFS = []
    
    proteins = alt_code.translate_six_frames(RNA,start='optional')
    
    for protein in proteins:
        starts = [i for i in range(len(protein)) if str(protein).startswith('M', i)]
        for start in starts:
            new_prot = ''
            stop_count = 0
            
            for base in protein[start:]:
                if str(base) == '*':
                     
                    if len(new_prot) >= min_len:
                        new_prot = Protein(new_prot)
                        ORFS.append(new_prot)
                    break
                else:
                    new_prot += str(base)
                
                

    return ORFS





In [None]:
# Test your function in this cell

# Test the function run time
#%timeit translate_orfs(sequence, 60, 3)
# Check expected output on demo data

demo_sequence_a = skbio.sequence.DNA('GTTGGATTCATGAAAGA')

print(translate_orfs(demo_sequence_a, 3, 1)) # should give a single sequence: MDS
print(translate_orfs(demo_sequence_a, 3, 2)) # should give a single sequence: MHE
demo_sequence_b = skbio.sequence.DNA('ATGAAATGAATGTCTTGA')
print(translate_orfs(demo_sequence_b, 2, 1)) # should give two sequences: MK and MS
demo_sequence_c = skbio.sequence.DNA('ATGAAAATGTCTTGA')
print(translate_orfs(demo_sequence_c, 2, 1)) # should give two sequences: MKMS and MS


print(translate_orfs(sequence, 60, 3))

# Write your own tests here:


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----




In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


### Question 3.2

(5 marks, max 50 words)

<div class="alert alert-info">
Would identifying and translating ORFs in a DNA sequence be a good candidate use-case for data parallelism? Explain your answer.
</div>

Data Parallelism entails the subdivision of the dataset into a number of smaller peices and concuirrently running the same operations/ program on the subdivided data. Identifying and translatinf ORF's is a good candidate as it allows for DNA data to be split up into reading frames and processed simultaneously, improving the speed of the analysis. 

### Question 3.3

(5 marks, max 100 words)

<div class="alert alert-info">
You have been asked to identify ORFs as part of an ab initio gene prediction analysis. Explain the potential consequences of using the wrong genetic code.
</div>

If the wrong genetic code is used, incorrect amino acid sequences and ORF's will be identified. This will likely lead to completely incorrect downstream analysis of the gene. Maybe mention something about ab initio.

### END OF PART 3

## Part 4: Computational genomics and disease

### Question 4.1

(20 marks, max 300 words)

<div class="alert alert-info">
    
Melbourne has been experiencing an outbreak of disease in domestic dogs, caused by a bacterial pathogen commonly known as Pokerus. The pathogen is passed between dogs through close contact and causes an illness that affects appetite and growth.


This disease has been observed before, though never in Victoria. Some members of the community believe that the pathogen was first introduced to Melbourne via racing greyhounds (a breed of dog). The pathogen can infect all dog breeds but some breeds suffer more severe symptoms.
 
Concerned for the health of domestic dogs across Melbourne, help has been enlisted from a well-funded Public Health laboratory with a computational genomics unit that has extensive experience in the investigation of disease outbreaks caused by bacterial pathogens.
    
**The computational genomics lab has been assigned the following tasks:**

- Investigate the evolutionary relationships between different Pokerus isolates including isolates from previous outbreaks in other locations and isolates from the current Melbourne outbreak.
- Determine whether there is evidence that the pathogen was introduced to Melbourne via racing greyhounds.
 
**Available resources include:**
 
- 10 purified, cultured Pokerus isolates from other outbreaks around the world.
- 10 purified, cultured Pokerus isolates from the Melbourne outbreak from several dog breed hosts including racing greyhounds.
- The NCBI reference sequence database (RefSeq)
    - Includes the latest annotated *Canis lupus familiaris* (dog) reference genome.
    - Includes the latest annotated Pokerus reference genome.
- Any other databases required by tools covered in COMP90016 this semester.
 
You can assume that the resources available and any biological samples you propose to generate from them are of sufficient quality and quantity for the analysis you propose.
 
**Using the information and resources from the scenario above, describe an experimental approach to carry out the specified tasks.**
    
- Describe and justify all necessary steps in your analysis. 
- Limit your discussion to aspects of the bioinformatics. 
- Include the names of the tools you propose to use and the specific inputs and outputs of each step. 
- Please use tools that have been used or mentioned in COMP90016 this semester. 
- Include your choice of sequencing technology and important quality control measures. 
- Discuss the potential results and how their interpretation could lead to conclusions regarding the two tasks. 
- Answer in text format only, do not include figures or code.
</div>

-Use long read sequencing to sequence all isolates. Long read sequencers such as the PacBio sequencer are ideal for tasks such as phylogenetics, genome assembly and outbreak analysis. This is becsause the large bp reads help to eliminate amplification bias and are computationally easier to assemble into a genome. 
-Long reads can have issues with error rates, to help mitigate this. Short read sequencing like Illumina could also be used to adjust for errors from certain regions of the long reads. 
-Use FASTQC on outputted FASTQ file for initial quality control.
-Use snippy to generate an alignment file of the reads against the NCBI reference genome
-Multiple sequence alignment should then be used to determine the similarity and evolutionary reltionships between the isolates. 
-For the MSA, progressive/tree alignment via Clustal can be used to generate a MSA file.
-The MSA file can then be analyzed to assess isolate relationships.
-Generate a phylogenetic tree with iqtree. The output file is a newick file
-The newick file will contain a maximum likelihood tree that highlights the genetic relationship between isolates.
-The newick file along with supporting csv data can be uploaded to microreact to visualise geographical data.
-If the pathogen originated via racing greyhounds in melbourne, the isolates will form a monophyletic group with the greyhound isolate as the common ancestor. 

### Question 4.2

(20 marks, max 300 words)

<div class="alert alert-info">
    
One particular breed of dog called Growlithes seems to be resistant to the symptoms brought on by the Pokerus bacteria (although they can still be infected).
    
Scientists hypothesise that Growlithes are protected from severe symptoms by genetic variants related to their response to heat. The biological process of "response to heat" is represented by gene ontology term GO:0009408. The same well-funded computational genomics laboratory has been tasked with discovering more about the mechanisms of the resistance in Growlithes.

**The computational genomics lab has been assigned the following tasks:**
 
- Identify a set of genetic variants in Growlithes that differ from the dog reference genome.
- Determine whether there are genetic variants in Growlithe genes that are involved in the "response to heat" and are predicted to affect the function of the gene product.

 
**Available resources include:**
 
- All resources from question 4.1.
- 20 DNA samples from Growlithes.
- The gene ontology database.
- Any other databases required by tools covered in COMP90016 this semester.
 
You can assume that the resources available and any biological samples you propose to generate from them are of sufficient quality and quantity for the analysis you propose.
 
**Using the information and resources from the scenario above, describe an experimental approach to carry out the specified tasks.**
    
- Describe and justify all necessary steps in your analysis. 
- Limit your discussion to aspects of the bioinformatics. 
- Include the names of the tools you propose to use and the specific inputs and outputs of each step. 
- Please use tools that have been used or mentioned in COMP90016 this semester. 
- Include your choice of sequencing technology and important quality control measures. 
- Discuss the potential results and how their interpretation could lead to conclusions regarding the two tasks. 
- Answer in text format only, do not include figures or code.
</div>

-Sequence growltihe genome with a short read sequincing technology (Illumina). Short read is used for its higher accuracy levels. 
-Use FASTQC to perform quality control on FASTQ readset.
-Use data fromn gene ontology database as a reference for genes of interest.
-Use BWA to map reads against reference genome. Data is now in BAM format.
-Improve the alignment with SAMtools via duplicate removal, local realignment and base quality score recalibration.
-Use a samtools on outputted BAM file to prepare for variant calling.
-Use bcftools to call variants. The output will be a .vcf file.
-Filter variants for quality within bcftools, adjusting parameters for phred scores and sensitivity to maximise the likelihood of the variants being biologically accurate. The .vcf file will now contain higher quality information.
-The vcf file can be analysed to identify variants of interest between the growlithe genome and the dog reference genome.
-These areas of interest can be further analysed via the use of tblastx on the original fastq files. tblastx will undergo a protein - protein comparison between the wild type and the variant. This can be used to glean information about how the growlithe variant is being differentially expressed at the level of proteins. 

### END OF PART 4

### END OF EXAM