Cells that must be completed to receive marks are labelled like this:

`# -- GRADED CELL (1 mark) - complete this cell --`

Some graded cells are code cells, in which you must complete the code to solve a problem. Other graded cells are markdown cells, in which you must write your answers to short-answer questions. 

You will see the following text in graded code cells:

```
# YOUR CODE HERE
raise NotImplementedError()
```

***You must remove the `raise NotImplementedError()` line from the cell, and replace it with your solution.***


Only graded cells will be marked.
**Don't make changes outside graded cells, and don't add or remove cells from the notebook**.

Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and student ID number below:

In [None]:
NAME = "Jiayu Wang"
ID = "1039580"

# COMP90014 Assignment 2
### Semester 2, 2021



This assignment should be completed by each student individually. Make sure you read this entire document, and ask for help if anything is not clear. Any changes or clarifications to this document will be announced via the LMS.

Please make sure you review the University's rules on academic honesty and plagiarism: https://academichonesty.unimelb.edu.au/

Do not copy any code from other students or from the internet. This is considered plagiarism.

To complete the assignment, finish the tasks in this notebook.

The tasks are a combination of writing your own code, interpreting the results and answering related short-answer questions.

In some cases, we have provided test input and test output that you can use to try out your solutions. These tests are just samples and are **not** exhaustive - they may warn you if you've made a mistake, but they are not guaranteed to. It's up to you to decide whether your code is correct.

### Completing the assignment

You will need to finish the tasks in this notebook.

You must not call functions from external packages that implement parts of the algorithms we have asked you to write.

The tasks are a combination of writing your own implementations of algorithms we've discussed in lectures (including adapting or describing them in plain English or pseudocode) and interpreting the results in short answer format.

In some case, we have provided test input and test output that you can use to try out your solutions. These tests are just samples and are not exhaustive - they may warn you if you've made a mistake, but they are not guaranteed to. It's up to you to decide whether your code is correct.

### Submitting your work

Your completed notebook file containing all your answers must be turned in via LMS in `.ipynb` format.
You must also submit a file in `html` format with the output cleared.
You can do this by using the `Clear all output` option in the menu.


### Marking

Only modify graded cells. If you want to write a helper function, do it in a graded cell.

Word limits, where stated, will be strictly enforced. Answers exceeding the limit **will not be marked**.

No marks are allocated to commenting in your code. We do however, encourage efficient and well commented code.


### Pseudocode
Pseudocode for algorithms are a series of logical steps which any programmer can understand.
Here is the pseudocode for a function fizzbuzz to print numbers that are divisible by 3 or 5:

```
function fizzbuzz()
    For i = 1 to 100
        If i is divisible by 3 Then
            Print "Fizz"
        If i is divisible by 5 Then
            Print "Buzz"
```
      
As you can see, the basic steps are shown, but there is no language-specific syntax. <br>
In this manner, pseudocode explains the algorithm procedure in **direct, plain language.** <br>
As a note, if your function calls another function (lets call this function **boom**), write it as **boom()** in the pseudocode. The open/close brackets show that 'boom' is another function call.

There are no real conventions aside from the above, so please use a style which you think is **clearest.**

# Part I

### Background and data 

WGCNA stands for weighted gene co-expression network analysis. It is a data analysis technique used for studying biological networks based on pairwise correlations of gene expression data. WGCNA is good at identifying clusters of genes that may be co-regulated, and therefore may have shared biological function.

For this assignment, you will primarily be using the [FlyAtlas](http://flyatlas.org) dataset. For this assignment, instead of using the probe-wise dataset, we will be using the expression value for each gene.



### Import packages

If you are using jupyter lab on CloudStor SWAN, all of the packages in the following cell will be available.
If you are working on your local computer, you may need to install some packages.

If you need to use additional packages, you must import them in a graded cell.



In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import altair
import pandas as pd
import numpy as np
import networkx as nx
import scipy
import re
from io import StringIO
from copy import deepcopy
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import squareform

### Settings
We will set numpy and pandas to display numbers to just two decimal places in this notebook - this won't affect the actual numbers, just their display.

In [None]:
np.set_printoptions(precision=2)
pd.options.display.precision=2

### Read in data

To begin we will importing the fly atlas data into a pandas dataframe. We will then inspect the first few items in our data. It should have gene names as row names and sample names as column names.



In [None]:
# Import data
raw_expression = pd.read_csv('flyatlas_subset.csv.gz', index_col=0)

# Print first 5 rows
raw_expression.head()

The data frame has 3114 rows (genes) and 136 columns (samples) so it is certainly high dimensional. These 136 columns represent 4 replicates each from 34 different tissue types.

### Data labels
The following code snippet removes the replicate name from each sample, so we can use these labels as categories for plotting later.

In [None]:
# Make list of sample names without replicate number
tissues_list = [re.match('(.+?)(( biological)? rep\d+)', c).group(1)
                     for c in raw_expression.columns]
tissues = pd.Series(tissues_list, index = raw_expression.columns)

### Transforming the data

It's common practice to take the log of expression values. Here is a visual motivation as to why this may be useful:

In [None]:
log_expression = np.log(raw_expression + 1)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
ax1.hist(raw_expression.values.flatten(), bins=200)
ax1.set_title('raw expression')
ax1.set_xlabel('expression value')
ax1.set_ylabel('num occurrences')
ax2.hist(log_expression.values.flatten(), bins=200)
ax2.set_title('log expression')
ax2.set_xlabel('expression value')
ax2.set_ylabel('num occurrences')
plt.show()

<br>
From here on, we will use 'log_expression' as our sample data

In [None]:
log_expression = np.log(raw_expression + 1)

# Task 1 - Building a correlation matrix

## Task 1.1

The [FlyAtlas](http://flyatlas.org) dataset contains four biological replicates for each tissue. Combine the biological replicates by calculating the mean expression value for each gene in each tissue.

In [None]:
# ~~ GRADED CELL (2 marks) - complete this cell ~~

def average_by_tissue(expression, tissues):
    '''
        Given a DataFrame of gene expression data, 
        and a list, array or Series of tissues corresponding to the columns of the dataframe,
        average over the expression values in each gene for each tissue type and
        return the resulting dataframe. 
        The columns of the new dataframe should correspond to the provided tissues.
    '''

    # Exclude null/extreme and incorrect-format input 
    if  len(tissues)!= 0 and len(tissues) == len(expression.columns) and isinstance(expression, pd.DataFrame) and expression.empty != True:
        # keep only one for the repetive column names, while still keeping the original order 
        lst = []
        for c in tissues:
            if not c in lst:
                lst.append(c)           
        # initialize the new dataframe 
        expression.columns = tissues
        idxAverage = expression.index
        averaged_expression = pd.DataFrame(columns = lst, index = idxAverage)
        # calculate the mean values and put them into the new dataframe 
        for i in lst:
            expressionMean = expression[i].mean(axis = 1)
            averaged_expression[i] = expressionMean
    
        return averaged_expression

In [None]:
# Testing cell - Do not alter.

# Inspect mean expression values for tissue "Adult Eye"
log_expression_copy = deepcopy(log_expression)
mean_tissue_expression = average_by_tissue(log_expression_copy, tissues)

print(f'\n1a student:\n{mean_tissue_expression["Adult Eye"]}\n\n')


The below test case should return

```
      A    B
0   4.5  2.5
1  10.0  7.0
```


In [None]:
test_df = pd.DataFrame([[5,4,3,2],[10,10,6,8]])
print(average_by_tissue(test_df, ['A','A','B','B']))

In [None]:
# Calculate average expression for each tissue in the flyatlas data
tissue_expression = average_by_tissue(log_expression, tissues)

In [None]:
# Inspect the new expression dataframe. There should be 34 columns, corresponding to the 34 tissue types.
tissue_expression.head()

## Task 1.2

WGCNA starts by building a pairwise correlation matrix of genes. Using the matrix you just created, produce an *unsigned* correlation matrix where each cell contains the absolute value of the correlation coefficients.

You can calculate the Pearson correlation values yourself, or look up a numpy or scipy function to do so.

In [None]:
# ~~ GRADED CELL (2 mark) - complete this cell ~~

def calculate_unsigned_correlation(expression):
    ''' 
        Produce the unsigned correlation matrix for a table of gene expression values.
        Assume that the columns of the expression matrix are samples and the rows are
        genes, and return an array of arrays giving the Pearson correlation between each pair of genes,
        in the same order as the rows of the expression table.
    '''
    # YOUR CODE HERE
    nparray = expression.to_numpy()
    corrMatrix = np.corrcoef(nparray)
    with np.nditer(corrMatrix, op_flags=['readwrite']) as it:
            for cor in it:
                if cor < 0:
                    cor[...] = - cor[...]
    return corrMatrix


In [None]:
# Testing cell - Do not alter.

# Inspect correlation matrix
student_correlation_matrix = calculate_unsigned_correlation(mean_tissue_expression)

print(f'\n1b student:\n{student_correlation_matrix[:5, :5]}\n\n')


The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 1.  ,  0.95,  0.96,  0.44,  0.3 ,  0.15],
       [ 0.95,  1.  ,  1.  ,  0.71,  0.59,  0.46],
       [ 0.96,  1.  ,  1.  ,  0.67,  0.54,  0.41],
       [ 0.44,  0.71,  0.67,  1.  ,  0.99,  0.95],
       [ 0.3 ,  0.59,  0.54,  0.99,  1.  ,  0.99],
       [ 0.15,  0.46,  0.41,  0.95,  0.99,  1.  ]])
```

In [None]:
test_df = pd.DataFrame([[ 3.8,  2.7,  4.5],
                       [ 4.3,  3.4,  6.2],
                       [ 5.3,  4.3,  7. ],
                       [ 4.6,  6. ,  7.7],
                       [ 5.2,  7.3,  8.8],
                       [ 6.2,  8.5,  9.4]], 
                         columns=['Tissue1', 'Tissue2', 'Tissue3'],
                         index=['GeneA', 'GeneB', 'GeneC', 'GeneD', 'GeneE', 'GeneF'])
calculate_unsigned_correlation(test_df)

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
```

In [None]:
test_df = pd.DataFrame([[ 3.8,  2.7,  4.5],
                       [ 4.3,  3.4,  6.2],
                       [ 5.2,  7.3,  8.8],
                       [ 6.2,  8.5,  9.4]], 
                         columns=['Tissue1', 'Tissue2', 'Tissue3'],
                         index=['GeneA', 'GeneB', 'GeneC', 'GeneD'])
calculate_unsigned_correlation(test_df)

In [None]:
# Calculate the correlation matrix for the flyatlas data
unsigned_correlation = calculate_unsigned_correlation(tissue_expression)
_ = plt.hist(unsigned_correlation.flatten(), bins=100)

## TASK 1.3

Why are we using an unsigned correlation matrix instead of a signed correlation matrix? (2 marks; 50 words)

YOUR ANSWER HERE
```
The adjacency matrix calculation below requires the absolute value of the correlation between genes. Thus, we need the unsigned matrix instead of signed matrix. Besides, we care more about strength of the correlation instead of it's positive or negative (orientation). 

```


# Task 2 - Building an adjacency matrix

To use the correlation matrix to create a network, we will transform it into an adjacency matrix. You will create two types of adjacency matrix, a binary adjacency matrix and a weighted adjacency matrix.

## Task 2.1

To create the binary adjacency matrix, transform the correlation matrix such that every correlation greater than or equal to a given threshold value is considered adjacent (represented by a 1 in the matrix), and every correlation below that value is considered not adjacent (represented by a 0). Set the diagonal of the adjacency matrix to 0, so that we don't consider a node to be adjacent to itself.

In [None]:
# ~~ GRADED CELL (1 marks) - complete this cell ~~

def calculate_binary_adjacencies(correlation, threshold):
    '''
        Given a correlation matrix between genes of shape (N,N),
        return the corresponding binary adjacency matrix of shape (N,N),
        where correlation values are above the given threshold.
    '''
    
    # YOUR CODE HERE
    # exclude invalid/extreme/null input
    if isinstance(correlation, np.ndarray) and np.any(correlation) and (type(threshold) == float or type(threshold) == int):
        # iterate through array, compare and then change the values into binary
        with np.nditer(correlation, op_flags=['readwrite']) as it:
            for cor in it:
                if cor >= threshold:
                    cor[...] = 1
                else:
                    cor[...] = 0
        np.fill_diagonal(correlation, 0)
    return correlation

In [None]:
# Testing cell - Do not alter.

# Inspect binary adjacency matrix
student_binary = calculate_binary_adjacencies(student_correlation_matrix, 0.5)

print(f'\n2a student:\n{student_binary[:10, :10]}\n\n')


The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  1.,  0.],
       [ 0.,  1.,  0.,  1.],
       [ 0.,  0.,  1.,  0.]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_binary_adjacencies(test_corr, 0.5)

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.],
       [ 0.,  0.,  1.,  0.]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_binary_adjacencies(test_corr, 0.6)

In [None]:
# Calculate the binary adjacency matrix for the flyatlas data
adjacency_binary = calculate_binary_adjacencies(unsigned_correlation, 0.85)

## Task 2.2

Calculate the connectivity of the adjacency matrix by dividing the total number of edges by the number of possible edges.

In [None]:
# ~~ GRADED CELL (1 mark) - complete this cell ~~

def calculate_connectivity(adjacency):
    '''
        Calculate the number of edges that exist in a given binary adjacency matrix,
        divided by the total number of possible edges between all nodes.
    '''
    
    # YOUR CODE HERE
    # exclude invalid input
    if isinstance(adjacency, np.ndarray) and np.any(adjacency):
        edges = (adjacency.sum() - np.trace(adjacency)) / 2
        possible_edges = adjacency.shape[0] * (adjacency.shape[0] - 1) / 2
    
    return edges / possible_edges

In [None]:
# Testing cell - Do not alter.

# Check connectivity for our ajacency matrix

student_connectivity = calculate_connectivity(student_binary)
print(f'\n2b student:\n{student_connectivity}\n\n')


In [None]:
# Should return 0.5
calculate_connectivity(np.array([[ 0.,  1.,  0.,  0.],
                                   [ 1.,  0.,  1.,  0.],
                                   [ 0.,  1.,  0.,  1.],
                                   [ 0.,  0.,  1.,  0.]]))

In [None]:
# Should return 0.33
calculate_connectivity(np.array([[ 0.,  1.,  0.,  0.],
                                   [ 1.,  0.,  0.,  0.],
                                   [ 0.,  0.,  0.,  1.],
                                   [ 0.,  0.,  1.,  0.]]))

In [None]:
# Calculate connectivity for the flyatlas binary adjacency matrix 
calculate_connectivity(adjacency_binary)

## Task 2.3

The weighted adjacency matrix can be created by raising the correlation matrix to some power. Write a function that raises the correlation matrix to some power, `beta`, and sets the diagonal to `0`. For the rest of the assignment we will use `beta = 4` but your function should accept any integer.

In [None]:
# ~~ GRADED CELL (2 marks) - complete this cell ~~

def calculate_weighted_adjacencies(correlation, beta):
    '''
        Given a correlation matrix between genes of shape (N,N),
        return the corresponding binary adjacency matrix of shape (N,N),
        where we use a power-law soft threshold with parameter beta.
    '''

    # YOUR CODE HERE
    # exclude invalid input
    if isinstance(correlation, np.ndarray) and np.any(correlation):
        weighted_adj_matrix = pow(correlation, beta)
        weighted_adj_matrix = np.round(weighted_adj_matrix, 2)
        np.fill_diagonal(weighted_adj_matrix, 0)

    return weighted_adj_matrix

In [None]:
# Testing cell - Do not alter.

# Inspect weighted ajacencies

student_weighted_adjacencies = calculate_weighted_adjacencies(student_correlation_matrix, 4)
print(f'\n2c student:\n{student_weighted_adjacencies[:5, :5]}\n\n')


The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.  ,  0.9 ,  0.09,  0.02],
       [ 0.9 ,  0.  ,  0.35,  0.21],
       [ 0.09,  0.35,  0.  ,  0.98],
       [ 0.02,  0.21,  0.98,  0.  ]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_weighted_adjacencies(test_corr, 2)

The below test case should return (if displayed to a precision of two decimal places)

```
array([[ 0.  ,  0.86,  0.03,  0.  ],
       [ 0.86,  0.  ,  0.21,  0.1 ],
       [ 0.03,  0.21,  0.  ,  0.97],
       [ 0.  ,  0.1 ,  0.97,  0.  ]])
```

In [None]:
test_corr = np.array([[ 1.  ,  0.95,  0.3 ,  0.15],
       [ 0.95,  1.  ,  0.59,  0.46],
       [ 0.3 ,  0.59,  1.  ,  0.99],
       [ 0.15,  0.46,  0.99,  1.  ]])
calculate_weighted_adjacencies(test_corr, 3)

In [None]:
# Calculate the weighted adjacency matrix for the flyatlas data
adjacency_weighted = calculate_weighted_adjacencies(unsigned_correlation, 4)

## Task 2.4

How do you expect the network connectivity would change if the threshold for the binary adjacency matrix is increased or decreased? (2 marks; 50 words)


If threshold is increased, connectivity will be decreased. Connectivity = number_of_edges/number_of_possible_edges. If threshold increased, then the number of edges is decreased while number of possible edges remains unchanged. Hence connectivity is decreased along. So too, if threshold decreased then connectivity will be increased.  

# Task 3 - Dimensionality Reduction

In this task we will be performing Priciple Components Analysis to determine which gene in the first principle component has the highest contribution to the variance.

## Task 3.1

Perform a Principle Components Analysis on the log_expression matrix with the correct number of components to explain 90% of the variance, and print the explained variance by component list.

In [None]:
# ~~ GRADED CELL (2 marks) - complete this cell ~~

'''
    Hints: 
    - Your solution should set a variable 'n' where n is the 
    number of components required to explain 90% of the variance.
    - You may use the sklearn 'PCA' function we imported earlier. 
'''

# Initiation
pca = PCA()
expression_pca = pca.fit_transform(log_expression.T.to_numpy())
var = 0
n = 0
# increase the value of n by 1, stop until n explaining 90% of the variance
while var < 0.9:
    var += pca.explained_variance_ratio_[n]
    n += 1
# update PCA with the correct number of components, namely, 'n' explaining 90% of the variance
pca = PCA(n_components = n)
expression_pca = pca.fit_transform(log_expression.T.to_numpy())

In [None]:
# Testing cell - Do not alter.

print(f'\nExpected components: {n}')
print(f'\nVariance ratio: {pca.explained_variance_ratio_}')
print(f'\nExplained variance: {pca.explained_variance_}')


## Task 3.2

**Print** the gene that contributes most to the first eigenvector (the first principle component) of the PCA.

In [None]:
# ~~ GRADED CELL (2 marks) - complete this cell ~~

# YOUR CODE HERE
df = pd.DataFrame(expression_pca)
#  find the name(index) of the feature with the highest feature score on pc1
most_important = np.abs(pca.components_[0]).argmax()
gene = log_expression.index[most_important]
print(f'Gene that contributes most to the first eigenvector: {gene}')


In [None]:
# Testing cell - Do not alter.



# Task 4 - Graph Metrics

Graph metrics are important parameters to assist in characterising a network as a whole or even the relative importance of specific nodes in a network and could give us hints regarding their importance.


## Task 4.1

Normalised Degree Centrality

Describe an algorithm in pseudocode that returns the normalised degree centrality of a node (degree divided by the maximum node degree in the graph), receiving as parameters a node index **i** and its binary adjacency matrix **m**.

YOUR ANSWER HERE  
```
// As for directed graph, we write this function based on formula:normalised D(i)=degree(i)/max_degree(i).

function normalised_degree_centrality(i, m):  
    n = length(m)    // length(m) calculates the length of dimention of m, n is number of vertices
    max_degree = n - 1  
    degree = 0
    for x = 1 to n:       // this loop calculate the sum of the values on the row of index i
        degree = degree + m[i][x]  
    normalised_degree_centrality = degree/max_degree  
    return normalised_degree_centrality
       
```    
    


## Task 4.2

Closeness Centrality

Describe an algorithm in pseudocode that returns the closeness centrality of a node, receiving as parameters a node index **i** and its binary adjacency matrix **m**. As part of your answer you can assume the function *min_dist(a,b,m)* is available. This function will return the minimum distance between nodes a and b in a graph represented by the adjecendy matrix m. (2 marks)

YOUR ANSWER HERE
```
function closeness_centrality(i, m):
    n = length(m)    // length(m) calculates the length of dimention of m, n is number of vertices
    distance = 0
    for v from 1 to n:
        if v != i then:
            distance = distance + min_dist(i,v,m)
    closeness_centrality = 1/distance
    return closeness_centrality
```

## Task 4.3

Clustering Coefficient and Average Path Length are two important properties to distinguish between different network types.

Consider that you are working with a particular biological network. Describe briefly, with your own words, how would you use these properties to verify whether your network is consistent with a Random Network, Small-World Network or Regular Lattice Network. (Maximum 150 words).


YOUR ANSWER HERE
```
N = number of nodes; C: clustering coefficient; L = Average Path Length; p: Probability of neighbours being connected: K: k-nearest neighbors 

If the network has low C and low L, where L ≈ log(N),then it's a Random Network. This is because edges are formed randomly with uniform probability.
If the network has high C (C >> the C of Random Network)and low L (L≈ the L of Random Network), then it's a Small-World Network. 
If the network has high C (C >> the C of Random Network) and high L (L >> the L of Random Network), it's a Regular Lattice Network.This is because L ≈ N/2K, which is greater than that in a Ramdom Network.
```

# Part II

## Genome assembly

Provide short answers to the following questions about genome assembly algorithms.


### Question 5a.

Despite being slow, overlap-layout-consensus assembly algorithms can tolerate high error rates in the sequencing reads. How do these algorithms handle sequencing errors **during graph construction**? (50 words; 4 marks)


YOUR ANSWER HERE
```
OLC has build,simplify and traverse steps. The layout process is Hamiltonian path. OLC will get transitively-inferrible edges removed. After layout, it will trace a sensible path to call a consensus. Also, in overlap detection OLC allows a few mismatches and producing a single path of nodes.  
```

### Question 5b.

In general, why is it faster to construct a de Bruijn graph of *k*-mers than to construct a graph of read overlaps? (50 words; 4 marks)

YOUR ANSWER HERE
```
Constructing a graph of read overlaps relies on all-vs-all pairwise comparison on reads, which leads to a lot of computation. In contrast, de Bruijn chops reads into k-mers and records their neighboring relations simultaneously. Also, after layout OLC calls consensus sequence from alignments, while k-mers has consensus information already.
```


### Question 5c.

Remember that in the worst case, a single error in a sequencing read will generate *k* erroneous *k*-mers. Describe how these incorrect *k*-mers will appear in the de Bruijn graph. (100 words; 5 marks)


YOUR ANSWER HERE
```
k erroneous k-mers will create an alternative route formed by these erroneous k-mers connected in series one by one. 
For example, if a read AATCGGG was sequenced as AATTGGG, then it generates 3 erroneous 3-mers which are ATT, TTG, TGG, then these 3 3-mers will one-by-one connect in series to create an alternate route between node AT and GG.
```


### Question 5d. 

If the sequencing error rate is low, and your read depth is high, how could an algorithm deal with the issues you described in Question 5c during graph simplification? (50 words; 5 marks)


```
We can apply weight calculated by the supporting read-depth, which means edges with higher read depth have more weight than others. If an edge has low weight, we can delete it.
```

### Question 5e.

You are assembling a genome for a newly-discovered native bee species. Male bees of this species are haploid, meaning they have no heterozygosity.

You have received 12 gigabases ($12 \times 10^{9} \text{ bases}$) of 100 b sequencing reads, which were produced from DNA from a single male bee.

From these reads, you produce a *k*-mer depth spectrum using $k = 51$. The main peak on the spectrum is at a depth of 40.

Estimate the size of the bee's genome. Hint: you could start by calculating the number of reads. (2 marks)


YOUR ANSWER HERE
```
Number of reads: 12*10^9/ 100 = 12*10^7
Genome size ≈ Number of reads * (read length-k+1) / Dk = 12*10^7 * (100-51+1) / 40 = 15*10^7 b
Hence the answer is 15*10^7 b
```

Question 5f. What additional feature would you expect to see on the *k*-mer depth spectrum described in Question 5e if the bee had been a heterozygous diploid? (2 marks)

YOUR ANSWER HERE
```
If the bee had been a heterozygous diploid, aside from the peak at depth of 40, we also expect to see another peak at around depth of 20, which is lower than the peak at depth of 40. 
```