# COMP90016 - Assignment 2
Version 1. Last edited 04/04/2022.


## Semester 1, 2022


This assignment should be completed by each student individually. Make sure you read this entire document, and ask for help if anything is not clear. Any changes or clarifications to this document will be announced via the LMS.

Please make sure you review the University's rules on academic honesty and plagiarism: https://academichonesty.unimelb.edu.au/

Do not copy any code from other students or from the internet. This is considered plagiarism.

Your completed notebook file containing all your answers will be turned in via LMS. Please also submit an HTML file.

To complete the assignment, finish the tasks in this notebook.

The tasks are a combination of utilising computational genomics tools, writing your own code, interpreting the results and answering related short-answer questions.

In some cases, we have provided test input and test output that you can use to try out your solutions. These tests are just samples and are **not** exhaustive - they may warn you if you've made a mistake, but they are not guaranteed to. It's up to you to decide whether your code is correct.

**Remember to save your work early and often.**

## Marking

Cells that must be completed to receive marks are clearly labelled. Some cells are code cells, in which you must complete the code to solve a problem. Others are markdown cells, in which you must write your answers to short-answer questions. 

Cells that must be completed to receive marks are labelled like this:

`# -- GRADED CELL (1 mark) - complete this cell --`

Some graded cells are code cells, in which you must complete the code to solve a problem. Other graded cells are markdown cells, in which you must write your answers to short-answer questions. 

You will see the following text in graded code cells:

```
# YOUR CODE HERE
raise NotImplementedError()
```

***You must remove the `raise NotImplementedError()` line from the cell, and replace it with your solution.***

Only add answers to graded cells. If you want to import a library or use a helper function, this must be included in a graded cell.

Only graded cells will be marked.
**Don't make changes outside graded cells, and don't add or remove cells from the notebook**.

>Word limits, where stated, will be strictly enforced. Answers exceeding the limit **will not be marked**.

>Run-time limits will be imposed for each coding question. The run-time of a code cell can be calculated by including `%time` at the top of your cell. Cells exceeding the run-time limit **will not be marked**. The run-time limits only apply to test cases that are included in this document.

No marks are allocated to commenting in your code. We do however, encourage efficient and well commented code.

The total marks for the assignment add up to 100, and it will be worth 15% of your overall subject grade.

Part 1: 35 marks

Part 2: 20 marks

Part 3: 45 marks

## Submitting

Before you turn this assignment in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and student ID number below:

In [None]:
NAME = "Benjamin Field"

ID = "831975"

## Overview

In this assignment, you will answer questions about VCF files and BLAST.

You will use the `pysam` and `skbio` libraries in your functions. You may want to refer to sections of the documentation for these tools for additional help. Additional to `pysam`, `skbio` and standard Python 3 functions and methods, you may also use any other library we have used in COMP90016 including `collections`, `numpy`, `pandas`, `math`, `itertools`, `seaborn` and `matplotlib`.

## Part 1: VCF files

### Setup

The VCF files you will be using can be downloaded from the LMS. Download the files and place them in a directory named `data` in the same location as this notebook. **DO NOT** rename the files. 

In [None]:
# Import the pysam module.
# To use your local device, see the install instructions in workshop 5.
import pysam

In [None]:
# Read in the VCF files to produce VariantFile objects.
demo_vcf_a = pysam.VariantFile("data/comp90016_assignment_2_demo_a.vcf")
demo_vcf_b = pysam.VariantFile("data/comp90016_assignment_2_demo_b.vcf")
comp90016_vcf = pysam.VariantFile("data/comp90016_assignment_2.vcf")
comp90016_vcf

VariantFile objects allow us to access the data within VCF files. You may want to refer back to the work you did in workshop 5. You may also want to read the documentation linked below:
>https://pysam.readthedocs.io/en/latest/usage.html#working-with-vcf-bcf-formatted-files

### Questions
In the cells below, complete the following tasks:

### Question 1.1
(5 marks, max 50 words)

Which reference genome (full species name and version) was used for the variant calling in `comp90016_assignment_2.vcf`? Which variant calling tool was used?


-- GRADED CELL (5 marks) - complete this cell --

The Homo sapiens (human)reference genome was used, the version is hg38. freeBayes version 1.3.1 is the variant calling tool that was used. 

### Question 1.2 

(10 marks) 

Write a python function to calculate the ti/tv for a set of SNVs in a VCF file containing variants from a single sample. Disregard all other variant types. Consider multi-allelic SNVs as seperate SNVs. Consider a particular SNV allele only once, regardless of how many copies of the allele are present. Assume the input is a pysam VariantFile object. Return a single floating-point number. If there are no transversion SNVs, return None.

**Reminder:**

`The four standard DNA bases can be divided into purines (A, G) and pyrimidines (C,T). A transition is a single base change from a purine to a purine, or from a pyrimidine to a pyrimidine. A transversion is a single base change from a purine to a pyrimidine or from a pyrimidine to a purine. The transition/transversion ratio (ti/tv) is the ratio of the number of transitions to the number of transversions.`

In [None]:
# GRADED CELL 1.2 (10 marks, max 1 min run-time)
def titv(vcf):
    """
    Calculate the ti/tv for a set of SNVs in a VCF file.
    Disregard all other variant types.
    Consider multi-allelic SNVs as seperate SNVs.
    Consider a particular SNV allele only once, regardless of how many copies of the allele are present.
    Assume the input is a pysam VariantFile object.
    Return a single floating-point number.
    If there are no transversion SNVs, return None.
    """
    
    # YOUR CODE HERE
    # set up counters to track transitions and transversions
    transitions = 0
    transversions = 0 
    
    # set up lists to identify SNV's and match to transition or transversion
    purines = ['A','G']
    pyrimidines = ['C','T']
    
    # loop through each reference allele in vcf file
    for read in vcf.fetch():
        
        alts = []
        # check allele is an SNV
        if len(read.ref) != 1:
            continue
            
        # loop through each alternate allele and check against reference allele.
        # Record transition or transversion for each alternate allel
        for alt in read.alts:
            
            # skip if not an SNV or alt allele has already occurred in read
            if len(alt) != 1 or alt in alts:
                continue
                
            if read.ref in purines and alt in pyrimidines:
                transversions += 1
            elif read.ref in pyrimidines and alt in purines:
                transversions += 1
            elif read.ref in pyrimidines and alt in pyrimidines:
                transitions += 1
            elif read.ref in purines and alt in purines:
                transitions += 1
            alts.append(alt)
                
    if transversions == 0:
        return None
    
    # calculate ti/tv ratio and return value
    return transitions/transversions


In [None]:
# Test your function in this cell
print(titv(demo_vcf_a)) # should output 2.0
print(titv(demo_vcf_b)) # should output 1.0

print(titv(comp90016_vcf))

In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


### Question 1.3

(10 marks, max 100 words)

Explain why the ti/tv ratio is typically higher in protein-coding regions than in non-coding regions.



-- GRADED CELL (10 marks) - complete this cell --

Transversions are much more likely to result in a changed amino acid encoding. If an amino acid encoding is changed in a protein coding region, it could easily result in a missense or nonsense mutation. The region will no longer lead to the expression of the same amino acid sequence and the protein is likely to not function as well as before (or not at all). Thus evolution will favor the conservation of these sequences. As transitions are not as likely to change the encoded amino acid sequence and affect the viability of the protein, they are seen more often. Consequently, the ti/tv ratio is higher in protein-coding regions.

### Question 1.4

(5 marks, max 100 words)

When working on a variant calling project, you might generate files with the extensions vcf.gz or vcf.tbi. Explain the contents of these files and how they relate to .vcf files.


-- GRADED CELL (5 marks) - complete this cell --

.gz indicates the the file has been compressed as a gzip file and the .tbi extension marks the related tabix index file which contains the index of the vcf.gz file. They relate to .vcf files in that they are used when the .vcf file has been compressed for efficiency reasons and are just the compressed version (and corresponding index) of the original .vcf file. 

### Question 1.5

(5 marks, max 100 words)

Computational genomics often involves "big" genomics data sets. Explain how .gz and .tbi files help make working with large files feasible.







-- GRADED CELL (5 marks) - complete this cell --

.gz files and their associated .tbi files allow for the compression and indexation of very large amounts of genomic data. Genomic files can be quite large so it makes sense for efficiency to compress them. The indexation enabled byt tabix files means that the entire file doesn;t need to be decompressed each time you need to access data, which again improves the efficiency of working with large files. 


## Part 2: More VCF files

### Setup

The VCF file you will be using for this part can be downloaded from the LMS. Download the `comp90016_assignment_3_MBP.vcf.gz` file and place it in the `data` directory. **DO NOT** rename the file.

The data you will be using is publically available as part of the 1000 Genomes Project, which was completed in 2015. It was one of the first large-scale projects to try and capture the level of variation in the human genome. Much larger projects have been completed since then. You can read one of the key publciations describing this dataset here: https://www.nature.com/articles/nature15393. 

Our gene of interest is *MBP* (Myelin Basic Protein, NM_001025081). Variants in *MBP* from three selected individuals (HG02687, HG03897 and HG04118) have been included.

In [None]:
# Read in the MBP VCF file to produce a VariantFile objects.
comp90016_vcf_MBP = pysam.VariantFile("data/comp90016_assignment_2_mbp.vcf")

### Questions
In the cells below, complete the following tasks:

### Question 2.1

(10 marks)

Write a python function to create a genotype dictionary for a given variant position in a given pysam VariantFile object. The sample names as strings are keys and tuples of the genotypes are the values. Assume vcf is a pysam VariantFile object. Assume chrom is the name of a contig as a string. Assume pos is an integer. Return the genotype dictionary. If the combination of chrom and pos does not exist in the VCF file, return None.

In [None]:
# GRADED CELL 2.1 (10 marks, max 1 min run-time)

def genotype_dict(vcf, chrom, pos):
    """
    Create a genotype dictionary for a given variant position in a given pysam VariantFile object. 
    The sample names as strings are keys and tuples of the genotypes are the values. 
    Assume vcf is a pysam VariantFile object. 
    Assume chrom is the name of a contig as a string. 
    Assume pos is an integer. 
    If the combination of chrom and pos does not exist in the VCF file, return None. 
    Return the genotype dictionary.
    """
    
    # YOUR CODE HERE
    # set up empty dict to record output
    genotypes = {}
    
    # loop through each variant in vcf file
    for read in vcf.fetch():
        # check read matches specified input contig and position
        if read.contig == chrom and read.pos == pos:
            
            # loop through sample data for each variant and update genotype dictionary 
            for sample in read.samples.items():
                genotypes[sample[0]] = sample[1].items()[0][1]
                
    if len(genotypes) == 0:
        return None
    
    return genotypes
    

In [None]:
# ~~ Test your function in this cell ~~
print(genotype_dict(demo_vcf_b, '18', 74690979)) # should output {'HG02687': (0, 0), 'HG03897': (0, 1), 'HG04118': (0, 0)}
print(genotype_dict(demo_vcf_b, '18', 74690999)) # should output {'HG02687': (1, 1), 'HG03897': (0, 0), 'HG04118': (0, 0)}
print(genotype_dict(demo_vcf_b, '2', 1000)) # should output None

print(genotype_dict(comp90016_vcf_MBP, '18', 74694135))


In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----


### Question 2.2

(5 marks, max 100 words)

For each sample, list the total number of homozygous reference, homozygous alternative and heterozygous alleles. Which of the three samples has the most homozygous alternative variants in the MBP gene?


-- GRADED CELL (5 marks) - complete this cell --

Homozygous reference:
HG02687: 1317
HG03897: 1349
HG04118: 1331

Homozygous Alternative:
HG02687: 22
HG03897: 16
HG04118: 18

Heterozygous alleles:
HG02687: 68
HG03897: 42
HG04118: 58

HG02687 is the sample with the most homozygous alternative variants. 

### Question 2.3

(5 marks, max 100 words)

What information is contained in the ID column of the MBP VCF file? Explain how this information is useful to users.


-- GRADED CELL (5 marks) - complete this cell --

The ID column contains a list of unique identifiers (when said identifers are available). These could be a dbSNP rs identifier, a COSMIC identifier or a custom field. The identifer is unique and indicates the specific variant contained at each location. This is useful to the user because it provides information about the known variant at each location. 

## Part 3: BLAST

### Setup

ORF-9b has been detected in multiple isolates of SARS coronavirus as well as in other coronaviruses that infect other animals. We would like to determine whether SARS-CoV-2 (responsible for COVID-19) has an ORF that is homologous to ORF-9b in other coronaviruses. We will be using `BLAST` on the command line for this purpose.

We will be using a sequence from an isolate of SARS-CoV-2 sequenced in December 2019 in Wuhan, China. 

>https://www.viprbrc.org/brc/viprStrainDetails.spg?ncbiAccession=MN908947&decorator=corona

Using the ViPR link above, navigate to and download a FASTA file of the DNA sequence for the **nucleocapsid (N) gene** of the SARS-CoV-2 isolate (Do not download the full genome). Rename the file `comp90016_COVID19_N.fasta` and place it in the `data` directory.

Download the file `comp90016_9b.fasta` file from the LMS and place it in the same directory as the `comp90016_COVID19_N.fasta` file.

This file contains ORF-9b sequences from a selection of characterised coronaviruses.

`BLAST` is installed on SWAN. If you would like to use your personal computer, installation instructions can be found here https://www.ncbi.nlm.nih.gov/books/NBK279671/.

If you are using a local Anaconda environment you can install BLAST as:
`conda install -c bioconda blast`

In [None]:
# First, build a custom BLAST database from the ORF-9b sequences. 
# Do this with the following command:

!makeblastdb -dbtype nucl -in data/comp90016_assignment_2_9b.fasta

# You will see some extra files have appeared in the data directory.


In [None]:
# Next, perform the BLAST search with the following command.
# The results will be printed in tabulated form. 
# You can pipe the results to a file if you prefer.

!blastn -query data/comp90016_COVID19_N.fasta -db data/comp90016_assignment_2_9b.fasta -outfmt 7

### Questions
In the cells below, complete the following tasks:

### Question 3.1

(5 marks, max 100 words)

Explain one factor that is important when selecting a BLAST database.



-- GRADED CELL (5 marks) - complete this cell --

It is important to consider the size of the database. Because BLAST is looking for statistically relevanbt associations between large genome sets, the larger the database the more processing power is required. The required processing power can be prohibitive with large enough databases and BLAST will slow down considerably and, so database size should be taken into account. 

### Question 3.2

(10 marks, max 100 words)

Is there evidence from the BLAST results that the COVID-19 N gene has an internal ORF homologous to ORF-9b sequences in other coronaviruses? Justify your choice. What was the highest scoring hit?


-- GRADED CELL (10 marks) - complete this cell --

There is evidence to suggest this as BLAST results show hits on coronaviruses other than the human coronoavirus and the highest scoring hit (score of 366) was on the "bat_coronavirus_HKU3_orf_9b" gene. Additionally, there are low E values for the blast results on the ORF-9b sequences which indicate that we can be confident int he accuracy of these results.  

### Question 3.3

(5 marks, max 100 words)

Interpret the E value of the top scoring BLAST hit. How confident are you in your BLAST results?


-- GRADED CELL (5 marks) - complete this cell --

-The E value for the top scoring BLAST hit ("bat_coronavirus_HKU3_orf_9b") is 7.49e-105. 
-Very low E values (approaching 0) indicate that the results are very reliable.
-As the E value for the top scoring hit is extremely close to 0, we can be very confident in the reliability of the results. 
-It's important to note that allow E value doesn't indicate absolute cetainty but rather a very high degree of confidence.

### Question 3.4

(5 marks, max 50 words)

Modify the provided BLAST command to work with an input of a protein sequence FASTA file (named comp90016_COVID19_N_protein.fasta). Decrease the word-size parameter from the default value to 2. Please use the same custom database and output format used previously. Note that you are not required to execute this command.


-- GRADED CELL (5 marks) - complete this cell --

!tblastn -query data/comp90016_COVID19_N_protein.fasta -db data/comp90016_assignment_2_9b.fasta -outfmt 7 -word_size 2

### Question 3.5

(10 marks, max 100 words)

What effect does decreasing the word size have on the sensitivity (ability to detect true homologs) and run-time of the search? Explain why this is the case.


-- GRADED CELL (10 marks) - complete this cell --

Decreasing the word size increases the likelihood of finding exact matches but it also increases the likelihood of finding false positives. As a result, it increases sensitivity. 

Decreasing the word size increases the run time of the search as it increases the number of results and matches. 

### Question 3.6

(10 marks)

Write a Python function to calculate the percentage sequence similarity between two aligned protein sequences. 

- Assume that pairwise_aln is an `skbio.alignment.TabularMSA` object. 
- Assume sub_matrix is a 2D dictionary encoding an amino-acid substitution matrix, such as the BLOSUM-62 matrix below.
- Two amino acid residues are similar if there is a value greater than 0 in the corresponding cell of the input substitution matrix. 
- The percentage similarity is defined as the number of positions with similar amino acid residues as a percentage of the total number of positions in the alignment. 
- Return a floating-point number between 0 and 100. 
- If pairwise_aln contains fewer than 2 sequences, return None.

In [None]:
# Next we import `skbio` so that we can take advantage of skbio.alignment objects 
# for storing and accessing sequence alignments.
import skbio

For this question, we will be using amino-acid substitution matrices, encoded as 2D dictionaries. The cell below stores the BLOSUM-62 substitution matrix as a dictionary (the default substitution matrix for BLAST). The protein sequences used to test your code will only contain the one-letter codes of the 20 standard amino-acids.

In [None]:
BLOSUM_62 = {
    '*':{'*':1,'A':-4,'C':-4,'B':-4,'E':-4,'D':-4,'G':-4,'F':-4,'I':-4,'H':-4,'K':-4,'M':-4,'L':-4,'N':-4,'Q':-4,'P':-4,'S':-4,'R':-4,'T':-4,'W':-4,'V':-4,'Y':-4,'X':-4,'Z':-4},
    'A':{'*':-4,'A':4,'C':0,'B':-2,'E':-1,'D':-2,'G':0,'F':-2,'I':-1,'H':-2,'K':-1,'M':-1,'L':-1,'N':-2,'Q':-1,'P':-1,'S':1,'R':-1,'T':0,'W':-3,'V':0,'Y':-2,'X':-1,'Z':-1},
    'C':{'*':-4,'A':0,'C':9,'B':-3,'E':-4,'D':-3,'G':-3,'F':-2,'I':-1,'H':-3,'K':-3,'M':-1,'L':-1,'N':-3,'Q':-3,'P':-3,'S':-1,'R':-3,'T':-1,'W':-2,'V':-1,'Y':-2,'X':-1,'Z':-3},
    'B':{'*':-4,'A':-2,'C':-3,'B':4,'E':1,'D':4,'G':-1,'F':-3,'I':-3,'H':0,'K':0,'M':-3,'L':-4,'N':3,'Q':0,'P':-2,'S':0,'R':-1,'T':-1,'W':-4,'V':-3,'Y':-3,'X':-1,'Z':1},
    'E':{'*':-4,'A':-1,'C':-4,'B':1,'E':5,'D':2,'G':-2,'F':-3,'I':-3,'H':0,'K':1,'M':-2,'L':-3,'N':0,'Q':2,'P':-1,'S':0,'R':0,'T':-1,'W':-3,'V':-2,'Y':-2,'X':-1,'Z':4},
    'D':{'*':-4,'A':-2,'C':-3,'B':4,'E':2,'D':6,'G':-1,'F':-3,'I':-3,'H':-1,'K':-1,'M':-3,'L':-4,'N':1,'Q':0,'P':-1,'S':0,'R':-2,'T':-1,'W':-4,'V':-3,'Y':-3,'X':-1,'Z':1},
    'G':{'*':-4,'A':0,'C':-3,'B':-1,'E':-2,'D':-1,'G':6,'F':-3,'I':-4,'H':-2,'K':-2,'M':-3,'L':-4,'N':0,'Q':-2,'P':-2,'S':0,'R':-2,'T':-2,'W':-2,'V':-3,'Y':-3,'X':-1,'Z':-2},
    'F':{'*':-4,'A':-2,'C':-2,'B':-3,'E':-3,'D':-3,'G':-3,'F':6,'I':0,'H':-1,'K':-3,'M':0,'L':0,'N':-3,'Q':-3,'P':-4,'S':-2,'R':-3,'T':-2,'W':1,'V':-1,'Y':3,'X':-1,'Z':-3},
    'I':{'*':-4,'A':-1,'C':-1,'B':-3,'E':-3,'D':-3,'G':-4,'F':0,'I':4,'H':-3,'K':-3,'M':1,'L':2,'N':-3,'Q':-3,'P':-3,'S':-2,'R':-3,'T':-1,'W':-3,'V':3,'Y':-1,'X':-1,'Z':-3},
    'H':{'*':-4,'A':-2,'C':-3,'B':0,'E':0,'D':-1,'G':-2,'F':-1,'I':-3,'H':8,'K':-1,'M':-2,'L':-3,'N':1,'Q':0,'P':-2,'S':-1,'R':0,'T':-2,'W':-2,'V':-3,'Y':2,'X':-1,'Z':0},
    'K':{'*':-4,'A':-1,'C':-3,'B':0,'E':1,'D':-1,'G':-2,'F':-3,'I':-3,'H':-1,'K':5,'M':-1,'L':-2,'N':0,'Q':1,'P':-1,'S':0,'R':2,'T':-1,'W':-3,'V':-2,'Y':-2,'X':-1,'Z':1},
    'M':{'*':-4,'A':-1,'C':-1,'B':-3,'E':-2,'D':-3,'G':-3,'F':0,'I':1,'H':-2,'K':-1,'M':5,'L':2,'N':-2,'Q':0,'P':-2,'S':-1,'R':-1,'T':-1,'W':-1,'V':1,'Y':-1,'X':-1,'Z':-1},
    'L':{'*':-4,'A':-1,'C':-1,'B':-4,'E':-3,'D':-4,'G':-4,'F':0,'I':2,'H':-3,'K':-2,'M':2,'L':4,'N':-3,'Q':-2,'P':-3,'S':-2,'R':-2,'T':-1,'W':-2,'V':1,'Y':-1,'X':-1,'Z':-3},
    'N':{'*':-4,'A':-2,'C':-3,'B':3,'E':0,'D':1,'G':0,'F':-3,'I':-3,'H':1,'K':0,'M':-2,'L':-3,'N':6,'Q':0,'P':-2,'S':1,'R':0,'T':0,'W':-4,'V':-3,'Y':-2,'X':-1,'Z':0},
    'Q':{'*':-4,'A':-1,'C':-3,'B':0,'E':2,'D':0,'G':-2,'F':-3,'I':-3,'H':0,'K':1,'M':0,'L':-2,'N':0,'Q':5,'P':-1,'S':0,'R':1,'T':-1,'W':-2,'V':-2,'Y':-1,'X':-1,'Z':3},
    'P':{'*':-4,'A':-1,'C':-3,'B':-2,'E':-1,'D':-1,'G':-2,'F':-4,'I':-3,'H':-2,'K':-1,'M':-2,'L':-3,'N':-2,'Q':-1,'P':7,'S':-1,'R':-2,'T':-1,'W':-4,'V':-2,'Y':-3,'X':-1,'Z':-1},
    'S':{'*':-4,'A':1,'C':-1,'B':0,'E':0,'D':0,'G':0,'F':-2,'I':-2,'H':-1,'K':0,'M':-1,'L':-2,'N':1,'Q':0,'P':-1,'S':4,'R':-1,'T':1,'W':-3,'V':-2,'Y':-2,'X':-1,'Z':0},
    'R':{'*':-4,'A':-1,'C':-3,'B':-1,'E':0,'D':-2,'G':-2,'F':-3,'I':-3,'H':0,'K':2,'M':-1,'L':-2,'N':0,'Q':1,'P':-2,'S':-1,'R':5,'T':-1,'W':-3,'V':-3,'Y':-2,'X':-1,'Z':0},
    'T':{'*':-4,'A':0,'C':-1,'B':-1,'E':-1,'D':-1,'G':-2,'F':-2,'I':-1,'H':-2,'K':-1,'M':-1,'L':-1,'N':0,'Q':-1,'P':-1,'S':1,'R':-1,'T':5,'W':-2,'V':0,'Y':-2,'X':-1,'Z':-1},
    'W':{'*':-4,'A':-3,'C':-2,'B':-4,'E':-3,'D':-4,'G':-2,'F':1,'I':-3,'H':-2,'K':-3,'M':-1,'L':-2,'N':-4,'Q':-2,'P':-4,'S':-3,'R':-3,'T':-2,'W':11,'V':-3,'Y':2,'X':-1,'Z':-3},
    'V':{'*':-4,'A':0,'C':-1,'B':-3,'E':-2,'D':-3,'G':-3,'F':-1,'I':3,'H':-3,'K':-2,'M':1,'L':1,'N':-3,'Q':-2,'P':-2,'S':-2,'R':-3,'T':0,'W':-3,'V':4,'Y':-1,'X':-1,'Z':-2},
    'Y':{'*':-4,'A':-2,'C':-2,'B':-3,'E':-2,'D':-3,'G':-3,'F':3,'I':-1,'H':2,'K':-2,'M':-1,'L':-1,'N':-2,'Q':-1,'P':-3,'S':-2,'R':-2,'T':-2,'W':2,'V':-1,'Y':7,'X':-1,'Z':-2},
    'X':{'*':-4,'A':-1,'C':-1,'B':-1,'E':-1,'D':-1,'G':-1,'F':-1,'I':-1,'H':-1,'K':-1,'M':-1,'L':-1,'N':-1,'Q':-1,'P':-1,'S':-1,'R':-1,'T':-1,'W':-1,'V':-1,'Y':-1,'X':-1,'Z':-1},
    'Z':{'*':-4,'A':-1,'C':-3,'B':1,'E':4,'D':1,'G':-2,'F':-3,'I':-3,'H':0,'K':1,'M':-1,'L':-3,'N':0,'Q':3,'P':-1,'S':0,'R':0,'T':-1,'W':-3,'V':-2,'Y':-2,'X':-1,'Z':4}}

In [None]:
# GRADED CELL 3.6 (10 marks, max 1 min run-time)

def percentage_similarity(pairwise_aln, sub_matrix):
    """
    Calculate the percentage sequence similarity between two aligned protein sequences. 
    Assume that pairwise_aln is an skbio.alignment.TabularMSA object. 
    Assume sub_matrix is a 2D dictionary encoding an amino-acid substitution matrix. 
    Two amino acid residues are similar if there is a value greater than 0 in the corresponding cell of sub_matrix. 
    The percentage similarity is defined as the number of positions with similar amino acid residues as a percentage of the total number of positions in the alignment.
    Return a floating point number between 0 and 100. 
    If pairwise_aln contains fewer than 2 sequences, return None.
    """
   
    # YOUR CODE HERE
    if len(pairwise_aln) < 2:
        return None
    
    similars = 0
    first_seq = pairwise_aln[0]
    second_seq = pairwise_aln[1]
    seq_length = len(first_seq)
    
    for i in range(seq_length):
        
        amino_A = str(first_seq[i])
        amino_B = str(second_seq[i])
        
        if amino_A == '-' or amino_B == '-':
            continue  
        elif sub_matrix[amino_A][amino_B] > 0:
            similars += 1
        
    return (similars/seq_length) * 100

In [None]:
# ~~ Test your function in this cell ~~
demo_sequence_a = skbio.sequence.Protein('MYWIW')
demo_sequence_b = skbio.sequence.Protein('IYW--')
demo_pairwise_prot_aln = skbio.TabularMSA([demo_sequence_a, demo_sequence_b])

sequence_a = skbio.sequence.Protein('MYGEGEPGGWQDHVTVLATRRHPKWAQAWVSTMPWGYECGFSRAWVHQTPWINV-----VSLSSHEAYGVVAVRHPWEIFSPYEVYAPYVQDTQHHGNPGQFTTSCYPDE')
sequence_b = skbio.sequence.Protein('MYADGEPGAWQDHMTVLAIYWHHKWAHAWVSTMPWSYECGFSRAWVHQTPWINVIRFTQVSLSSRAWYGILAVRHPWEIFSPYDVYAPYVAATQHHGNPGQFSTSCYP--')
pairwise_prot_aln = skbio.TabularMSA([sequence_a, sequence_b])

print(percentage_similarity(demo_pairwise_prot_aln, BLOSUM_62)) # should return 60.0

print(percentage_similarity(pairwise_prot_aln, BLOSUM_62))

In [None]:
# --- AUTOGRADING CELL DO NOT EDIT ----
