## DexB2 Resit August 2024

In [None]:
# Analysis modules
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Lucy is interested in the evolution of the ARP family of genes.  These are single copy in most species and encode a protein about 370 aminos acids long.  There's more infomation about them here:

https://www.uniprot.org/uniprotkb/O80931/entry


![alternative text](AS1.png)

NCBI (National Centeral for Biological Infomation) maintains a database of all protein sequences.  BLAST is a program which searches this database looking for matches.  Matches are given a score for the number of matches, mismatches and gaps, the percentage identity to the search protein, and the chances of finding a match as good as this in the database by chance alone (e-value/E()).




Lucy has searched the NCBI database with two different ARP genes to find all the ARP-like proteins.  AS1 is an ARP family gene from Arabidopsis, BARP2 is an ARP family gene from Begonia.  

The blast hits for AS1 are in :
AS1_NCBI.csv

The blast hits for BARP2 are in
BARP2_NCBI.tsv
    
AS1_NCBI.csv is comma delineated.  It does not have a header row, but the columns are:
    
 *Description*,	Infomation about the protein which has matched  
 *Organism*,	Species the match comes from   
 *Common Name*,	common name of the species the match comes from  
 *Score(Bits)*,	Score for the match (higher = better)  
 *Query Cover*,	The percentage of the Query sequence covered by the match  
 *E()*,	The e-value of the match (likelihood of match by chance alone)  
 *Identities(%)*,	The percentage of identical amino acids in the match  
 *Length*,	The length of the matched protein (in amino acids)  
 *Accession*, The unique accession number of the matched protein in the NCBI database  
    
ARP2_NCBI.tsv is tab delineated.  It does have a header row.  

*Hit*	The number of the match in the blast output  
*DB*	Code for the database searched  
*Accession*	The unique accession number of the matched protein in the NCBI database  
*Description*	Infomation about the protein which has matched  
*Organism*	Species the match comes from   
*Length*	The length of the matched protein (in amino acids)  
*Score(Bits)*	Score for the match (higher = better)  
*Identities(%)*	The percentage of identical amino acids in the match  
*Positives(%)*	The percentage of similar amino acids in the match  
*E()* The e-value of the match (likelihood of match by chance alone)  


She needs to combine the results of these searches into a single dataframe listing all the ARP proteins found.  
She needs see how common gene duplication is for this protein.  
She needs to see if there is any evidence that after gene duplication there are changes in the protiens, such as loss/gain of sequence.

#### Checking the searches.  

Read in both files, adding a header row to AS1_NCBI.csv.  Check the head and tail of the dataframes to be sure they've read in correctly

<div class = "alert alert-danger">
Q1 How many lines of data are in AS1_NCBI.csv?   (1 mark)
 
    
    a 567  
    b 772  
    c 998  
    d 1004 
    e 5768  
    
Enter your answer on LEARN   

These hits include several cases where different protiens from the same species are found.  

<div class = "alert alert-danger">
Q2 How many unique species (labeled 'Organism' in the dataframe) are in the the AS1 searches?

    a 332
    b 172
    c 729
    d 468
    e 513
    
Enter your answer on LEARN

Have AS1 and BARP2 found matches in different ranges of species?  Find the overlap in species between the two dataframes.  

Useful commands:

    .unique()
    set(x) - set(y)

<div class = "alert alert-danger">
Q3 The number of species found in the AS1 search, but not in the BARP1 search is:

    a 23
    b 75
    c 156
    d 254
    e 305
    
Enter your answer on LEARN

#### Combine the dataframes  
Simplify the dataframes so they contain only the following columns:  
'Accession', 'Organism', 'Length', 'Score(Bits)', 'Identities(%)', 'E()'

Concatenate the dataframes vertically to add the AS1 search to the BARP2 search.  
You will need to re-set the index.

<div class = "alert alert-danger">
Q4 The number of unique species in this joint dataframe is:

    a 152
    b 267
    c 532
    d 808
    e 1325
    
Enter your answer on LEARN

#### Plot the scores  
The Blast Score counts up the number of exact matches, near matches, mis-matches and gaps in an alignment.  'Real' matches have high scores. However, matches which are less likely to be the same protein, or are in very evolutionary distant species, have lower scores. Usually this is detected by a cut-off in the distribution of scores, below which the number of detected scores sharply increases.

Examine the distribution of scores in the combined dataframe, to see if there is a cut-off which we can use to filter the real ARP proteins from other similar proteins.

<div class = "alert alert-danger">
Q5 Plot a histogram of Blast Scores.  Label the x and y axis and give the graph an appropriate title.  
   Upload the plot to LEARN (2 marks)

<div class = "alert alert-danger">
Q6 A reasonable cut-off for 'good hits' is:

    a >250
    b >375
    c >500
    d >650
    e >700
    
Enter your answer on LEARN

#### Filtering. 

Filter the dataframe to keep only the good hits.

#### How many copies in each species?  
To find the number of copies of ARP genes in each species we need to group the dataframe by species and count the records.  

    use .groupby() and .count()

Tidy this dataframe so you have only two columns - 'Organism' and a single column of counts which is the copy number.  Rename this column 'Copy_number'.  Reset the index.

<div class = "alert alert-danger">
Q7 Make a histogram to show the count of copies per species. 
    Label the y-axis "Copy number of ARP per species", give an appropriate title.   
    
Upload the plot to LEARN (2 marks)

Subset the dataframe to find the species with a copy number of over 4.

<div class = "alert alert-danger">
Q8 Of the following options, which species has the most copies of the ARP gene?

    a Ananas comosus
    b Vigna angularis var. angularis
    c Zingiber officinale
    d Ziziphus jujuba
    e Tripterygium wilfordii
    
Enter your answer on LEARN

#### What happens to duplicated ARP proteins?
Is there more sequence variation where there are multiple copies of ARP?

Join the copy number dataframe to the hit-score (good hits) filtered dataframe:  

Columns should now be:

        Accession	Organism	Length	Score(Bits)	Identities(%)	E()	Copy_number
        
With one row for each Accession.

<div class = "alert alert-danger">
Q9 Make a violin plot to show the % Identity by Copy number. Label the X and Y axes and give an appropriate title.
    
    
   Upload the plot to LEARN (2 marks)

#### Is there a pattern to the length of ARP proteins?

Sometimes BLAST hits detect a full-length protein and sometimes a truncated protein. It's important to distinguish these, and truncated proteins may have a different function or have lost function. Usually, full-length proteins will have similar lengths to one another, and truncated proteins will be clearly shorter.

Using the filtered data frame of good hits, next investigate the length of ARP proteins.  

<div class = "alert alert-danger">
Q10 Plot a histogram of the length of ARP proteins in amino acids.  Bin the data into 50 bins.   Label the X and Y axes and give an appropriate title.
    
    
   Upload the plot to LEARN (2 marks)

<div class = "alert alert-danger">
Q11 What is a good dividing line between truncated and full length ARP proteins?

    a 200 amino acids long
    b 300 amino acids long
    c 350 amino acids long
    d 450 amino acids long
    e 500 amino acids long
    
Enter your answer on LEARN (1 mark)

Is there any evidence that the short proteins are not real ARP proteins?  Plot identity by length in a scatter plot.

<div class = "alert alert-danger">
Q12 Plot identity by length in a scatter plot.  Use jointplot to show historgrams as well, and make the points small enough to distinguish (for example s=5).  Label the X and Y axes and give an appropriate title.
    
As this is a jointplot you have to set the title and labels with:
    ax.fig.suptitle('Title here')
    ax.set_axis_labels('X label', 'Y label')
    
    
   Upload the plot to LEARN (2 marks)

Categorise each protein as full length or truncated in a new column called 'Size'.  

Assign values based on a conditional length range of your choice, for example using DataFrame.loc[]. 

<div class = "alert alert-danger">
Q13 Which of these species has only truncated ARP genes?  Look up the matching dataframe rows to check.
    
    a Sesamum indicum
    b Solanum lycopersicum
    c Momordica charantia
    d Phtheirospermum japonicum
    e Daucus carota subsp. sativus
    
Enter your answer on LEARN (2 marks)

#### Do the truncated proteins show more variation than full length proteins?

Compare the average identity for short and long proteins.

<div class = "alert alert-danger">
Q14 Use a t-test to see if truncated proteins are less similar to ARP than full length proteins (by %_Identity)

    a Yes, pvalue = 0.01
    b No, pvalue > 0.05
    c Yes, pvalue = 0.05
    d No, pvalue > 0.001
    e Yes, pvalue = 5%
    
Enter your answer on LEARN (2 marks)

<div class = "alert alert-danger">
Upload your notebook to LEARN