# Lab 6: Polymorphisms and Mutations

In this lab, we will be comparing polymorphisms in two genes, HLA-A and H2A. HLA-A is an immune-related protein that is derived from the HLA-A gene, and is an MHC antigen. H2A is a histone protein, and helps in the stabalization of chromatin structure.

Part 1 of the lab will involve following a tutorial using the UCSC Genome Browser. Part 2 of the lab will include answering coding questions in this notebook.

## Import the Dependencies

In [None]:
# imports
from datascience import Table
import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')


## Part 1: Exploring polymorphisms in the UCSC Genome Browser

In this lab, we will use the UCSC Genome Browser to visualize polymorphisms in the HLA-A
and H2A proteins. By the end of this lab, you will have a better understanding of sequence conservation
and how conservation differs across proteins.

### Exploring HLA-A

First, we will learn about the HLA-A gene.


1. To access the UCSC Genome Browser, go to <a href="http://genome.ucsc.edu/" target="_blank">UCSC Genome Browser</a>

2. Under 'Popular Species' select **Human**. Under Human Assembly, select the Dec.2013 (GRCh38/hg38) genome build. Click **Go**.
![UCSC Home Page](https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/images/lab6/ucscHomepage.png)

3. Click on 'Genome Browser' under 'OurTools', highlighted in red in the image.
![UCSC Human](https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/images/lab6/human_hg38.png)
4. To reset the view and hide unused tracks, click the **hide all** button below the viewer.
![Hide All](https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/images/lab6/hideAll.png)
5. Type in **HLA-A** in the search bar, highlighted in red.
![UCSC Gateway](https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/images/lab2/ucscGateway.png)
6. Scroll down to the list of RefSeq Genes **Known Genes**, click on the option in which HLA-A is on chr6. This should
redirect you back to the browser.
![RefSeq HLA](https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/images/lab6/hla_refseq.png)
7. In the HLA-A track, click on the label for HLA-A. This should redirect you to a new screen. Given the information
on this screen, answer the following questions.
![View HLA](https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/images/lab6/viewHLA.png)

### Questions: Exploring HLA-A

<h2 style="color:red">** Question 1**</h2> 
What does 'HLA' stand for? **Note:** You may of to use external resources to find this.


---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 2**</h2> 
What is the location (chromosome and position) of HLA-A?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 3**</h2> 
How many base pairs long is HLA-A?


---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 4**</h2> 
How many exons does HLA-A have?



---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 5**</h2> 
Out of the exons, describe the peptides encoded by exons 1-7 for HLA-A. (e.g., exon 2 encodes for peptide ~BLANK~). This information can be found in **Summary of HLA-A**.


---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 6**</h2> 
What functional properties of HLA-A make it essecial to immune system response? **Note:** If you find that this page does not provide enough information
on the HLA-A gene, you can use a cited external source. 


---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

Now, let's navigate back to the browser to explore variation of HLA-A.

1. Navigate back to the Genome Browser viewer by clicking the back arrow.
2. Scroll down to track options, below the visualization.
3. Navigate to the Variation section.
4. Toggle 'Common SNPs' and 'Flagged SNPs' to 'squish', as shown in the image.
![variation](https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/images/lab6/toggle_variation.png)
5. Scroll back up to the visualization to view HLA-A and associated SNPs.

### Questions: HLA-A SNP Density

<h2 style="color:red">** Question 7**</h2> 
What is a SNP? How can the presence of SNP's affect protein function?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 8**</h2> 
What is a common SNP? A flagged SNP? Click on the labels under the **Variation** section to gather this information.

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 9**</h2> 
Navigate back to the browser, if you are not already there. Make sure you are zoomed out so that you can see the whole HLA-A gene. In what areas to you see the highest density of SNPs?
Is there any reasoning behind this SNP distribution?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 10**</h2> 
Do you see a greater density in any particular exons? If so, which two exons have the highest SNP density?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 11**</h2> 
Navigate to <a href="http://www.microbiologybook.org/bowers/mhc.htm" target="_blank">Microbiology Book</a> in a new tab, and find **Figure 2**. Read the caption and look at the Figure. Given what you know about HLA-A exons and their corresponding peptide domains, what can you say about the exons with the greatest SNP density and the function of the domains that they produce?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

Now, navigate back to the UCSC Genome browser and zoom in to HLA-A and click on any clinically associated SNP in the 2nd or 3rd exon. This should take you to a new page.


### Questions: HLA-A SNP Search

<h2 style="color:red">** Question 12**</h2> 
Let's get more information on the specific SNP you clicked on. What is the reference allele for this SNP? What is the position of this SNP?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 13**</h2> 
What, if any, is the **Function** of this SNP? What does this function mean?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

### Exploring H2A.
Now, we will explore gene H2A, or HIST3H2A.

1. Navigate back to the Genome Browser Viewer by clicking the back button.
1. Type in **HIST3H2A** in the search bar.
5. In the resulting list of **Known Genes**, click on the **first** option, called HIST3H2A. This should redirect you back to the browser.
6. In the new track, click on the label for HIST3H2A. This should redirect you to a new screen. Given the information on this screen, answer the following questions.

### Questions: Exploring H2A

<h2 style="color:red">** Question 14**</h2> 
What does 'HIST3H2A' stand for? **Note:** You may of to use external resources to find this.

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 15**</h2> 
What is the location (chromosome and position) of H2A?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 16**</h2> 
How long is the gene?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 17**</h2> 
Read the RefSeq Summary for H2A. What important roles does H2A play? **Note:** If this page does not provide enough information
on the H2A gene, google to find more information.

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

Now, let's navigate back to the browser to explore variation of H2A. Make sure that Common SNPs and Flagged SNPs are still being displayed. 


### Questions: H2A SNP Density

Navigate back to the Genome Browser Viewer by clicking the back arrow in your browser.



<h2 style="color:red">** Question 18**</h2> 
Zoom out to see the whole H2A gene. In what areas, if any, to you see the highest density of SNPs?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 19**</h2> 
How does the SNP distribution in H2A compare to the distribution of HLA-A? What functional properties of these genes would cause this difference in SNP distributions?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

# Part 2: Multiple Sequence Alignment of HLA-A sequences

### Multiple Sequence Alignment of HLA-A sequences

Now, we will align three different HLA-A sequences from different individuals. This requires a multiple sequence alignment (MSA) algorithm. We have applied an MSA alignment algorithm to these sequences for you. The algorithm we used is called T-coffee, and can be found here: http://tcoffee.vital-it.ch/apps/tcoffee/do:regular Here, you can use tcoffee to find the alignment between any number of sequences.

<h2 style="color:red">** Question 20**</h2> 

Take a look at the T-coffee website (http://tcoffee.vital-it.ch/apps/tcoffee/references.html) and give a summary about what their algorithm does.


---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

Now, let's load data for 3 HLA-A sequences to align.

In [None]:
# load in sequence data for HLA-A
# load in data
table = Table.read_table('https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/data/lab6/hla-a_msa.csv')
table


For each of the aligned sequences, there is either a mismatch between the three sequences, or a dash ('-') that indicates an deletion in that sequence, or an insertion in the other sequences.

Now, we will visualize the sequence differences, by counting the number differences at each location of the three sequences.

First, we define a function, called getDifferenceVector() that returns the counts of differences between three sequences at each location.

In [None]:
# Function that calculates the number of differences between three sequences. 
# Sequences must be the same length.
def getDifferenceVector(seq1,seq2,seq3):
    if len(seq1) != len(seq2) != len(seq3):
        return('Error: the input strings are not of the same length.')
    # create empty list to hold counts
    counts = [0] * len(seq1)
    index = [0] * len(seq1)
    for i in range(len(seq1)):
        if (seq1[i]!=seq2[i]):
            counts[i]+=1
        if (seq1[i]!=seq3[i]):
            counts[i]+=1
        if (seq2[i]!=seq3[i]):
            counts[i]+=1
        index[i] = i
    return (index, counts)
    

Using this function, we will compute the differences between all three sequences and plot this difference vector.

<h2 style="color:red">** Question 21**</h2> 
Fill in the blanks below to plot the difference vector across the three HLA-A sequences.

In [None]:
# TODO: assign sequences seq1,seq2,seq3

# YOUR CODE HERE
raise NotImplementedError()

(index, counts) = getDifferenceVector(seq1,seq2,seq3)
plt.plot(index, counts)
plt.ylabel('Differences')
plt.xlabel('Position')

plt.show()

<h2 style="color:red">** Question 22**</h2> 
How does this sample distribution of mismatches compare to the actual distribution for HLA-A we saw in the UCSC Genome Browser? Where in HLA-A do these locations of increased mismatches occur? Think about the exon positions and the variability of each exon.


---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 23**</h2> 

Based on the distribution of mismatches for these three sequences for HLA-A, how would you expect the mismatch distribution for H2A to look?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

# Part 3: Extra for Experts

<h2 style="color:red">** Bonus Question 1**</h2> 

Choose a SNP overlapping HLA-A from the UCSC Genome Browser (This can be the same SNP you chose in part 1 of the lab.) Record the dbSNP name.  Does this SNP appear in any of the sequences above? How can you verify this? (Hint: This will require coding)

In [None]:
# YOUR CODE HERE
raise NotImplementedError()