# Lab 7: The Cancer Genome

In [None]:
# imports
from datascience import Table
import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')
from sklearn.cluster import KMeans

In this lab, we will be exploring mutational signatures. A **mutational signature** is a collection of mutations that can categorize and define specific cancer types. Today, we will specifically explore mutational signatures that are linked to smoking tobacco. Information from this lab was taken from the paper https://www.nature.com/articles/nature12477.pdf.

## Part 1

For part 1 of the lab, we will compare the difference in mutation rates between smokers and nonsmokers in three different types of cancer. Data from this lab was extracted from http://cancer.sanger.ac.uk/cosmic/signatures.

First, let's load in the first dataset. 

In [None]:
# load in data for part 1
table = Table.read_table('https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/data/lab7/Table_S2_all_cancers.csv')


## Because we will be creating complex plots, we will use dataframes for these exercises.
## convert to dataframe
dataframe = table.to_df()

dataframe

### What's in the table?

This table contains mutation rates in three different cancer types. We also include mutation rates for 'All cancer types'. Mutation rates are measured in mutations per MB (megabase). 

The second column of our data table tells us what feature the mutation counts are from. For example, we can count the mean mutations in 'All Cancer Types' from Signature 1. We can also count the mutations that consist of a C -> T substitution from a specific cancer (i.e. larynx).

The 3rd and 4th columns tell us the mean and standard error for the mutation rates in **non-smokers** found for each (cancer type, mutation type) pair. The 5th and 6th columns tell us the mean and standard error for the mutation counts found in **smokers** for each (caner type, mutation type) pair.

First, let's answer some questions about the dataset.





<h2 style="color:red">** Question 1**:</h2> 
What are the three cancer types that we have data for in this table?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---


<h2 style="color:red">** Question 2**:</h2> 
What are the signatures that we have mutation rates for in this table? 

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

### Plotting the Data

Next, we will plot the mutation rate (substitution) means and signature means for each cancer category provided in the table. First, we will plot information for all cancer types. Pay close attention to the code, as you will be plotting a similar figure for the remaining three cancer types as an exercise.

First, let's filter by Cancer Type to subselect information for 'All Cancer Types'.

In [None]:
## filter out for 'All Cancer Types'
allCancers = dataframe[dataframe['Cancer Type'] == 'All Cancer Types']

## show result
allCancers

<h2 style="color:red">** Question 3**:</h2> 
For the remaining three cancer types, create a new variable for each cancer type and filter by cancer type. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


Now, let's gather information for all signature mutation rates in all cancer types. We will create two dataframes: one will contain the mean mutation rates for smokers and non-smokers, and the other will contain standard error for the mean mutation rates of smokers and non-smokers. We will use these dataframes to plot the distributions of signatures found in smokers and non-smokers.

In [None]:

############################# Gather Signature Information for 'All Cancer Types' ################################

# filter out for signatures only
allCancers_signatures = allCancers[allCancers['Feature'].str.contains("Signature")]

# get signature means
signature_means = allCancers_signatures[['Feature', 'Non-smokers-Mean', 'Smokers-Mean']].set_index(['Feature'])

# get signature std error 
signature_stdErr = allCancers_signatures[['Feature','Non-smokers-StdErr', 'Smokers-StdErr']].set_index(['Feature'])

# rename std error means to match mean column names
signature_stdErr.columns = ['Non-smokers-Mean', 'Smokers-Mean']


Now, let's gather information for all **substitution** mutation counts. We will use these dataframes to plot the distributions of mutation types found in smokers and non-smokers.

In [None]:
############################# Gather Substitution Information for 'All Cancer Types' ################################
# filter out for substitutions only
allCancers_substitutions = allCancers[allCancers['Feature'].str.contains("substitutions")]

# get substitution means
substitution_means = allCancers_substitutions[['Feature', 'Non-smokers-Mean', 'Smokers-Mean']].set_index(['Feature'])

# get substitution std error
substitution_stdErr = allCancers_substitutions[['Feature','Non-smokers-StdErr', 'Smokers-StdErr']].set_index(['Feature'])

# rename std error means to match mean column names
substitution_stdErr.columns = ['Non-smokers-Mean', 'Smokers-Mean']

Now, let's plot both the signature means and substitutional means along with the standard error. We will plot these two plots in a subplot so we can easily view them next to eachother. We will use the **stdErr** dataframes to add error bars to our plots.

In [None]:
########################## plot the data in a double bar chart ########################################

# create a new subplot for substitution and mutations
fig, axes = plt.subplots(nrows=1, ncols=2)

# plot signatures in the 1st column
signature_means.plot(kind='bar', yerr=signature_stdErr, ax=axes[0], figsize=[15,4])

# plot substitutions in the 2nd column
substitution_means.plot(kind='bar', yerr=substitution_stdErr, ax=axes[1], figsize=[15,4])


<h2 style="color:red">** Question 4**:</h2> 
 Plot the substitutions and signature means and standard error in a subplot for **lung adenocarcinoma**. Refer to the code above as an example.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<h2 style="color:red">** Question 5**:</h2> 
 Plot the substitutions and signature means and standard error in a subplot for lung squamous.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<h2 style="color:red">** Question 6**:</h2> 
 Plot the substitutions and signature means and standard error in a subplot for larynx cancer.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Questions 7-11** are based off results from the 4 graphs above.

<h2 style="color:red">** Question 7**:</h2> 
 What is the signature with the highest mean mutations in each category? You should have four answers, one for each cancer type and one for 'All Cancer types'.

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---


<h2 style="color:red">** Question 8**:</h2> 
What is the mutation (substitution) type with the highest mean in each category?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 9**:</h2> 
Look at the mutations that are most prevalent in cancer types with signature 4 as the highest mean mutation. Are the most prevalent mutation types for these cancer types similar? If so, what are they?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 10**:</h2> 
Are there any noticeable differences in the signature distributions found in smokers and nonsmokers in these plots? If so, what are they?

I.e. Think about the following example questions:
- Which signatures are the most different between smokers and nonsmokers?
- Is there less variation between smokers and non smokers in any of the cancers?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

## Part 2: Signature Exploration

In part 2 of this lab, we will be plotting the frequencies of different mutations found in each signature. We will focus on signatures 2, 4, 5, and 13.

In [None]:
# Load in data for muations and signatures
signatures = Table.read_table('https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/data/lab7/Table_S5_features_tobacco_smokers.csv').to_df()
# Show the table
signatures

In this table, the first column denotes the mutation. For example, 'C>A' denotes a C mutated to an A. The column labeled **Context** denotes the base pairs surrounding the mutation. The next four columns indicate the probability of each mutation is in signatures 1,4,5 and 13, respectively. The last column indicates the probability of mutations in an in vitro DNA sample treated with B[a]P, or the chemical benzo[a]pyrene.


For the next exercises, we will be plotting the mutational probabilities from each signature in the table. First, let's plot the mutation probabilities of signature 2.

In [None]:
# Filter data for signature 2 and set x index to Mutation name
sig_2 = signatures[['Mutation', 'Signature 2']].set_index('Mutation')
sig_2

In [None]:
# plot signature 2

sig_2.plot(kind='bar', figsize=[20,5])

<h2 style="color:red">** Question 11**:</h2> 
 What type of mutation seems to be the most prevalent for signature 2?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

Visit <a href="http://cancer.sanger.ac.uk/cosmic/signatures" target="_blank">the Cosmic Database</a> and scroll down to the section and plots for each signature and look for Signature 2. Answer the following questions about Signature 2.

<h2 style="color:red">** Question 12**:</h2> 
What other signature is often found in samples containing signature 2?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 13**:</h2> 
The description of signature 2 indicates that a polymorphism in specific genes leads to greater presence of signature 2. Which genes are these? Given what we discussed in lecture, how are these proteins from the listed genes playing a role in mutational murden of signature 2?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 14**:</h2> 
 Plot the mutation distribution for signature 5.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<h2 style="color:red">** Question 15**:</h2> 
What type of mutation is most prevalent in signature 5?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 16**:</h2> 
Plot the mutation distribution for signature 13.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<h2 style="color:red">** Question 17**:</h2> 
What type of mutation seems to be the highest for signature 13?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 18**:</h2> 
Plot the mutation distribution for signature 4.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<h2 style="color:red">** Question 19**:</h2> 
What type of mutation seems to be the highest for signature 4?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

Visit <a href="http://cancer.sanger.ac.uk/cosmic/signatures" target="_blank">the Cosmic Database</a> and answer the following questions about Signature 4.

<h2 style="color:red">** Question 20**:</h2> 
What cancer types is signature 4 associated with?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 21**:</h2> 
What are the mutational biases that signature 4 is known for that is mentioned on the cosmic signatures web page? Is this reflected in the mutations graphed above for signature 4?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

### Signatures of Carcinogenic Substances
Now, we will look at the signature of a chemical compound, benzo[a]pyrene. This chemical is found in coal tar and tobacco smoke. Its metabolite binds to DNA, which can cause eventual mutations. The mutational signature of benzo[a]pyrene was obtained by exposing DNA to benzo[a]pyrene in vitro, or outside of a living organism.


Here, we will investigate the mutational signatures of benzo[a]pyrene and compare it to some of the signatures we have seen so far.

<h2 style="color:red">** Question 22**:</h2> 
Plot the mutation distribution for the signature from benzoapyrene.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<h2 style="color:red">** Question 23**:</h2> 
What type of mutationis the most prevalent for benzo[a]pyrene?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 24**:</h2> 
Which of the signatures plotted does benzoqpyrene's signature most resemble? Explain why this signature is most similar to benzopyrene's signature.

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---