# Lab 9: Immunotherapy Clinical Trials

In [None]:
# imports
from datascience import Table
import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')
from sklearn.cluster import KMeans
import pandas as pd

Today, we will be investigating results from the paper 'Genomic correlates of response to CTLA-4 blockade in metastatic melanoma' by Van Allen et al, found here: https://www.ncbi.nlm.nih.gov/pubmed/26359337

**This paper analyzes whole exome data from 110 patients with melanoma who were pretreated with ipilimumab. Ipilimumlab is an antibody that aids in the inhibition of CTLA-4, a check pointer in T cell differentiation.**

The goal of this lab is to find relationships between a patient's response to drugs such as ipilimumab and their mutagen and neoantigen count in the exome.


First, let's load in the data for this lab.

In [None]:
# load in data for part 1
patients = Table.read_table("https://raw.githubusercontent.com/data-8/mcb-88-connector/gh-pages/data/lab9/TableS2.Clinical_and_genome_characteristics_each_patient.csv")

patients

In this table, we have data for 110 patients who were diagnosed with melanoma. 


For each patient, we will be looking at the following columns:
- overall_survival: total survival time 
- progression_free: total survival time without tumor progression
- neos500, neos250, neos100, neos50: number of neoantigens found of a certain length in the patient
- hla.a1, hla.a2, hla.b1, hla.b2, hla.c1, hla.c2: HLA A, B and C alleles for each patient

First, we will plot the overall survival times of all the patients against the progression free survival times.


<h2 style="color:red">** Question 1**:</h2> 
Plot the overall survival of patients against progression free survival of patients in a scatter plot. (Reference http://data8.org/datascience/tutorial.html#getting-started if you are unsure of how to make a scatter plot.) In this plot, you should plot the columns 'overall_survival' against 'progression_free'. Make sure to specify 'overall_survival' on the x-axis.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<h2 style="color:red">** Question 2**:</h2> 
Ideally, if every patient had a positive response to the immunotherapy drug, how would this plot look (ie. what would be the shape of the plot)?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

## Categories of Patient Responses

For the next questions, we will be categorizing patients by their responses to the CTLA inhibiting drug. There are three categories we will be looking at. These categories are patients with:
- Minimal to no benefit
- Long-term survival with no clinical benefit
- Clinical benefit


### The RECIST Metric

One of the metrics we will use to classify patients is RECIST, found in the table above. RECIST is a response evaluation for solid tumors. The RECIST classifications for each patient can be one of the following:
- CR = complete response 
- PR = partial response 
- SD = stable disease
- PD = progressive disease

For more information on RECIST guidelines, please refer to https://ctep.cancer.gov/protocoldevelopment/docs/recist_guideline.pdf.

The categories we will be grouping patients into are defined as follows:
### Minimal to no benefit
Individuals with minimal to no benefit are classified by the following criteria:
- having RECIST criteria of PD 
- or overall_survival < 1 year and RECIST criteria of SD

### Long-term survival with no clinical benefit
Individuals with long-term survival with no clinical benefits are classified as having:
- overall_survival > 2 years 
- and have early tumor progression < 6 months (progression_free < 6 months)

### Clinical benefit
Individuals with a clinical benefit had:
- RECIST of CR or PR
- or overall_survival > 1 year and RECIST criteria of SD


Next, we will filter out patients into each of these three cohorts. We will filter out patients with clinical benefit as an example. You will filter out patients with long-term survival and minimal benefit as an exercise.


In [None]:
# First, we will convert our data to a dataframe (df) to make filtering easier. 
patients_df = patients.to_df()

First, let's filter out patients with clinical benefit as an example. As stated above, the filtering criteria for this cohort is RECIST of CR or PR, or overall_survival > 1 year with RECIST criteria of SD.

In [None]:
## We want to filter by wheather RECIST is CR or PR, or overall_survival > 1 year and RECIST is SD.
# We write this filtering statement as:
# (patients_df['RECIST'] == 'CR' | patients_df['RECIST'] == 'PR') | ((patients_df['overall_survival'] > 365) & (patients_df['RECIST'] == 'SD'))
# Where (patients_df['RECIST'] == 'CR' | patients_df['RECIST'] == 'PR') checks if RECIST = CR or RECIST = PR
# and ((patients_df['overall_survival'] > 365) & (patients_df['RECIST'] == 'SD')) checks if the patient survives
# longer than 1 year and RECIST = SD
##
# Note that for filtering & means 'and' and | means 'or'

benefit = patients_df[(patients_df['RECIST'] == 'CR') 
                      | (patients_df['RECIST'] == 'PR') 
                      | ((patients_df['overall_survival'] < 365) 
                         & (patients_df['RECIST'] == 'SD'))]

benefit.head()

<h2 style="color:red">** Question 3**:</h2> 
Filter the patients in the long-term survival with no clinical benefit category. The criteria for filtering are listed above, and is as follows:
overall_survival > 2 years and progression free survival < 6 months (say average month is 30 days).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


<h2 style="color:red">** Question 4**:</h2> 
Filter the patients in the minimal to no benefit category. The criteria for filtering are listed above, and is as follows: RECIST = PD or overall_survival < 1 year and RECIST = SD.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()



<h2 style="color:red">** Question 5**:</h2> 
    Print out how many individuals are in each of these three cohorts. (Hint: You can get the number of rows in a DataFrame by using the .shape on the dataframe and fetching the first dimension (i.e. df.shape[0])).

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


## Investigation of mutation and neoantigen load across cohorts

For the next part of the lab, we will be comparing the mutation and neoantigen load across the three cohorts we filtered by in the previous section. Our goal is to find whether certain cohorts have higher mutation and neoantigen rates.

First, we will graph the mutation counts in a box plot across all three cohorts. As an exercise, you will then create a box plot of the neoantigen counts across cohorts.


In [None]:
# filter all three cohorts by the 'mutations' column
frames = [minimal['mutations'], long_term['mutations'], benefit['mutations']]

# merge the mutations column from all three cohorts together
result = pd.concat(frames, axis=1)

# rename the columns to match the cohort name. This will be displayed on the boxplot.
result.columns = ['minimal', 'long_term', 'benefit']

result.head()

In [None]:
# Now, plot the mutation load for each cohort. Make one box for each cohort.
result.plot(kind='box', figsize=(6,8))
plt.show()


<h2 style="color:red">** Question 6**:</h2> 
What is the name of the cohort with the highest median number of mutations?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

<h2 style="color:red">** Question 7**:</h2> 
Why do you think the group with the most mutations has a positive benefit from the CTLA-4 inhibiting drug?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---

## Boxplots for Neoantigens

Next, you will generate box plots for neoantigen counts for each cohort. As you can see in the dataset, we have neoantigen loads for neos500, neos250, neos100 and neos50. Each of these correspond to neoantigens within a specific binding affinity range. neos500 has the highest binding affinity, while neos50 has the lowest. Also from the table, you can see that each patient has their corresponding HLA alleles. These alleles were used to predict the binding affinity of mutated sequences for each of these neoantigen categories.



<h2 style="color:red">** Question 8**:</h2> 
Make a box plot that displays the counts of the neoantigens predicted to have the strongest binding affinity to their HLA molecules (found in the column neos500) for each cohort.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()


<h2 style="color:red">** Question 9**:</h2> 
Which cohort has the highest median number of neoantigens?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---


<h2 style="color:red">** Question 10**:</h2> 
Why would you expect this cohort to have the most neoantigens?

---
## <span style="color:red">Student Answer</span>

*Double-click and add your answer between the lines*

---