# Project 2 - The Cancer Genome Atlas (TCGA) Data Analysis

Notebook version: `25.3` (please don't change)

**IMPORTANT: Before you do anything, save a copy of this notebook to your own google drive using the `File -> Save a copy to Drive` button in the menu. Otherwise you cannot save your changes. Once you've saved a copy to your own drive, it's available there just like a regular Google Docs file, and it is saved automatically.**

The Cancer Genome Atlas (TCGA) is an international endeavor to catalogue genomic and genetic mutations in a variety of cancer tissues. It is generally believed that gathering such information from a large number of patients will improve our ability to diagnose, treat, and prevent cancer through a better understanding of the genomic variation introduced by cancer. Paramount to arrive at such understanding is the bioinformatic analysis of the data. In this project, you are offered the possibility to contribute to this venture. You have access to the processed TCGA data from $9,648$ patients having different forms of cancer.

You are given the `clinical.csv` file, which contains many different types of information about each patient. For example, the field `cancer_type` contains the cancer subtype, `drug_received_treatment` denotes whether the patients received a drug treatment, and `vital_status` denotes whether a patient was still alive during the follow-up. Please note, that for some of these variables, information is available for a subset of the patients. An example is the `her2_immunohistochemistry_level_result` column, containing the HER2 score (0, 1+, 2+, or 3+), where 3+ denotes HER2-positive. This score is only available for breastcancer patients. For all patients for which a variable is not measured, the value is set to "NaN" (`np.nan` in Python).

For each patient, you further have access to the following data:

- `/data/expression.pkl` - **Gene expresssion data**: Expression of each gene, measured by RNAseq. The data was normalized to one million counts per sample (CPM) to account for different sequencing depths per sample, and then log-transformed. The data was not standardized (i.e. the mean expression of each gene is not zero), so think carefully about whether your analysis requires this.

- `ME.pkl` - **DNA Methylation data**: Methylation of each gene, represented as beta-values, which are continuous values between 0 and 1, representing the ratio of intensities between methylated and unmethylated sites.

- `CN.pkl` - **Copy-Number Variation data**: Copy-number variation of each gene. 

- `MIR.pkl` - **microRNA expression data**: Expression of each microRNA, measured by RNAseq. Just like the other gene experssion data, if was normalized to one million counts per sample (CPM), and then log-transformed, but not standardized.

To link the data from these files to the patients, you can use the `patient_id` column in the each datatypes' dataframe, which corresponds to the index of the clinical dataframe (use `clinical.index`, or `clinical['patient_id']` to access it).

Please note that for some of these data types, there is are additional samples for some patients, that are from reference healthy tissue. These can be identified by the `sample_type` column in the corresponding dataframe.

More information about this dataset and some analyses of it can be found in the paper: Taskesen et al. Pan-cancer subtyping in a 2Dmap shows substructures that are driven by specific combinations of molecular characteristics. Nature Scientific Reports, 6:24949, 2016.
doi: 10.1038/srep24949. (also available on BrightSpace)

<br>

---
<br>


> To contribute to the quest for solving cancer, you are asked to analyze this data, which also means that you should think of meaningful and interesting questions that can be answered using the provided data (these are not known beforehand!). Make use of at least one techniques you have learned from each of the modules 2, 3 and 4. Examples may include, looking for differential expressed markers, clustering the data to discover subtypes, or build predictors for adverse outcomes. More information about TCGA can be found on their website: https://cancergenome.nih.gov/ .
>
> The results should be summarized in a poster. Make sure that you: motivate choices that
you made during the analyses (aim of the performed analysis, type of algorithm, parameter
settings etc.); explain and discuss your findings; explain what is represented in figures (what
is on the axes etc.).

---

**Hint**: So far you've made your plots with `matplotlib.pyplot`, which is excellent for basic plots, but if you need other types of plots, you may want to look at the `seaborn` library. They have many different types of visualizations (see some example [here](https://seaborn.pydata.org/examples/index.html)), and the library works well together with pandas.

In [None]:
!mkdir -p /data
!wget -nc -O "/data/clinical.csv" https://surfdrive.surf.nl/files/index.php/s/653xXM13mXQFhnR/download
!wget -nc -O "/data/cnv.pkl" https://surfdrive.surf.nl/files/index.php/s/Gkn21dal4o2mNhd/download
!wget -nc -O "/data/expression.pkl" https://surfdrive.surf.nl/files/index.php/s/OCi3ZI2clscbqIs/download
!wget -nc -O "/data/meth.pkl" https://surfdrive.surf.nl/files/index.php/s/6uzoxlHVVCjHyM1/download
!wget -nc -O "/data/mirna.pkl" https://surfdrive.surf.nl/files/index.php/s/CCtSonICb3O0ByR/download

In [None]:
import pandas as pd
import pickle

with open("/data/cnv.pkl", "rb") as f:
  CN = pickle.load(f)

with open("/data/expression.pkl", "rb") as f:
  GE = pickle.load(f)

with open("/data/meth.pkl", "rb") as f:
  ME = pickle.load(f)

with open("/data/mirna.pkl", "rb") as f:
  MIR = pickle.load(f)

clinical = pd.read_csv("/data/clinical.csv", index_col=0)

In [None]:
# Just like in the first project, everything is stored in Pandas dataframes:
display(sample_info.head())