# Project 2 (P2)

Notebook version: `25.1` (please don't change)

**IMPORTANT: Before you do anything, save a copy of this notebook to your own google drive using the `File -> Save a copy to Drive` button in the menu. Otherwise you cannot save your changes. Once you've saved a copy to your own drive, it's available there just like a regular Google Docs file, and it is saved automatically.**

The Cancer Genome Atlas (TCGA) is an international endeavor to catalogue genomic and genetic mutations in a variety of cancer tissues. It is generally believed that gathering such information from a large number of patients will improve our ability to diagnose, treat, and prevent cancer through a better understanding of the genomic variation introduced by cancer. Paramount to arrive at such understanding is the bioinformatic analysis of the data. In this project, you are offered the possibility to contribute to this venture. You have access to normalized TCGA data from 4434 patients having different forms of cancer.


You are also given the `sample_info.pkl` file, which contains information about each patient. For example, the field `cancertype` contains the cancer subtype, `os` the time to follow up in months and `osi` has a 1 if a patient was alive at follow-up time and 0 otherwise. Please note, that for some of these variables, information is available for a subset of the patients. An example is the `HER2type` column, denoting whether a breastcancer patient has of the HER2 type, which is naturally only available for breastcancer patients. For all patients for which a variable is not measured, the value is set to "NaN" (`np.nan` in Python).

For each patient, you further have access to the following data:
- `GE.pkl` -Gene expression data
- `ME.pkl` - DNA Methylation data
- `CN.pkl` - Copy-Number Variation data
- `MIR.pkl` - microRNA expression data

To contribute to the quest for solving cancer, you are asked to analyze this data, which also means that you should think of meaningful and interesting questions that can be answered using the provided data (these are not known beforehand!). Try to make use of the content presented as part of modules 2, 3 and 4, e.g. looking for differential expressed markers, clustering the data to discover subtypes, or build predictors for adverse outcomes. More information about TCGA can be found on their website: https://cancergenome.nih.gov/ .

More information about this dataset and some analyses of it can be found in the paper: Taskesen et al. Pan-cancer subtyping in a 2Dmap shows substructures that are driven by specific combinations of molecular characteristics. Nature Scientific Reports, 6:24949, 2016.
doi: 10.1038/srep24949. (also available on BrightSpace)

<br>

---

The results should be summarized in a poster. Make sure that you: motivate choices that
you made during the analyses (aim of the performed analysis, type of algorithm, parameter
settings etc.); explain and discuss your findings; explain what is represented in figures (what
is on the axes etc.).

---

**Hint**: So far you've made your plots with `matplotlib.pyplot`, which is excellent for basic plots, but if you need other types of plots, you may want to look at the `seaborn` library. They have many different types of visualizations (see some example [here](https://seaborn.pydata.org/examples/index.html)), and the library works well together with pandas.


In [None]:
!wget -nc -O "CN.pkl" https://surfdrive.surf.nl/files/index.php/s/M4ggslgihmkElMQ/download
!wget -nc -O "GE.pkl" https://surfdrive.surf.nl/files/index.php/s/MpDsfJWwwVOENod/download
!wget -nc -O "ME.pkl" https://surfdrive.surf.nl/files/index.php/s/pLIH3F1mnAPPCkQ/download
!wget -nc -O "MIR.pkl" https://surfdrive.surf.nl/files/index.php/s/olb2ZNtPyfLU8qR/download
!wget -nc -O "sample_info.pkl" https://surfdrive.surf.nl/files/index.php/s/Oa71uEshvR9isKL/download

In [56]:
import pandas as pd
import numpy as np
import pickle

with open("CN.pkl", "rb") as f:
  CN = pickle.load(f)

with open("GE.pkl", "rb") as f:
  GE = pickle.load(f)

with open("ME.pkl", "rb") as f:
  ME = pickle.load(f)

with open("MIR.pkl", "rb") as f:
  MIR = pickle.load(f)

sample_info = pd.read_csv("sample_info.csv", index_col=0, low_memory=False)
sample_info[sample_info == "[Not Available]"] = np.nan  # Make sure we only have one type of missing value in the dataframe

In [None]:
# Just like in the first project, everything is stored in Pandas dataframes:
display(sample_info.head())