# MC3

The TCGA project brought together over 10,000 patients from 33 cancer types. Lots of different research labs contributed to the effort -- but one collaboration in particular sought to unify all the mutation calling to create one comprehensive data set. Enter: MC3.

The name MC3 comes from: Multi-Center Mutation Calling in Multiple Cancers

**Citation:** Ellrott, K. et al. Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Systems 6, 271–281.e7 (2018)

**Website:** https://doi.org/10.1016/j.cels.2018.03.002

MC3 is a huge data set, and I'm not sure how rstudio.cloud would handle it. I've selected samples from a single cancer type: glioblastoma. We want to use MC3 to identify what genes are mutated in GBM and how often they are mutated.

The whole MC3 file comprises 3,600,963 mutations. The subset of GBM samples is only 69,137 mutations, or 1.9%. Processing a file line by line is one of the strengths of python -- we only need to hold one line of data in memory at a time, not the whole file like with R.

Here's a peek at the data set we are working with (`data/test_mutations.tsv`). The entire GBM data set is `data/gbm_mutations.tsv`.


| tcga_id | cancer_type | Tumor_Sample_Barcode | Gene | Chromosome | Start_Position | Variant_Classification |
| --- | --- | --- | --- | --- | --- | --- |
| TCGA-06-0185-01 | GBM | TCGA-06-0185-01A-01D-1491-08 | LRRC55 | 11 | 56949722 | Missense_Mutation |
| TCGA-06-5416-01 | GBM | TCGA-06-5416-01A-01D-1486-08 | ARSB | 5 | 78181606 | Nonsense_Mutation |
| TCGA-28-1751-01 | GBM | TCGA-28-1751-01A-02W-0643-08 | OR4K14 | 14 | 20483102 | Missense_Mutation |
| TCGA-14-0812-01 | GBM | TCGA-14-0812-01B-01W-0643-08 | AADACL3 | 1 | 12780908 | Missense_Mutation |
| TCGA-19-5956-01 | GBM | TCGA-19-5956-01A-11D-1696-08 | CHGB | 20 | 5904096 | Nonsense_Mutation |
| TCGA-19-1388-01 | GBM | TCGA-19-1388-01A-01W-0643-08 | SETD8 | 12 | 123880952 | Silent |
| TCGA-06-5416-01 | GBM | TCGA-06-5416-01A-01D-1486-08 | PRDM15 | 21 | 43242321 | Missense_Mutation |
| TCGA-12-3646-01 | GBM | TCGA-12-3646-01A-01W-0922-08 | LINGO2 | 9 | 27949327 | Missense_Mutation |
| TCGA-12-3652-01 | GBM | TCGA-12-3652-01A-01D-1495-08 | FKBP8 | 19 | 18649190 | Missense_Mutation |
| TCGA-19-5956-01 | GBM | TCGA-19-5956-01A-11D-1696-08 | CD209 | 19 | 7810925 | Missense_Mutation |
| TCGA-32-1991-01 | GBM | TCGA-32-1991-01A-01D-1353-08 | OR8G1 | 11 | 124120491 | Silent |
| TCGA-14-0866-01 | GBM | TCGA-14-0866-01B-01W-0643-08 | FZD1 | 7 | 90895283 | Missense_Mutation |
| TCGA-28-1760-01 | GBM | TCGA-28-1760-01A-01W-0643-08 | MYOZ1 | 10 | 75391768 | Missense_Mutation |
| TCGA-06-5858-01 | GBM | TCGA-06-5858-01A-01D-1696-08 | NEMF | 14 | 50267402 | Missense_Mutation |
| TCGA-06-5416-01 | GBM | TCGA-06-5416-01A-01D-1486-08 | VWA3A | 16 | 22134405 | Splice_Site |
| TCGA-06-5416-01 | GBM | TCGA-06-5416-01A-01D-1486-08 | GPR179 | 17 | 36483260 | Missense_Mutation |
| TCGA-06-2566-01 | GBM | TCGA-06-2566-01A-01W-0837-08 | MCL1 | 1 | 150551355 | Missense_Mutation |
| TCGA-87-5896-01 | GBM | TCGA-87-5896-01A-01D-1696-08 | F9 | X | 138643011 | Missense_Mutation |
| TCGA-32-2616-01 | GBM | TCGA-32-2616-01A-01D-1495-08 | VWA3A | 16 | 22134956 | Missense_Mutation |
| TCGA-06-2566-01 | GBM | TCGA-06-2566-01A-01W-0837-08 | GNA12 | 7 | 2770753 | 3'UTR |

### Grading rubric:
There are five tasks for you to complete below. How many did you attempt? How many did you seek help for when you got stuck? I need to see the code in order to know you tried, so type it out and I know what you were thinking!

### Tools in your toolbox:

- Open a file with
```python
file_variable_name = open("path/to_data.tsv", "r")
```
- Close a file with
```python
file_variable_name.close()
```
- Read a single line of a file with
```python
file_variable_name.readline()
```
- Iterate over each line in a file with
```python
for line in file_variable_name:
    do something
```
- Remove whitespace and split up a line by tabs with
```
stripped_and_split = line.strip().split("\t") # stripped_and_split is a list
```
- Access the value in the nth position of a list with
```
new_variable = stripped_and_split[n] # remember to start counting at 0
```
- Print the value of a variable to the screen
```python
print(new_variable)
```
- Intilize a new, empty dictionary with
```python
new_dictionary = {}
```
- Check if a key is already in a dictionary with
```python
if my_key in new_dictionary:
    do something
else
    do something else
```
- Add a key to a dictionary
```python
new_dictionary[my_key] = 1 # for example, if you're keeping count, start with a value of 1
```
- Update a value in a dictionary
```python
new_dictionary[my_key] = new_value
```
or
```python
new_dictionary[my_key] += 1 # this adds one to the current value
```
- Access the key and value pairs of a dictionary in a for loop
```python
for k,v in new_dictionary.items():
    # k means key
    # v means value
    do something with k and v together!
```
- Check if a number is greater than, less than, equal to, etc. with
```python
if my_number > 100:
    do something
    # use >= for greater than or equal
    # use < or <= for less than / less than or equal to
    # use == for equal to
```
- Remember to indent four spaces inside `for` loops and `if` statements

## Your task:

We need to open up the mutations file, read through each line, and keep track of the number of times we see each gene. One line corresponds to one mutation, so count every line. There is a header line with column names which we want to skip over. Follow each step below:

1. Open the mutations files (`data/gbm_mutations.tsv`), or first test with `data/test_mutations.tsv`

In [1]:
file = open("data/gbm_mutations.tsv", "r")

2. Use `.readline()` to remove the header line

In [2]:
file.readline()

'tcga_id\tcancer_type\tTumor_Sample_Barcode\tHugo_Symbol\tChromosome\tStart_Position\tVariant_Classification\n'

3. Create an empty dictionary called genes_dictionary to store mutation counts and iterate over each remaining line of the file in a `for` loop
  - Strip and split (by tabs `\t`) each line to isolate the gene name from each mutation (what position is Gene?)
  - Keep track of the number of mutations seen for each gene (hint: use a dictionary `genes_dictionary` with gene names as the key, and the number of mutations as the value, just add one each time you see a gene)
  - Remember, if a gene has not been seen before, you need to add it to the dictionary with a value of 1

In [3]:
genes_dictionary = {}

for line in file:
    stripped_and_split = line.strip().split("\t") # stripped_and_split is a list
    gene = stripped_and_split[3] # remember to start counting at 0
    
    if gene in genes_dictionary:
        genes_dictionary[gene] += 1
    else:
        genes_dictionary[gene] = 1

4. Print out any gene with greater than 100 mutations. You can check each gene and its value with `.items()` in a `for` loop.

In [4]:
for k,v in genes_dictionary.items():
    if v > 100:
        print(k, v)

TP53 147
EGFR 130
PTEN 136
TTN 285
MUC16 128


5. Close the file

In [5]:
file.close()

In [6]:
# if everything worked, you should have 147 mutations for TP53
assert genes_dictionary["TP53"] == 147

In [None]:
# END OF MC3 ACTIVITY!

### Dig deeper: look up each of these top genes. Oncogenes or tumor suppressors?

### Any reason to doubt any of them are actually cancer-related genes? hmm