In [24]:
import seaborn as sns
from ggplot import *
from matplotlib import pyplot as plt
import bokeh

import pandas as pd
import dask.dataframe as dd
import numpy as np
import scipy as sc
import statsmodels as sm

import sklearn as sk
import tensorflow as tf
import keras
import xgboost as xgb
import lightgbm as lgbm
import tpot

import sys
import os
import gc

# data sources

## DNA, [Mutation](https://ghr.nlm.nih.gov/primer/mutationsanddisorders/possiblemutations)

Literally, per genome and chromosome the change in the pair compared 
to a normal reference. Remember we have (Adenine,Thymine) and (Guanine,Cytosine) as the base pairs.

The types of mutations include (taken [from here]((https://ghr.nlm.nih.gov/primer/mutationsanddisorders/possiblemutations)):

Missense mutation. This type of mutation is a change in one DNA base pair that results in the substitution of one amino acid for another in the protein made by a gene. 

Nonsense mutation: is also a change in one DNA base pair. Instead of substituting one amino acid for another, however, the altered DNA sequence prematurely signals the cell to stop building a protein. This type of mutation results in a shortened protein that may function improperly or not at all.

Insertion: 
An insertion changes the number of DNA bases in a gene by adding a piece of DNA. As a result, the protein made by the gene may not function properly.

Deletion:
A deletion changes the number of DNA bases by removing a piece of DNA. Small deletions may remove one or a few base pairs within a gene, while larger deletions can remove an entire gene or several neighboring genes. The deleted DNA may alter the function of the resulting protein(s).

Duplication:
A duplication consists of a piece of DNA that is abnormally copied one or more times. This type of mutation may alter the function of the resulting protein.

Frameshift mutation:
This type of mutation occurs when the addition or loss of DNA bases changes a gene's reading frame. A reading frame consists of groups of 3 bases that each code for one amino acid. A frameshift mutation shifts the grouping of these bases and changes the code for amino acids. The resulting protein is usually nonfunctional. Insertions, deletions, and duplications can all be frameshift mutations.

Repeat expansion:
Nucleotide repeats are short DNA sequences that are repeated a number of times in a row. For example, a trinucleotide repeat is made up of 3-base-pair sequences, and a tetranucleotide repeat is made up of 4-base-pair sequences. A repeat expansion is a mutation that increases the number of times that the short DNA sequence is repeated. This type of mutation can cause the resulting protein to function improperly.

### DATA FIELDS, shape (422553, 11)
``` ID      |  Location        | Change     |  Gene   | Mutation type|  Var.Allele.Frequency  | Amino acid```

```SampleID,| Chr, Start, Stop|  Ref, Alt  | Gene    |    Effect    |  DNA_VAF, RNA_VAF      | Amino_Acid_Change```

```string   |string, int, int | char, char | string  |    string    |  float, float          |  string```

NOTE: this gives us direct insight in how genetic mutations lead to changes in amino-acids.

## Copy Number Variations

A copy number variation (CNV) is when the number of copies of a particular gene varies from one individual to the next.

### DATA FIELDS, shape (24802, 372)
``` Gene      | Chr, Start, Stop | Strand     |   SampleID 1..SampleID N```

``` string    |string, int, int  | int        |  int..int```


## Methylation, gene expression regulation

Degree of [methylation](https://en.wikipedia.org/wiki/DNA_methylation)
indicates addition of Methyl groups to the DNA. Increased methylation is associated with less transcription of the DNA:
Methylated means the gene is switched OFF, Unmethylated means the gene is switched ON.

Alterations of DNA methylation have been recognized as an important component of cancer development.


### DATA FIELDS, shape (485577, 483) 
``` probeID   | Chr, Start, Stop | Strand  | Gene   |  Relation_CpG_island | SampleID 1..SampleID N```

``` string    |string, int, int  | int     | string |   string             | float..float```


## RNA, gene expression

Again four building blocks; Adenosine (A), Uracil (U), Guanine (G), Cytosine (C).

(DNA) --> (RNA)

A --> U 

T --> A

C --> G

G --> C

Gene expression profiles, continuous values resulting from the normalisation of counts.

### DATA FIELDS, shape (60531, 477)
``` Gene      | Chr, Start, Stop | Strand  | SampleID 1..SampleID N```

``` string    |string, int, int  | int     |  float..float```


## miRNA, transcriptomics

The connection between the RNA production and protein creation. I.e. perhaps miRNA expression values can be associated with specific proteins.

### DATA FIELDS, shape (2220, 458)
``` MIMATID  | Name   | Chr, Start, Stop | Strand  | SampleID 1..SampleID N```

``` string   | string |string, int, int  | int     |  float..float```


## Proteomes

Proteine expression profiles, ditto, continuous values resulting from the normalisation of counts


### DATA FIELDS, shape (282, 355)
``` ProteinID  | SampleID 1..SampleID N```

``` string     | float..float```

### QUIZ, identify our data sets in the following image!


![image.png](_hackathon2018/_images/overview.png)


## GOAL

Some degree of multi-omic analysis and identification of pathways.

![image.png](_hackathon2018/_images/multi_omic.png)


# load in data...

In [61]:
data_clinical = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Phenotype_Metadata.txt',
                           sep="\t")
data_gene_expression = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_GeneExpression.txt',
                           sep="\t", dtype={'Start': 'float64', 'Stop': 'float64'})
data_copy_number = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_CNV.txt',
                           sep="\t",  dtype={'Start': 'float64', 'Stop': 'float64'})
data_miRNA = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_miRNA.txt',
                           sep="\t")
data_Mutation = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Mutation.txt',
                           sep="\t")
data_Methylation = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Methylation.txt',
                           sep="\t", dtype={'Start': 'float64', 'Stop': 'float64'})
data_Proteome = dd.read_csv('/media/koekiemonster/DATA-FAST/genetic_expression/hackathon_2/Melanoma/Melanoma_Proteome.txt',
                           sep="\t")

In [47]:
df_Methylation = data_Methylation.compute()
df_GeneExpression = data_gene_expression.compute()
df_proteome = data_Proteome.compute()
df_mutation = data_Mutation.compute()
df_copy_number = data_copy_number.compute()
df_miRNA = data_miRNA.compute()

In [82]:
df_mutation[:10]

Unnamed: 0,Sample,Chr,Start,Stop,Ref,Alt,Gene,Effect,DNA_VAF,RNA_VAF,Amino_Acid_Change
0,TCGA-D3-A3ML-06,chr5,140182973,140182973,G,A,PCDHA3,Missense_Mutation,0.532468,,p.D731N
1,TCGA-D3-A3ML-06,chr2,133541884,133541884,C,T,NCKAP5,Missense_Mutation,0.52,,p.E834K
2,TCGA-D3-A3ML-06,chr19,51217544,51217544,C,T,SHANK1,Missense_Mutation,0.21875,,p.G179R
3,TCGA-D3-A3ML-06,chr8,2820757,2820757,G,A,CSMD1,Silent,0.515723,,p.I3148I
4,TCGA-D3-A3ML-06,chr19,43708353,43708353,C,T,PSG4,Missense_Mutation,0.488764,,p.E39K
5,TCGA-D3-A3ML-06,chr5,125919676,125919676,A,G,ALDH7A1,Missense_Mutation,0.431818,,p.V114A
6,TCGA-D3-A3ML-06,chr6,136597581,136597581,G,A,BCLAF1,Missense_Mutation,0.368421,,p.S361L
7,TCGA-D3-A3ML-06,chr9,137630613,137630613,A,G,COL5A1,Missense_Mutation,0.141026,,p.T485A
8,TCGA-D3-A3ML-06,chr6,38840840,38840840,G,A,DNAH8,Missense_Mutation,0.59375,,p.E2249K
9,TCGA-D3-A3ML-06,chr5,140433458,140433458,G,A,PCDHB1,Silent,0.487805,,p.R801R


# feature manipulation

# feature normalisation

# feature batching and transposition

## per layer clustering

## per layer classification

# feature merging