# Melt mutation data

This notebook melts the mutation matrix into a tabular format, for use by the <code>core-service</code> repository.

Imports...

In [1]:
import pandas as pd
import os

Load mutation matrix...

In [2]:
%%time
mutation_path = os.path.join('data', 'mutation-matrix.tsv.bz2')
mutation_df = pd.read_table(mutation_path)

CPU times: user 1min 2s, sys: 2.73 s, total: 1min 4s
Wall time: 1min 4s


Grab a small subset of mutated genes to explain the approach...

In [3]:
mutation_subset_df = mutation_df[mutation_df['1']==1].head().loc[:, :'12']
mutation_subset_df

Unnamed: 0,sample_id,1,2,3,9,10,12
152,TCGA-18-3406-01,1,0,0,0,0,0
559,TCGA-38-4631-01,1,0,0,0,0,0
584,TCGA-3A-A9IU-01,1,0,0,0,0,0
849,TCGA-55-8089-01,1,1,0,0,0,1
867,TCGA-55-8507-01,1,0,0,0,0,0


<a href='http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html'>Melt</a> the mutation data frame. This will convert to a table of <code>sample</code>, <code>entrez_gene_id</code>, <code>mutation_status</code> pairs. Then, <code>mutation_status</code> is filtered to only contain mutated sample/gene pairs. Note that it correctly picks out the seven mutated sample/gene pairs.

In [4]:
melted_mutation_subset_df = pd.melt(mutation_subset_df, id_vars='sample_id', var_name='entrez_gene_id')
melted_mutation_subset_df[melted_mutation_subset_df.value==1].drop('value', axis=1)

Unnamed: 0,sample_id,entrez_gene_id
0,TCGA-18-3406-01,1
1,TCGA-38-4631-01,1
2,TCGA-3A-A9IU-01,1
3,TCGA-55-8089-01,1
4,TCGA-55-8507-01,1
8,TCGA-55-8089-01,2
28,TCGA-55-8089-01,12


Finally, apply this to the full mutation data, and write to a <code>.tsv</code> file.

In [5]:
melted_mutation_path = os.path.join('data', 'melted-mutations.tsv')
melted_mutation_df = pd.melt(mutation_df, id_vars='sample_id', var_name='entrez_gene_id')
mutations = melted_mutation_df.value==1
melted_mutation_df[mutations].drop('value', axis=1).to_csv(melted_mutation_path, sep='\t', index=False)