# TCGA - gene expression - data proprocessing

In order to reduce the number of features (~60k) we count with in the obtained gene-expression dataset (from Pancancer TCGA), in this notebook some no-supervised methods of features filtering are going to be used:

First, those genes whose expression is constant throughout the dataset, that is their standard deviation equals to zero (std = 0), will be removed.

After doing this, it will be computed the Median-Absolute-Deviation (MAD) of the remaining genes throughout the dataset, and according to this measure those genes whose MAD is among the 20k highest will be selected.

In [None]:
import pandas as pd
import numpy as np
import plotly.offline as py
import plotly.graph_objs as go

### Data loading

In [None]:
df_gene_exp = pd.read_hdf('data/TCGA_data.h5', key='both_gene_expression')

In [None]:
df_gene_exp.isnull().any().any()

In [None]:
df_gene_exp.shape

## Standard Deviation (std)

In [None]:
std = df_gene_exp.std(axis=0)

In [None]:
py.init_notebook_mode()

data = [go.Histogram(x=np.array(std))]

py.iplot(data)

In [None]:
len(std[std<1e-12])

In [None]:
len(std[std>=1e-12])

In [None]:
len(std[std>=1e-12])+len(std[std<1e-12])

In [None]:
df_gene_exp_1 = df_gene_exp[std[std>=1e-12].index]

In [None]:
df_gene_exp_1.shape

In [None]:
py.init_notebook_mode()

data = [go.Histogram(x=np.array(df_gene_exp_1.std()))]

py.iplot(data)

## Median-Absolute-Deviation (MAD)

In [None]:
from statsmodels import robust

mad = pd.Series(robust.scale.mad(df_gene_exp_1), index=df_gene_exp_1.columns)

In [None]:
py.init_notebook_mode()

data = [go.Histogram(x=np.array(mad))]

py.iplot(data)

In [None]:
df_gene_exp_2 = df_gene_exp_1[mad.sort_values(ascending=False)[0:20000].index]

In [None]:
df_gene_exp_2.shape

In [None]:
py.init_notebook_mode()

data = [go.Histogram(x=np.array(mad.sort_values(ascending=False)[0:20000]))]

py.iplot(data)

In [None]:
mad_20 = pd.DataFrame(mad.sort_values(ascending=False)[0:20000], columns=['mad'])
std = pd.DataFrame(std, columns=['std'])
std_mad_20 = pd.merge(std, mad_20, left_index=True, right_index=True)

In [None]:
# Create a trace
trace = go.Scatter(
    x = list(std_mad_20['std']),
    y = list(std_mad_20['mad']),
    mode = 'markers'
)

data = [trace]
layout = go.Layout(title="Selected features: std-MAD", xaxis=dict(title='Standard Deviation - std'),
                   yaxis=dict(title='Median Absolute Deviation - MAD'))
fig = go.Figure(data=data, layout=layout)

py.offline.iplot(fig)

### Splitting

First, brca and non_brca patients id are loaded in order to split the preprocessed dataframe

In [None]:
non_brca_patients = list(pd.read_hdf('data/TCGA_data.h5', key='non_brca_patients')[0])

In [None]:
brca_patients = list(pd.read_hdf('data/TCGA_data.h5', key='brca_patients')[0])

In [None]:
prep_brca_gene_exp = df_gene_exp_2.loc[brca_patients]
prep_non_brca_gene_exp = df_gene_exp_2.loc[non_brca_patients]

In [None]:
prep_brca_gene_exp.isnull().any().any()

In [None]:
prep_non_brca_gene_exp.isnull().any().any()

In [None]:
prep_brca_gene_exp.shape

In [None]:
prep_non_brca_gene_exp.shape

## Data exportation

In [None]:
with pd.HDFStore('data/TCGA_gene_exp_20k_std-MAD.h5', 'w') as store:
    store['brca'] = prep_brca_gene_exp
    store['non_brca'] = prep_non_brca_gene_exp