This notebook analyses the NKI gene expression dataset.  
https://www.nature.com/articles/srep01236  

The dataset can be downloaded from https://data.world/deviramanan2016/nki-breast-cancer-data  

This notebook was prepared by Davide Gurnari. 

In [1]:
import numpy as np
import pandas as pd
import networkx as nx
from sklearn.preprocessing import MinMaxScaler

from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
from pyballmapper import BallMapper
from pyballmapper.plotting import graph_GUI
from bokeh.plotting import figure, show

from matplotlib import colormaps as cm

In [3]:
# this cell allows for bokeh plots to be displayed inside jupyter notebooks
from bokeh.io import output_notebook
output_notebook()

## load data

In [4]:
nki_cleaned_df = pd.read_csv('data/nki_cleaned.csv')
print(nki_cleaned_df.shape)
nki_cleaned_df.head()

(272, 1570)


Unnamed: 0,patient,id,age,eventdeath,survival,timerecurrence,chemo,hormonal,amputation,histtype,...,contig36312_rc,contig38980_rc,nm_000853,nm_000854,nm_000860,contig29014_rc,contig46616_rc,nm_000888,nm_000898,af067420
0,s122,18,43,False,14.817248,14.817248,False,False,True,1,...,0.591103,-0.355018,0.373644,-0.76069,-0.164025,-0.038726,0.237856,-0.087631,-0.369153,0.153795
1,s123,19,48,False,14.261465,14.261465,False,False,False,1,...,-0.199829,-0.001635,-0.062922,-0.682204,-0.220934,-0.100088,-0.466537,-0.231547,-0.643019,-0.014098
2,s124,20,38,False,6.644764,6.644764,False,False,False,1,...,0.328736,-0.047571,0.084228,-0.69595,-0.40284,-0.099965,0.110155,-0.114298,0.258495,-0.198911
3,s125,21,50,False,7.748118,7.748118,False,True,False,1,...,0.648861,-0.039088,0.182182,-0.52464,0.03732,-0.167688,-0.01679,-0.285344,-0.251188,0.86271
4,s126,22,38,False,6.436687,6.31896,False,False,True,1,...,-0.287538,-0.286893,0.057082,-0.565021,-0.105632,-0.108148,-0.405853,-0.053601,-0.677072,0.13416


In [5]:
# the first 17 columns are patient's info, the others gene expressions data
nki_cleaned_df.columns[:17]

Index(['patient', 'id', 'age', 'eventdeath', 'survival', 'timerecurrence',
       'chemo', 'hormonal', 'amputation', 'histtype', 'diam', 'posnodes',
       'grade', 'angioinv', 'lymphinfil', 'barcode', 'esr1'],
      dtype='object')

In [6]:
# we will use the gene expression data as features
# and 'eventdeath' as target
X = nki_cleaned_df[nki_cleaned_df.columns[17:]]
y = nki_cleaned_df[['eventdeath']].astype(int)
y.mean()

eventdeath    0.283088
dtype: float64

## Euclidean distance

Using the standard Euclidean distance does not reveal much...

In [10]:
EPS = 10

X = nki_cleaned_df[nki_cleaned_df.columns[17:]]

nki_bm = BallMapper(X = X.to_numpy(), 
                    eps = EPS,
                    verbose=True) 

my_red_palette = cm.get_cmap('Reds')
nki_bm.add_coloring(coloring_df=nki_cleaned_df[['eventdeath', 'esr1']])

euclidean_gui = graph_GUI(nki_bm.Graph,
                          my_red_palette,
                          ['eventdeath', 'esr1'])
euclidean_gui.color_by_variable('eventdeath')


show(euclidean_gui.plot)

Finding vertices...
251 vertices found.
Computing points_covered_by_landmarks...
Running BallMapper 
Finding edges...
Creating Ball Mapper graph...
Done
color by variable eventdeath 
MIN_VALUE: 0.000, MAX_VALUE: 1.000


## Cosine distance

But if we use the cosine distance we see two clusters

In [11]:
from scipy.spatial.distance import cosine

In [13]:
EPS_cosine = 0.43

X = nki_cleaned_df[nki_cleaned_df.columns[17:]]

nki_cosine_bm = BallMapper(X = X.to_numpy(),   
                           eps = EPS_cosine, 
                           metric=cosine,  # a custom distance function
                           verbose='tqdm') 
nki_cosine_bm.add_coloring(coloring_df=nki_cleaned_df[['eventdeath', 'esr1']])

my_red_palette = cm.get_cmap('Reds')

cosine_gui = graph_GUI(nki_cosine_bm.Graph,
                       my_red_palette,
                       tooltips_variables=['eventdeath', 'esr1']
                       )
cosine_gui.color_by_variable('eventdeath')

show(cosine_gui.plot)

using custom distance <function cosine at 0x1562bbd00>
Finding vertices...


  0%|          | 0/272 [00:00<?, ?it/s]

193 vertices found.
Computing points_covered_by_landmarks...


  0%|          | 0/193 [00:00<?, ?it/s]

Running BallMapper 
Finding edges...


0it [00:00, ?it/s]

Creating Ball Mapper graph...
Done
color by variable eventdeath 
MIN_VALUE: 0.000, MAX_VALUE: 1.000


By coloring by the expression level of the estrogen receptor gene we see a good separation

In [14]:
cosine_gui.color_by_variable('esr1')

show(cosine_gui.plot)

color by variable esr1 
MIN_VALUE: -1.334, MAX_VALUE: 0.588


## distance correlation

In [15]:
from scipy.spatial.distance import correlation

In [18]:
EPS_correlation = 0.5

X = nki_cleaned_df[nki_cleaned_df.columns[17:]]

nki_correlation_bm = BallMapper(X = X.to_numpy(),
                           eps = EPS_correlation,
                           metric=correlation,
                           verbose=True)

my_red_palette = cm.get_cmap('Reds')
nki_correlation_bm.add_coloring(coloring_df=nki_cleaned_df[['eventdeath', 'esr1']])
correlation_gui = graph_GUI(nki_correlation_bm.Graph, my_red_palette,
                            )
correlation_gui.color_by_variable('eventdeath')

show(correlation_gui.plot)

using custom distance <function correlation at 0x1562bbc70>
Finding vertices...
185 vertices found.
Computing points_covered_by_landmarks...
Running BallMapper 
Finding edges...
Creating Ball Mapper graph...
Done
color by variable eventdeath 
MIN_VALUE: 0.000, MAX_VALUE: 1.000


The flare with lower survival probability corresponds to high ESR1 patients that do not respond well to therapy.

In [19]:
correlation_gui.color_by_variable('esr1')

show(correlation_gui.plot)

color by variable esr1 
MIN_VALUE: -1.334, MAX_VALUE: 0.588
