Details:

https://academic.oup.com/jnci/article/104/4/311/979947

> Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets.

AURKA

ER is ESR1 (Source: https://www.genecards.org/cgi-bin/carddisp.pl?gene=ESR1)

HER2 is ERBB2 (Source: https://www.genecards.org/cgi-bin/carddisp.pl?gene=ERBB2)

In [3]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Customize this for each notebook

In [8]:
OUTPUT_DIR='Three-Gene-Model-Lab-Demo/'
OPTIONS={'seed':3}
PREFIX="_".join([f"{key}={OPTIONS[key]}" for key in OPTIONS.keys()])
RESULTS={}
PREFIX

'seed=3'

In [9]:
from pathlib import Path
home = str(Path.home())

In [10]:
KNOWLEDGE_LIB=f'{home}/knowledgelib'

In [12]:
from IPython.display import display, Markdown, Latex
import sys
sys.path.insert(0,f'{KNOWLEDGE_LIB}')
import pyknowledge
import pandas as pd
import scipy.io
import pandas as pd
import numpy as np
import joblib

## Load the input data

In [14]:
## Customize this load to read in the data and format it with the correct columns
def load_data_all(seed):
    mat = scipy.io.loadmat("/disk/metabric/BRCA1View20000.mat")
    #gene_labels = open("/disk/metabric/gene_labels.txt").read().split("\n")
    gene_labels = [g[0] for g in mat['gene'][0]]
    df = pd.DataFrame(mat['data'].transpose(), columns=gene_labels)
    [n_dim, n_sample] = df.shape
    for i in range(n_dim):
        m1 = min(df.iloc[:,i])
        m2 = max(df.iloc[:,i])
        df.iloc[:,i] =(df.iloc[:,i] - m1)/(m2 - m1)
    df['target'] = mat['targets']
    df['Subtype'] = df.target.map({1:'Basal',2:'HER2+',3:'LumA',4:'LumB',5:'Normal Like',6:'Normal'})
    df['color'] = df.target.map({1:'red',2:'green',3:'purple',4:'cyan',5:'blue',6:'green'})
    df['graph_color'] = df.target.map({1:'#FFFFFF',2:'#F5F5F5',3:'#FFFAFA',4:'#FFFFF0',5:'#FFFAF0',6:'#F5FFFA'})
    index = joblib.load(f'/disk/metabric/index_{seed}.joblib.z')    
    df = df.iloc[index,:]
    df = df.set_index(np.arange(len(df)))
    
    return df

In [17]:
df_all = load_data_all(OPTIONS['seed'])

In [18]:
df_all.Subtype.value_counts() # basal-like, HER2-enriched, luminal A, and luminal B

LumA           721
LumB           491
Basal          330
HER2+          239
Normal Like    202
Normal         150
Name: Subtype, dtype: int64

## Knowledge

#### Genes

In [19]:
knowledge_genes = ["ERBB2","ESR1","AURKA"]

In [22]:
genes_df_all = df_all[knowledge_genes+["Subtype"]]
genes_df_all.head()

Unnamed: 0,ERBB2,ESR1,AURKA,Subtype
0,7.063275,0.71751,0.378666,LumB
1,7.248318,0.560912,0.23741,Normal Like
2,6.860123,0.651396,0.224923,LumA
3,7.010364,0.789543,0.12955,LumA
4,7.44147,0.88332,0.182163,LumA


In [58]:
means = genes_df_all.groupby('Subtype').mean()
medians = genes_df_all.groupby('Subtype').median()
stdevs = genes_df_all.groupby('Subtype').std()

means

Unnamed: 0_level_0,ERBB2,ESR1,AURKA
Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Basal,6.953369,0.150253,0.460046
HER2+,7.888029,0.314071,0.414654
LumA,7.174146,0.691527,0.242708
LumB,7.15405,0.715256,0.393492
Normal,6.820962,0.448411,0.090667
Normal Like,7.120211,0.496876,0.221378


In [59]:
medians

Unnamed: 0_level_0,ERBB2,ESR1,AURKA
Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Basal,6.856068,0.104206,0.47045
HER2+,8.044117,0.222098,0.404954
LumA,7.147807,0.70671,0.231855
LumB,7.095976,0.735066,0.384867
Normal,6.907045,0.430171,0.072994
Normal Like,7.065202,0.51905,0.210372


In [39]:
stdevs

Unnamed: 0_level_0,ERBB2,ESR1,AURKA
Subtype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Basal,0.553819,0.132523,0.122079
HER2+,0.638267,0.23814,0.10567
LumA,0.291271,0.129719,0.094623
LumB,0.452018,0.122941,0.108063
Normal,0.359079,0.154813,0.078081
Normal Like,0.446458,0.188961,0.102247


In [60]:
baseline = 'Basal'
baseline_means = means.loc[baseline]
baseline_medians = medians.loc[baseline]
baseline_stdevs = stdevs.loc[baseline]
baseline_means

ERBB2    6.953369
ESR1     0.150253
AURKA    0.460046
Name: Basal, dtype: float64

## Scale

In [61]:
df_subtype = genes_df_all.loc[genes_df_all['Subtype'] != baseline]
df = df_subtype.drop('Subtype',axis=1).subtract(baseline_medians,axis=1).divide(baseline_stdevs,axis=1)
df = df.join(df_subtype[['Subtype']])
df

Unnamed: 0,ERBB2,ESR1,AURKA,Subtype
0,0.374142,4.627909,-0.751834,LumB
1,0.708264,3.446241,-1.908924,Normal Like
2,0.007323,4.129022,-2.011207,LumA
3,0.278604,5.171463,-2.792446,LumA
4,1.057028,5.879088,-2.361470,LumA
...,...,...,...,...
2128,0.106796,3.720567,1.050065,LumB
2129,0.276012,4.138515,-3.578229,Normal
2130,-0.097714,3.703847,-0.306405,HER2+
2131,0.133961,3.740493,-1.947008,LumA


## EDA

### Each gene individually

In [62]:
source = df.join(df_all[['target']]).melt(id_vars=['Subtype','target'])
source.columns = ["Subtype","target","Gene","Value"]
counts = source.groupby('Subtype')['target'].count().to_frame()
counts.columns = ['Count']
source = source.set_index('Subtype').join(counts).reset_index()
import altair as alt
alt.data_transformers.disable_max_rows()

g = alt.Chart(source).transform_calculate(
    pct='1 / datum.Count'
).mark_area(
    opacity=0.3,
    interpolate='step'
).encode(
    alt.X('Value:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('sum(pct):Q', axis=alt.Axis(format='%'),stack=None),
    alt.Color('Subtype:N'),
    row='Gene:N'
)
g

In [69]:
directions = df.drop('Subtype',axis=1)>0
directions.value_counts()

ERBB2  ESR1   AURKA
True   True   False    1303
False  True   False     261
True   True   True      156
       False  False      46
False  True   True       24
True   False  True       11
False  False  False       2
dtype: int64

In [74]:
directions.join(df[['Subtype']]).groupby(knowledge_genes)['Subtype'].value_counts()

ERBB2  ESR1   AURKA  Subtype    
False  False  False  HER2+            1
                     Normal Like      1
       True   False  LumB            81
                     Normal          66
                     LumA            57
                     Normal Like     49
                     HER2+            8
              True   LumB            20
                     HER2+            3
                     Normal           1
True   False  False  HER2+           43
                     Normal Like      3
              True   HER2+           11
       True   False  LumA           648
                     LumB           300
                     Normal Like    148
                     HER2+          124
                     Normal          83
              True   LumB            90
                     HER2+           49
                     LumA            16
                     Normal Like      1
Name: Subtype, dtype: int64

In [87]:
patterns = set([
    (False,True,True),
    (True,False,False),
    (True,True,False) # Maybe?
])

In [92]:
(False,True,True) == (False,True,False)

False

In [95]:
mask = []
for ix1 in df.index:
    directions1 = tuple(directions.loc[ix1])
    if directions1 not in patterns:
        mask.append(False)
    else:
        mask.append(True)

In [96]:
from IPython.display import Image

import networkx as nx

G = nx.Graph()
for ix in df_all.index:
    c = 'white'
    G.add_node(ix,color='black',style='filled',fillcolor=c)

for ix1 in df.loc[mask].index:
    directions1 = tuple(directions.loc[ix1])
    if directions1 not in patterns:
        continue
    for ix2 in df.loc[mask].index:
        directions2 = tuple(directions.loc[ix2])
        if directions2 == directions1:
            G.add_edge(ix1,ix2)

In [None]:
def save(A,file="graph.png"):
    g = A.draw(format=file.split(".")[-1], prog='dot')
    open(file,"wb").write(g)
    return Image(g)

#pos = nx.drawing.nx_agraph.graphviz_layout(G, prog='dot')
#A = nx.nx_agraph.to_agraph(G)
#A.graph_attr["rankdir"] = "LR"
# draw it in the notebook
#save(A,file=f"{OUTPUT_DIR}{PREFIX}_graph.png")

In [100]:
!mkdir {OUTPUT_DIR}{PREFIX}_graphs

mkdir: cannot create directory ‘Three-Gene-Model-Lab-Demo/seed=3_graphs’: File exists


In [None]:
graphs = list(G.subgraph(c).copy() for c in nx.connected_components(G))

for i,graph in enumerate(graphs):
    pos = nx.drawing.nx_agraph.graphviz_layout(graph, prog='dot')
    A = nx.nx_agraph.to_agraph(graph)
    A.graph_attr["rankdir"] = "LR"
    # draw it in the notebook
    save(A,file=f"{OUTPUT_DIR}{PREFIX}_graphs/graph_{i}.png")
    break