Details:

https://academic.oup.com/jnci/article/104/4/311/979947

> Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor [ER], HER2, and aurora kinase A [AURKA]). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets.

AURKA

ER is ESR1 (Source: https://www.genecards.org/cgi-bin/carddisp.pl?gene=ESR1)

HER2 is ERBB2 (Source: https://www.genecards.org/cgi-bin/carddisp.pl?gene=ERBB2)

In [156]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Customize this for each notebook

In [157]:
OUTPUT_DIR='Three-Gene-Model-Lab-Demo/'
OPTIONS={'seed':3}
PREFIX="_".join([f"{key}={OPTIONS[key]}" for key in OPTIONS.keys()])
RESULTS={}
PREFIX

'seed=3'

In [158]:
from pathlib import Path
home = str(Path.home())

In [159]:
KNOWLEDGE_LIB=f'{home}/knowledgelib'

In [160]:
from IPython.display import display, Markdown, Latex
import sys
sys.path.insert(0,f'{KNOWLEDGE_LIB}')
import pyknowledge
import pandas as pd
import scipy.io
import pandas as pd
import numpy as np
import joblib

## Load the input data

In [161]:
## Customize this load to read in the data and format it with the correct columns
def load_data_all():
    mat = scipy.io.loadmat("/disk/metabric/BRCA1View20000.mat")
    #gene_labels = open("/disk/metabric/gene_labels.txt").read().split("\n")
    gene_labels = [g[0] for g in mat['gene'][0]]
    df = pd.DataFrame(mat['data'].transpose(), columns=gene_labels)
    [n_dim, n_sample] = df.shape
    for i in range(n_dim):
        m1 = min(df.iloc[:,i])
        m2 = max(df.iloc[:,i])
        df.iloc[:,i] =(df.iloc[:,i] - m1)/(m2 - m1)
    df['target'] = mat['targets']
    df['Subtype'] = df.target.map({1:'Basal',2:'HER2+',3:'LumA',4:'LumB',5:'Normal Like',6:'Normal'})
    df['color'] = df.target.map({1:'red',2:'green',3:'purple',4:'cyan',5:'blue',6:'green'})
    df['graph_color'] = df.target.map({1:'#FFFFFF',2:'#F5F5F5',3:'#FFFAFA',4:'#FFFFF0',5:'#FFFAF0',6:'#F5FFFA'})
    df = df.set_index(np.arange(len(df)))
    
    return df

In [None]:
df_all = load_data_all()

In [None]:
df_all.Subtype.value_counts() # basal-like, HER2-enriched, luminal A, and luminal B

## Knowledge

#### Genes

In [None]:
knowledge_genes = ["ERBB2","ESR1","AURKA"]

In [None]:
genes_df_all = df_all[knowledge_genes+["Subtype"]]
genes_df_all.head()

In [None]:
means = genes_df_all.groupby('Subtype').mean()
medians = genes_df_all.groupby('Subtype').median()
stdevs = genes_df_all.groupby('Subtype').std()

means

In [None]:
medians

In [None]:
stdevs

In [None]:
baseline = 'Basal'
baseline_means = means.loc[baseline]
baseline_medians = medians.loc[baseline]
baseline_stdevs = stdevs.loc[baseline]
baseline_means

## Scale

In [None]:
df_subtype = genes_df_all.loc[genes_df_all['Subtype'] != baseline]
df = df_subtype.drop('Subtype',axis=1).subtract(baseline_means,axis=1).divide(baseline_stdevs,axis=1)
df = df.join(df_subtype[['Subtype']])
df

## EDA

### Each gene individually

In [None]:
source = df.join(df_all[['target']]).melt(id_vars=['Subtype','target'])
source.columns = ["Subtype","target","Gene","Value"]
counts = source.groupby('Subtype')['target'].count().to_frame()
counts.columns = ['Count']
source = source.set_index('Subtype').join(counts).reset_index()
import altair as alt
alt.data_transformers.disable_max_rows()

g = alt.Chart(source).transform_calculate(
    pct='1 / datum.Count'
).mark_area(
    opacity=0.3,
    interpolate='step'
).encode(
    alt.X('Value:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('sum(pct):Q', axis=alt.Axis(format='%'),stack=None),
    alt.Color('Subtype:N'),
    row='Gene:N'
)
g

In [None]:
directions = df.drop('Subtype',axis=1)>0
pattern_counts = directions.value_counts()
pattern_counts

In [None]:
pattern_subtype_counts = directions.join(df[['Subtype']]).groupby(knowledge_genes)['Subtype'].value_counts()
pattern_subtype_frac = pattern_subtype_counts.divide(pattern_counts)
pattern_df = pattern_subtype_frac.to_frame().join(pattern_subtype_counts)
pattern_df.columns=['Fraction','Count']
pattern_df.sort_values(by='Fraction',ascending=False)

In [None]:
patterns = set([
    (False,True,True),
    (True,False,False)
 #   (True,True,False) # Maybe?
])

In [None]:
mask = []
for ix1 in df.index:
    directions1 = tuple(directions.loc[ix1])
    if directions1 not in patterns:
        mask.append(False)
    else:
        mask.append(True)

In [None]:
from IPython.display import Image

import networkx as nx

#G = nx.Graph()
#for ix in df_all.index:
#    c = 'white'
#    G.add_node(ix,color='black',style='filled',fillcolor=c)

df_mask = df.loc[mask]
directions_mask = directions.loc[mask]

In [None]:
A = pd.DataFrame(index=df_all.index,columns=df_all.index)
columns = directions_mask.columns
pts = directions_mask.reset_index().set_index(list(columns))
patterns_list = list(patterns)
for pattern in patterns_list:
    ixs = pts.loc[pattern,'index']
    A.loc[ixs,ixs] = 1
np.fill_diagonal(A.values, np.NaN)
A.stack()

In [None]:
#A = A.loc[A.fillna(0).sum()!=0]
#A = A.loc[:,A.index]

In [None]:
A.shape

In [None]:
A.to_csv(f'{OUTPUT_DIR}/A.csv')

**You will only need to proceed after this point if you want to visualize the graphs**

**Proceed with caution as they might be large**

In [127]:
G = nx.from_pandas_adjacency(A.fillna(0), create_using=nx.Graph)

In [128]:
def save(A,file="graph.png"):
    g = A.draw(format=file.split(".")[-1], prog='dot')
    open(file,"wb").write(g)
    return Image(g)

#pos = nx.drawing.nx_agraph.graphviz_layout(G, prog='dot')
#A = nx.nx_agraph.to_agraph(G)
#A.graph_attr["rankdir"] = "LR"
# draw it in the notebook
#save(A,file=f"{OUTPUT_DIR}{PREFIX}_graph.png")

In [129]:
!mkdir {OUTPUT_DIR}{PREFIX}_graphs

mkdir: cannot create directory ‘Three-Gene-Model-Lab-Demo/seed=3_graphs’: File exists


In [130]:
graphs = list(G.subgraph(c).copy() for c in nx.connected_components(G))

for i,graph in enumerate(graphs):
    nodes = list(graph.nodes())
    if len(nodes) > 1:
        pos = nx.drawing.nx_agraph.graphviz_layout(graph, prog='dot')
        AG = nx.nx_agraph.to_agraph(graph)
        AG.graph_attr["rankdir"] = "LR"
        # draw it in the notebook
        save(AG,file=f"{OUTPUT_DIR}{PREFIX}_graphs/graph_{i}.png")

KeyboardInterrupt: 