# Graph Creation (Adjacency Matrix) per METABRIC Subtype
This notebook details how to take the METABRIC clinical data and create graphs based on subtype. In this notebook, we will generate graphs based on the "TMB (nonsynonymous)" feature

## Imports and Installations
Collection of any necessary imports/installations

In [3]:
!pip install pandas 

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-1.4.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
     |████████████████████████████████| 11.7 MB 12.0 MB/s            
Installing collected packages: pandas
Successfully installed pandas-1.4.2


In [4]:
import pandas as pd
import numpy as np

## Loading Gene Expression Dataset
This dataset of gene expressions will be used to run through a neural network and will not be used in the creation of knowledge graphs. Though, we will need this dataset to make sure we can map the clinical patient data back to the correct patients. 

In [5]:
gene_df = pd.read_csv("/large/metabric/expression_with_gene_ids_min_max_no_pam50.csv.gz")
ids = gene_df["Sample ID"].tolist()
gene_df = gene_df.drop(["Sample ID"], axis=1)
gene_df.index = ids
gene_df

Unnamed: 0,A1CF,A2M,A2ML1,A4GALT,A4GNT,AAAS,AACS,AACSP1,AADAC,AADACL2,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
MB-0362,0.121975,0.215056,0.137268,0.319299,0.299318,0.379480,0.412660,0.361855,0.101269,0.172184,...,0.221594,0.299784,0.362357,0.225769,0.367554,0.132938,0.328175,0.573906,0.539070,0.147249
MB-0346,0.192559,0.042307,0.204485,0.103171,0.315292,0.391824,0.544226,0.372350,0.087123,0.164404,...,0.156303,0.516268,0.489675,0.363120,0.457905,0.155936,0.121523,0.483270,0.411438,0.224434
MB-0386,0.129016,0.302035,0.104212,0.478857,0.307694,0.265761,0.191160,0.214396,0.124491,0.135946,...,0.167134,0.210737,0.420390,0.164456,0.268026,0.169930,0.358189,0.616131,0.515510,0.124535
MB-0574,0.186569,0.204583,0.084922,0.155657,0.259502,0.225779,0.276242,0.363504,0.156255,0.174428,...,0.177223,0.211609,0.669651,0.530200,0.548744,0.054719,0.364991,0.480279,0.564949,0.158037
MB-0185,0.110777,0.337835,0.133802,0.265908,0.555257,0.349347,0.222268,0.381949,0.058114,0.111793,...,0.382424,0.450318,0.644640,0.280082,0.305045,0.135671,0.259829,0.531486,0.503557,0.165486
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
MB-0812,0.263569,0.327166,0.149486,0.187280,0.467571,0.130192,0.211174,0.441342,0.250357,0.177241,...,0.187513,0.264235,0.577818,0.347245,0.665362,0.144825,0.538418,0.555090,0.765131,0.169573
MB-1076,0.186013,0.576193,0.131895,0.742197,0.309652,0.264979,0.236222,0.493037,0.385240,0.167714,...,0.091934,0.065556,0.343792,0.243226,0.413232,0.109368,0.571451,0.635040,0.649998,0.298282
MB-0814,0.299798,0.575751,0.106458,0.426221,0.419683,0.335738,0.340183,0.502101,0.133941,0.139156,...,0.082602,0.063475,0.492459,0.214623,0.500081,0.151580,0.538403,0.790849,0.766577,0.400082
MB-1087,0.112526,0.529944,0.092508,0.588968,0.391670,0.130606,0.220825,0.458416,0.578603,0.247752,...,0.055908,0.063976,0.342782,0.257248,0.503659,0.047696,0.713061,0.603247,0.656763,0.243879


## Knowledge Graph Construction Functions
Two functions are provided to turn clinical data into graphs. The first function, "generate_fully_connected_graph()", takes in two arguments: the dataframe for the clinical data and the feature we want to make a graph for. It creates a fully-connected/weighted graph in the form of an adjacency matrix that can be used for a Neural Graph Machine. The second function, "fc_to_knn()", takes in two arguments: the graph matrix from the first function and k, the number of neihgbors we want each patient to have (if you don't know, there is a default value of 4 already set so you only need to pass in the graph matrix). It creates an unweighted/undirected graph in the form of an adjacency matrix that can be used for a Graph Convolutional Neural Network.

In [7]:
def generate_fully_connected_graph(df, column):
    df1 = df[df[column].notna()]
    ids = df1["Patient ID"].tolist()
    df1 = df1.reset_index(drop=True)
    col = df1[column]
    col = (col - col.mean()) / col.std() 
    df1 = df1.drop([column], axis=1)
    adj = np.zeros([df1.shape[0], df1.shape[0]])
    for i in range(len(adj)):
        for j in range(len(adj[i])):
            if i != j:
                adj[i][j] = np.abs(col[i] - col[j])
                adj[j][i] = adj[i][j]
    adj = pd.DataFrame(adj)
    adj.columns = ids
    adj.index = ids
    col.index = ids
    return adj, df1, col

In [26]:
def fc_to_knn(fc, k=4):
    fc1 = fc.to_numpy()
    knn = np.zeros(shape=(fc1.shape[0],fc1.shape[0]))
    for i in range(len(fc1)):
        nearest = np.argpartition(fc1[i], -k)[-k:]
        # print(nearest)
        for idx in nearest:
            knn[i][idx] = 1
            knn[idx][i] = 1
    knn = pd.DataFrame(knn)
    knn.columns = fc.columns
    knn.index = fc.columns
    return knn

## Clinical Data Knowledge Graphs
For this conference, we plan to use data from the patients' clinical data. Other datasets can possibly be used, but the Sample IDs/Patient IDs must match correspondingly. 

In [15]:
clinical_df = pd.read_csv('brca_metabric_clinical_data.tsv',sep='\t')
clinical_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"] != "NC"]
clinical_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"].notna()]
clinical_df = clinical_df[clinical_df["TMB (nonsynonymous)"].notna()]
clinical_df.head()

Unnamed: 0,Study ID,Patient ID,Sample ID,Age at Diagnosis,Type of Breast Surgery,Cancer Type,Cancer Type Detailed,Cellularity,Chemotherapy,Pam50 + Claudin-low subtype,...,Relapse Free Status (Months),Relapse Free Status,Number of Samples Per Patient,Sample Type,Sex,3-Gene classifier subtype,TMB (nonsynonymous),Tumor Size,Tumor Stage,Patient's Vital Status
0,brca_metabric,MB-0000,MB-0000,75.65,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,,NO,claudin-low,...,138.65,0:Not Recurred,1,Primary,Female,ER-/HER2-,0.0,22.0,2.0,Living
1,brca_metabric,MB-0002,MB-0002,43.19,BREAST CONSERVING,Breast Cancer,Breast Invasive Ductal Carcinoma,High,NO,LumA,...,83.52,0:Not Recurred,1,Primary,Female,ER+/HER2- High Prolif,2.615035,10.0,1.0,Living
2,brca_metabric,MB-0005,MB-0005,48.87,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,YES,LumB,...,151.28,1:Recurred,1,Primary,Female,,2.615035,15.0,2.0,Died of Disease
3,brca_metabric,MB-0006,MB-0006,47.68,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,Moderate,YES,LumB,...,162.76,0:Not Recurred,1,Primary,Female,,1.307518,25.0,2.0,Living
4,brca_metabric,MB-0008,MB-0008,76.97,MASTECTOMY,Breast Cancer,Breast Mixed Ductal and Lobular Carcinoma,High,YES,LumB,...,18.55,1:Recurred,1,Primary,Female,ER+/HER2- High Prolif,2.615035,40.0,2.0,Died of Disease


## Split Dataset by Subtype

In [18]:
her2_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"] == "Her2"]
basal_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"] == "Basal"]
claudin_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"] == "claudin-low"]
luma_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"] == "LumA"]
lumb_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"] == "LumB"]
normal_df = clinical_df[clinical_df["Pam50 + Claudin-low subtype"] == "Normal"]
her2_df

Unnamed: 0,Study ID,Patient ID,Sample ID,Age at Diagnosis,Type of Breast Surgery,Cancer Type,Cancer Type Detailed,Cellularity,Chemotherapy,Pam50 + Claudin-low subtype,...,Relapse Free Status (Months),Relapse Free Status,Number of Samples Per Patient,Sample Type,Sex,3-Gene classifier subtype,TMB (nonsynonymous),Tumor Size,Tumor Stage,Patient's Vital Status
11,brca_metabric,MB-0035,MB-0035,84.22,MASTECTOMY,Breast Cancer,Breast Invasive Lobular Carcinoma,High,NO,Her2,...,35.79,1:Recurred,1,Primary,Female,ER+/HER2- High Prolif,6.537589,28.0,2.0,Died of Disease
28,brca_metabric,MB-0079,MB-0079,50.42,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,YES,Her2,...,26.28,1:Recurred,1,Primary,Female,,5.230071,40.0,2.0,Died of Disease
45,brca_metabric,MB-0113,MB-0113,36.96,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,Low,YES,Her2,...,42.60,0:Not Recurred,1,Primary,Female,HER2+,0.000000,17.0,2.0,Living
60,brca_metabric,MB-0129,MB-0129,63.53,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,Low,YES,Her2,...,38.06,0:Not Recurred,1,Primary,Female,HER2+,0.000000,24.0,2.0,Died of Other Causes
76,brca_metabric,MB-0148,MB-0148,53.16,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,NO,Her2,...,1.74,0:Not Recurred,1,Primary,Female,HER2+,3.922553,19.0,1.0,Living
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1953,brca_metabric,MB-7262,MB-7262,83.80,BREAST CONSERVING,Breast Cancer,Breast Invasive Ductal Carcinoma,Moderate,NO,Her2,...,102.66,0:Not Recurred,1,Primary,Female,ER+/HER2- High Prolif,13.075177,40.0,,Died of Other Causes
1961,brca_metabric,MB-7273,MB-7273,47.47,BREAST CONSERVING,Breast Cancer,Breast Invasive Lobular Carcinoma,High,YES,Her2,...,205.46,0:Not Recurred,1,Primary,Female,HER2+,6.537589,20.0,,Living
1962,brca_metabric,MB-7275,MB-7275,30.02,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,YES,Her2,...,22.63,1:Recurred,1,Primary,Female,HER2+,10.460142,20.0,,Died of Disease
1966,brca_metabric,MB-7279,MB-7279,86.14,MASTECTOMY,Breast Cancer,Breast Invasive Ductal Carcinoma,High,NO,Her2,...,19.08,1:Recurred,1,Primary,Female,HER2+,5.230071,50.0,,Died of Disease


In [17]:
her2_df["Pam50 + Claudin-low subtype"].unique()

array(['Her2'], dtype=object)

In [None]:
['claudin-low', 'LumA', 'LumB', 'Normal', 'Her2', 'Basal']

## Create K-Nearest Graphs (TMB) for each Subtype

In [27]:
her2_fc = pd.DataFrame(generate_fully_connected_graph(her2_df, "TMB (nonsynonymous)")[0])
her2_knn = pd.DataFrame(fc_to_knn(her2_fc, k=3))
basal_fc = pd.DataFrame(generate_fully_connected_graph(basal_df, "TMB (nonsynonymous)")[0])
basal_knn = pd.DataFrame(fc_to_knn(basal_fc, k=3))
claudin_fc = pd.DataFrame(generate_fully_connected_graph(claudin_df, "TMB (nonsynonymous)")[0])
claudin_knn = pd.DataFrame(fc_to_knn(claudin_fc, k=3))
luma_fc = pd.DataFrame(generate_fully_connected_graph(luma_df, "TMB (nonsynonymous)")[0])
luma_knn = pd.DataFrame(fc_to_knn(luma_fc, k=3))
lumb_fc = pd.DataFrame(generate_fully_connected_graph(lumb_df, "TMB (nonsynonymous)")[0])
lumb_knn = pd.DataFrame(fc_to_knn(lumb_fc, k=3))
normal_fc = pd.DataFrame(generate_fully_connected_graph(normal_df, "TMB (nonsynonymous)")[0])
normal_knn = pd.DataFrame(fc_to_knn(normal_fc, k=3))
her2_knn

Unnamed: 0,MB-0035,MB-0079,MB-0113,MB-0129,MB-0148,MB-0152,MB-0188,MB-0221,MB-0230,MB-0249,...,MB-7250,MB-7251,MB-7252,MB-7256,MB-7260,MB-7262,MB-7273,MB-7275,MB-7279,MB-7281
MB-0035,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MB-0079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MB-0113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MB-0129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MB-0148,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
MB-7262,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MB-7273,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MB-7275,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MB-7279,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
