# Validate NN on SIDER unseen database
<b>Author</b>: Ian Coleman <br/>
<b>Function</b>: Let's take the NN developed in Opa/ and test it out on an unseen database

Ways to improve <br>
- get more chems through disgenet
- get more diseases by running opa2vec on ctd data freshly

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from random import randint
import random
from sklearn.model_selection import train_test_split
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from pandas_ml import ConfusionMatrix
import json
import subprocess
import pickle
import math

#Set random seed
np.random.seed(1606)

In [2]:
# Right, what databases? Where can I get unseen chem-disease associations
# Virtual Metabolic Human - ?nutrients
# Sider - drugs
# I want environmental chemicals, could use EPA toxic list but probably all in training database
# How does this validation thing work, (i) import trained model (ii) create features for the chemical/diseases
# (iii) predict

In [3]:
# Import database of unqiue diseases with their vectors from opa-nn

In [4]:
# import new chemicals with their actual disease associations, extract unique chems

## 1. Sider

In [5]:
# Import sider (all side effects)
# SE = side effect
# CID1 - "flat compound", i.e. stereo-isomers have been merged into one compound
# CID2 - stereo-specific compound id
colnames = ['CID1', 'CID2', 'UMLS', 'UMLS2Type', 'UMLS2', 'SEname']
sider = pd.read_csv('../validation/data/meddra_all_se.tsv', sep='\t', names=colnames)

In [6]:
sider.sample(3)

Unnamed: 0,CID1,CID2,UMLS,UMLS2Type,UMLS2,SEname
161261,CID100005314,CID000005314,C0042510,PT,C0042510,Ventricular fibrillation
59539,CID100003143,CID000148123,C0151738,LLT,C0151738,Large intestine perforation
87586,CID100003690,CID000003690,C0020542,PT,C0020542,Pulmonary hypertension


### Get Disease MESH IDs for sider side effects
Problem here is that sider uses UMLS, convert this to MESH
<br> Commenting out the next few cells as the mapping process is intensive and I've saved map

In [7]:
# # Import CTD Chemical-Disease Original CSV to get disease names, try semantic matching to get UMLS-MESH conversion
# Read in CTD sample, skipping the intro rows
cols = ['DiseaseID', 'DiseaseName', 'DirectEvidence']
col_types = {   
    'DiseaseID': 'category',
    'DiseaseName': 'category',
    'DirectEvidence': 'category'
}
df_cd = pd.read_csv('../ctd-to-nt/csvs/CTD_chemicals_diseases.csv', skiprows=27, usecols=cols, dtype=col_types)
df_cd = df_cd.drop(0)
df_cd = df_cd.dropna(subset=['DirectEvidence']) # drop if it doesn't have direct evidence

In [8]:
# df_cd.head()

In [9]:
# Make a mesh disease name to mesh id map for later use
mesh_get_id = dict(zip(df_cd.DiseaseName, df_cd.DiseaseID))

In [10]:
# # Process DiseaseID so as to be usable in url
# df_cd['DiseaseID'] = df_cd['DiseaseID'].str.replace('MESH:', '')

# #Specify type to optimise
# df_cd['ChemicalID'] = df_cd.ChemicalID.astype(str)
# # df_cd['InferenceGeneSymbol'] = df_cd.InferenceGeneSymbol.astype(str)

In [11]:
# Use a measure of distance to match up disease names from ctd (MESH) and from sider (UMLS) 
from difflib import SequenceMatcher
import pdb

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def create_map(std_list, flawed_list):
    flawed_list = (n for n in flawed_list)
    team_map = {}
    best_score = {}
    for team in flawed_list:
        scores = [similar(team, std_team) for std_team in std_list]
        highest = max(scores)
        if highest > 0.8:
            index = scores.index(max(scores))
            team_map[team] = std_list[index]
    return team_map

In [12]:
# umls = sorted(sider.SEname.unique())
# mesh = sorted(df_cd.DiseaseName.unique())

In [13]:
# Commenting out as takes ages, and am saving the map as pickle object
# umls_mesh_map = create_map(umls, mesh)
# umls_mesh_map_mod = {value:key for (key, value) in umls_mesh_map_mod.items()}

In [14]:
# print(ummap)

In [15]:
# # These are the incorrect mappings I've identified for a 0.8 similarity cutoff
# remove = ('Agraphia', 'Angina, Stable', 'Cerebrospinal Fluid Otorrhea', 'Confusion',
#          'Endarteritis', 'Fetal Growth Retardation', 'Glucose Intolerance', 'Hearing Disorders',
#          'Hemoperitoneum', 'Hepatitis, Animal', 'Hip Contracture', 'Hyperoxaluria',
#          'Hyperoxia', 'Hyperpigmentation','Hypolipoproteinemias',  'Intestinal Diseases',
#          'Milk Hypersensitivity', 'Mucositis', 'Murine Acquired Immunodeficiency Syndrome', 
#          'Muscle Neoplasms', 'Mycotoxicosis', 'Olfaction Disorders','Osteopetrosis',
#          'Peanut Hypersensitivity', 'Pharyngeal Neoplasms', 'Polycystic liver disease',
#          'Pseudohypoparathyroidism', 'Psychomotor Agitation', 'Pulmonary Emphysema',
#          'Purpura, Thrombocytopenic', 'Renal Insufficiency', 'Sciatic Neuropathy',
#          'Simian Acquired Immunodeficiency Syndrome', 'Spinal Curvatures', 'Sporotrichosis',
#          'Vipoma', 'Vitamin A Deficiency', 'Vitamin D Deficiency', 'Vitamin E Deficiency',
#          'Wheat Hypersensitivity')
# umls_mesh_map_mod = {key: umls_mesh_map[key] for key in umls_mesh_map if key not in remove}
# # Muscle neoplasms is not the same as muscle spams
# # 'Olfaction Disorders' != 'Ovulation disorder'

In [16]:
# # Export map of UMLS:MESH
# with open('umls_mesh_map'+ '.pkl', 'wb') as f:
#         pickle.dump(umls_mesh_map_mod, f, pickle.HIGHEST_PROTOCOL)

In [17]:
# Loading the map from pickle object - if you haven't created it you may need to uncomment above lines
def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

umls_mesh_map_mod = load_obj('umls_mesh_map')

In [18]:
#Use the umls-mesh map to add mesh col to sider

In [19]:
sider['MESH'] = sider.SEname.map(lambda x: umls_mesh_map_mod.get(str(x)))

In [20]:
sider.head()

Unnamed: 0,CID1,CID2,UMLS,UMLS2Type,UMLS2,SEname,MESH
0,CID100000085,CID000010917,C0000729,LLT,C0000729,Abdominal cramps,
1,CID100000085,CID000010917,C0000729,PT,C0000737,Abdominal pain,Abdominal Pain
2,CID100000085,CID000010917,C0000737,LLT,C0000737,Abdominal pain,Abdominal Pain
3,CID100000085,CID000010917,C0000737,PT,C0687713,Gastrointestinal pain,
4,CID100000085,CID000010917,C0000737,PT,C0000737,Abdominal pain,Abdominal Pain


In [21]:
print('total sider rows: ', sider.shape[0])
print('sider rows with mesh value: ', sider[sider.MESH.map(lambda x: x is not None)].shape[0])

total sider rows:  309849
sider rows with mesh value:  146773


In [22]:
sider_mod = sider[sider.MESH.map(lambda x: x is not None)]

In [23]:
sider_mod.sample(2)

Unnamed: 0,CID1,CID2,UMLS,UMLS2Type,UMLS2,SEname,MESH
127407,CID100004595,CID000004595,C0013384,LLT,C0013384,Dyskinesia,Dyskinesias
300125,CID116132446,CID016132446,C0030305,PT,C0030305,Pancreatitis,Pancreatitis


In [24]:
# Split out the two CID columns NOTE that each row can now potentially be two - one for each CID1 and CID2
sider1 = sider_mod[['CID1', 'MESH']]
sider2 = sider_mod[['CID2', 'MESH']]
sider1.columns = ['CID', 'MESH']
sider2.columns = ['CID', 'MESH']
sider_mod = pd.concat([sider1, sider2], ignore_index=True)

In [25]:
sider_mod.sample(2)

Unnamed: 0,CID,MESH
215305,CID000004927,Pruritus
161941,CID000002540,Hypersensitivity


In [26]:
print('Sider shape: ', sider_mod.shape[0])
sider_mod = sider_mod.drop_duplicates()
print('Total unique correlated chem:dis observations: ', sider_mod.shape[0])
print('Unique chems: ', sider_mod.CID.unique().shape[0])
print('Unique diseases: ', sider_mod.MESH.unique().shape[0])

Sider shape:  293546
Total unique correlated chem:dis observations:  145635
Unique chems:  2968
Unique diseases:  1034


In [27]:
# Chop out all chems that are in our training database
# Read in training db chems (opa-nn notebook)
chems_in_nn = pd.read_csv('../opa/chemsInNN.txt', names=['Chem'])
chems_in_nn = chems_in_nn.dropna().drop_duplicates()
chems_in_nn.shape[0]

# Now chop from SIDER db
nnChems = list(chems_in_nn.Chem)
sider_mod['inNN'] = sider_mod.CID.map(lambda x: x in nnChems)
sider_mod = sider_mod[sider_mod.inNN.map(lambda x: x == False)]
sider_mod = sider_mod[['CID', 'MESH']]
sider_mod = sider_mod.reset_index(drop=True)

In [28]:
print('Total unique correlated chem:dis observations: ', sider_mod.shape[0])
print('Unique chems: ', sider_mod.CID.unique().shape[0])
print('Unique diseases: ', sider_mod.MESH.unique().shape[0])

Total unique correlated chem:dis observations:  145635
Unique chems:  2968
Unique diseases:  1034


In [29]:
# Now we have a set of chem:dis that are not in the NN training set

In [30]:
# Next: Make each vector for these
# Then: Run NN on them

# Chemical entity - Gene Ontology embeddings (via associated genes)
# Disease entity - Gene Ontology embeddings (via associated genes)
# Disease entity - Human Phenotype Ontology embeddings (via associated phenotypes)
# Disease entity - Mammalian Phenotype Ontology embeddings (via associated phenotypes)
# Chemical entity - Chemical Entities of Biological Interest (CHEBI ) Ontology embeddings
# Disease entity - Disease Ontology embeddings
# Chemical entity - Human Interaction Network Ontology embeddings (via associated genes)
# Disease entity - Human Interaction Network Ontology embeddings (via associated genes)


In [31]:
# SIDER-GO vecs
# For this I need chem-gene associations and disease-gene associations
# Sources: CTD, Disgenet

### Sider Go vecs
First get IDs

In [32]:
sider_mod.sample(2)

Unnamed: 0,CID,MESH
119514,CID005311051,Hyperglycemia
5944,CID100002375,Dysuria


In [33]:
# Turn CID to CTD chemical ID with this map I made earlier 
# Load the map from pickle object
def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

ctd_cid_map = load_obj('../opa/ctd_cid_map')

In [34]:
# Will need to standardise the CID and decode from bytes
def cid_standardiser (cid):
    # Must be format CID + 9 int chars, starting with 1 seemingly
    cid = int(cid)
    output = 'CID1' + '0' * (8 - len(str(cid))) + str(cid)
    return output

ctd_cid_map_df = pd.DataFrame.from_dict(ctd_cid_map, orient='index')

In [35]:
# Process and reverse map
ctd_cid_map_df[0] = ctd_cid_map_df[0].str.decode('utf-8')
ctd_cid_map_df[0] = ctd_cid_map_df[0].map(lambda x: cid_standardiser(x))
ctd_cid_map = dict(zip(ctd_cid_map_df[0], ctd_cid_map_df.index.values))

In [36]:
# Now we have the map, apply it to our sider df
sider_mod['ChemicalID'] = sider_mod.CID.map(lambda x: ctd_cid_map.get(x))

In [37]:
print('chem:dis combos: ', sider_mod[sider_mod.ChemicalID.map(lambda x: x is not None)].shape[0])
print('unique chems: ',sider_mod[sider_mod.ChemicalID.map(lambda x: x is not None)].ChemicalID.nunique())

chem:dis combos:  31280
unique chems:  595


In [38]:
sider_mod = sider_mod[sider_mod.ChemicalID.map(lambda x: x is not None)]
sider_mod['MESHid'] = sider_mod.MESH.map(lambda x: mesh_get_id.get(x))
sider_mod.head()

Unnamed: 0,CID,MESH,ChemicalID,MESHid
40,CID100000119,Angioedema,D005680,MESH:D000799
41,CID100000119,Pain,D005680,MESH:D010146
42,CID100000119,Urticaria,D005680,MESH:D014581
43,CID100000137,Anemia,D000622,MESH:D000740
44,CID100000137,Aphasia,D000622,MESH:D001037


In [39]:
print('Total unique correlated chem:dis observations: ', sider_mod.shape[0])
print('Unique chems: ', sider_mod.CID.unique().shape[0])
print('Unique diseases: ', sider_mod.MESHid.unique().shape[0])
## Note that we're losing a lot when we take only chems in CTD - see if we can get gene assocs from elsewhere

Total unique correlated chem:dis observations:  31280
Unique chems:  595
Unique diseases:  886


### Get chem-gene-vecs and dis-gene-vecs premade from CTD data

In [40]:
# Import GOFUNC vecs directly, for diseases
with open('../opa/go-gofuncs.lst', 'r') as file:
    text = file.read()
    
# Strip and split vector data into list of lists [chem, vec]
text = text.replace('\n', '')
text = text.split(']')
text = [item.strip().split(' [') for item in text]

# Turn it into a data frame
df = pd.DataFrame(text)
df.columns = ['ID', 'Vector']

# Clean
df = df.dropna()
df['Vector'] = df.Vector.map(lambda x: x.rstrip().lstrip().replace('    ', ' ').replace('   ', ' ').replace('  ', ' ').replace(' ', ','))

# Turn vector column into a list
df['Vector'] = df.Vector.map(lambda x: x.split(','))

In [41]:
df[df.ID.map(lambda x: ('MESH' not in x) & ('OMIM' not in x))].shape# 586

(586, 2)

In [42]:
# Get the chemical vecs, delete any row without a chemical vec
chem_go_vecs = df[df.ID.map(lambda x: ('MESH' not in x) & ('OMIM' not in x))]
chem_to_vec = dict(zip(chem_go_vecs.ID, chem_go_vecs.Vector))
sider_mod['ChemGoVec'] = sider_mod.ChemicalID.map(lambda x: chem_to_vec.get(x))
sider_mod = sider_mod[sider_mod.ChemGoVec.map(lambda x: x is not None)]

In [43]:
# Get the disease vecs, delete any row without a disease vec
dis_go_vecs = df[df.ID.map(lambda x: 'MESH' in x)]
dis_to_vec = dict(zip(dis_go_vecs.ID, dis_go_vecs.Vector))
sider_mod['DisGoVec'] = sider_mod.MESHid.map(lambda x: dis_to_vec.get(x))
sider_mod = sider_mod[sider_mod.DisGoVec.map(lambda x: x is not None)]

In [44]:
sider_mod.head()

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec
76,CID100000143,Anorexia,D002955,MESH:D000855,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[-0.03718876, 0.12608664, -0.0080918, -0.14357..."
80,CID100000143,Diarrhea,D002955,MESH:D003967,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[7.43628596e-04, 1.31800339e-01, 2.09368542e-0..."
83,CID100000143,Hypersensitivity,D002955,MESH:D006967,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01048629, 0.12093927, 0.02683925, -0.134863..."
87,CID100000143,Pruritus,D002955,MESH:D011537,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[-4.40565981e-02, 8.48817080e-02, -7.02312589e..."
88,CID100000143,Stomatitis,D002955,MESH:D013280,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[-4.93375286e-02, 8.82932022e-02, 1.08785890e-..."


In [45]:
has_dis_vec = sider_mod.DisGoVec.map(lambda x: x is not np.nan)
has_chem_vec = sider_mod.ChemGoVec.map(lambda x: x is not None)
sider_mod = sider_mod[has_dis_vec & has_chem_vec]
print('Number of chem-dis pairs with gofuncs: ', sider_mod.shape[0])
print('Number of chems: ', sider_mod.ChemicalID.nunique())
print('Number of diseases: ', sider_mod.MESHid.nunique())

Number of chem-dis pairs with gofuncs:  1366
Number of chems:  62
Number of diseases:  166


### Del any pairs in the original NN dataset

In [46]:
# Now to make it a real blind test we must del any chem-dis pairs in the NN db
nn_chem_dis = pd.read_csv('../ctd-to-nt/chem-dis-pos-assocs.csv')
nn_chem_dis.columns = ['ChemicalID', 'MESHid']

# Remove from sider_mod any chem-dis pairs that exist in nn_chem_dis
combined_cd = pd.merge(sider_mod[['ChemicalID', 'MESHid']], nn_chem_dis, on=['ChemicalID', 'MESHid'], how='left', indicator='Exist')
combined_cd['Exist'] = np.where(combined_cd.Exist == 'both', True, False)
not_in_nn = [not i for i in list(combined_cd.Exist)]
sider_mod = sider_mod[not_in_nn]

In [47]:
has_dis_vec = sider_mod.DisGoVec.map(lambda x: x is not np.nan)
has_chem_vec = sider_mod.ChemGoVec.map(lambda x: x is not None)
sider_mod = sider_mod[has_dis_vec & has_chem_vec]
print('Number of chem-dis pairs with gofuncs: ', sider_mod.shape[0])
print('Number of chems: ', sider_mod.ChemicalID.nunique())
print('Number of diseases: ', sider_mod.MESHid.nunique())

Number of chem-dis pairs with gofuncs:  991
Number of chems:  62
Number of diseases:  151


### Add control rows (all above are correlated)

In [48]:
# Add control rows (all above are correlated)
sider_mod['Correlation'] = 1

In [49]:
sider.head()

Unnamed: 0,CID1,CID2,UMLS,UMLS2Type,UMLS2,SEname,MESH
0,CID100000085,CID000010917,C0000729,LLT,C0000729,Abdominal cramps,
1,CID100000085,CID000010917,C0000729,PT,C0000737,Abdominal pain,Abdominal Pain
2,CID100000085,CID000010917,C0000737,LLT,C0000737,Abdominal pain,Abdominal Pain
3,CID100000085,CID000010917,C0000737,PT,C0687713,Gastrointestinal pain,
4,CID100000085,CID000010917,C0000737,PT,C0000737,Abdominal pain,Abdominal Pain


In [50]:
# Add unrelated pairs - control obs
no_rows = (sider_mod.shape[0]-1)    # This is a parameter to be tuned --> how many uncorrelated pairs do we want
print('Original shape: ', sider_mod.shape)
sider_mod = sider_mod.drop_duplicates(subset=['ChemicalID', 'MESHid'], keep=False)
print('Shape after dropping duplicates: ', sider_mod.shape)

# Randomly select chems and diseases (as many as there are related pairs)
df_chems = sider_mod[['ChemicalID', 'ChemGoVec']].drop_duplicates(subset=['ChemicalID']).reset_index(drop=True)
df_dis = sider_mod[['MESHid', 'DisGoVec', 'MESH']].drop_duplicates(subset=['MESHid']).reset_index(drop=True)
df_chems.columns = ['ID', 'Vector']
df_dis.columns = ['ID', 'Vector', 'MESH']

# print('chem size: ', df_chems.shape[0])
# print('dis size: ', df_dis.shape[0])

no_chems = len(df_chems) - 1
no_dis = len(df_dis) - 1
rand_chems = np.random.choice(no_chems, no_rows, replace=True)
rand_dis = np.random.choice(no_dis, no_rows, replace=True)

# Add the new pairs as rows
for x in range(0, no_rows):
    int1 = rand_chems[x]
    int2 = rand_dis[x]
    chem, chemvec = df_chems.loc[int1, 'ID'], df_chems.loc[int1, 'Vector']
    dis, disvec, mesh = df_dis.loc[int2, 'ID'], df_dis.loc[int2, 'Vector'], df_dis.loc[int2, 'MESH']
    sider_mod = sider_mod.append({'ChemicalID':chem, 'MESHid':dis, 'ChemGoVec':chemvec, 'DisGoVec': disvec, 'Correlation':0, 'MESH': mesh}, ignore_index=True)

print('Shape after adding controls: ', sider_mod.shape)
# Drop any duplicates (removes known correlated pairs accidentally generated as uncorrelated)
sider_mod = sider_mod.drop_duplicates(subset=['ChemicalID', 'MESHid'], keep=False)
print('Shape after dropping duplicates: ', sider_mod.shape)

Original shape:  (991, 7)
Shape after dropping duplicates:  (991, 7)
Shape after adding controls:  (1981, 7)
Shape after dropping duplicates:  (1657, 7)


In [51]:
sider_mod.head()

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation
0,CID100000143,Hypersensitivity,D002955,MESH:D006967,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01048629, 0.12093927, 0.02683925, -0.134863...",1
2,CID100000143,Urticaria,D002955,MESH:D014581,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[-0.15222934, 0.05032845, -0.21946777, -0.0608...",1
3,CID100000143,Acute Kidney Injury,D002955,MESH:D058186,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01984936, 0.0847866, 0.04233291, -0.0797925...",1
4,CID100000143,Disease Progression,D002955,MESH:D018450,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[1.11526530e-03, 1.28473654e-01, 3.04674823e-0...",1
6,CID100000681,Atrial Fibrillation,D004298,MESH:D001281,"[-6.34962916e-02, 1.08258978e-01, -2.02546176e...","[-1.60670001e-03, 1.17270418e-01, 1.88340824e-...",1


In [52]:
# sider_mod[['MESH', 'ChemicalID', 'Correlation']].sort_values(['ChemicalID'])

In [53]:
# Manually looking at chem-dis associations. Some don't seem to exist like
# Hypertension	D019793
# Neoplasms	D019793
# Don't seem to exist - google search but do exist in sider

In [54]:
sider_mod.sample(5)

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation
423,CID100003394,Skin Ulcer,D005480,MESH:D012883,"[-0.19691008, 0.01246815, -0.21318772, -0.0904...","[-0.03957883, 0.13396174, -0.0498878, -0.09962...",1
1641,,Adenocarcinoma of lung,D005013,MESH:C538231,"[-0.09967206, 0.05885228, -0.1085031, -0.05162...","[-0.05705886, 0.09871884, 0.00668764, -0.11565...",0
156,CID100002818,Hypertriglyceridemia,D003024,MESH:D015228,"[-0.0243297, 0.09980957, -0.01168219, -0.12672...","[-3.41259576e-02, 1.32385924e-01, 4.26359326e-...",1
1646,,Trigeminal Neuralgia,D012968,MESH:D014277,"[1.50684863e-02, 1.09219313e-01, 3.79371457e-0...","[-0.14551874, 0.05946752, -0.18298474, -0.0806...",0
1559,,Hypersensitivity,D004221,MESH:D006967,"[-0.15010051, 0.04692498, -0.16275029, -0.0622...","[0.01048629, 0.12093927, 0.02683925, -0.134863...",0


##  SIDER Phenotype Ontology 

In [55]:
# First get DOIDs --> importing map and applying it:
mapper = pd.read_csv('../opa/chem_dis_to_CID_DOID.csv')
print(mapper.DOID.nunique()) # 1671
mesh_to_doid = dict(zip(mapper.ID, mapper.DOID))
sider_mod['DOID'] = sider_mod.MESHid.map(lambda x: mesh_to_doid.get(x))

1671


In [56]:
# Standardise the DOIDs
def doid_standardiser (doid):
    doid = doid.replace(':', '_')
    return doid

sider_mod['DOID'] = sider_mod.DOID.map(lambda x: np.nan if isinstance(x, float) else doid_standardiser(x))

In [57]:
# Simply load in the premade dis-phenVec maps
def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

dis_mpVec = load_obj('../opa/dis_mpVec_map')

def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

dis_hpVec = load_obj('../opa/dis_hpVec_map')

In [58]:
# Apply the maps to add phenVecs to our dataframe
empty_vec = [0] * 200

sider_mod['disPhenVecMP'] = sider_mod.DOID.map(lambda x: dis_mpVec.get(x, empty_vec))
sider_mod['disPhenVecHP'] = sider_mod.DOID.map(lambda x: dis_hpVec.get(x, empty_vec))

In [59]:
sider_mod.head()

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation,DOID,disPhenVecMP,disPhenVecHP
0,CID100000143,Hypersensitivity,D002955,MESH:D006967,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01048629, 0.12093927, 0.02683925, -0.134863...",1,DOID_1205,"[3.05852331e-02, 1.36921287e-01, 5.83447553e-0...","[4.44701687e-02, 1.48065820e-01, 7.01489896e-0..."
2,CID100000143,Urticaria,D002955,MESH:D014581,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[-0.15222934, 0.05032845, -0.21946777, -0.0608...",1,DOID_1555,"[0.02479337, 0.11779714, 0.04578669, -0.118837...","[0.04158028, 0.15021026, 0.05177102, -0.128954..."
3,CID100000143,Acute Kidney Injury,D002955,MESH:D058186,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01984936, 0.0847866, 0.04233291, -0.0797925...",1,DOID_3021,"[2.61312630e-02, 1.06001623e-01, 5.01475073e-0...","[4.42636572e-02, 1.56471714e-01, 5.58946170e-0..."
4,CID100000143,Disease Progression,D002955,MESH:D018450,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[1.11526530e-03, 1.28473654e-01, 3.04674823e-0...",1,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
6,CID100000681,Atrial Fibrillation,D004298,MESH:D001281,"[-6.34962916e-02, 1.08258978e-01, -2.02546176e...","[-1.60670001e-03, 1.17270418e-01, 1.88340824e-...",1,DOID_0060224,"[5.46471141e-02, 1.52429357e-01, 7.47762397e-0...","[0.06875926, 0.15412922, 0.06408796, -0.108529..."


In [60]:
# Right let's add the rest of the features to our dataset

### Add CHEBI vecs

In [61]:
# Import chem2chebivec map made in opa-nn notebook
def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

chem2chebi = load_obj('../opa/chem2chebi')

In [62]:
sider_mod.sample(4)

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation,DOID,disPhenVecMP,disPhenVecHP
1121,,Proteinuria,D003024,MESH:D011507,"[-0.0243297, 0.09980957, -0.01168219, -0.12672...","[-1.23689458e-01, 2.06021965e-02, -1.98148027e...",0,DOID_576,"[0.038631, 0.12100562, 0.06092621, -0.11268248...","[5.98250851e-02, 1.54849425e-01, 5.55189103e-0..."
1032,,Otitis Media,D002065,MESH:D010033,"[-0.02266272, 0.10866198, 0.01653025, -0.12538...","[0.01173668, 0.08411296, 0.00439407, -0.083879...",0,DOID_10754,"[2.77346596e-02, 1.19539544e-01, 6.01626299e-0...","[0.04242221, 0.15833516, 0.04688059, -0.118981..."
203,CID100003121,Deafness,D014635,MESH:D003638,"[-0.09201236, 0.06609955, -0.13402744, -0.1180...","[2.99392752e-02, 1.05787165e-01, 5.17426692e-0...",1,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
366,CID100003386,Hyperplasia,D005473,MESH:D006965,"[-0.17009397, 0.01808067, -0.3474555, -0.16473...","[-2.88997646e-02, 1.38664156e-01, 3.66541967e-...",1,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [63]:
sider_mod['CHEBIvec'] = sider_mod.ChemicalID.map(lambda x: chem2chebi.get(x, empty_vec))

### Add DO Vecs

In [64]:
# Import Gofunc vec file
with open('../opa/do-vecs.lst', 'r') as file:
    text = file.read()
    
# Strip and split vector data into list of lists [disease, vec]
text = text.replace('\n', '')
text = text.split(']')
text = [item.strip().split(' [') for item in text]

# Turn it into a data frame
df = pd.DataFrame(text)
df.columns = ['ID', 'Vector']

# Clean
df = df.dropna()
df['Vector'] = df.Vector.map(lambda x: x.rstrip().lstrip().replace('    ', ' ').replace('   ', ' ').replace('  ', ' ').replace(' ', ','))

# Turn vector column into a list
df['Vector'] = df.Vector.map(lambda x: x.split(','))

# Make a map of it (DisID to DOvec)
dis_to_DOvec = dict(zip(df.ID, df.Vector))

In [65]:
sider_mod['DOvec'] = sider_mod.MESHid.map(lambda x: dis_to_DOvec.get(x))

In [66]:
# Change the DO vec elements from string to floats
sider_mod['DOvec'] = sider_mod.DOvec.map(lambda x: [float(i) for i in x])

In [67]:
sider_mod.head()

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation,DOID,disPhenVecMP,disPhenVecHP,CHEBIvec,DOvec
0,CID100000143,Hypersensitivity,D002955,MESH:D006967,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01048629, 0.12093927, 0.02683925, -0.134863...",1,DOID_1205,"[3.05852331e-02, 1.36921287e-01, 5.83447553e-0...","[4.44701687e-02, 1.48065820e-01, 7.01489896e-0...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.0323501714, 0.0718148202, 0.0226857904, -0...."
2,CID100000143,Urticaria,D002955,MESH:D014581,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[-0.15222934, 0.05032845, -0.21946777, -0.0608...",1,DOID_1555,"[0.02479337, 0.11779714, 0.04578669, -0.118837...","[0.04158028, 0.15021026, 0.05177102, -0.128954...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.02033558, 0.07687499, 0.03339575, -0.061590..."
3,CID100000143,Acute Kidney Injury,D002955,MESH:D058186,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01984936, 0.0847866, 0.04233291, -0.0797925...",1,DOID_3021,"[2.61312630e-02, 1.06001623e-01, 5.01475073e-0...","[4.42636572e-02, 1.56471714e-01, 5.58946170e-0...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.0153563516, 0.0718724281, 0.0305716358, -0...."
4,CID100000143,Disease Progression,D002955,MESH:D018450,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[1.11526530e-03, 1.28473654e-01, 3.04674823e-0...",1,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.00961015, 0.0239394, 0.01371201, -0.0367966..."
6,CID100000681,Atrial Fibrillation,D004298,MESH:D001281,"[-6.34962916e-02, 1.08258978e-01, -2.02546176e...","[-1.60670001e-03, 1.17270418e-01, 1.88340824e-...",1,DOID_0060224,"[5.46471141e-02, 1.52429357e-01, 7.47762397e-0...","[0.06875926, 0.15412922, 0.06408796, -0.108529...","[0.01396446, 0.05335214, 0.02592678, -0.054371...","[0.02116747, 0.06971144, 0.04062635, -0.064116..."


### Get HINO Vecs

In [68]:
# Import HINO vec file
with open('../opa/hinoVecs.lst', 'r') as file:
    text = file.read()
    
# Strip and split vector data into list of lists [chem, vec]
text = text.replace('\n', '')
text = text.split(']')
text = [item.strip().split(' [') for item in text]

# Turn it into a data frame
df = pd.DataFrame(text)
df.columns = ['ID', 'Vector']

# Clean
df = df.dropna()
df['Vector'] = df.Vector.map(lambda x: x.rstrip().lstrip().replace('    ', ' ').replace('   ', ' ').replace('  ', ' ').replace(' ', ','))

# Turn vector column into a list
df['Vector'] = df.Vector.map(lambda x: x.split(','))

# Make a map of it (DisID to DOvec)
entity_to_HINOvec = dict(zip(df.ID, df.Vector))

In [69]:
sider_mod['dis_HINOvec'] = sider_mod.MESHid.map(lambda x: entity_to_HINOvec.get(x))
sider_mod['chem_HINOvec'] = sider_mod.ChemicalID.map(lambda x: entity_to_HINOvec.get(x))

In [70]:
print('HINO dis vecs: ', sider_mod[sider_mod.dis_HINOvec.map(lambda x: x is not None)].shape[0])
print('HINO chem vecs: ', sider_mod[sider_mod.chem_HINOvec.map(lambda x: x is not None)].shape[0])
at_least_one = sider_mod.chem_HINOvec.map(lambda x: x is not None) | sider_mod.dis_HINOvec.map(lambda x: x is not None)
print('At least one hino vec: ', sider_mod[at_least_one].shape[0])

HINO dis vecs:  1294
HINO chem vecs:  1021
At least one hino vec:  1516


In [71]:
# Add empty vecs in place of None
empty_vec = [0] * 200

for col in ['dis_HINOvec', 'chem_HINOvec']:
    sider_mod[col] = sider_mod[col].map(lambda x: empty_vec if x is None else x)

In [72]:
# Change the HINO vec elements from string to floats
sider_mod['dis_HINOvec'] = sider_mod.dis_HINOvec.map(lambda x: [float(i) for i in x])
sider_mod['chem_HINOvec'] = sider_mod.chem_HINOvec.map(lambda x: [float(i) for i in x])

In [73]:
sider_mod.head()

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation,DOID,disPhenVecMP,disPhenVecHP,CHEBIvec,DOvec,dis_HINOvec,chem_HINOvec
0,CID100000143,Hypersensitivity,D002955,MESH:D006967,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01048629, 0.12093927, 0.02683925, -0.134863...",1,DOID_1205,"[3.05852331e-02, 1.36921287e-01, 5.83447553e-0...","[4.44701687e-02, 1.48065820e-01, 7.01489896e-0...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.0323501714, 0.0718148202, 0.0226857904, -0....","[0.0131735466, 0.0484740399, 0.0218853857, -0....","[0.02397728, 0.09010118, 0.04078608, -0.085982..."
2,CID100000143,Urticaria,D002955,MESH:D014581,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[-0.15222934, 0.05032845, -0.21946777, -0.0608...",1,DOID_1555,"[0.02479337, 0.11779714, 0.04578669, -0.118837...","[0.04158028, 0.15021026, 0.05177102, -0.128954...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.02033558, 0.07687499, 0.03339575, -0.061590...","[0.01310632, 0.08857583, 0.04576511, -0.082742...","[0.02397728, 0.09010118, 0.04078608, -0.085982..."
3,CID100000143,Acute Kidney Injury,D002955,MESH:D058186,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[0.01984936, 0.0847866, 0.04233291, -0.0797925...",1,DOID_3021,"[2.61312630e-02, 1.06001623e-01, 5.01475073e-0...","[4.42636572e-02, 1.56471714e-01, 5.58946170e-0...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.0153563516, 0.0718724281, 0.0305716358, -0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.02397728, 0.09010118, 0.04078608, -0.085982..."
4,CID100000143,Disease Progression,D002955,MESH:D018450,"[-0.1928972, 0.04133245, -0.13697416, -0.05781...","[1.11526530e-03, 1.28473654e-01, 3.04674823e-0...",1,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.01772805, 0.05946666, 0.02581723, -0.057734...","[0.00961015, 0.0239394, 0.01371201, -0.0367966...","[0.0144636966, 0.0569347367, 0.0221220553, -0....","[0.02397728, 0.09010118, 0.04078608, -0.085982..."
6,CID100000681,Atrial Fibrillation,D004298,MESH:D001281,"[-6.34962916e-02, 1.08258978e-01, -2.02546176e...","[-1.60670001e-03, 1.17270418e-01, 1.88340824e-...",1,DOID_0060224,"[5.46471141e-02, 1.52429357e-01, 7.47762397e-0...","[0.06875926, 0.15412922, 0.06408796, -0.108529...","[0.01396446, 0.05335214, 0.02592678, -0.054371...","[0.02116747, 0.06971144, 0.04062635, -0.064116...","[0.014063, 0.05615393, 0.02999277, -0.05821621...","[0.0187989, 0.07305145, 0.0346327, -0.07062967..."


### Get PRO vecs

In [74]:
# Import PRO vec file
with open('../opa/PROVecs.lst', 'r') as file:
    text = file.read()
    
# Strip and split vector data into list of lists [chem, vec]
text = text.replace('\n', '')
text = text.split(']')
text = [item.strip().split(' [') for item in text]

# Turn it into a data frame
df = pd.DataFrame(text)
df.columns = ['ID', 'Vector']

# Clean
df = df.dropna()
df['Vector'] = df.Vector.map(lambda x: x.rstrip().lstrip().replace('    ', ' ').replace('   ', ' ').replace('  ', ' ').replace(' ', ','))

# Turn vector column into a list
df['Vector'] = df.Vector.map(lambda x: x.split(','))

# Make a map of it (DisID to DOvec)
entity_to_PROvec = dict(zip(df.ID, df.Vector)) 

In [75]:
sider_mod['dis_PROvec'] = sider_mod.MESHid.map(lambda x: entity_to_PROvec.get(x))
sider_mod['chem_PROvec'] = sider_mod.ChemicalID.map(lambda x: entity_to_PROvec.get(x))

In [76]:
# Add empty vecs in place of None
empty_vec = [0] * 200

for col in ['dis_PROvec', 'chem_PROvec']:
    sider_mod[col] = sider_mod[col].map(lambda x: empty_vec if x is None else x)

In [77]:
# Change the vec elements from string to floats
sider_mod['dis_PROvec'] = sider_mod.dis_PROvec.map(lambda x: [float(i) for i in x])
sider_mod['chem_PROvec'] = sider_mod.chem_PROvec.map(lambda x: [float(i) for i in x])

In [78]:
sider_mod.sample(10)

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation,DOID,disPhenVecMP,disPhenVecHP,CHEBIvec,DOvec,dis_HINOvec,chem_HINOvec,dis_PROvec,chem_PROvec
132,CID100002662,Pulmonary Embolism,D000068579,MESH:D011655,"[0.00211891, 0.07899864, 0.02812183, -0.078234...","[-0.09822737, 0.0335036, -0.05875964, -0.03581...",1,DOID_9477,"[0.02974856, 0.10559778, 0.04610523, -0.100018...","[6.53055310e-02, 1.49755850e-01, 4.99520116e-0...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0156938918, 0.0867972076, 0.0221232176, -0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.00525787473, 0.0360455811, 0.0133509124, -0...","[0.06025169, 0.15407243, 0.07565367, -0.141278..."
644,CID100027991,Hallucinations,D003894,MESH:D006212,"[4.54983348e-03, 1.33685231e-01, 3.01043820e-0...","[1.38351833e-02, 1.00146689e-01, -1.20333534e-...",1,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1.65594146e-02, 4.69885767e-02, 2.39662267e-0...","[0.00156835, 0.03837546, 0.02150043, -0.032972...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0067672865, 0.046747178, 0.017236983, -0.03...","[0.0143580418, 0.0460643508, 0.0373494849, -0....","[0.01404202, 0.03400609, 0.02124024, -0.044419..."
765,CID100068740,Pain,C088658,MESH:D010146,"[-0.0192352, 0.12388816, 0.05082998, -0.121224...","[-7.52816349e-03, 9.87071991e-02, -3.51563394e...",1,DOID_0060164,"[0.020703, 0.10425372, 0.06019993, -0.10321342...","[0.0870415, 0.18313521, 0.04869605, -0.1033120...","[0.02015562, 0.06090377, 0.02371735, -0.060520...","[0.0221906733, 0.0621656179, 0.0342238694, -0....","[0.0179795623, 0.0671450794, 0.0407162756, -0....","[0.01677481, 0.04568736, 0.01816869, -0.042157...","[0.01871282, 0.08193666, 0.05205654, -0.092367...","[0.04957327, 0.10940924, 0.08774322, -0.119886..."
1832,,Acidosis,C520809,MESH:D000138,"[2.99941711e-02, 1.30992413e-01, -9.42363346e-...","[1.93286445e-02, 1.23003952e-01, 4.37963828e-0...",0,DOID_0050758,"[2.40530279e-02, 1.35453522e-01, 6.51230440e-0...","[4.35994193e-02, 1.77338064e-01, 6.37756735e-0...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.01223107, 0.05447845, 0.03592256, -0.060255...","[0.0152816297, 0.045640111, 0.0169180427, -0.0...","[0.0156728588, 0.0393506847, 0.0277857389, -0....","[0.00258124014, 0.0349172056, 0.0179528464, -0...","[0.00267311581, 0.0269108973, 0.0183382072, -0..."
1003,,Fibrosis,C056516,MESH:D005355,"[-0.10259372, 0.03158976, -0.13119341, -0.0748...","[-0.01172249, 0.13109156, 0.00888951, -0.15944...",0,,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.01953488, 0.05832418, 0.0215925, -0.0734598...","[0.011856921, 0.0321784094, 0.0178192798, -0.0...","[0.0150483958, 0.0510327779, 0.0197701622, -0....","[0.0150026493, 0.0817026943, 0.0370404795, -0....","[0.0154910693, 0.0648032054, 0.0405438244, -0....","[0.03244242, 0.11643924, 0.07927948, -0.103942..."
805,CID100082146,Hypersensitivity,C095105,MESH:D006967,"[-1.14149982e-02, 1.23534575e-01, 4.01898324e-...","[0.01048629, 0.12093927, 0.02683925, -0.134863...",1,DOID_1205,"[3.05852331e-02, 1.36921287e-01, 5.83447553e-0...","[4.44701687e-02, 1.48065820e-01, 7.01489896e-0...","[0.01104784, 0.0811751, 0.02639464, -0.0813666...","[0.0323501714, 0.0718148202, 0.0226857904, -0....","[0.0131735466, 0.0484740399, 0.0218853857, -0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0108111752, 0.0490850098, 0.038106367, -0.0...","[-0.00286633009, 0.0755436048, 0.0605314672, -..."
1119,,Dystonia,C108128,MESH:D004421,"[3.36659923e-02, 1.05258629e-01, -3.89491096e-...","[2.21940875e-02, 1.03328131e-01, 4.64014374e-0...",0,DOID_543,"[0.03356696, 0.1378114, 0.0644403, -0.12334467...","[4.53660190e-02, 1.53857276e-01, 5.68589233e-0...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.0312586166, 0.0807410479, 0.0347139612, -0....","[0.0188403018, 0.0431822352, 0.0268400479, -0....","[0.01628698, 0.04260838, 0.01876359, -0.041118...","[0.0148099, 0.0431435779, 0.0281828679, -0.040...","[0.00090492, 0.05994264, 0.03233665, -0.069343..."
603,CID100005978,Intestinal Obstruction,D014750,MESH:D007415,"[0.02965491, 0.10880372, 0.04334658, -0.113439...","[-0.00746287, 0.14074765, -0.00134251, -0.1557...",1,DOID_8437,"[0.02977912, 0.11603101, 0.05039036, -0.108854...","[0.04598249, 0.16349453, 0.06612629, -0.132812...","[4.79505537e-03, 9.33838561e-02, 5.36614992e-0...","[0.0280748326, 0.0803671926, 0.0389995761, -0....","[0.00974175986, 0.0590910651, 0.0298414342, -0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.03160229, 0.07520672, 0.0484142, -0.0722283...","[0.01587362, 0.03591713, 0.02736461, -0.036889..."
393,CID100003386,Hyperuricemia,D005473,MESH:D033461,"[-0.17009397, 0.01808067, -0.3474555, -0.16473...","[1.63829103e-02, 1.21408507e-01, -3.25293676e-...",1,DOID_1920,"[0.02442656, 0.14061204, 0.05963011, -0.115171...","[0.03530318, 0.16593891, 0.05585731, -0.125786...","[0.00727077, 0.05299819, 0.02481131, -0.049112...","[0.0183377, 0.07108285, 0.03803683, -0.0703049...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0363478102, 0.144311264, 0.0613140389, -0.1...","[0.00263378, 0.04712868, 0.0265272, -0.0483419...","[0.0374873, 0.17948729, 0.17558731, -0.1991960..."
1790,,Subarachnoid Hemorrhage,C010792,MESH:D013345,"[0.01489374, 0.11761086, 0.04951559, -0.110714...","[-8.19592327e-02, 1.09936357e-01, -7.64746219e...",0,DOID_0060228,"[0.01079678, 0.09617513, 0.0421772, -0.1027466...","[0.05360616, 0.1687279, 0.0551669, -0.11726358...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0.02268115, 0.06369793, 0.02509757, -0.040526...","[0.00925999, 0.0410583, 0.01971564, -0.0444400...","[0.0077734822, 0.040110637, 0.017096259, -0.03...","[0.00887711, 0.04698301, 0.02177738, -0.030554...","[0.00637148, 0.05133118, 0.02642822, -0.055287..."


## SIDER Run NN

In [170]:
# Load model (saved in opa-nn notebook)
from tensorflow.keras.models import load_model
# nn14022019auc921GoPhenCHEdoHI
# nn15022019auc937GoPhenCHEdoHIpro
# nn14022019auc921GoPhenCHEdoHI --> .586
# nn15022019auc92GoPhenCHEdoHIpro --> .50
model = load_model('../opa/nn15022019auc885GoHINOphen.h5')



In [171]:
# Now let's see if saving and loading the model worked
# Create a vector from go vec + empty vecs for the other desired vecs
# Use NN to make predictions
# Evaluate these

In [172]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_3 (Flatten)          (None, 800)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 200)               160200    
_________________________________________________________________
dense_13 (Dense)             (None, 60)                12060     
_________________________________________________________________
dense_14 (Dense)             (None, 10)                610       
_________________________________________________________________
dense_15 (Dense)             (None, 1)                 11        
Total params: 172,881
Trainable params: 172,881
Non-trainable params: 0
_________________________________________________________________


### Preprocess vecs

In [173]:
# # I thiiink that this model expects input shape 1600, so add empty vecs for the cols I don't have yet
# empty_vec = [0] * 200

# cols_to_do = ['disPhenVecMP', 'disPhenVecHP', 'CHEBIvec', 'DOvec', 'dis_HINOvec', 'chem_HINOvec', ]

# for col in cols_to_do:
#     sider_mod[col] = np.nan
#     sider_mod[col] = sider_mod[col].map(lambda x: empty_vec)

In [174]:
# Need to turn all to float
all_vecs = ['ChemGoVec', 'DisGoVec', 'disPhenVecMP', 'disPhenVecHP', 'CHEBIvec', 'DOvec', 'dis_HINOvec', 'chem_HINOvec', ]

for col in all_vecs:
    sider_mod[col] = sider_mod[col].map(lambda x: [float(i) for i in x])

In [175]:
print(sider_mod[sider_mod.Correlation == 1].shape[0])
print(sider_mod[sider_mod.Correlation == 0].shape[0])
print(sider_mod.shape)

881
776
(1657, 16)


In [176]:
sider_mod.sample(3)

Unnamed: 0,CID,MESH,ChemicalID,MESHid,ChemGoVec,DisGoVec,Correlation,DOID,disPhenVecMP,disPhenVecHP,CHEBIvec,DOvec,dis_HINOvec,chem_HINOvec,dis_PROvec,chem_PROvec
604,CID100005978,Pneumonia,D014750,MESH:D011014,"[0.02965491, 0.10880372, 0.04334658, -0.113439...","[-0.10981419, 0.02052441, -0.20415413, -0.0357...",1,DOID_552,"[0.03461778, 0.12718843, 0.05749597, -0.114553...","[0.04328105, 0.15122803, 0.05721117, -0.111103...","[0.00479505537, 0.0933838561, 0.0536614992, -0...","[0.01394993, 0.06581535, 0.03156545, -0.063000...","[0.02849239, 0.10294926, 0.04649258, -0.099316...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0310054, 0.11531034, 0.09447175, -0.1186431...","[0.01587362, 0.03591713, 0.02736461, -0.036889..."
1704,,Splenomegaly,D014750,MESH:D013163,"[0.02965491, 0.10880372, 0.04334658, -0.113439...","[-0.0909505263, 0.0255884528, -0.221973851, -0...",0,,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.00479505537, 0.0933838561, 0.0536614992, -0...","[0.0059688692, 0.0401449874, 0.0278134961, -0....","[0.00985397, 0.05246037, 0.03201986, -0.052521...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.03021841, 0.09208963, 0.05248984, -0.095798...","[0.01587362, 0.03591713, 0.02736461, -0.036889..."
665,CID100040973,Urticaria,D017135,MESH:D014581,"[0.01088468, 0.09196024, 0.04892443, -0.092027...","[-0.15222934, 0.05032845, -0.21946777, -0.0608...",1,DOID_1555,"[0.02479337, 0.11779714, 0.04578669, -0.118837...","[0.04158028, 0.15021026, 0.05177102, -0.128954...","[0.01064619, 0.05261517, 0.03455541, -0.053472...","[0.02033558, 0.07687499, 0.03339575, -0.061590...","[0.01310632, 0.08857583, 0.04576511, -0.082742...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.03658045, 0.11332764, 0.07062415, -0.121959...","[0.0057065, 0.04991744, 0.03463568, -0.0479967..."


In [177]:
# # Optionally remove all empty vecs
# empty_vec = [0.0] * 200

# for col in ['DisGoVec', 'ChemGoVec', 'disPhenVecMP', 'disPhenVecHP', 'CHEBIvec', 'DOvec', 'dis_HINOvec', 'chem_HINOvec', 'dis_PROvec']:
#     sider_mod[col] = sider_mod[col].map(lambda x: np.nan if x == empty_vec else x)
    
# sider_mod = sider_mod.dropna(subset=['DisGoVec', 'ChemGoVec', 'disPhenVecMP', 'disPhenVecHP', 'CHEBIvec', 'DOvec', 'dis_HINOvec', 'chem_HINOvec'])

In [178]:
print(sider_mod[sider_mod.Correlation == 1].shape[0])
print(sider_mod[sider_mod.Correlation == 0].shape[0])
print(sider_mod.shape)

881
776
(1657, 16)


In [179]:
# # Download sider_mod to run NN on it in Opa-nn notebook (compare and see if model load component failing)
# sider_mod.to_csv('Sider_val.csv')

In [180]:
# # For Keras, need to turn inputs into numpy arrays instead of pandas df
# # First create single np array of all vecs... not pretty:
# Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
# Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
# gofuncs = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

# DMPvecs = pd.DataFrame(sider_mod.disPhenVecHP.values.tolist(), index= sider_mod.index)
# DHPvecs = pd.DataFrame(sider_mod.disPhenVecMP.values.tolist(), index= sider_mod.index)
# disPvecs = DMPvecs.merge(DHPvecs, how='outer', left_index=True, right_index=True)

# all_X = disPvecs.merge(gofuncs, how='outer', left_index=True, right_index=True)

# CHEBvecs = pd.DataFrame(sider_mod.CHEBIvec.values.tolist(), index = sider_mod.index)
# all_X = CHEBvecs.merge(all_X, how='outer', left_index=True, right_index=True)

# DOvecs = pd.DataFrame(sider_mod.DOvec.values.tolist(), index = sider_mod.index)
# all_X = DOvecs.merge(all_X, how='outer', left_index=True, right_index=True)

# dHINOvecs = pd.DataFrame(sider_mod.dis_HINOvec.values.tolist(), index=sider_mod.index)
# cHINOvecs = pd.DataFrame(sider_mod.chem_HINOvec.values.tolist(), index=sider_mod.index)
# hinovecs = cHINOvecs.merge(dHINOvecs, how='outer', left_index=True, right_index=True)
# all_X = all_X.merge(hinovecs, how='outer', left_index=True, right_index=True)

# all_X = np.array(all_X)

In [181]:
# # try out dropping na rows --> does of course boost AUC significantly
# sider_mod = sider_mod.dropna()
# # need to re-add uncorrelated rows tho if dropping na

# # Add unrelated pairs - control obs
# no_rows = (sider_mod.shape[0]-1)   # This is a parameter to be tuned --> how many uncorrelated pairs do we want
# print('Original shape: ', sider_mod.shape)
# sider_mod = sider_mod.drop_duplicates(subset=['ChemicalID', 'MESHid'], keep=False)
# print('Shape after dropping duplicates: ', sider_mod.shape)

# # Randomly select chems and diseases (as many as there are related pairs)
# df_chems = sider_mod[['ChemicalID', 'ChemGoVec']].drop_duplicates(subset=['ChemicalID']).reset_index(drop=True)
# df_dis = sider_mod[['MESHid', 'DisGoVec', 'MESH']].drop_duplicates(subset=['MESHid']).reset_index(drop=True)
# df_chems.columns = ['ID', 'Vector']
# df_dis.columns = ['ID', 'Vector', 'MESH']

# # print('chem size: ', df_chems.shape[0])
# # print('dis size: ', df_dis.shape[0])

# no_chems = len(df_chems) - 1
# no_dis = len(df_dis) - 1
# rand_chems = np.random.choice(no_chems, no_rows, replace=True)
# rand_dis = np.random.choice(no_dis, no_rows, replace=True)

# # Add the new pairs as rows
# for x in range(0, no_rows):
#     int1 = rand_chems[x]
#     int2 = rand_dis[x]
#     chem, chemvec = df_chems.loc[int1, 'ID'], df_chems.loc[int1, 'Vector']
#     dis, disvec, mesh = df_dis.loc[int2, 'ID'], df_dis.loc[int2, 'Vector'], df_dis.loc[int2, 'MESH']
#     sider_mod = sider_mod.append({'ChemicalID':chem, 'MESHid':dis, 'ChemGoVec':chemvec, 'DisGoVec': disvec, 'Correlation':0, 'MESH': mesh}, ignore_index=True)

# print('Shape after adding controls: ', sider_mod.shape)
# # Drop any duplicates (removes known correlated pairs accidentally generated as uncorrelated)
# sider_mod = sider_mod.drop_duplicates(subset=['ChemicalID', 'MESHid'], keep=False)
# print('Shape after dropping duplicates: ', sider_mod.shape)

# # and re add empty vecs
# empty_vec = [0] * 200

# sider_mod['disPhenVecMP'] = sider_mod.DOID.map(lambda x: dis_mpVec.get(x, empty_vec))
# sider_mod['disPhenVecHP'] = sider_mod.DOID.map(lambda x: dis_hpVec.get(x, empty_vec))

In [182]:
# # Version for phen and gofunc vecs
# # For Keras, need to turn inputs into numpy arrays instead of pandas df
# # First create single np array of all vecs... not pretty:
# Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
# Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
# gofuncs = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

# DMPvecs = pd.DataFrame(sider_mod.disPhenVecHP.values.tolist(), index= sider_mod.index)
# DHPvecs = pd.DataFrame(sider_mod.disPhenVecMP.values.tolist(), index= sider_mod.index)
# disPvecs = DMPvecs.merge(DHPvecs, how='outer', left_index=True, right_index=True)

# all_X = disPvecs.merge(gofuncs, how='outer', left_index=True, right_index=True)

# all_X = np.array(all_X)

In [183]:
# # Version for HINO, DO, CHEBI, disphen and gofunc vecs
# # For Keras, need to turn inputs into numpy arrays instead of pandas df
# # First create single np array of all vecs... not pretty:
# Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
# Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
# gofuncs = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

# DMPvecs = pd.DataFrame(sider_mod.disPhenVecHP.values.tolist(), index= sider_mod.index)
# DHPvecs = pd.DataFrame(sider_mod.disPhenVecMP.values.tolist(), index= sider_mod.index)
# disPvecs = DMPvecs.merge(DHPvecs, how='outer', left_index=True, right_index=True)

# all_X = disPvecs.merge(gofuncs, how='outer', left_index=True, right_index=True)

# CHEBvecs = pd.DataFrame(sider_mod.CHEBIvec.values.tolist(), index = sider_mod.index)
# all_X = CHEBvecs.merge(all_X, how='outer', left_index=True, right_index=True)

# DOvecs = pd.DataFrame(sider_mod.DOvec.values.tolist(), index = sider_mod.index)
# all_X = DOvecs.merge(all_X, how='outer', left_index=True, right_index=True)

# dHINOvecs = pd.DataFrame(sider_mod.dis_HINOvec.values.tolist(), index=sider_mod.index)
# cHINOvecs = pd.DataFrame(sider_mod.chem_HINOvec.values.tolist(), index=sider_mod.index)
# hinovecs = cHINOvecs.merge(dHINOvecs, how='outer', left_index=True, right_index=True)
# all_X = all_X.merge(hinovecs, how='outer', left_index=True, right_index=True)

# all_X = np.array(all_X)

In [198]:
# # Version for PRO, HINO, DO, CHEBI, disphen and gofunc vecs
# # For Keras, need to turn inputs into numpy arrays instead of pandas df
# # First create single np array of all vecs... not pretty:
# Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
# Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
# gofuncs = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

# DMPvecs = pd.DataFrame(sider_mod.disPhenVecHP.values.tolist(), index= sider_mod.index)
# DHPvecs = pd.DataFrame(sider_mod.disPhenVecMP.values.tolist(), index= sider_mod.index)
# disPvecs = DMPvecs.merge(DHPvecs, how='outer', left_index=True, right_index=True)

# all_X = disPvecs.merge(gofuncs, how='outer', left_index=True, right_index=True)

# CHEBvecs = pd.DataFrame(sider_mod.CHEBIvec.values.tolist(), index = sider_mod.index)
# all_X = CHEBvecs.merge(all_X, how='outer', left_index=True, right_index=True)

# DOvecs = pd.DataFrame(sider_mod.DOvec.values.tolist(), index = sider_mod.index)
# all_X = DOvecs.merge(all_X, how='outer', left_index=True, right_index=True)

# dHINOvecs = pd.DataFrame(sider_mod.dis_HINOvec.values.tolist(), index=sider_mod.index)
# cHINOvecs = pd.DataFrame(sider_mod.chem_HINOvec.values.tolist(), index=sider_mod.index)
# hinovecs = cHINOvecs.merge(dHINOvecs, how='outer', left_index=True, right_index=True)
# all_X = all_X.merge(hinovecs, how='outer', left_index=True, right_index=True)

# dPROvecs = pd.DataFrame(sider_mod.dis_PROvec.values.tolist(), index=sider_mod.index)
# cPROvecs = pd.DataFrame(sider_mod.chem_PROvec.values.tolist(), index=sider_mod.index)
# PROvecs = cPROvecs.merge(dPROvecs, how='outer', left_index=True, right_index=True)
# all_X = all_X.merge(PROvecs, how='outer', left_index=True, right_index=True)

# all_X = np.array(all_X)

In [199]:
# # Version for gofunc vecs and CHEBI
# # For Keras, need to turn inputs into numpy arrays instead of pandas df
# # First create single np array of all vecs... not pretty:
# Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
# Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
# all_X = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

# CHEBvecs = pd.DataFrame(sider_mod.CHEBIvec.values.tolist(), index = sider_mod.index)
# all_X = CHEBvecs.merge(all_X, how='outer', left_index=True, right_index=True)

# all_X = np.array(all_X)

In [200]:
# # Version for gofunc vecs
# # For Keras, need to turn inputs into numpy arrays instead of pandas df
# # First create single np array of all vecs... not pretty:
# Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
# Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
# all_X = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

# all_X = np.array(all_X)

In [201]:
# # Version for gofunc vecs and HINO
# # For Keras, need to turn inputs into numpy arrays instead of pandas df
# # First create single np array of all vecs... not pretty:
# Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
# Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
# all_X = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

# dHINOvecs = pd.DataFrame(sider_mod.dis_HINOvec.values.tolist(), index=sider_mod.index)
# cHINOvecs = pd.DataFrame(sider_mod.chem_HINOvec.values.tolist(), index=sider_mod.index)
# hinovecs = cHINOvecs.merge(dHINOvecs, how='outer', left_index=True, right_index=True)
# all_X = all_X.merge(hinovecs, how='outer', left_index=True, right_index=True)

# all_X = np.array(all_X)

In [202]:
# Version for gofunc vecs and HINO
# For Keras, need to turn inputs into numpy arrays instead of pandas df
# First create single np array of all vecs... not pretty:
Dvecs = pd.DataFrame(sider_mod.DisGoVec.values.tolist(), index= sider_mod.index)
Cvecs = pd.DataFrame(sider_mod.ChemGoVec.values.tolist(), index= sider_mod.index)
all_X = Dvecs.merge(Cvecs, how='outer', left_index=True, right_index=True)

dHINOvecs = pd.DataFrame(sider_mod.dis_HINOvec.values.tolist(), index=sider_mod.index)
cHINOvecs = pd.DataFrame(sider_mod.chem_HINOvec.values.tolist(), index=sider_mod.index)
hinovecs = cHINOvecs.merge(dHINOvecs, how='outer', left_index=True, right_index=True)
all_X = all_X.merge(hinovecs, how='outer', left_index=True, right_index=True)

all_X = np.array(all_X)

In [203]:
# Now create np array of the y output
all_y = np.array(sider_mod.Correlation)

In [204]:
print('y shape: ', all_y.shape)
print('X shape: ', all_X.shape)

y shape:  (1657,)
X shape:  (1657, 2000)


In [205]:
# sider_mod[['ChemicalID', 'MESHid', 'Correlation', 'ChemGoVec', 'DisGoVec']]

In [206]:
# Now I have my validation db (tho small...) so Run NN, get predictions and accuracy

In [207]:
# 2. Compile the model (give it loss func, optimise func and eval metric)
model.compile(optimizer=tf.train.AdamOptimizer(), # determines how the model is adapted based on loss func
              loss='binary_crossentropy', # measure of accuracy during training
              metrics=['accuracy']) # measure for train and testing steps 

In [208]:
# Accuracy
test_loss, test_acc = model.evaluate(all_X, all_y)
print('Test accuracy:', test_acc)

ValueError: Error when checking input: expected flatten_3_input to have shape (800,) but got array with shape (2000,)

In [209]:
# Get actual predictions for test set
predictions = model.predict(all_X)
rounded_predictions = [int(float(round(x[0]))) for x in predictions]

ValueError: Error when checking input: expected flatten_3_input to have shape (800,) but got array with shape (2000,)

In [196]:
# ROC AUC
print('ROC AUC: ', roc_auc_score(all_y, predictions))
# .52...
# Right, options:
# (i) Model is shit and doesn't work and thesis isn't looking great
# (ii) WRONG -> Loading the model in isn't working properly - download data from here and run it in opa-nn
# (iii) Sample size is too small to detect pattern, get bigger validation db...seems v unlikely
# (iv) WRONG (prob)-> Good chance that a significant amount of the controls are actually correlated, import
# controls from opa-nn training database? Is that cheating tho - actually, given that the controls are taken
# from chems and diseases that have approval, it's prob safe to assume uncorr as otherwise would be included as corr
# (v) Some of these side effects occur in less than 1% of patients... seems like quite a bulls eye! Try another 
# dataset
# (vi) CHASE is a genius and highlighted that opa2vec will project the same data in different ways upon different
# trainings so I need to train the validation vecs along with the original vecs

ROC AUC:  0.6148238295283008


In [197]:
print('Chems :', sider_mod.ChemicalID.nunique())
print('Dis :', sider_mod.MESH.nunique())
print('chem:dis obs: ', sider_mod.shape[0])
print('of which are uncorrelated: ', sider_mod[sider_mod.Correlation == 0].shape[0])

Chems : 62
Dis : 151
chem:dis obs:  1657
of which are uncorrelated:  776
