# Notebook: Use NN to predict disease from chemicals using Opa2Vec vectors
<b> Author: </b> Ian Coleman <br>
<b> Purpose: </b> Take the vectors created in the opa2vec notebook. This took chemical go functions
    and disease go function, creating vectors for the chemicals. Train a NN to predict diseases from these chemical
    vectors

In [1]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### 1. Import Vectors and Pre-Process them

In [2]:
# TODO needs to be adapted to account for the fact that AllVectorResults.lst will now contain not only chemical
# vectors but also disease ones. The IDs are very similar but looks like maybe disease IDs are always len 8
# and chemical ones never are... verify. Have verified (dis are 8, chem are 7 or 10)

In [4]:
# Import vec file
with open('../../opa2vec/AllVectorResults.lst', 'r') as file:
    text = file.read()

In [5]:
# Strip and split it into list of lists [chem, vec]
text = text.replace('\n', '')
text = text.split(']')
text = [item.strip().split(' [') for item in text]

In [6]:
# Turn it into a data frame
df = pd.DataFrame(text)
df.columns = ['ID', 'Vector']
# df.head()

In [7]:
# Clean
df = df.dropna()
df['Vector'] = df.Vector.map(lambda x: x.rstrip().lstrip().replace('    ', ' ').replace('   ', ' ').replace('  ', ' ').replace(' ', ','))

In [8]:
# Turn vector column into a list
df['Vector'] = df.Vector.map(lambda x: x.split(','))

# df = df['Vector'].str.split(',', expand=True)
# df = df.join(vec_split, lsuffix='_df', rsuffix='_vec_split')
# df['chemVec'] = np.nan
# for index in range(df.shape[0]):
#     df['chemVec'][index] = df.iloc[index, 2:].tolist()

In [9]:
# df.loc[:,0].head()
# BCE binary classification --> The loss function recommended by Jun
# sigmoid output

In [18]:
# Now we have 
df[-5:]

Unnamed: 0,ID,Vector,is_chem
1668,C1853833,"[9.70322266e-03, 8.92579332e-02, 8.90747411e-0...",False
1669,C1853247,"[0.02768757, 0.09068831, 0.02479883, -0.113485...",False
1670,C1847013,"[0.03376195, 0.12275581, 0.02225231, -0.108791...",False
1671,C1836027,"[1.57302748e-02, 1.09188251e-01, 5.02204485e-0...",False
1672,C1840390,"[0.03446435, 0.11030531, 0.04322119, -0.119213...",False


### 2. Create DF for NN
From the ID-Vector DF we will now create a DF matching each chem with each disease of the following columns:
ChemID DisID ChemVec DisVec PositiveAssociationExists(binary)

I'm running into a problem here...
Disgenet uses UMLS ID for diseases
CTD uses MESH for diseases

I need to either: 
convert between MESH and UMLS --> Waiting for my UMLS membership, can't see how to do it without
OR recreate vectors using only CTD diseases
OR create a new chem_dis list from Disgenet --> Non-existent

In [11]:
# Step 1: Import file of proven chem-dis positive associations (created in ctd-to-nt notebook)
chem_dis = pd.read_csv('../ctd-to-nt/chem-dis-pos-assocs.csv')
chem_dis.head()

Unnamed: 0,ChemicalID,DiseaseID
0,C112297,MESH:D006948
1,C112297,MESH:D012640
2,C425777,MESH:D006948
3,C013567,MESH:D006333
4,C418863,MESH:D013262


In [None]:
# Step 2: Iterate through each chem and create a line for it with each dis

In [12]:
# First create is_chem col in df to differentiate between chem and disease
df['is_chem'] = df.ID.map(lambda x: len(x) != 8) # as len of disease ID is always 8

In [15]:
# We only want the chems and diseases that we have vectors for
df.shape

(1673, 3)

In [None]:
# Reshape chem_dis to to only keep lines where both chem and dis have a vec
chem_dis['DiseaseID'] = chem_dis.disease_id.map(lambda x: x)

In [16]:
# So iterate through vecs and create a line for it if there is a rel with a dis that has a vec
chem_dis.shape

(62015, 2)

In [None]:
# Step 3: For each line check the chem-dis reference df to see if positive rel exists, if so encode 1 else 0

In [None]:
# # Import disease list (created in opa2vec notebook that created vectors)
# diseases = pd.read_csv('diseases.lst', header=None, skiprows=1) # Skipping first row as will be nan
# diseases.shape # 1264 diseases...

In [None]:
# df.head()

In [None]:
# diseases.head()

In [192]:
# Import directly evidenced chemical-disease positive relationships from CTD
chem_dis = pd.read_csv('../ctd-to-nt/chem-dis-pos-assocs.csv')
chem_dis.head()

Unnamed: 0,ChemicalID,DiseaseID
0,C112297,MESH:D006948
1,C112297,MESH:D012640
2,C425777,MESH:D006948
3,C013567,MESH:D006333
4,C418863,MESH:D013262


In [None]:
df.head()

In [None]:
## Get rid of rows from chem_dis that have chems that aren't in df
print(chem_dis.shape)
chemsers = df.ChemicalID.unique()
bools = chem_dis.ChemicalID.map(lambda x: x in chemsers)
chem_dis = chem_dis[bools]
chem_dis.shape

In [None]:
print('Number chems: ', len(chem_dis.ChemicalID.unique()))
print('Number diseases: ', len(chem_dis.DiseaseID.unique()))

In [None]:
# Create column for each disease, nan columns
for name in chem_dis.DiseaseID.unique():
    df[name] = np.nan

In [None]:
df.head()

In [None]:
# For each chem-disease relationship set cell to one, if no relationship then set to 0
def check_assoc(row):
    for index, r in chem_dis[chem_dis.ChemicalID == row.ChemicalID].head().iterrows():
#         row[r.DiseaseID] = 1
        print(r.DiseaseID)
        df.loc[index, r.DiseaseID] = 1
    
    
# convert np.nan to 0 for col in df


In [None]:
chem_dis.head()

In [None]:
df.apply(check_assoc, axis=1)

In [None]:
df.head() 
df["MESH:D048629"].unique()

In [None]:
df.shape