This notebook takes in a list of matrices as inputs and generates the dat files needed by NCMF. This also involves sampling the target matrix for creation of train and test files as well.

Input given to this notebook:
1. List of matrices used for learning embeddings
2. Matrix being used by the link prediction task - this will be divided into train and test sets, with the embeddings being learnt from the train set; link prediction is done by a classifier trained on the train dataset and tested on the test set generated. The matrices must be saved as csv files.
3. entity.csv containing all the entity names in the order E0, E1, ... as they appear in the matrices. If a particular node type can be of differing labels(eg: node type is disease, node label is Asthma), this info must be present in a column called "Entity labels" in this csv. If this column is absent, node_label and node_type are considered to be the same.

Output from notebook:
1. sampled#_node.dat
2. sampled#_link.dat
3. sampled#_link.dat.test
4. sampled#_label.dat
5. sampled#_label.dat.test
6. sampled#_meta.dat
7. sampled#_info.dat

#### User inputs (Values to be modified as per the dataset)

In [1]:
# Fill out the following values as per the dataset
sample_id = 1 # update sample ID number
data_folder = "../../datasets/NCMF/ESP/" # path to the folder with the matrices
all_matrices = ["reduced_X0_train.npy", "X1.npy", "X2.npy"] # provide a comma separated list of file names; each matrix must be mapped to a separate file
# These are used for learning the embeddings
test_matrix = "reduced_X0_test.npy" # provide the test matrix that makes up link.dat.test
test_indices_file = "reduced_drug_drug_se_test_idx.csv"
entity_mapping = {'X0': ['E0', 'E1'],
                 'X1': ['E0', 'E2'],
                 'X2': ['E2', 'E2']}
entity_name_mapping = {'E0': "drug", "E1": "drug_sideeffect", "E2": "protein"} # Modify this as per the dataset for all entities
target_matrix_index = 0 # Set this to match the index of the matrix file as per the list all_matrices; this matrix is divided into train and test
rng = 3 # random number seed; edit this to create different samples
threshold = 0 # change this value based on the test data to decide between positive and negative links

#### Output files generated

In [2]:
# Output files
sampled_node_file = data_folder + f'sampled{sample_id}_node.dat'
sampled_link_file = data_folder + f'sampled{sample_id}_link.dat'
sampled_link_test_file = data_folder + f'sampled{sample_id}_link.dat.test'
sampled_label_file = data_folder + f'sampled{sample_id}_label.dat'
sampled_label_test_file = data_folder + f'sampled{sample_id}_label.dat.test'
sampled_meta_file = data_folder + f'sampled{sample_id}_meta.dat'
sampled_info_file = data_folder + f'sampled{sample_id}_info.dat'

#### Data Processing
The following steps are carried out in the section below:
1. Find number of entities in all the matrices and assign unique IDs to each entity
2. Flatten the matrix into a list of triplets
3. From the target_matrix, sample 20% for use as test data
4. Create node.dat, link.dat, link.dat.test, label.dat, label.dat.test, meta.dat for the dataset

In [3]:
import pandas as pd
import numpy as np
import random
random.seed(rng)

In [4]:
for i in range(len(all_matrices)):
    if "csv" in all_matrices[i]:
        exec(f"X{i} = pd.read_csv(data_folder + all_matrices[i], header = None).to_numpy()")
    elif "pkl" in all_matrices[i]:
        exec(f"X{i} = pd.read_pickle(data_folder + all_matrices[i])")
    elif "npy" in all_matrices[i]:
        exec(f"X{i} = np.load(data_folder + all_matrices[i])")
# This block assigns each of the matrices to variables X0, X1, X2,...

In [5]:
for i in range(len(all_matrices)):
    print(f"Shape of X{i}", end = " ")
    exec(f"print(X{i}.shape)")

Shape of X0 (645, 425883)
Shape of X1 (645, 22583)
Shape of X2 (22583, 22583)


In [6]:
# Get entity IDs - uses matrix_entity_similarity to get total number of unique entries
total_entities = 0
matrices_seen = []
for k,v in entity_mapping.items():
    if v[0] not in matrices_seen:
        matrices_seen.append(v[0])
        total_entities += eval(f"{k}.shape[0]")
        exec(f"{v[0]}_size = {k}.shape[0]")
    if v[1] not in matrices_seen:
        matrices_seen.append(v[1])
        total_entities += eval(f"{k}.shape[1]")
        exec(f"{v[1]}_size = {k}.shape[1]")
print(total_entities)


449111


In [7]:
print(E0_size)

645


In [8]:
full_df = pd.DataFrame(columns = ["left", "right", "value", "link_type"])

In [9]:
for k,v in entity_mapping.items(): # iterate over all matrices
    print(f"Considering {k}")
    link_type = int(k[1:])
    total_left_ents_so_far = 0
    for m in matrices_seen:
        if v[0] == m:
            break
        else:
            total_left_ents_so_far += eval(f"{m}_size")
    
    total_right_ents_so_far = 0
    for m in matrices_seen:
        if v[1] == m:
            break
        else:
            total_right_ents_so_far += eval(f"{m}_size")
            
#     for i in range(eval(f"{v[0]}_size")):
#         for j in range(eval(f"{v[1]}_size")):
#             left_id = i + total_left_ents_so_far
#             right_id = j + total_right_ents_so_far
#             value = eval(f"{k}[i][j]")
#             print(f"left = {left_id}, right = {right_id}, value = {value}")
#             full_df = full_df.append({"left": left_id, "right": right_id, "value": value, "link_type": link_type}, ignore_index = True)
    temp_df = pd.DataFrame(columns = ["left", "right", "value", "link_type"])
    temp_df["left"] = sorted(list(range(total_left_ents_so_far, total_left_ents_so_far + eval(f"{v[0]}_size"))) * eval(f"{v[1]}_size"))
    temp_df["right"] = list(range(total_right_ents_so_far, total_right_ents_so_far + eval(f"{v[1]}_size"))) * eval(f"{v[0]}_size")
    temp_df["value"] = eval(f"{k}").flatten()
    temp_df["link_type"] = [link_type] * len(temp_df["left"])
    full_df = pd.concat([full_df, temp_df], axis = 0, ignore_index = True)

Considering X0
Considering X1
Considering X2


In [10]:
full_df.shape

(799252459, 4)

In [11]:
full_df.head()

Unnamed: 0,left,right,value,link_type
0,0,645,0.0,0
1,0,646,0.0,0
2,0,647,0.0,0
3,0,648,1.0,0
4,0,649,0.0,0


In [12]:
full_df.tail()

Unnamed: 0,left,right,value,link_type
799252454,449110,449106,0.0,2
799252455,449110,449107,0.0,2
799252456,449110,449108,0.0,2
799252457,449110,449109,0.0,2
799252458,449110,449110,0.0,2


In [13]:
full_df.iloc[8]

left           0
right        653
value          0
link_type      0
Name: 8, dtype: object

#### Link file generation

The link.dat.test file is of the format left_node_id, right_node_id, link_value. It has both positive and negative links, and is used to train and test the SVM classifier in the downstream link prediction task.

The link.dat file is of the format left_node_id, right_node_id, link_type, link_value. It is made up of only positive links, i.e link_value is 1. This file is used to learn the embeddings of all the entities in the graph.

In [14]:
np.random.seed(rng)
target_matrix_R_ent_count = eval(f"X{target_matrix_index}.shape[0]")
target_matrix_C_ent_count = eval(f"X{target_matrix_index}.shape[1]")

In [15]:
if "csv" in test_matrix:
    test_data_np = pd.read_csv(data_folder + test_matrix).values
    test_data_indices = pd.read_csv(data_folder + test_indices_file)
elif "pkl" in test_matrix:
    test_data_np = pd.read_pickle(data_folder +  test_matrix)
    test_data_indices = pd.DataFrame(pd.read_pickle(data_folder + test_indices_file))
elif "npy" in test_matrix:
    test_data_np = np.load(data_folder + test_matrix)
    test_data_indices = pd.read_csv(data_folder + test_indices_file)
test_data_indices.columns = ["indices"]

In [16]:
orig_test_data_size = test_data_indices.shape[0]

In [17]:
node1_offset = 0
v = entity_mapping[f"X{target_matrix_index}"]
for m in matrices_seen:
    if v[0] == m:
        break
    else:
        node1_offset += eval(f"{m}_size")

In [18]:
node1_offset

0

In [19]:
node2_offset = 0
v = entity_mapping[f"X{target_matrix_index}"]
for m in matrices_seen:
    if v[1] == m:
        break
    else:
        node2_offset += eval(f"{m}_size")

In [20]:
node2_offset

645

In [21]:
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# scaler.fit(full_df[full_df["link_type"] == target_matrix_index]["value"])

In [22]:
test_data = pd.DataFrame(columns = ["left", "right", "value", "link_type"])

In [23]:
for index in test_data_indices["indices"]:
    r, c = index // target_matrix_C_ent_count, index % target_matrix_C_ent_count
    #value_added = scaler.transform([[test_data_np[r][c]]])[0][0]
    value_added = test_data_np[r][c]
    test_data = test_data.append({"left": r + node1_offset, "right": c + node2_offset, "value": value_added, "link_type": target_matrix_index}, ignore_index = True)

In [24]:
test_data.head()

Unnamed: 0,left,right,value,link_type
0,143.0,645.0,1.0,0.0
1,427.0,646.0,1.0,0.0
2,210.0,647.0,1.0,0.0
3,445.0,648.0,1.0,0.0
4,282.0,649.0,1.0,0.0


In [25]:
test_data.tail()

Unnamed: 0,left,right,value,link_type
1828779,640.0,426525.0,0.0,0.0
1828780,432.0,221541.0,0.0,0.0
1828781,471.0,358298.0,0.0,0.0
1828782,380.0,426526.0,0.0,0.0
1828783,97.0,426527.0,0.0,0.0


In [26]:
test_data.groupby("value").count()

Unnamed: 0_level_0,left,right,link_type
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,912110,912110,912110
1.0,916674,916674,916674


In [27]:
# test_data["value"] = test_data["value"].apply(lambda x: 1 if x>threshold else 0)
# test_data.head()

In [28]:
test_data.shape

(1828784, 4)

In [29]:
test_data_numpy = test_data.to_numpy()

In [30]:
print(f"Percentage of non zeros in test data = {np.count_nonzero(test_data_numpy[:,2])/test_data_numpy.shape[0]}")
print(f"Percentage of zeros in test data = {1-(np.count_nonzero(test_data_numpy[:,2])/test_data_numpy.shape[0])}")

Percentage of non zeros in test data = 0.501247823690496
Percentage of zeros in test data = 0.49875217630950397


In [31]:
# Create link.dat.test
link_file_for_lp = open(sampled_link_test_file, "w+") # Same test file will be used by HIN2Vec and other algorithms too

number_of_pairs = 0
for i in range(test_data_numpy.shape[0]):
        number_of_pairs += 1
        link_file_for_lp.write(str(int(test_data_numpy[i][0])) + "\t" + str(int(test_data_numpy[i][1])) + "\t" + str(int(round(test_data_numpy[i][2]))) + "\n")
            
link_file_for_lp.close()

In [32]:
full_df.shape

(799252459, 4)

In [33]:
# Create link.dat file
emb_df = full_df[full_df["value"] != 0] # remove negative links
# min_val = min(emb_df["value"])
# max_val = max(emb_df["value"])
#emb_df["value"] = scaler.transform(emb_df[["value"]]) # transform the value of the link to differentiate between positive and negative links
emb_df["left"] = emb_df["left"].astype(int)
emb_df["right"] = emb_df["right"].astype(int)
emb_df["link_type"] = emb_df["link_type"].astype(int)
emb_df = emb_df[["left", "right", "link_type", "value"]]
emb_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,left,right,link_type,value
3,0,648,0,1.0
9,0,654,0,1.0
11,0,656,0,1.0
16,0,661,0,1.0
34,0,679,0,1.0


In [34]:
# print(min(emb_df["value"]))
# print(max(emb_df["value"]))
print(emb_df.shape)

(9913104, 4)


In [35]:
emb_df = emb_df[emb_df["value"] != 0] # retain only positive links for representation learning

In [37]:
emb_df["link_type"].value_counts()

0    8581324
2    1296040
1      35740
Name: link_type, dtype: int64

In [36]:
emb_df.to_csv(sampled_link_file, sep="\t", header=False, index=False)

#### Label files generation

The label.dat file has the following format: node_id, node_name, node_type, node_label.

The file label.dat.test has the following format: node_id, node_name, node_type, node_label

In [88]:
entity_file = data_folder + "reduced_entity_df.csv"

In [89]:
entity_df = pd.read_csv(entity_file)

In [90]:
entity_df.head()

Unnamed: 0.1,Unnamed: 0,Entity Names,entity_ID
0,0,clobetasol,0
1,1,ibandronate,1
2,2,minoxidil,2
3,3,cefuroxime,3
4,4,chloramphenicol,4


In [91]:
entity_df.shape

(449111, 3)

In [92]:
matrices_seen # this is the list of entities in the dataset

['E0', 'E1', 'E2']

In [93]:
if "Entity labels" in entity_df.columns:
    isNodeLabelSeparate = True
    print("Node labels are different from node types")
else:
    isNodeLabelSeparate = False
    print("Node labels are the same as node types")
    

Node labels are the same as node types


In [94]:
count = 0
node_type = []
node_label = []
for entity in matrices_seen:
    print(f"Entity {entity}")
    entity_size = eval(f"{entity}_size")
    node_type_val = int(entity[1:])
    node_type.extend([node_type_val] * entity_size)
    if not isNodeLabelSeparate:
        node_label.extend([node_type_val] * entity_size)
    count += entity_size
    
print(len(node_label))
entity_df["Node label"] = node_label if not isNodeLabelSeparate else list(entity_df["Entity Labels"])
entity_df["Node type"] = node_type

Entity E0
Entity E1
Entity E2
449111


In [95]:
entity_df.head()

Unnamed: 0.1,Unnamed: 0,Entity Names,entity_ID,Node label,Node type
0,0,clobetasol,0,0,0
1,1,ibandronate,1,0,0
2,2,minoxidil,2,0,0
3,3,cefuroxime,3,0,0
4,4,chloramphenicol,4,0,0


In [96]:
entity_df.tail()

Unnamed: 0.1,Unnamed: 0,Entity Names,entity_ID,Node label,Node type
449106,449106,729438,449106,2,2
449107,449107,26046,449107,2,2
449108,449108,3417,449108,2,2
449109,449109,26049,449109,2,2
449110,449110,26050,449110,2,2


In [97]:
test_labels_df = entity_df.sample(frac = 0.2, replace = False, random_state = rng)
test_labels_df = test_labels_df[["Entity Names", "Node type", "Node label"]]
test_labels_df.head()

Unnamed: 0,Entity Names,Node type,Node label
269993,"('ibandronate', 'BACTERIAL_INFECTION')",1,1
107823,"('ofloxacin', 'DIFFICULTY_IN_WALKING')",1,1
224851,"('ethacrynic_acid', 'HYPOGLYCAEMIA')",1,1
35092,"('baclofen', 'SKIN_LESION')",1,1
283627,"('pancuronium', 'AORTIC_REGURGITATION')",1,1


In [98]:
test_labels_df.shape

(89822, 3)

In [99]:
remaining_labels_df = entity_df.drop(test_labels_df.index)
remaining_labels_df = remaining_labels_df[["Entity Names", "Node type", "Node label"]]
remaining_labels_df.head()

Unnamed: 0,Entity Names,Node type,Node label
0,clobetasol,0,0
1,ibandronate,0,0
2,minoxidil,0,0
3,cefuroxime,0,0
4,chloramphenicol,0,0


In [100]:
remaining_labels_df.shape

(359289, 3)

In [101]:
test_labels_df.to_csv(sampled_label_test_file, sep = "\t", header = False)

In [102]:
remaining_labels_df.to_csv(sampled_label_file, sep = "\t", header = False)

#### Node file generation

Node file node.dat is of the format node id, node_name, node_type.

In [103]:
entity_df[["Entity Names", "Node type"]].to_csv(sampled_node_file, sep = "\t", header = False)

#### Meta file generation
This file is of the format

Node Total: Count <total number of entities>
Node Type_0: Count <E0_size>
...
Edge Total: Count <total number of links>
Edge Type_0: Count <number of positive links in E0>
...
Label Total: Count <total number of node label types>
Label Class_0_Total: Count <total number of nodes in E0>
Label Class_0_Type_0: Count <total number of nodes in E0 of label 0>
...

In [104]:
# Getting node counts
node_values = list(entity_df.groupby("Node type").count()["Entity Names"])
print(node_values)

[645, 425883, 22583]


In [105]:
# Getting link counts
emb_count = list(emb_df.groupby("link_type").count()["value"])
test_count = list(test_data[test_data["value"] != 0].groupby("link_type").count()['value'])[0]
emb_count[target_matrix_index] += test_count
print(emb_count)

[9497998, 35740, 1296040]


In [106]:
# Getting label counts
entity_df.groupby(["Node type", "Node label"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 0,Entity Names,entity_ID
Node type,Node label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,645,645,645
1,1,425883,425883,425883
2,2,22583,22583,22583


In [107]:
label_counts = entity_df.groupby(['Node type','Node label']).count().apply(list).to_dict()
print(label_counts)

{'Unnamed: 0': {(0, 0): 645, (1, 1): 425883, (2, 2): 22583}, 'Entity Names': {(0, 0): 645, (1, 1): 425883, (2, 2): 22583}, 'entity_ID': {(0, 0): 645, (1, 1): 425883, (2, 2): 22583}}


In [108]:
meta_file_writer = open(sampled_meta_file, "w+")

meta_file_writer.write(f"Node Total: Count {sum(node_values)}" + "\n")
for i in range(len(node_values)):
    meta_file_writer.write(f"Node Type_{i}: Count {node_values[i]}" + "\n")
meta_file_writer.write(f"Edge Total: Count {sum(emb_count)}" + "\n")
for i in range(len(emb_count)):
    meta_file_writer.write(f"Edge Type_{i}: Count {emb_count[i]}" + "\n")
meta_file_writer.write(f"Label Total: Count {sum(label_counts['Entity Names'].values())}" + "\n")

for i in range(len(node_values)):
    meta_file_writer.write(f"Label Class_{i}_Total: Count {node_values[i]}" + "\n")
    for k in sorted(label_counts["Entity Names"]):
        if k[0] == i:
            meta_file_writer.write(f"Label Class_{k[0]}_Type_{k[1]}: Count {label_counts['Entity Names'][k]}" + "\n")
        
meta_file_writer.close()

#### Info file generation

Info file is of the format

node.dat

TYPE    MEANING

0       DRUG

1       PROTEIN

-----------------------------------------------

link.dat

LINK    START   END     MEANING

0       0       0       DRUG-and-DRUG

1       0       1       DRUG-and-PROTEIN

2       1       1       PROTEIN-and-PROTEIN

-----------------------------------------------

label.dat

TYPE    CLASS   MEANING

0       0       DRUG

1       1       PROTEIN


In [109]:
info_file_writer = open(sampled_info_file, "w+")

info_file_writer.write("node.dat\n")
info_file_writer.write("TYPE\tMEANING\n")
for k, v in entity_name_mapping.items():
    info_file_writer.write(f"{k[1:]}"+"\t"+f"{v}"+"\n")

info_file_writer.write("\n-----------------------------------------------\n")

info_file_writer.write("link.dat\n")
info_file_writer.write("LINK\tSTART\tEND\tMEANING\n")
for k,v in entity_mapping.items():
    info_file_writer.write(f"{k[1:]}" + "\t" + f"{v[0][1:]}" + "\t" + f"{v[1][1:]}" + "\t" + f"{entity_name_mapping[f'E{int(v[0][1:])}']}" + "-and-" + f"{entity_name_mapping[f'E{int(v[1][1:])}']}" + "\n")

info_file_writer.write("\n-----------------------------------------------\n")
info_file_writer.write("label.dat\n")
info_file_writer.write("TYPE\tCLASS\tMEANING\n")
for i in range(len(node_values)):
    for k in sorted(label_counts["Entity Names"]):
        if k[0] == i:
            info_file_writer.write(f"{i}" + "\t" + f"{k[1]}" + "\t" + f"{entity_name_mapping[f'E{int(k[1])}']}" + "\n")

info_file_writer.close()