This notebook takes in a list of matrices as inputs and generates the dat files needed by NCMF. This also involves sampling the target matrix for creation of train and test files as well.

Input given to this notebook:
1. List of matrices
2. Matrix being used by the link prediction task - this will be divided into train and test sets, with the embeddings being learnt from the train set; link prediction is done by a classifier trained on the train dataset and tested on the test set generated. The matrices must be saved as csv files.
3. entity.csv containing all the entity names in the order E0, E1, ... as they appear in the matrices. If a particular node type can be of differing labels(eg: node type is disease, node label is Asthma), this info must be present in a column called "Entity labels" in this csv. If this column is absent, node_label and node_type are considered to be the same.

Output from notebook:
1. sampled#_node.dat
2. sampled#_link.dat
3. sampled#_link.dat.test
4. sampled#_label.dat
5. sampled#_label.dat.test
6. sampled#_meta.dat
7. sampled#_info.dat

#### User inputs (Values to be modified as per the dataset)

In [1]:
# Fill out the following values as per the dataset
sample_id = 1 # update sample ID number
data_folder = "../../datasets/NCMF/Polypharmacy/" # path to the folder with the matrices
all_matrices = ["drug-drug.csv", "drug-protein.csv", "protein-protein.csv"] # provide a comma separated list of file names; each matrix must be mapped to a separate file
entity_mapping = {'X0': ['E0', 'E0'],
                 'X1': ['E0', 'E1'],
                 'X2': ['E1', 'E1']}
entity_name_mapping = {'E0': "drug", "E1": "protein"} # Modify this as per the dataset for all entities
target_matrix_index = 0 # Set this to match the index of the matrix file as per the list all_matrices; this matrix is divided into train and test
rng = 3 # random number seed; edit this to create different samples

#### Output files generated

In [2]:
# Output files
sampled_node_file = data_folder + f'sampled{sample_id}_node.dat'
sampled_link_file = data_folder + f'sampled{sample_id}_link.dat'
sampled_link_test_file = data_folder + f'sampled{sample_id}_link.dat.test'
sampled_label_file = data_folder + f'sampled{sample_id}_label.dat'
sampled_label_test_file = data_folder + f'sampled{sample_id}_label.dat.test'
sampled_meta_file = data_folder + f'sampled{sample_id}_meta.dat'
sampled_info_file = data_folder + f'sampled{sample_id}_info.dat'

#### Data Processing
The following steps are carried out in the section below:
1. Find number of entities in all the matrices and assign unique IDs to each entity
2. Flatten the matrix into a list of triplets
3. From the target_matrix, sample 20% for use as test data
4. Create node.dat, link.dat, link.dat.test, label.dat, label.dat.test, meta.dat for the dataset

In [3]:
import pandas as pd
import numpy as np
import random
random.seed(rng)

In [4]:
for i in range(len(all_matrices)):
    exec(f"X{i} = pd.read_csv(data_folder + all_matrices[i], header = None).to_numpy()")
# This block assigns each of the matrices to variables X0, X1, X2,...

In [5]:
for i in range(len(all_matrices)):
    print(f"Shape of X{i}", end = " ")
    exec(f"print(X{i}.shape)")

Shape of X0 (645, 645)
Shape of X1 (645, 837)
Shape of X2 (837, 837)


In [6]:
# Get entity IDs - uses matrix_entity_similarity to get total number of unique entries
total_entities = 0
matrices_seen = []
for k,v in entity_mapping.items():
    if v[0] not in matrices_seen:
        matrices_seen.append(v[0])
        total_entities += eval(f"{k}.shape[0]")
        exec(f"{v[0]}_size = {k}.shape[0]")
    if v[1] not in matrices_seen:
        matrices_seen.append(v[1])
        total_entities += eval(f"{k}.shape[1]")
        exec(f"{v[1]}_size = {k}.shape[1]")
print(total_entities)


1482


In [7]:
print(E0_size)

645


In [8]:
full_df = pd.DataFrame(columns = ["left", "right", "value", "link_type"])

In [9]:
for k,v in entity_mapping.items(): # iterate over all matrices
    print(f"Considering {k}")
    link_type = int(k[1:])
    total_left_ents_so_far = 0
    for m in matrices_seen:
        if v[0] == m:
            break
        else:
            total_left_ents_so_far += eval(f"{m}_size")
    
    total_right_ents_so_far = 0
    for m in matrices_seen:
        if v[1] == m:
            break
        else:
            total_right_ents_so_far += eval(f"{m}_size")
            
#     for i in range(eval(f"{v[0]}_size")):
#         for j in range(eval(f"{v[1]}_size")):
#             left_id = i + total_left_ents_so_far
#             right_id = j + total_right_ents_so_far
#             value = eval(f"{k}[i][j]")
#             print(f"left = {left_id}, right = {right_id}, value = {value}")
#             full_df = full_df.append({"left": left_id, "right": right_id, "value": value, "link_type": link_type}, ignore_index = True)
    temp_df = pd.DataFrame(columns = ["left", "right", "value", "link_type"])
    temp_df["left"] = sorted(list(range(total_left_ents_so_far, total_left_ents_so_far + eval(f"{v[0]}_size"))) * eval(f"{v[1]}_size"))
    temp_df["right"] = list(range(total_right_ents_so_far, total_right_ents_so_far + eval(f"{v[1]}_size"))) * eval(f"{v[0]}_size")
    temp_df["value"] = eval(f"{k}").flatten()
    temp_df["link_type"] = [link_type] * len(temp_df["left"])
    full_df = pd.concat([full_df, temp_df], axis = 0, ignore_index = True)

Considering X0
Considering X1
Considering X2


In [10]:
full_df.shape

(1656459, 4)

In [11]:
full_df.head()

Unnamed: 0,left,right,value,link_type
0,0,0,1.0,0
1,0,1,0.0,0
2,0,2,0.0,0
3,0,3,0.0,0
4,0,4,0.0,0


In [12]:
full_df.tail()

Unnamed: 0,left,right,value,link_type
1656454,1481,1477,0.0,2
1656455,1481,1478,0.0,2
1656456,1481,1479,0.0,2
1656457,1481,1480,0.0,2
1656458,1481,1481,0.0,2


In [13]:
full_df.iloc[8]

left         0
right        8
value        1
link_type    0
Name: 8, dtype: object

#### Link file generation

The link.dat.test file is of the format left_node_id, right_node_id, link_value. It has both positive and negative links, and is used to train and test the SVM classifier in the downstream link prediction task.

The link.dat file is of the format left_node_id, right_node_id, link_type, link_value. It is made up of only positive links, i.e link_value is 1. This file is used to learn the embeddings of all the entities in the graph.

In [14]:
np.random.seed(rng)
target_matrix_R_ent_count = eval(f"X{target_matrix_index}.shape[0]")
target_matrix_C_ent_count = eval(f"X{target_matrix_index}.shape[1]")
sample = np.random.choice(target_matrix_R_ent_count * target_matrix_C_ent_count, int(0.2 * target_matrix_R_ent_count * target_matrix_C_ent_count) , replace = False) # sample 20 % of the data

In [15]:
offset = 0
for i in range(target_matrix_index):
    offset += eval(f"X{i}.shape[0] * X{i}.shape[1]")
print(f"Offset for samples is {offset}")

Offset for samples is 0


In [16]:
sample = sample + offset

In [17]:
sample

array([150332, 340896, 203070, ...,  30138, 263981, 130075])

In [18]:
test_data = full_df.iloc[sample]

In [19]:
orig_test_data_size = test_data.shape[0]

In [20]:
# Find all entries that are zeros in the test df
zero_test_data = pd.DataFrame(test_data.groupby("left").sum()["value"]).reset_index()
zero_test_data = zero_test_data[zero_test_data["value"] == 0]

In [21]:
len(zero_test_data["left"].unique())

15

In [22]:
zero_test_data

Unnamed: 0,left,value
28,28,0.0
53,53,0.0
56,56,0.0
57,57,0.0
59,59,0.0
60,60,0.0
65,65,0.0
93,93,0.0
110,110,0.0
223,223,0.0


In [23]:
node1_offset = 0
v = entity_mapping[f"X{target_matrix_index}"]
for m in matrices_seen:
    if v[0] == m:
        break
    else:
        node1_offset += eval(f"{m}_size")

In [24]:
node1_offset

0

In [25]:
node2_offset = 0
v = entity_mapping[f"X{target_matrix_index}"]
for m in matrices_seen:
    if v[1] == m:
        break
    else:
        node2_offset += eval(f"{m}_size")

In [26]:
node2_offset

0

In [27]:
for i in list(zero_test_data["left"].unique()):
    x = eval(f"X{target_matrix_index}[{int(i) - node1_offset},:]")
    print(i)
    non_zeros = np.where(x != 0)[0]
    print(non_zeros)            
    node_1 = int(i)
    try:
        node_2 = float(random.choice(non_zeros)) + node2_offset
        print(f"Node 1: {node_1}, Node 2: {node_2}")
        print(f"sample val = {int((node_1 - node1_offset) * target_matrix_R_ent_count) + int(node_2 - node2_offset) + offset}")
        value_added = eval(f"X{target_matrix_index}[{int(i) - node1_offset}][{node_2 - node2_offset}]")
        test_data = test_data.append({"left": node_1, "right": node_2, "value": value_added, "link_type": target_matrix_index}, ignore_index = True)
        sample = np.append(sample, int((node_1 - node1_offset) * target_matrix_R_ent_count) + int(node_2 - node2_offset) + offset)
    except:
        print(f"Cannot find a non zero entry for {node_1}")

28
[ 28 260 587]
Node 1: 28, Node 2: 28.0
sample val = 18088
Cannot find a non zero entry for 28
53
[ 30  45  46  53 130 204 352 387 407]
Node 1: 53, Node 2: 407.0
sample val = 34592
Cannot find a non zero entry for 53
56
[ 22  34  40  56 153 171 260 398 404 465 634]
Node 1: 56, Node 2: 40.0
sample val = 36160
Cannot find a non zero entry for 56
57
[ 20  42  57  61 209 230]
Node 1: 57, Node 2: 57.0
sample val = 36822
Cannot find a non zero entry for 57
59
[ 20  30  59 370 402 616]
Node 1: 59, Node 2: 402.0
sample val = 38457
Cannot find a non zero entry for 59
60
[ 17  20  37  41  60  61 209 474]
Node 1: 60, Node 2: 474.0
sample val = 39174
Cannot find a non zero entry for 60
65
[20 22 41 65]
Node 1: 65, Node 2: 20.0
sample val = 41945
Cannot find a non zero entry for 65
93
[ 20  93 157]
Node 1: 93, Node 2: 157.0
sample val = 60142
Cannot find a non zero entry for 93
110
[ 37  42 110 155 215 452]
Node 1: 110, Node 2: 37.0
sample val = 70987
Cannot find a non zero entry for 110
223
[186

In [28]:
sample[orig_test_data_size:]

array([], dtype=int64)

In [29]:
test_data["value"] = test_data["value"].apply(lambda x: 1 if x != 0 else 0) # convert value to indicate only presence/absence of link

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [30]:
test_data.head()

Unnamed: 0,left,right,value,link_type
150332,233,47,0,0
340896,528,336,1,0
203070,314,540,0,0
391723,607,208,0,0
2115,3,180,0,0


In [31]:
test_data.shape[0]

83205

In [32]:
test_data_np = test_data.to_numpy()

In [33]:
print(f"Percentage of ones in test data = {np.count_nonzero(test_data_np[:,2])/test_data_np.shape[0]}")
print(f"Percentage of zeros in test data = {1-(np.count_nonzero(test_data_np[:,2])/test_data_np.shape[0])}")

Percentage of ones in test data = 0.30739739198365484
Percentage of zeros in test data = 0.6926026080163452


In [34]:
# Create link.dat.test
link_file_for_lp = open(sampled_link_test_file, "w+") # Same test file will be used by HIN2Vec and other algorithms too

number_of_pairs = 0
for i in range(test_data_np.shape[0]):
        number_of_pairs += 1
        link_file_for_lp.write(str(int(test_data_np[i][0])) + "\t" + str(int(test_data_np[i][1])) + "\t" + str(int(test_data_np[i][2])) + "\n")
            
link_file_for_lp.close()

In [35]:
full_df.drop(sample, inplace = True, axis = 0)

In [36]:
full_df.shape

(1573254, 4)

In [37]:
# Create link.dat file
emb_df = full_df[full_df["value"] != 0]
emb_df["value"] = emb_df["value"]
emb_df["left"] = emb_df["left"].astype(int)
emb_df["right"] = emb_df["right"].astype(int)
emb_df["link_type"] = emb_df["link_type"].astype(int)
emb_df = emb_df[["left", "right", "link_type", "value"]]
emb_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[r

Unnamed: 0,left,right,link_type,value
6,0,6,0,1.0
10,0,10,0,1.0
20,0,20,0,1.0
22,0,22,0,1.0
23,0,23,0,1.0


In [38]:
emb_df.to_csv(sampled_link_file, sep="\t", header=False, index=False)

#### Label files generation

The label.dat file has the following format: node_id, node_name, node_type, node_label.

The file label.dat.test has the following format: node_id, node_name, node_type, node_label

In [39]:
entity_file = data_folder + "entity.csv"

In [40]:
entity_df = pd.read_csv(entity_file)

In [41]:
entity_df.head()

Unnamed: 0,Entity Names
0,CID000000085
1,CID000000119
2,CID000000143
3,CID000000158
4,CID000000159


In [42]:
entity_df.shape

(1482, 1)

In [43]:
matrices_seen # this is the list of entities in the dataset

['E0', 'E1']

In [44]:
if "Entity labels" in entity_df.columns:
    isNodeLabelSeparate = True
    print("Node labels are different from node types")
else:
    isNodeLabelSeparate = False
    print("Node labels are the same as node types")
    

Node labels are the same as node types


In [45]:
count = 0
node_type = []
node_label = []
for entity in matrices_seen:
    print(f"Entity {entity}")
    entity_size = eval(f"{entity}_size")
    node_type_val = int(entity[1:])
    node_type.extend([node_type_val] * entity_size)
    if not isNodeLabelSeparate:
        node_label.extend([node_type_val] * entity_size)
    count += entity_size
    
entity_df["Node label"] = node_label if not isNodeLabelSeparate else list(entity_df["Entity Labels"])
entity_df["Node type"] = node_type

Entity E0
Entity E1


In [46]:
entity_df.head()

Unnamed: 0,Entity Names,Node label,Node type
0,CID000000085,0,0
1,CID000000119,0,0
2,CID000000143,0,0
3,CID000000158,0,0
4,CID000000159,0,0


In [47]:
entity_df.tail()

Unnamed: 0,Entity Names,Node label,Node type
1477,84816,1,1
1478,4190,1,1
1479,92483,1,1
1480,10988,1,1
1481,221656,1,1


In [48]:
test_labels_df = entity_df.sample(frac = 0.2, replace = False, random_state = rng)
test_labels_df = test_labels_df[["Entity Names", "Node type", "Node label"]]
test_labels_df.head()

Unnamed: 0,Entity Names,Node type,Node label
99,CID000002610,0,0
800,139760,1,1
687,56413,1,1
710,2842,1,1
288,CID000004095,0,0


In [49]:
test_labels_df.shape

(296, 3)

In [50]:
remaining_labels_df = entity_df.drop(test_labels_df.index)
remaining_labels_df = remaining_labels_df[["Entity Names", "Node type", "Node label"]]
remaining_labels_df.head()

Unnamed: 0,Entity Names,Node type,Node label
0,CID000000085,0,0
2,CID000000143,0,0
5,CID000000191,0,0
7,CID000000214,0,0
8,CID000000271,0,0


In [51]:
remaining_labels_df.shape

(1186, 3)

In [52]:
test_labels_df.to_csv(sampled_label_test_file, sep = "\t", header = False)

In [53]:
remaining_labels_df.to_csv(sampled_label_file, sep = "\t", header = False)

#### Node file generation

Node file node.dat is of the format node id, node_name, node_type.

In [54]:
entity_df[["Entity Names", "Node type"]].to_csv(sampled_node_file, sep = "\t", header = False)

#### Meta file generation
This file is of the format

Node Total: Count <total number of entities>
Node Type_0: Count <E0_size>
...
Edge Total: Count <total number of links>
Edge Type_0: Count <number of positive links in E0>
...
Label Total: Count <total number of node label types>
Label Class_0_Total: Count <total number of nodes in E0>
Label Class_0_Type_0: Count <total number of nodes in E0 of label 0>
...

In [55]:
# Getting node counts
node_values = list(entity_df.groupby("Node type").count()["Entity Names"])
print(node_values)

[645, 837]


In [56]:
# Getting link counts
emb_count = list(emb_df.groupby("link_type").count()["value"])
test_count = list(test_data[test_data["value"] != 0].groupby("link_type").count()['value'])[0]
emb_count[target_matrix_index] += test_count
print(emb_count)

[127591, 15575, 20298]


In [57]:
# Getting label counts
entity_df.groupby(["Node type", "Node label"]).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Entity Names
Node type,Node label,Unnamed: 2_level_1
0,0,645
1,1,837


In [58]:
label_counts = entity_df.groupby(['Node type','Node label']).count().apply(list).to_dict()
print(label_counts)

{'Entity Names': {(0, 0): 645, (1, 1): 837}}


In [59]:
meta_file_writer = open(sampled_meta_file, "w+")

meta_file_writer.write(f"Node Total: Count {sum(node_values)}" + "\n")
for i in range(len(node_values)):
    meta_file_writer.write(f"Node Type_{i}: Count {node_values[i]}" + "\n")
meta_file_writer.write(f"Edge Total: Count {sum(emb_count)}" + "\n")
for i in range(len(emb_count)):
    meta_file_writer.write(f"Edge Type_{i}: Count {emb_count[i]}" + "\n")
meta_file_writer.write(f"Label Total: Count {sum(label_counts['Entity Names'].values())}" + "\n")

for i in range(len(node_values)):
    meta_file_writer.write(f"Label Class_{i}_Total: Count {node_values[i]}" + "\n")
    for k in sorted(label_counts["Entity Names"]):
        if k[0] == i:
            meta_file_writer.write(f"Label Class_{k[0]}_Type_{k[1]}: Count {label_counts['Entity Names'][k]}" + "\n")
        
meta_file_writer.close()

#### Info file generation

Info file is of the format

node.dat

TYPE    MEANING

0       DRUG

1       PROTEIN

-----------------------------------------------

link.dat

LINK    START   END     MEANING

0       0       0       DRUG-and-DRUG

1       0       1       DRUG-and-PROTEIN

2       1       1       PROTEIN-and-PROTEIN

-----------------------------------------------

label.dat

TYPE    CLASS   MEANING

0       0       DRUG

1       1       PROTEIN


In [60]:
info_file_writer = open(sampled_info_file, "w+")

info_file_writer.write("node.dat\n")
info_file_writer.write("TYPE\tMEANING\n")
for k, v in entity_name_mapping.items():
    info_file_writer.write(f"{k[1:]}"+"\t"+f"{v}"+"\n")

info_file_writer.write("\n-----------------------------------------------\n")

info_file_writer.write("link.dat\n")
info_file_writer.write("LINK\tSTART\tEND\tMEANING\n")
for k,v in entity_mapping.items():
    info_file_writer.write(f"{k[1:]}" + "\t" + f"{v[0][1:]}" + "\t" + f"{v[1][1:]}" + "\t" + f"{entity_name_mapping[f'E{int(v[0][1:])}']}" + "-and-" + f"{entity_name_mapping[f'E{int(v[1][1:])}']}" + "\n")

info_file_writer.write("\n-----------------------------------------------\n")
info_file_writer.write("label.dat\n")
info_file_writer.write("TYPE\tCLASS\tMEANING\n")
for i in range(len(node_values)):
    for k in sorted(label_counts["Entity Names"]):
        if k[0] == i:
            info_file_writer.write(f"{i}" + "\t" + f"{k[1]}" + "\t" + f"{entity_name_mapping[f'E{int(k[1])}']}" + "\n")

info_file_writer.close()