# **UAS BIG DATA**
# **ERLINA NITA SUMADYA**

# **2206130712**

# **Neural Collaborative Filtering for Drug prediction**




Machine learning has widespread applications across various facets of human life. Discovering new drugs can be a costly and time-consuming endeavor, often taking up to a year. Therefore, identifying existing drugs that can effectively treat specific medical conditions proves to be highly advantageous.

This concept parallels recommendation systems, where items are suggested to users based on specific criteria. In a similar vein, the writer  can propose drugs for the treatment of particular medical conditions by considering relevant factors.

To achieve this, the writer  employ the Neural Collaborative Filtering approach to predict suitable drugs for specific medical conditions. His chosen model is the Multilayer Perceptron, as outlined in a referenced research paper[1].

The writer have made certain adaptations to align with current requirements. With these adjustments, let's commence the process.



## **Step 1**
## **Importing All Necessory Liberies**




In [21]:
# Importing Libraries
import sys
import multiprocessing
from time import time

import numpy as np
import pandas as pd
import random
import math
import argparse
import heapq
import scipy.sparse as sp

# The following imports are commented out, indicating they are not currently used in the code.
# Importing Theano (numerical computation library)
# import theano
# import theano.tensor as T

# Importing TensorFlow and Keras
import tensorflow as tf
import keras
from keras import layers
from keras.models import Sequential, Model

# Importing Keras Backend
from keras import backend as K

# Importing Keras Components for Layers
from keras import initializers
from keras.regularizers import l1, l2
from keras.layers import Dense, Lambda, Activation

# Importing More Keras Layers
from keras.layers import Embedding, Input, Dense, Reshape, Flatten, Dropout, concatenate

# Importing Keras Optimizers
from keras.optimizers import Adagrad, Adam, SGD, RMSprop





### **Dataset**

we are using drug_com data set which you can download from Kaggle [Here]( https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018)


This dataset includes the different drugs and their effectiveness or uses in different medical conditions.

Now let’s analyze the dataset to get an understanding of the data.


In [10]:
!gdown --id "1fjbMJKg5bMeUf5on9-hg0rG4SH9AxTYL"
# The file should now be downloaded, and you can use it with pandas or any other library

Downloading...
From: https://drive.google.com/uc?id=1fjbMJKg5bMeUf5on9-hg0rG4SH9AxTYL
To: /content/drugsComTrain_raw.csv
100% 83.0M/83.0M [00:01<00:00, 60.6MB/s]


In [11]:
df = pd.read_csv("drugsComTrain_raw.csv", encoding = 'latin-1').drop(['uniqueID','date'],axis = 1)
df


Unnamed: 0,drugName,condition,review,rating,usefulCount
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,27
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,192
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,17
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,10
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,37
...,...,...,...,...,...
161292,Campral,Alcohol Dependence,"""I wrote my first report in Mid-October of 201...",10,125
161293,Metoclopramide,Nausea/Vomiting,"""I was given this in IV before surgey. I immed...",1,34
161294,Orencia,Rheumatoid Arthritis,"""Limited improvement after 4 months, developed...",2,35
161295,Thyroid desiccated,Underactive Thyroid,"""I&#039;ve been on thyroid medication 49 years...",10,79


In [12]:
print(len(df['condition'].unique()))
print(len(df['drugName'].unique()))

885
3436


In [13]:
Medical_Conditions = df['condition'].unique()
Drugs = df['drugName'].unique()

.


Our dataset contains the drug name and respective medical condition, with all other information like a review , rating, useful count ...

Now we want to find the usefulness of a particular drug in the different conditions ( i.e predicting the drug  for certain condition)

Here we can see that we have 3436 unique drugs with 885 unique conditions and also 161297 interactions.



### **Mapping the drug names to Unique IDs and Unique IDs to  drug names**

In [14]:
Medical_Conditions_ID_to_name= {}

for i in range(len(Medical_Conditions)):
  key = i
  value = Medical_Conditions[i]
  Medical_Conditions_ID_to_name[key] = value

Drugs_ID_to_name = {}


for i in range(len(Drugs)):
  key = i
  value = Drugs[i]
  Drugs_ID_to_name[key] = value

Medical_Conditions_ID_to_NAME = pd.DataFrame(list(Medical_Conditions_ID_to_name.items()))
Drugs_ID_to_NAME = pd.DataFrame(list(Drugs_ID_to_name.items()))

Medical_Conditions_ID_to_NAME.to_csv('Medical_Conditions_ID_to_NAME.csv')
Drugs_ID_to_NAME.to_csv('Drugs_ID_to_NAME.csv')

In [15]:
Medical_Conditions_name_to_ID =  dict([(value, key) for key, value in Medical_Conditions_ID_to_name.items()])
Drugs_name_to_ID =  dict([(value, key) for key, value in Drugs_ID_to_name.items()])

Considering the useful columns


(Here we are using the usefulcount as the interaction between drug and specific medical condition )


In [16]:
df = df[['condition','drugName','usefulCount']].copy()
df

Unnamed: 0,condition,drugName,usefulCount
0,Left Ventricular Dysfunction,Valsartan,27
1,ADHD,Guanfacine,192
2,Birth Control,Lybrel,17
3,Birth Control,Ortho Evra,10
4,Opiate Dependence,Buprenorphine / naloxone,37
...,...,...,...
161292,Alcohol Dependence,Campral,125
161293,Nausea/Vomiting,Metoclopramide,34
161294,Rheumatoid Arthritis,Orencia,35
161295,Underactive Thyroid,Thyroid desiccated,79


Mapping

In [17]:
for i in range(len(df['drugName'])):
  df['drugName'][i] = Drugs_name_to_ID[df['drugName'][i]]

for i in range(len(df['condition'])):
  df['condition'][i] = Medical_Conditions_name_to_ID[df['condition'][i]]

df.sort_values("condition", axis = 0, ascending = True,
                 inplace = True)

df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['drugName'][i] = Drugs_name_to_ID[df['drugName'][i]]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['condition'][i] = Medical_Conditions_name_to_ID[df['condition'][i]]


Unnamed: 0,condition,drugName,usefulCount
0,0,0,27
45609,0,147,16
2044,0,667,31
26551,0,887,43
113009,0,887,50
...,...,...,...
159071,880,1970,1
159319,881,93,11
160791,882,420,62
160921,883,470,92


As this is a binary classification problem we are not considering the properties of drugs ( can be done in the next version )

We are going to build the sparse matrix for representing interactions between drugs and medical conditions

For testing, we have taken 200 users with their positive, negative interactions (Please have a look at "log loss with negative sampling" part in the referred research paper[1]



In [18]:
def get_data():
  '''
  return :  train and test data
  train-> sparce matrix
  test-> matrix (list of lists)
  '''

  train = sp.dok_matrix((885, 3436), dtype=np.float32)

  for i in  range(len(df['condition'])):
    ls = list(df.iloc[i])
    train[ls[0],ls[1]] = 1.0


  test = []

  for j in  range(200):
    i = random.randint(0, 161296)
    ls = list(df.iloc[i])
    test.append([ls[0],ls[1]])

  return train , test



### **Evaluation Process**

To evaluate the performance of
drug recommendation, we adopted the leave-one-out evaluation.
as described in the referred research paper[1]

HR intuitively measures whether
the test item is present on the top-10 list, and the NDCG
accounts for the position of the hit by assigning higher scores
to hits at top ranks. We will calculate both metrics for each
test user and asssine the average score.


In [19]:
def getHitRatio(ranklist, gtItem):
    for item in ranklist:
        if item == gtItem:
            return 1
    return 0

def getNDCG(ranklist, gtItem):
    for i in range(len(ranklist)):
        item = ranklist[i]
        if item == gtItem:
            return math.log(2) / math.log(i+2)
    return 0


In [20]:
def evaluate(model,K):

  HR, NDCG = [],[]
  train,test = get_data()


  for i in range(len(test)):

    rating = test[i]
    u = rating[0]

    # taking 99 randome untested conditions by that drug
    count = 0
    drugs = []
    while(count != 99):
      j = random.randint(0, 3435)
      if (u,j) in train.keys():
        continue
      drugs.append(j)
      count+=1



    gtdrug = rating[1]
    drugs.append(gtdrug)



    # Get prediction scores
    map_drug_score = {}
    medical_conditions = np.full(len(drugs), u, dtype = 'int32')
    predictions = model.predict([medical_conditions, np.array(drugs)],
                                 batch_size=64, verbose=0)


    for i in range(len(drugs)):
        drug = drugs[i]
        map_drug_score[drug] = predictions[i]

    drugs.pop()


    ranklist = heapq.nlargest(K, map_drug_score, key=map_drug_score.get)
    hr = getHitRatio(ranklist, gtdrug)
    ndcg = getNDCG(ranklist, gtdrug)

    HR.append(hr)
    NDCG.append(ndcg)

  return (HR, NDCG)






### **Prepareing model**


**Multi-Layer Presepteron** :( as described in the paper[1])

Since NCF adopts two pathways to model drugs and conditions,
it is intuitive to combine the features of two pathways by
concatenating them. This design has been widely adopted
in multimodal deep learning works. However, simply
a vector concatenation does not account for any interactions
between drug and condition latent features, which is insufficient
for modeling the collaborative filtering effect. To address
this issue, we will add hidden layers on the concatenated vector, using a standard MLP to learn the interaction
between drug and condition latent features. In this sense, we can
endow the model a large level of **flexibility and non-linearity.**


In [4]:
def get_model(num_medical_conditions, num_drugs, layers = [16,8], reg_layers=[0,0]):

    assert len(layers) == len(reg_layers)

    num_layer = len(layers)                                                                      #Number of layers in the MLP

    medical_condition_input = Input(shape=(1,), dtype='int32', name = 'user_input')
    drug_input = Input(shape=(1,), dtype='int32', name = 'item_input')


    MLP_Embedding_Medical_Conditions = Embedding(input_dim = num_medical_conditions, output_dim = int(layers[0]/2),
                                   name = 'medical_condition_embedding',
                                    W_regularizer = l2(reg_layers[0]), input_length=1)
    MLP_Embedding_Drugs = Embedding(input_dim = num_drugs, output_dim = int(layers[0]/2),
                                   name = 'drug_embedding',
                                    W_regularizer = l2(reg_layers[0]), input_length=1)



    medical_condition_latent = Flatten()(MLP_Embedding_Medical_Conditions(medical_condition_input))                                      # flattening embedding for user
    drug_latent = Flatten()(MLP_Embedding_Drugs(drug_input))                                      # flattening embedding for items


    vector = keras.layers.concatenate([medical_condition_latent,drug_latent])                                 # forming the 0th layer of NN by concatinating the user and items flatten layer

    # MLP layers
    for idx in range(1, num_layer):
        layer = Dense(layers[idx], W_regularizer= l2(reg_layers[idx]), activation='relu', name = 'layer%d' %idx)
        vector = layer(vector)
        #layer1 = Dropout(0.25)
        #vector = layer1(vector)




    # Final prediction layer
    prediction = Dense(1, activation='sigmoid', init='lecun_uniform', name = 'prediction')(vector)

    model = Model(input=[medical_condition_input, drug_input],
                  output=prediction)

    return model



In [5]:

def get_train_instances(train, num_negatives):

    medical_condition_input, drug_input, labels = [],[],[]
    num_medical_conditions = train.shape[0]

    for (u, i) in train.keys():

        # positive instance
        medical_condition_input.append(u)
        drug_input.append(i)
        labels.append(1)


        # negative instances
        for t in range(num_negatives):
            j = np.random.randint(num_drugs)

            while ( (u,j) in train.keys() ) :
                j = np.random.randint(num_drugs)
            medical_condition_input.append(u)
            drug_input.append(j)
            labels.append(0)


    return medical_condition_input, drug_input, labels

### **Training testing and evaluating  :**

The Neural network architecture that will have 6 hidden layers with one input layer(formed by concatenating drugs and conditions embeddings) and one output layer


the optimizer, learning rate, batch size, epochs ... are decided after doing few experiments, and optimum is chosen for this work.


The process includes :
1. create a model
2. train model
3. test/ evaluate model
4. calculate HR, NDCG
5. check for the best HR, NDCG and save the model
6. Repeat  the steps 3, 4 and 5 for "epochs" times

This will save the best model


In [31]:
if __name__ == '__main__':
    path = '/content'
    layers = [256, 128, 64, 32, 16, 8]
    reg_layers = [0, 0, 0, 0, 0, 0]
    num_negatives = 6
    learner = 'adam'
    learning_rate = 0.001
    batch_size = 256
    epochs = 10
    verbose = 1

    topK = 5
    model_out_file = 'Pretrain_new.h5'

    # Loading data
    t1 = time()
    train, test = get_data()

    num_medical_conditions, num_drugs = train.shape

    # Build model
    model = get_model_updated(num_medical_conditions, num_drugs, layers, reg_layers)

    # Compile model
    model.compile(optimizer=Adam(lr=learning_rate), loss='binary_crossentropy', metrics=['accuracy'])

    # Check Init performance
    t1 = time()
    (hr, ndcg) = evaluate(model, topK)
    HR, NDCG = np.array(hr).mean(), np.array(ndcg).mean()
    print('Init: HR = %.4f, NDCG = %.4f [%.1f]' % (HR, NDCG, time() - t1))

    # Train model
    best_hr, best_ndcg, best_iter = HR, NDCG, -1

    for epoch in range(epochs):
        t1 = time()

        # Generate training instances
        medical_condition_input, drug_input, labels = get_train_instances(train, num_negatives)

        # Training
        hist = model.fit([np.array(medical_condition_input), np.array(drug_input)],
                         np.array(labels), batch_size=batch_size, epochs=20, verbose=0, shuffle=True)

        t2 = time()

        # Evaluation
        if epoch % verbose == 0:
            (hr, ndcg) = evaluate(model, topK)
            HR, NDCG, loss = np.array(hr).mean(), np.array(ndcg).mean(), hist.history['loss'][0]
            print('Iteration %d [%.1f s]: HR = %.4f, NDCG = %.4f, loss = %.4f [%.1f s]'
                  % (epoch, t2 - t1, HR, NDCG, loss, time() - t2))

            if HR >= best_hr and NDCG >= best_ndcg:
                best_hr, best_ndcg, best_iter = HR, NDCG, epoch
                model.save(model_out_file)

    print("End. Best Iteration %d:  HR = %.4f, NDCG = %.4f. " % (best_iter, best_hr, best_ndcg))
    print("The best MLP model is saved to %s" % (model_out_file))




Init: HR = 0.0800, NDCG = 0.0494 [30.1]
Iteration 0 [33.1 s]: HR = 0.9800, NDCG = 0.8795, loss = 0.4299 [29.9 s]


  saving_api.save_model(


Iteration 1 [23.2 s]: HR = 0.9800, NDCG = 0.9160, loss = 0.0986 [32.1 s]
Iteration 2 [41.2 s]: HR = 1.0000, NDCG = 0.9653, loss = 0.0595 [29.1 s]
Iteration 3 [22.8 s]: HR = 1.0000, NDCG = 0.9762, loss = 0.0408 [29.7 s]
Iteration 4 [41.4 s]: HR = 1.0000, NDCG = 0.9566, loss = 0.0386 [29.1 s]
Iteration 5 [41.2 s]: HR = 1.0000, NDCG = 0.9754, loss = 0.0350 [29.2 s]
Iteration 6 [22.7 s]: HR = 1.0000, NDCG = 0.9871, loss = 0.0294 [29.0 s]
Iteration 7 [41.2 s]: HR = 1.0000, NDCG = 0.9908, loss = 0.0239 [30.3 s]
Iteration 8 [21.6 s]: HR = 1.0000, NDCG = 0.9889, loss = 0.0229 [30.1 s]
Iteration 9 [41.2 s]: HR = 1.0000, NDCG = 0.9926, loss = 0.0205 [28.9 s]
End. Best Iteration 9:  HR = 1.0000, NDCG = 0.9926. 
The best MLP model is saved to Pretrain_new.h5


In [32]:
def get_model_updated(num_medical_conditions, num_drugs, layers, reg_layers):
    # Input layers
    medical_condition_input = Input(shape=(1,), dtype='int32', name='medical_condition_input')
    drug_input = Input(shape=(1,), dtype='int32', name='drug_input')

    # Embedding layers
    medical_condition_embedding = Embedding(input_dim=num_medical_conditions, output_dim=layers[0] // 2,
                                           input_length=1, name='medical_condition_embedding')(medical_condition_input)
    drug_embedding = Embedding(input_dim=num_drugs, output_dim=layers[0] // 2,
                               input_length=1, name='drug_embedding')(drug_input)

    # Flatten embeddings
    medical_condition_flat = Flatten()(medical_condition_embedding)
    drug_flat = Flatten()(drug_embedding)

    # Concatenate flattened embeddings
    merged_vector = concatenate([medical_condition_flat, drug_flat], axis=-1)

    # Build the neural network
    for i in range(1, len(layers)):
        merged_vector = Dense(layers[i], activation='relu', name='layer%d' % i)(merged_vector)

    # Output layer with sigmoid activation
    output = Dense(1, activation='sigmoid', kernel_regularizer=l2(reg_layers[-1]), name='output')(merged_vector)

    # Model
    model = Model(inputs=[medical_condition_input, drug_input], outputs=output)

    return model

# Usage
model = get_model_updated(num_medical_conditions, num_drugs, layers, reg_layers)


In [33]:
Best_model = tf.keras.models.load_model(
    '/content/Pretrain_new.h5', compile = True
)

Now let’s try to find out the possible drugs for Rheumatoid Arthritis( Rheumatoid arthritis (RA) is an autoimmune disease that can cause joint pain and damage throughout your body)

In [35]:
drugs = [i for i in range(3436)]

medical_conditions = np.full(len(drugs),18, dtype = 'int32')

predictions = Best_model.predict([medical_conditions, np.array(drugs)],
                                 batch_size=100, verbose=0)

map_drug_score ={}

for i in range(len(drugs)):                                           # creating the{ item : chance } dict
        drug = drugs[i]
        map_drug_score[drug] = predictions[i]

ranklist = heapq.nlargest(5, map_drug_score, key=map_drug_score.get)



print( str(Medical_Conditions_ID_to_name[18]) +" can be treated by : " )

print("\n")
for i in ranklist:
  print("\n\t"+str(Drugs_ID_to_name[i]))


Rheumatoid Arthritis can be treated by : 



	Mobic

	Acetaminophen / hydrocodone

	Hydroxychloroquine

	Meloxicam

	Celecoxib


The model has predicted a few correct drugs like:
* **Hydroxychloroquine**

and also suggested possible drugs for the treatment of Rheumatoid arthritis
* Etanercept

* Acetaminophen / hydrocodone

* Naproxen

* Aleve


Although the working of these drugs can only be confirmed by the **scientists working in these fields**.





This was the basic model for drug prediction

Modification can be done to lear some more complex relations

  Few modifications :
  

*   We can consider the featchers of the drug ( like protine structure , activity in different environment ... )
*   large data set can also help to increse the radius of the possibilities and can lead to better predictions
* using more deep NN structure
* fine tuning the hyperparameters
* ...








## **References **:

[1] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu and Tat-Seng Chua (2017). Neural Collaborative Filtering. In Proceedings of WWW '17, Perth, Australia, April 03-07, 2017.


