![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/tigerlily_logo.jpg)

# 🐯 TigerLily: Finding drug interactions in silico with the Graph 🐯

### 1. What is TigerLily?
### 2. What do we achieve by using Tigerlily? 
### 3. Why do we care?

![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring.jpg)

# 1. Imports 🚢

#### The imports that we use depend on the [TigerLily](https://github.com/benedekrozemberczki/tigerlily) library - the PyPI description of the library is available [here](https://pypi.org/project/tiger/). It can be installed with the pip command!

## 1.1. Tigerlily specific imports 🐅

We use **TigerLily** classes and functions from the following name spaces:
    
- ``dataset``: Tools for loading the example dataset.
- ``embedding``: Machine learning tools for creating node embeddings.
- ``pagerank``: Learning personalized pagerank scores of nodes.
- ``operator``: Functions to generate the drug pair featurs.

In [68]:
from tigerlily.dataset import ExampleDataset
from tigerlily.pagerank import PersonalizedPageRankMachine
from tigerlily.embedding import EmbeddingMachine
from tigerlily.operator import hadamard_operator, concatenation_operator

## 1.2. General data manipulation and machine learning imports 💾

We are going to use import that help with the following:

- **[pandas](https://pandas.pydata.org/)** - Tabular data manipulation.
- **[lightgbm](https://lightgbm.readthedocs.io/)** - Gradient boosted trees for classification.
- **[scikit-learn](https://scikit-learn.org/stable/)** - Train-test set split generations and evaluation metrics.

In [69]:
import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# 2. DrugBank DDI 💊 and BioSNAP🧬 Dataset Loading

We need drug-drug interactions and a biological graph; we use these public dataset sources:
    
- [Biological Graph - BioSNAP](http://snap.stanford.edu/biodata/)
- [Drug Interactions - DrugBank DDI from Therapeutic Data Commons](https://tdcommons.ai/multi_pred_tasks/ddi/)

In [70]:
dataset = ExampleDataset()

In [71]:
edges = dataset.read_edges()
target = dataset.read_target()

In [72]:
target.head(5)

Unnamed: 0,drug_1,drug_2,label
0,DB00424,DB08897,1
1,DB00670,DB06148,1
2,DB00391,DB00517,1
3,DB01090,DB09076,1
4,DB00391,DB00462,1


In [73]:
edges.head(5)

Unnamed: 0,node_1,node_2,type_1,type_2
0,DB00008,ENTREZ:995300,drug,gene
1,DB00008,ENTREZ:306914,drug,gene
2,DB00009,ENTREZ:387026,drug,gene
3,DB00009,ENTREZ:13591824,drug,gene
4,DB00009,ENTREZ:37605,drug,gene


# 3. PageRank Computation with TigerGraph 🐯



![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_A.jpg)

## 3.1. Etablishing a connection and installing the Personalized PageRank query 📡

- You need to add your TigerGraph Cloud instance.
- The name of your drug - gene graph that you want to populate.
- Your username.
- Your secret for the graph.
- The password for the TigerGraph Cloud user.

In [91]:
host = "host_here_please_replace"

graphname = "graphname_here_please_replace"

username= "usrname_here_please_replace"

secret = "secret_here_please_replace"

password = "password_here_please_replace"

We will compute the Personalized PageRank of drug nodes using a ``PersonalizedPageRankMachine`` instance.

In [74]:
machine = PersonalizedPageRankMachine(host=host,
                                      graphname=graphname,
                                      username=username,
                                      secret=secret,
                                      password=password)

We connect to the TigerGraph Cloud solution first by using the ``.connect()`` method.

In [75]:
machine.connect()

We install the Personalized PageRank query, the ``.install_query()`` method allows the installation of other queries.

In [13]:
machine.install_query()

## 3.2. Defining a Graph and computing Personalized PageRank for the drug nodes

We remove the existing edges and upload the edges in the edges pd.DataFrame - this contains **drug-gene** and **gene-gene** edges.

In [76]:
machine.upload_graph(new_graph=True, edges=edges)

![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_B.jpg)

Using the graph connection we get the identifiers of the drug nodes.

In [77]:
drug_node_ids = machine.connection.getVertices("drug")

We compute the Personalized PageRank scores for the drug nodes with the ``.get_personalized_pagerank()`` method.

In [79]:
pagerank_scores = machine.get_personalized_pagerank(drug_node_ids)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2.0), HTML(value='')))




In [80]:
pagerank_scores.head(10)

Unnamed: 0,node_1,node_2,score
0,DB06262,DB06262,0.50553
1,DB06262,ENTREZ:178200,0.0316
2,DB06262,ENTREZ:29371,0.03149
3,DB06262,ENTREZ:433201,0.03047
4,DB06262,ENTREZ:178896,0.02992
5,DB06262,ENTREZ:178194,0.02991
6,DB06262,ENTREZ:178196,0.02979
7,DB06262,ENTREZ:177807,0.02975
8,DB06262,ENTREZ:189937,0.02966
9,DB06262,ENTREZ:178198,0.02946


# 4. Embedding learning from Personalized PageRank scores 🤖

![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_C.jpg)

We do not want to compute all of the PageRank scores, so let us load the pre-computed ones to speed up things.

In [81]:
pagerank_scores = dataset.read_pagerank()

We create an ``EmbeddingMachine`` instance - it will learn node embeddings for each drug, we set the hyperparameters manually.

In [82]:
embedding_machine = EmbeddingMachine(seed=42, dimensions=32, max_iter=20)

We learn an embedding with the ``fit()`` method and look at the embedding matrix.

In [83]:
embedding = embedding_machine.fit(pagerank_scores)
embedding.head(10)

Unnamed: 0,node_id,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,...,emb_22,emb_23,emb_24,emb_25,emb_26,emb_27,emb_28,emb_29,emb_30,emb_31
0,DB01620,-0.079809,-0.042563,-0.042563,-0.125164,-0.14237,-0.067374,-0.052152,-0.218553,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051984,-0.127584,-0.211926,-0.231118,0.48347,-0.090635,-0.050701
1,DB00521,-0.079809,-0.042563,-0.042563,-0.125227,-0.142349,-0.067374,-0.052152,-0.208243,-0.030083,...,-0.054623,-0.052735,-0.051751,-0.051984,-0.174903,-0.211926,-0.154916,-0.322516,-0.073146,-0.050701
2,DB00843,-0.079809,-0.042563,-0.042563,-0.124773,-0.14237,-0.067374,-0.052152,-0.216274,-0.030083,...,-0.054677,-0.052692,-0.051751,-0.051984,-0.142617,-0.209812,-0.25039,0.014273,-0.090635,-0.050701
3,DB06262,-0.079809,-0.042563,-0.042563,-0.125227,-0.14237,-0.067374,-0.052152,3.546667,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051984,-0.174903,-0.211926,-0.260254,-0.322516,-0.090635,-0.050701
4,DB00415,-0.079809,-0.042563,-0.042563,-0.125227,-0.14237,-0.067374,-0.052152,6.872592,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051984,-0.174903,-0.211926,-0.260254,-0.322516,-0.090635,-0.050701
5,DB00818,-0.079809,-0.042563,-0.042563,-0.125154,-0.140378,-0.067374,-0.052152,-0.218659,-0.030083,...,-0.05461,-0.052676,-0.051751,-0.051255,0.096379,-0.17573,-0.237928,0.231855,-0.089306,-0.050701
6,DB00754,-0.079809,-0.042563,-0.042563,-0.125227,-0.14237,-0.067374,-0.052152,-0.218861,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051984,-0.174296,-0.211926,0.006208,-0.322516,-0.086557,-0.050701
7,DB00956,-0.079809,-0.042563,-0.042563,-0.125227,-0.14237,-0.067374,-0.052152,-0.219,-0.030083,...,-0.054677,-0.05271,-0.051751,-0.051889,-0.174903,-0.210148,-0.259712,0.662643,-0.090635,-0.048227
8,DB00204,-0.079809,-0.042563,-0.042563,-0.122611,-0.131676,-0.067374,-0.052152,-0.162693,-0.030083,...,-0.053413,-0.04743,-0.051751,-0.046824,-0.171833,-0.187591,-0.231425,-0.317716,-0.0852,-0.050701
9,DB00517,-0.079809,-0.042563,-0.042563,-0.125227,-0.14209,-0.067374,-0.052152,0.511181,-0.030083,...,-0.054677,-0.052399,-0.050132,-0.050549,-0.172673,-0.080634,-0.218975,-0.305161,-0.090635,0.058491


# 5. Classifier Training and Inference 🔮


![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_D.jpg)

We will generate drug pair features using a Tigerlily [Hadamard operator](https://snap.stanford.edu/node2vec/).

In [84]:
drug_pair_features = embedding_machine.create_features(target, hadamard_operator)

We define a gradient boosted tree classifier, create dataset splits and fit the model the to training portion. 

In [85]:
model = LGBMClassifier(learning_rate=0.01, n_estimators=100)

X_train, X_test, y_train, y_test = train_test_split(drug_pair_features,
                                                    target,
                                                    train_size=0.8,
                                                    random_state=42)

model.fit(X_train,y_train["label"])

LGBMClassifier(learning_rate=0.01)

We predict the labels for the test set.

In [86]:
predicted_label = model.predict_proba(X_test)

We compute a performance metric for the predictive task.

In [87]:
auroc_score_value = roc_auc_score(y_test["label"], predicted_label[:,1])
print(f'AUROC score: {auroc_score_value :.4f}')

AUROC score: 0.9475


Let us look at those scores closer!

In [88]:
y_test["prediction"] = predicted_label[:,1]
y_test.head(10)

Unnamed: 0,drug_1,drug_2,label,prediction
157672,DB00850,DB00260,0,0.360778
117533,DB00936,DB01337,0,0.428537
17186,DB00384,DB01024,1,0.589512
93575,DB00547,DB09118,1,0.650235
99136,DB00582,DB00648,1,0.607848
31134,DB00199,DB01179,1,0.700403
131013,DB01117,DB00921,0,0.542052
93887,DB00679,DB09238,1,0.669026
40741,DB00483,DB06204,1,0.750296
176400,DB00419,DB00850,0,0.479897


# 6. Ideas and Readings 🤔

This notebook is ideal to get started with **TigerLily** on a very specific problem. It can be extended and you can also read more about drug interaction prediction models, tasks and datasets!

## 6.1. Ideas, extensions and potential applications💡

- Try out other classifiers such as logistic regression.

- Tune the Personalized PageRank computation, node embedding and classifier hyperparameters!

- Early warning systems for drug discovery in the pre-clinical phase.

- Augment drug synergy prediction systems and do multi-objective-optimization by using interaction scores.

- Predict polypharmacy side effects of drug pairings with TigerLily.

## 6.2. Readings 📘 📗 📙

### 6.2.1. Papers 📚

- [Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development](https://arxiv.org/abs/2102.09548)
- [ChemicalX: A Deep Learning Library for Drug Pair Scoring](https://arxiv.org/abs/2202.05240)
- [Modeling Polypharmacy Side Effects with Graph Convolutional Networks](https://academic.oup.com/bioinformatics/article/34/13/i457/5045770)
- [A Unified View of Relational Deep Learning for Drug Pair Scoring](https://arxiv.org/abs/2111.02916)

### 6.2.2. Links 🕸️

- [TigerGraph](https://www.tigergraph.com/)
- [TigerGraph Cloud](https://tgcloud.io/)
- [TigerGraph Data Science](https://www.tigergraph.com/graph-data-science-library/)
- [ChemicalX](https://github.com/AstraZeneca/chemicalx)
- [Therapeutic Data Commons](https://tdcommons.ai/)
- [BioSNAP](http://snap.stanford.edu/biodata/)
- [Awesome Drug Pair Scoring](https://github.com/AstraZeneca/awesome-drug-pair-scoring)

# 7. Author 🦸

- Author: Benedek Rozemberczki
- E-mail: benedek.rozemberczki@gmail.com
- Date: 2022.04.12.