![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/tigerlily_logo.jpg)

# 🐯 TigerLily: Finding drug interactions in silico with the Graph 🐯

### 1. What is TigerLily?
### 2. What do we achieve by using Tigerlily? 
### 3. Why do we care?

![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring.jpg)

# 1. Imports 🚢

#### The imports that we use depend on the [TigerLily](https://github.com/benedekrozemberczki/tigerlily) library - the PyPI description of the library is available [here](https://pypi.org/project/tiger/). It can be installed with the pip command!

## 1.1. Tigerlily specific imports 🐅

We use **TigerLily** classes and functions from the following name spaces:
    
- ``dataset``: Tools for loading the example dataset.
- ``embedding``: Machine learning tools for creating node embeddings.
- ``pagerank``: Learning personalized pagerank scores of nodes.
- ``operator``: Functions to generate the drug pair featurs.

In [13]:
from tigerlily.dataset import ExampleDataset
from tigerlily.pagerank import PersonalizedPageRankMachine
from tigerlily.embedding import EmbeddingMachine
from tigerlily.operator import hadamard_operator, concatenation_operator

## 1.2. General data manipulation and machine learning imports 💾

We are going to use import that help with the following:

- **[pandas](https://pandas.pydata.org/)** - Tabular data manipulation.
- **[lightgbm](https://lightgbm.readthedocs.io/)** - Gradient boosted trees for classification.
- **[scikit-learn](https://scikit-learn.org/stable/)** - Train-test set split generations and evaluation metrics.

In [14]:
import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

# 2. DrugBank DDI 💊 and BioSNAP🧬 Dataset Loading

We need drug-drug interactions and a biological graph; we use these public dataset sources:
    
- [Biological Graph - BioSNAP](http://snap.stanford.edu/biodata/)
- [Drug Interactions - DrugBank DDI from Therapeutic Data Commons](https://tdcommons.ai/multi_pred_tasks/ddi/)

In [15]:
dataset = ExampleDataset()

In [16]:
edges = dataset.read_edges()
target = dataset.read_target()

In [17]:
target.head(5)

Unnamed: 0,drug_1,drug_2,label
0,DB00424,DB08897,1
1,DB00670,DB06148,1
2,DB00391,DB00517,1
3,DB01090,DB09076,1
4,DB00391,DB00462,1


In [18]:
edges.head(5)

Unnamed: 0,node_1,node_2,type_1,type_2
0,DB00008,ENTREZ:995300,drug,gene
1,DB00008,ENTREZ:306914,drug,gene
2,DB00009,ENTREZ:387026,drug,gene
3,DB00009,ENTREZ:13591824,drug,gene
4,DB00009,ENTREZ:37605,drug,gene


# 3. PageRank Computation with TigerGraph 🐯



![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_A.jpg)

## 3.1. Etablishing a connection and installing the Personalized PageRank query 📡

- You need to add your TigerGraph Cloud instance.
- The name of your drug - gene graph that you want to populate.
- Your secret for the graph.
- The password for the TigerGraph Cloud user.

In [20]:
host = "https://tigerlily.i.tgcloud.io"

graphname = "tester"

secret = "secret_goes_here"

password = "password_goes_here"

We will compute the Personalized PageRank of drug nodes using a ``PersonalizedPageRankMachine`` instance.

In [None]:
machine = PersonalizedPageRankMachine(host=host,
                                      graphname=graphname,
                                      secret=secret,
                                      password=password)

We connect to the TigerGraph Cloud solution first by using the ``.connect()`` method.

In [None]:
machine.connect()

We install the Personalized PageRank query, the ``.install_query()`` method allows the installation of other queries.

In [None]:
machine.install_query()

We remove the existing edges and upload the edges in the edges pd.DataFrame - this contains **drug-gene** and **gene-gene** edges.

In [None]:
machine.upload_graph(new_graph=True, edges=edges)

## 3.2. Defining a Graph and computing Personalized PageRank for the drug nodes

![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_B.jpg)

Using the graph connection we get the identifiers of the drug nodes.

In [None]:
drug_node_ids = machine.connection.getVertices("drug")

We compute the Personalized PageRank scores for the drug nodes with the ``.get_personalized_pagerank()`` method.

In [None]:
pagerank_scores = machine.get_personalized_pagerank(drug_node_ids)

In [28]:
pagerank_scores.head(10)

Unnamed: 0,node_1,node_2,score,node_1_num,node_2_num
0,DB01620,DB01620,0.15897,1005,3813
1,DB01620,ENTREZ:510296,0.07913,1005,2954
2,DB01620,ENTREZ:458542,0.07542,1005,7
3,DB01620,DB02709,0.00927,1005,1441
4,DB01620,DB00366,0.00902,1005,828
5,DB01620,DB00737,0.00902,1005,2918
6,DB01620,DB01708,0.00875,1005,3151
7,DB01620,DB01026,0.00848,1005,2508
8,DB01620,DB00836,0.00844,1005,2999
9,DB01620,DB00257,0.00835,1005,515


# 4. Embedding learning from Personalized PageRank scores 🤖

![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_C.jpg)

We do not want to compute all of the PageRank scores, so let us load the pre-computed ones to speed up things.

In [23]:
pagerank_scores = dataset.read_pagerank()

We create an ``EmbeddingMachine`` instance - it will learn node embeddings for each drug, we set the hyperparameters manually.

In [44]:
embedding_machine = EmbeddingMachine(seed=42, dimensions=32, max_iter=20)

We learn an embedding with the ``fit()`` method and look at the embedding matrix.

In [35]:
embedding = embedding_machine.fit(pagerank_scores)
embedding.head(10)

Unnamed: 0,node_id,emb_0,emb_1,emb_2,emb_3,emb_4,emb_5,emb_6,emb_7,emb_8,...,emb_22,emb_23,emb_24,emb_25,emb_26,emb_27,emb_28,emb_29,emb_30,emb_31
0,DB01620,-0.079809,-0.042563,-0.042563,-0.125073,-0.142232,-0.067374,-0.052152,-0.216095,-0.030083,...,-0.054677,-0.052727,-0.051751,-0.051983,-0.139338,-0.211693,-0.245172,0.186335,-0.050412,-0.090606
1,DB00521,-0.079809,-0.042563,-0.042563,-0.125216,-0.139214,-0.067374,-0.052152,-0.217174,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051983,-0.108249,-0.211809,-0.258564,0.295233,-0.05005,-0.090606
2,DB00843,-0.079809,-0.042563,-0.042563,-0.125227,-0.142371,-0.067374,-0.052152,-0.216126,-0.030083,...,-0.054677,-0.052711,-0.051751,-0.051983,-0.174959,0.049575,-0.260398,-0.322347,-0.050412,1.762666
3,DB06262,-0.079809,-0.042563,-0.042563,-0.125227,-0.142371,-0.067374,-0.052152,-0.17535,-0.030083,...,-0.054677,-0.044198,-0.051751,-0.051983,-0.174959,0.730457,-0.165776,-0.322347,-0.044207,-0.090606
4,DB00415,-0.079809,-0.042563,-0.042563,-0.125227,5.974596,-0.067374,-0.052152,-0.219001,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051983,-0.174959,-0.211809,-0.260398,-0.322347,-0.050412,-0.090606
5,DB00818,-0.079809,-0.042563,-0.042563,-0.1241,-0.142371,-0.067374,-0.052152,-0.208823,-0.030083,...,-0.054677,-0.052656,-0.051751,-0.051983,-0.137955,-0.206205,-0.248846,0.022307,-0.050412,-0.090606
6,DB00754,-0.079809,-0.042563,-0.042563,-0.125227,-0.142364,-0.067374,-0.052152,-0.218645,-0.030083,...,-0.054677,-0.052731,-0.051751,-0.051949,-0.133832,-0.211368,-0.25896,-0.303721,-0.049431,-0.090606
7,DB00956,-0.079809,-0.042563,-0.042563,-0.125227,-0.142371,-0.067374,-0.052152,-0.219001,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051983,-0.174959,-0.211809,4.888365,-0.322347,-0.050412,-0.090606
8,DB00204,-0.079809,-0.042563,-0.042563,-0.125227,-0.142371,-0.067374,-0.052152,-0.219001,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051983,-0.174959,-0.211809,-0.260398,-0.322347,-0.050412,-0.090606
9,DB00517,-0.079809,-0.042563,-0.042563,-0.125227,-0.142371,-0.067374,-0.052152,-0.219001,-0.030083,...,-0.054677,-0.052735,-0.051751,-0.051983,-0.174959,-0.211809,-0.260398,3.350473,-0.050412,-0.090606


# 5. Classifier Training and Inference 🔮


![title](https://github.com/benedekrozemberczki/datasets/raw/master/images/pair_scoring_D.jpg)

We will generate drug pair features using a Tigerlily [Hadamard operator](https://snap.stanford.edu/node2vec/).

In [38]:
drug_pair_features = embedding_machine.create_features(target, hadamard_operator)

We define a gradient boosted tree classifier, create dataset splits and fit the model the to training portion. 

In [39]:
model = LGBMClassifier(learning_rate=0.01, n_estimators=100)

X_train, X_test, y_train, y_test = train_test_split(drug_pair_features,
                                                    target,
                                                    train_size=0.8,
                                                    random_state=42)

model.fit(X_train,y_train["label"])

LGBMClassifier(learning_rate=0.01)

We predict the labels for the test set.

In [40]:
predicted_label = model.predict_proba(X_test)

We compute a performance metric for the predictive task.

In [41]:
auroc_score_value = roc_auc_score(y_test["label"], predicted_label[:,1])
print(f'AUROC score: {auroc_score_value :.4f}')

AUROC score: 0.9510


Let us look at those scores closer!

In [42]:
y_test["prediction"] = predicted_label[:,1]
y_test.head(10)

Unnamed: 0,drug_1,drug_2,label,prediction
157672,DB00850,DB00260,0,0.314553
117533,DB00936,DB01337,0,0.321906
17186,DB00384,DB01024,1,0.66596
93575,DB00547,DB09118,1,0.608919
99136,DB00582,DB00648,1,0.551621
31134,DB00199,DB01179,1,0.635063
131013,DB01117,DB00921,0,0.3595
93887,DB00679,DB09238,1,0.584536
40741,DB00483,DB06204,1,0.624107
176400,DB00419,DB00850,0,0.404385


# 6. Ideas and Readings 🤔

This notebook is ideal to get started with **TigerLily** on a very specific problem. It can be extended and you can also read more about drug interaction prediction models, tasks and datasets!

## 6.1. Ideas, extensions and potential applications💡

- Try out other classifiers such as logistic regression.

- Tune the Personalized PageRank computation, node embedding and classifier hyperparameters!

- Early warning systems for drug discovery in the pre-clinical phase.

- Augment drug synergy prediction systems and do multi-objective-optimization by using interaction scores.

- Predict polypharmacy side effects of drug pairings with TigerLily.

## 6.2. Readings 📘 📗 📙

### 6.2.1. Papers 📚

- [Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development](https://arxiv.org/abs/2102.09548)
- [ChemicalX: A Deep Learning Library for Drug Pair Scoring](https://arxiv.org/abs/2202.05240)
- [Modeling Polypharmacy Side Effects with Graph Convolutional Networks](https://academic.oup.com/bioinformatics/article/34/13/i457/5045770)
- [A Unified View of Relational Deep Learning for Drug Pair Scoring](https://arxiv.org/abs/2111.02916)

### 6.2.2. Links 🕸️

- [TigerGraph](https://www.tigergraph.com/)
- [TigerGraph Cloud](https://tgcloud.io/)
- [TigerGraph Data Science](https://www.tigergraph.com/graph-data-science-library/)
- [ChemicalX](https://github.com/AstraZeneca/chemicalx)
- [Therapeutic Data Commons](https://tdcommons.ai/)
- [BioSNAP](http://snap.stanford.edu/biodata/)
- [Awesome Drug Pair Scoring](https://github.com/AstraZeneca/awesome-drug-pair-scoring)

# 7. Author 🦸

- Author: Benedek Rozemberczki
- E-mail: benedek.rozemberczki@gmail.com
- Date: 2022.04.12.