<a href="https://colab.research.google.com/github/ersilia-os/event-fund-ai-drug-discovery/blob/main/notebooks/session4_breakout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session 4: Generative Models

This notebook explores generative models based on similarity searches (using a molecule as initial hit, looks for similar molecules in a virtually generated library).
We will use our top10 hits from Session 2 to explore how these models work.

## Initial Hits

Each generative model requires a starting point, a molecule that will serve as a blueprint for the generation of novel molecules.

In this exercise, we will use the top10 hits from session 2 that you selected from the MMV malaria box. These should be stored in your drive under DataScience_Workshop/data/day2

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#import the necessary packages
import pandas as pd

#we can open it as a pandas dataframe
data = "drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/mmv_malariabox_selection.csv"
df=pd.read_csv(data)
df.head()

In [None]:
smiles = df["input"].tolist()

In [None]:
# select the top1 hit and see its structure
smi = smiles[1]

In [None]:
smi

In [None]:
#revise session 2 skills development if you are unsure about this step to visualise molecules using RdKit
%%capture
!pip install rdkit

from rdkit import Chem
from rdkit.Chem import Draw

In [None]:
mol = Chem.MolFromSmiles(smi)
Draw.MolToImage(mol)

## Ersilia Model Hub
First, we need to install Ersilia in this Google Colab notebook.

In [None]:
#@markdown Click on the play button to install Ersilia in this Colab notebook.

%%capture
%env MINICONDA_INSTALLER_SCRIPT=Miniconda3-py37_4.12.0-Linux-x86_64.sh
%env MINICONDA_PREFIX=/usr/local
%env PYTHONPATH={PYTHONPATH}:/usr/local/lib/python3.7/site-packages
%env CONDA_PREFIX=/usr/local
%env CONDA_PREFIX_1=/usr/local
%env CONDA_DIR=/usr/local
%env CONDA_DEFAULT_ENV=base
!wget https://repo.anaconda.com/miniconda/$MINICONDA_INSTALLER_SCRIPT
!chmod +x $MINICONDA_INSTALLER_SCRIPT
!./$MINICONDA_INSTALLER_SCRIPT -b -f -p $MINICONDA_PREFIX
!python -m pip install git+https://github.com/ersilia-os/ersilia.git
!python -m pip install requests --upgrade
import sys
_ = (sys.path.append("/usr/local/lib/python3.7/site-packages"))

### Fetching Similarity Models
We will work with two similarity models:
* eos4b8j: gdbchembl-similarity
* eos4b8j gdbmedchem-similarity

In short, each of those models uses a virtually generated library of billions of hits to identify the 100 most similar to the starting point. You can read more about them on their respective publications([gdbchembl](https://www.frontiersin.org/articles/10.3389/fchem.2020.00046/full) and [gdbmedchem](https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201900031))

GDBChEMBL contains a collection of 166.4 billion possible molecules of up to 17 atoms, and is browsable [here](http://faerun.gdb.tools/). The GBDMedChEM is a curated version of GDBChEMBL and restricts the search space to 10 million [molecules](http://gdb.unibe.ch)

*Disclaimer: both these models post predictions online. If you are concerned about IP privacy issues check the publication for more information on data policy.* 

## Generating a 100 molecules from the top hit
Together we will walk through an example of how we can generate hits from the best molecule we found in the MMV Malaria Box using the similarity search in ChEMBL

In [None]:
#You can write here the molecule you want to get predictions for:
molecule = "CCCCC"

In [None]:
#@title GDBChEMBL Similarity
#@markdown Press the play button to run a prediction!
!ersilia fetch eos4b8j
from ersilia import ErsiliaModel

model = ErsiliaModel("eos4b8j")
model.serve()
output = model.predict(input=molecule, output="pandas")
model.close()
output.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/eos4b8j.csv", index=False)

In [None]:
#@title GDBMEdChem Similarity
#@markdown Press the play button to run a prediction!
!ersilia fetch eos7jlv
from ersilia import ErsiliaModel

model = ErsiliaModel("eos7jlv")
model.serve()
output = model.predict(input=molecule, output="pandas")
model.close()
output.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/eos7jlv.csv", index=False)

In [None]:
from ersilia import ErsiliaModel

model = ErsiliaModel("eos7jlv")
model.serve()
output = model.predict(input="CCCC")
for x in output:
    print(x)