# FEgrow: An Open-Source Molecular Builder and Free Energy Preparation Workflow

**Authors: Mateusz K Bieniek, Ben Cree, Rachael Pirie, Joshua T. Horton, Natalie J. Tatum, Daniel J. Cole**

## Overview

For parallelisation we employ Dask. This is done inside of ChemSpace and it spreads the work across all the CPUs/cores on the workstation. 

It is not just building, parameterising and scoring of the molecules that is parallelised, but also the heaviest parts in the active learning, like the tanimoto distances and fingerprint computation. 

Dask however can work with more than just one workstation. It can:
 - schedule jobs on HPC and run them directly there
 - connect to many different PCs via SSH and run the jobs there
 - run jobs in the cloud, AWS, and others
 - and more
 
Here we'll showcase a few options how to tell Dask which computing platform to utilise. 

In [None]:
import pandas as pd
import prody
from rdkit import Chem

import fegrow
from fegrow import ChemSpace

from fegrow.testing import core_5R83_path, rec_5R83_path, data_5R83_path

# Prepare the ligand template

In [None]:
scaffold = Chem.SDMolSupplier(core_5R83_path)[0]

As we are using already prepared Smiles that have the scaffold as a substructure, it is not needed to set any growing vector. 

<div class="alert alert-block alert-warning">
    <b>ALWAYS</b> ensure that <b>__name__ == "__main__"</b> when creating a cluster in your code
</div>

In [None]:
from dask.distributed import LocalCluster


if False and __name__ == "__main__":
    lc = LocalCluster(n_workers=2)
    # create the chemical space
    cs = ChemSpace(dask_cluster=lc)
    
# from now on you are using your own cluster. And this is very much the default as well. 

<div class="alert alert-block alert-info">
<b>SSH workers? </b> Yes, use SSHCluster! </div>
You can use SSH to add workstations as workers. First, I recommend setting up your ~/.ssh/config file with your hosts. Then ensure they all the same "conda environment", 
ie same versions of python, dask and other packages

In [None]:
# see https://docs.dask.org/en/stable/deploying-ssh.html for more
from dask.distributed import SSHCluster

if False and __name__ == "__main__":
    lc = SSHCluster(
        [
        # NOTE: add your public key to ~/.ssh/authorized_keys
        'localhost', #  first: scheduler
        'localhost', #  workers from now on. Run workers on localhost too
        'larch'      #  keep adding worksations, use ~/.ssh/config to define PCs
         # NOTE: you can attach many workstations here as a list! 
        ],
        # NOTE: update your firewall (UFW, iptables, firewalld, etc)
        #       to allow the scheduler port TCP 8343
        scheduler_options={"port": 8343, "dashboard_address": ":8989"}, 
        # processes per host
        worker_options={"n_workers": 3}, 
        # best to ensure that the python path is universal across the PCs        
        # remote_python='/home/nmb1063/mamba/envs/fegrow/bin/python'
       )
    
    cs = ChemSpace(dask_cluster=lc)

In [None]:
# see https://docs.dask.org/en/stable/deploying-ssh.html for more
from dask.distributed import SSHCluster

if False and __name__ == "__main__":
    lc = SSHCluster(
        [
        # NOTE: add your public key to ~/.ssh/authorized_keys
        'localhost', #  first: scheduler
        'localhost', #  workers from now on. Run workers on localhost too
        'larch'      #  keep adding worksations, use ~/.ssh/config to define PCs
         # NOTE: you can attach many workstations here as a list! 
        ],
        # NOTE: update your firewall (UFW, iptables, firewalld, etc)
        #       to allow the scheduler port TCP 8343
        scheduler_options={"port": 8343, "dashboard_address": ":8989"}, 
        # processes per host
        worker_options={"n_workers": 3}, 
        # best to ensure that the python path is universal across the PCs        
        # remote_python='/home/nmb1063/mamba/envs/fegrow/bin/python'
       )
    
    cs = ChemSpace(dask_cluster=lc)

In [None]:
# Here is an example for Archer2 that might be a good start
from dask_jobqueue import SLURMCluster  # check out the documentation!
import dask

def create_archer_cluster():
    # Archer has its own instructions for Dask
    # and these should be fitted to the jobs run
    cluster = SLURMCluster(account='e123-proj', 
                           queue='standard', 
                           job_extra_directives=['--nodes=1', '--qos=standard'], 
                           #n_workers=2,
                           #silence_logs='debug',
                           processes=8,
                           cores=128, 
                           job_cpu=128, 
                           memory="256GB", 
                           job_directives_skip=['--mem', '-n 1', '-N 1'], 
                           walltime='15:10:00', 
                           interface='hsn0', 
                           shebang="#!/bin/bash --login",
                           local_directory='$PWD',
                           job_script_prologue=['hostname', 
                                                'ip addr',
                                                'eval "$(/work/../conda shell.bash hook)"', 
                                                'conda activate env1', 
                                                'export OPENMM_CPU_THREADS=1'
                                               ], 
                           scheduler_options={'dashboard_address': 'localhost:9224'})
    print("JOB Script: ", cluster.job_script(), "END")
    # request 5 nodes
    cluster.scale(jobs=5)
    return cluster

In [None]:
# we're not growing the scaffold, we're superimposing bigger molecules on it
cs.add_scaffold(scaffold)
cs.add_protein(rec_5R83_path)

In [None]:
# load 50k smiles dataset from the study
smiles = pd.read_csv(data_5R83_path).Smiles.to_list()

# for testing, sort by size and pick small
smiles.sort(key=len)
# take 5 smallest smiles
smiles = smiles[:5]

In [None]:
# here we add Smiles which should already have been matched
# to the scaffold (rdkit Mol.HasSubstructureMatch)
cs.add_smiles(smiles[:3], protonate=False)
evaluated = cs.evaluate()