# `ShapeLinker`: Sampling mode
Sample from a trained ShapeLinker model. This notebook is adapted from [ReinventCommunity](https://github.com/MolecularAI/ReinventCommunity).

In [9]:
# load dependencies
import os
import re
import json
import tempfile

root = os.path.dirname(os.getcwd())
# --------- change these path variables as required
reinvent_dir = os.path.join(root, 'Reinvent')

reinvent_env = os.path.expanduser("/opt/conda/envs/shapelinker")
output_dir = os.path.join(root, "results/sampling/5T35_test")
data_folder = os.path.join(root, 'utils/data')
agents_folder = os.path.join(root, 'models/agents')

# if required, generate a folder to store the results
os.makedirs(output_dir, exist_ok=True)

## Setting up the configuration
`REINVENT` has an entry point that loads a specified `JSON` file on startup. `JSON` is a low-level data format that allows to specify a fairly large number of parameters in a cascading fashion very quickly. The parameters are structured into *blocks* which can in turn contain blocks or simple values, such as *True* or *False*, strings and numbers. In this tutorial, we will go through the different blocks step-by-step, explaining their purpose and potential values for given parameters. Note, that while we will write out the configuration as a `JSON` file in the end, in `python` we handle the same information as a simple `dict`.

In [10]:
# initialize the dictionary
configuration = {
    "version": 3,                          # we are going to use REINVENT's newest release
    "run_type": "sampling",                # other run types: "scoring", "validation",
                                           #                  "transfer_learning",
                                           #                  "reinforcement_learning" and
                                           #                  "create_model"
    "model_type": "link_invent"
}

In [11]:
# add block to specify whether to run locally or not and
# where to store the results and logging
configuration["logging"] = {
    "sender": "http://127.0.0.1",          # only relevant if "recipient" is set to "remote"
    "recipient": "local",                  # either to local logging or use a remote REST-interface
    "logging_path": os.path.join(output_dir, "progress.log"), # where the run's output is stored
    "job_name": "Sampling prior",       # set an arbitrary job name for identification
    "job_id": "N/A"                       # only relevant if "recipient" is set to "remote"
}

We will need to specify a path to an agent (parameter `model_path`), which can be a prior or focused agent.

In [12]:
# set up the file path, where the generated molecules are
# deposited (as SMILES)
output_SMILES_path = os.path.join(output_dir, "sampled", "sampled.csv")

# add the "parameters" block
configuration["parameters"] = {
    "model_path": os.path.join(agents_folder, 'Agent.shapelinker_5T35.ckpt'),
    "output_path": output_SMILES_path,       # output path where SMILES will be deposited (as csv)
    "num_samples": 10000,                             # specifies, how many molecules are to be sampled per warhead pair
    "batch_size": 32,                              # internal batch size; should match the one used
                                                    # during training
    "with_likelihood": True     ,                   # also provide the log-likelihood
    "temperature": 1.5,                             # temperature for sampling
    "randomize_warheads": True,                    # randomize the warheads
    "warheads": ['*C(=O)N[C@H](C(=O)N1C[C@H](O)C[C@H]1C(=O)NCc1ccc(C2=C(C)NCS2)cc1)C(C)(C)C|*NC(=O)C[C@@H]1N=C(c2ccc(Cl)cc2)c2c(sc(C)c2C)-n2c(C)nnc21'], # warheads to use
    }

In [13]:
# write the configuration file to the disc
configuration_JSON_path = os.path.join(output_dir, "sampling_config.json")
with open(configuration_JSON_path, 'w') as f:
    json.dump(configuration, f, indent=4, sort_keys=True)

## Run
Now it is time to execute the model locally. As we will not update any weights, execution should be very fast. The result will be a `SMILES` file in the `sampled` directory specified.

The command-line execution looks like this:
```
# activate envionment
conda activate shapelinker

# execute file
python <your_path>/input.py <config>.json
```

In [14]:
%%capture captured_err_stream --no-stderr

# execute from the command-line
!{reinvent_env}/bin/python {reinvent_dir}/input.py {configuration_JSON_path}

In [15]:
# print the output to a file, just to have it for documentation
with open(os.path.join(output_dir, "run.err"), 'w') as file:
    file.write(captured_err_stream.stdout)

In [16]:
# print the resulting SMILES file
!head -n 15 {output_dir}/sampled/sampled.csv

molecules,input,linker,likelihood
CC1=C(c2ccc(CNC(=O)C3CC(O)CN3C(=O)C(NC(=O)NC(=O)c3ccc(NS(=O)(=O)COc4cccc(S(=O)(=O)Nc5ccc(COC(=O)NC(=O)CC6N=C(c7ccc(Cl)cc7)c7c(sc(C)c7C)-n7c(C)nnc76)cc5)c4)cc3)C(C)(C)C)cc2)SCN1,C1(c2ccc(CNC(=O)C3CC(O)CN3C(=O)C(C(C)(C)C)NC(*)=O)cc2)=C(C)NCS1|O=C(N*)CC1c2n(c(C)nn2)-c2sc(C)c(C)c2C(c2ccc(Cl)cc2)=N1,O(CS(=O)(=O)Nc1ccc(C(N[*])=O)cc1)c1cccc(S(Nc2ccc(COC(=O)[*])cc2)(=O)=O)c1,40.88142776489258
CC1=C(c2ccc(CNC(=O)C3CC(O)CN3C(=O)C(NC(=O)OS(=O)(=O)Oc3ccc(S(=O)(=O)c4cn(CC(=O)NC(=O)CC5N=C(c6ccc(Cl)cc6)c6c(sc(C)c6C)-n6c(C)nnc65)nc4S(=O)(=O)C4(C)CCC4)cc3)C(C)(C)C)cc2)SCN1,C1(c2ccc(CNC(=O)C3CC(O)CN3C(=O)C(C(C)(C)C)NC(*)=O)cc2)=C(C)NCS1|O=C(N*)CC1c2n(c(C)nn2)-c2sc(C)c(C)c2C(c2ccc(Cl)cc2)=N1,O(S(O[*])(=O)=O)c1ccc(S(=O)(=O)c2cn(CC(=O)[*])nc2S(=O)(=O)C2(C)CCC2)cc1,37.759666442871094
CC1=C(c2ccc(CNC(=O)C3CC(O)CN3C(=O)C(NC(=O)NS(=O)(=O)OCCSCCSCCSCCNS(=O)(=O)c3cccc(S(=O)(=O)NC(=O)CC4N=C(c5ccc(Cl)cc5)c5c(sc(C)c5C)-n5c(C)nnc54)c3)C(C)(C)C)cc2)SCN1,C1(c2ccc(CNC(=O)C3CC(O)CN3C(=O