# `ShapeLinker`: Sampling mode
Sample from a trained ShapeLinker model. This notebook is adapted from [ReinventCommunity](https://github.com/MolecularAI/ReinventCommunity).

In [37]:
# load dependencies
import os
import re
import json
import tempfile

root = os.path.dirname(os.getcwd())
# --------- change these path variables as required
reinvent_dir = os.path.join(root, 'Reinvent')

reinvent_env = os.path.expanduser("/opt/conda/envs/shapelinker")
output_dir = os.path.join(root, "results/sampling/5T35_test")
data_folder = os.path.join(root, 'utils/data')
agents_folder = os.path.join(root, 'models/agents')

# if required, generate a folder to store the results
try:
    os.mkdir(output_dir)
except FileExistsError:
    pass

## Setting up the configuration
`REINVENT` has an entry point that loads a specified `JSON` file on startup. `JSON` is a low-level data format that allows to specify a fairly large number of parameters in a cascading fashion very quickly. The parameters are structured into *blocks* which can in turn contain blocks or simple values, such as *True* or *False*, strings and numbers. In this tutorial, we will go through the different blocks step-by-step, explaining their purpose and potential values for given parameters. Note, that while we will write out the configuration as a `JSON` file in the end, in `python` we handle the same information as a simple `dict`.

In [38]:
# initialize the dictionary
configuration = {
    "version": 3,                          # we are going to use REINVENT's newest release
    "run_type": "sampling",                # other run types: "scoring", "validation",
                                           #                  "transfer_learning",
                                           #                  "reinforcement_learning" and
                                           #                  "create_model"
    "model_type": "link_invent"
}

In [39]:
# add block to specify whether to run locally or not and
# where to store the results and logging
configuration["logging"] = {
    "sender": "http://127.0.0.1",          # only relevant if "recipient" is set to "remote"
    "recipient": "local",                  # either to local logging or use a remote REST-interface
    "logging_path": os.path.join(output_dir, "progress.log"), # where the run's output is stored
    "job_name": "Sampling prior",       # set an arbitrary job name for identification
    "job_id": "N/A"                       # only relevant if "recipient" is set to "remote"
}

We will need to specify a path to an agent (parameter `model_path`), which can be a prior or focused agent.

In [40]:
# set up the file path, where the generated molecules are
# deposited (as SMILES)
output_SMILES_path = os.path.join(output_dir, "sampled", "sampled.csv")

# add the "parameters" block
configuration["parameters"] = {
    "model_path": os.path.join(agents_folder, 'Agent.shapelinker_5T35.ckpt'),
    "output_path": output_SMILES_path,       # output path where SMILES will be deposited (as csv)
    "num_samples": 10000,                             # specifies, how many molecules are to be sampled per warhead pair
    "batch_size": 32,                              # internal batch size; should match the one used
                                                    # during training
    "with_likelihood": True     ,                   # also provide the log-likelihood
    "temperature": 1.5,                             # temperature for sampling
    "randomize_warheads": True,                    # randomize the warheads
    "warheads": ['*C(=O)N[C@H](C(=O)N1C[C@H](O)C[C@H]1C(=O)NCc1ccc(C2=C(C)NCS2)cc1)C(C)(C)C|*NC(=O)C[C@@H]1N=C(c2ccc(Cl)cc2)c2c(sc(C)c2C)-n2c(C)nnc21'], # warheads to use
    }

In [41]:
# write the configuration file to the disc
configuration_JSON_path = os.path.join(output_dir, "sampling_config.json")
with open(configuration_JSON_path, 'w') as f:
    json.dump(configuration, f, indent=4, sort_keys=True)

## Run
Now it is time to execute the model locally. As we will not update any weights, execution should be very fast. The result will be a `SMILES` file in the `sampled` directory specified.

The command-line execution looks like this:
```
# activate envionment
conda activate shapelinker

# execute file
python <your_path>/input.py <config>.json
```

In [43]:
%%capture captured_err_stream --no-stderr

# execute from the command-line
!{reinvent_env}/bin/python {reinvent_dir}/input.py {configuration_JSON_path}

In [44]:
# print the output to a file, just to have it for documentation
with open(os.path.join(output_dir, "run.err"), 'w') as file:
    file.write(captured_err_stream.stdout)

In [45]:
# print the resulting SMILES file
!head -n 15 {output_dir}/sampled/sampled.csv

molecules,input,linker,likelihood
Cc1cc(C(C(=O)N2CC(O)CC2C(=O)NC(NCC2CON2C(=O)OCCC2CC(N(C(=O)c3cc(-c4cc(Cl)ccc4C(=O)N4Cc5ccccc5CC4C)n(C)c3C)c3ccccc3)C2)c2ccc(-c3scnc3C)cc2)C(C)C)on1,C1C(O)CC(C(NC(*)c2ccc(-c3scnc3C)cc2)=O)N1C(C(C(C)C)c1onc(C)c1)=O|c1c(N(C(=O)c2cc(-c3cc(Cl)ccc3C(=O)N3C(C)Cc4ccccc4C3)n(C)c2C)C2CC(*)C2)cccc1,O1CC(CN[*])N1C(=O)OCC[*],18.37080192565918
Cc1cc(C(C(=O)N2CC(O)CC2C(=O)NC(CC(=O)NC(C=O)CCC2CC(N(C(=O)c3cc(-c4cc(Cl)ccc4C(=O)N4Cc5ccccc5CC4C)n(C)c3C)c3ccccc3)C2)c2ccc(-c3scnc3C)cc2)C(C)C)on1,C1C(O)CC(C(NC(*)c2ccc(-c3scnc3C)cc2)=O)N1C(C(C(C)C)c1onc(C)c1)=O|c1c(N(C(=O)c2cc(-c3cc(Cl)ccc3C(=O)N3C(C)Cc4ccccc4C3)n(C)c2C)C2CC(*)C2)cccc1,C(=O)C(NC(C[*])=O)CC[*],11.440457344055176
Cc1cc(C(C(=O)N2CC(O)CC2C(=O)NC(CC=Cc2cc(NC3CC(N(C(=O)c4cc(-c5cc(Cl)ccc5C(=O)N5Cc6ccccc6CC5C)n(C)c4C)c4ccccc4)C3)cnn2)c2ccc(-c3scnc3C)cc2)C(C)C)on1,C1C(O)CC(C(NC(*)c2ccc(-c3scnc3C)cc2)=O)N1C(C(C(C)C)c1onc(C)c1)=O|c1c(N(C(=O)c2cc(-c3cc(Cl)ccc3C(=O)N3C(C)Cc4ccccc4C3)n(C)c2C)C2CC(*)C2)cccc1,n1c(C=CC[*])cc(