# Ligand ADMET and Potency (Property Prediction)

The [ADMET](https://polarishub.io/competitions/asap-discovery/antiviral-admet-2025) and [Potency](https://polarishub.io/competitions/asap-discovery/antiviral-potency-2025) Challenge of the [ASAP Discovery competition](https://polarishub.io/blog/antiviral-competition) take the shape of a property prediction task. Given the SMILES (or, to be more precise, the CXSMILES) of a molecule, you are asked to predict the numerical properties of said molecule. This is a relatively straight-forward application of ML and this notebook will quickly get you up and running!

To begin with, choose one of the two challenges! The code will look the same for both. 

In [1]:
CHALLENGE = "antiviral-admet-2025"

## Load the competition

Let's first load the competition from Polaris.

Make sure you are logged in! If not, simply run `polaris login` and follow the instructions. 

In [2]:
import polaris as po

competition = po.load_competition(f"asap-discovery/{CHALLENGE}")

As suggested in the logs, we'll cache the dataset. Note that this is not strictly necessary, but it does speed up later steps.

In [3]:
competition.cache()

Output()

'C:\\Users\\agitter\\AppData\\Local\\polaris\\polaris\\Cache\\datasets\\2bd81341-152f-40b5-ae1d-31df564cb5ea'

Let's get the train and test set and take a look at the data structure.

In [4]:
train, test = competition.get_train_test_split()

In [5]:
train[0]

('COC1=CC=CC(Cl)=C1NC(=O)N1CCC[C@H](C(N)=O)C1 |a:16|',
 {'LogD': 0.3, 'KSOL': nan, 'HLM': nan, 'MDR1-MDCKII': 2.0, 'MLM': nan})

In [6]:
test[0]

'CC(C)[C@H]1C2=C(CCN1C(=O)CC1=CN=CC3=CC=CC=C13)SC=C2 |o1:3|'

In [7]:
len(train)

434

In [8]:
len(test)

126

### Raw data dump
We've decided to sacrifice the completeness of the scientific data to improve its ease of use. For those that are interested, you can also access the raw data dump that this dataset has been created from.

In [9]:
import fsspec
import zipfile

with fsspec.open("https://fs.polarishub.io/2025-01-asap-discovery/raw_data_package.zip") as fd:
    with zipfile.ZipFile(fd, 'r') as zip_ref:
        zip_ref.extractall("./raw_data_package/")

ValueError: The HTTP server doesn't appear to support range requests. Only reading this file from the beginning is supported. Open with block_size=0 for a streaming file interface.

In [8]:
import pandas as pd
from pathlib import Path

subdir = "admet" if CHALLENGE == "antiviral-admet-2025" else "potency"

path = Path("./raw_data_package")
path = path / subdir

csv_files = list(path.glob("*.csv"))
pd.read_csv(csv_files[0]).head(3)

Unnamed: 0,in-vitro_MDR1-MDCKII-Papp_bienta: mean_Papp_A_to_B (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: mean_Papp_A_to_B (Num) (10^-6 cm/s),in-vitro_MDR1-MDCKII-Papp_bienta: SD_Papp_A_to_B (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: SD_Papp_A_to_B (Num),in-vitro_MDR1-MDCKII-Papp_bienta: mean_percent_recovery_A_to_B (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: mean_percent_recovery_A_to_B (Num),in-vitro_MDR1-MDCKII-Papp_bienta: SD_percent_recovery_A_to_B (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: SD_percent_recovery_A_to_B (Num),in-vitro_MDR1-MDCKII-Papp_bienta: mean_Papp_B_to_A (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: mean_Papp_B_to_A (Num) (10^-6 cm/s),in-vitro_MDR1-MDCKII-Papp_bienta: SD_Papp_B_to_A (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: SD_Papp_B_to_A (Num),in-vitro_MDR1-MDCKII-Papp_bienta: mean_percent_recovery_B_to_A (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: mean_percent_recovery_B_to_A (Num),in-vitro_MDR1-MDCKII-Papp_bienta: SD_percent_recovery_B_to_A (Mod),in-vitro_MDR1-MDCKII-Papp_bienta: SD_percent_recovery_B_to_A (Num),Molecule Name,CXSMILES (CDD Compatible),Batch Created Date
0,=,9.46,=,0.359309,=,82.0,=,0.453,,,,,,,,,ASAP-0023274,O=C(NC1=CC(Cl)=CC(C(=O)NC2=CC=CC(N3N=NNC3=O)=C...,2024-04-01
1,=,5.06,=,0.917089,=,85.5,=,2.78,,,,,,,,,ASAP-0023270,O=C(NC1=CC(Cl)=CC(C(=O)NC2=CC=CC(C3=NN=NN3)=C2...,2024-04-01
2,=,2.79,=,0.024695,=,76.0,=,1.78,,,,,,,,,ASAP-0023266,CC[C@H](CC1=NN=NN1)C1=CC=C(NC(=O)C2=CC(Cl)=CC(...,2024-04-01


## Build a model
Next, we'll train a simple baseline model using scikit-learn. 

You'll notice that the challenge has multiple targets.

In [10]:
train.target_cols

['LogD', 'MDR1-MDCKII', 'MLM', 'KSOL', 'HLM']

An interesting idea would be to build a multi-task model to leverage shared information across tasks.

For the sake of simplicity, however, we'll simply build a model per target here. 

In [11]:
import datamol as dm
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor

# Prepare the input data. We'll use Datamol to compute the ECFP fingerprints for both the train and test columns.
X_train = np.array([dm.to_fp(dm.to_mol(smi)) for smi in train.X])
X_test = np.array([dm.to_fp(dm.to_mol(smi)) for smi in test.X])

y_pred = {}

# For each of the targets...
for tgt in competition.target_cols:

    # We get the training targets
    # Note that we need to mask out NaNs since the multi-task matrix is sparse.
    y_true = train.y[tgt]
    mask = ~np.isnan(y_true)

    # We'll train a simple baseline model
    model = GradientBoostingRegressor()
    model.fit(X_train[mask], y_true[mask])

    # And then use that to predict the targets for the test set
    y_pred[tgt] = model.predict(X_test)

In [12]:
y_pred

{'LogD': array([1.27351288, 1.7750958 , 1.60581776, 1.68979137, 2.22260256,
        1.98565186, 1.62269058, 1.72852641, 1.46558608, 2.79611543,
        1.78994201, 1.27138386, 2.32976974, 1.70854626, 2.85149588,
        1.60267896, 1.68979137, 1.69309276, 2.01102358, 1.45665708,
        1.84532247, 1.3058265 , 1.35403707, 1.72852641, 2.43961195,
        2.13192484, 1.13133154, 1.43635829, 1.55644648, 2.46849154,
        2.40235522, 2.00363513, 2.40235522, 1.13133154, 1.60581776,
        1.15667716, 1.52111828, 1.98877801, 1.93665769, 2.20229261,
        1.67533478, 1.60581776, 1.51791896, 1.99692246, 1.99944845,
        1.60581776, 1.49087086, 1.49087086, 1.67533478, 2.54709902,
        1.77762637, 1.98840725, 1.94973128, 1.13006841, 2.3555056 ,
        1.55644648, 1.99692246, 1.97039445, 1.0418421 , 2.01976573,
        1.685116  , 1.55644648, 1.61252384, 2.20229261, 1.00331458,
        1.84532247, 1.4866602 , 2.46451033, 1.71377435, 1.57012582,
        1.73953818, 1.39871127, 0.356647

## Submit your predictions
Submitting your predictions to the competition is simple.

In [13]:
competition.submit_predictions(
    predictions=y_pred,
    prediction_name="tutorial-predictions",
    prediction_owner="agitter",
    report_url="https://github.com/agitter/asap-polaris-admet-challenge", 
    github_url="https://github.com/agitter/asap-polaris-admet-challenge",
    description="Submission using the tutorial Jupyter notebook",
    tags=["tutorial"],
    user_attributes={"Framework": "Scikit-learn", "Method": "Gradient Boosting"}
)

For the ASAP competition, we will only evaluate your latest submission. 

The results will only be disclosed after the competition ends.

The End.