## Step 1
### About the Dataset

Polaris ([polarishub.io](https://polarishub.io/datasets/asap-discovery/antiviral-potency-2025-unblinded)) is an open-source platform that provides high-quality datasets for machine learning in drug discovery.

The chosen dataset ([source](https://asapdiscovery.org/pipeline/)) contains 1,328 pIC50 values against both the MERS-CoV Mpro target and the SARS-CoV-2 Mpro target. I've chosen this as it is intended to reflect a realistic dataset used in drug discovery settings, unlike other sources such as the [MoleculeNet](https://moleculenet.org/) datasets, which have been criticised for their flaws (see [this blogpost](https://practicalcheminformatics.blogspot.com/2023/08/we-need-better-benchmarks-for-machine.html) by Pat Walters).

This dataset was used in a blind challenge, with the dataset split into training and test, which is labelled in the 'set' column. Here, I will hold out the test set entirely until after completing my cross-validation (CV) procedure. i.e. I will conduct my training and CV procedure on the datapoints labelled 'train', then will test the performance of the model on the 'test' datapoints.

Some issues have been highlighted with this dataset. In particular, some enantiomers were given the same SMILES strings, so even if they had different bioactivities, a machine learning model wouldn't be able to distinguish between the two. As such, I will filter out those cases when testing the model, as was done for the challenge.

In [None]:
from data.dataset import dataset, dataset_df

dataset_df

Unnamed: 0,CXSMILES,Molecule Name,Set,pIC50 (MERS-CoV Mpro),pIC50 (SARS-CoV-2 Mpro)
0,COC[C@]1(C)C(=O)N(C2=CN=CC3=CC=CC=C23)C(=O)N1C...,ASAP-0000141,Train,4.19,
1,C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...,ASAP-0000142,Train,4.92,5.29
2,CNC(=O)CN1C[C@]2(C[C@H](C)N(C3=CN=CC=C3C3CC3)C...,ASAP-0000143,Train,4.73,
3,C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...,ASAP-0000144,Train,4.90,6.11
4,C=C(CN1CCC2=C(C=C(Cl)C=C2)C1C(=O)NC1=CN=CC2=CC...,ASAP-0000145,Train,4.81,5.62
...,...,...,...,...,...
1323,O=C(CC1=CN=CC2=CC=CC=C12)N1CCCC[C@H]1[C@H]1CCC...,ASAP-0032561,Test,4.54,4.20
1324,O=C(CC1=CN=CC2=CC=CC=C12)N1CCCC[C@H]1[C@H]1CCC...,ASAP-0032562,Test,4.42,
1325,O=C(CC1=CN=CC2=CC=CC=C12)N1CCC[C@H]2CCCC[C@@H]...,ASAP-0032572,Test,4.84,5.18
1326,COC1=CC=CC=C1[C@H]1C[C@H](C)CCN1C(=O)CC1=CN=CC...,ASAP-0032604,Test,5.53,5.59


## Step 2

### Data Preprocessing

As mentioned in the previous step, indistinct enantiomers need to be removed. Polaris noted entries `1036, 1039, 1219, 1225, 1306`, but I will write a function, `clean_smiles` to verify that these are the only cases of such. Multiple SMILES strings can be used to represent the same molecule so I will canonicalise the smiles by transforming them into an `rdkit.mol` object and transforming that back into a SMILES string. I will also be checking for other duplicate entries.

In [None]:
def clean_smiles(smiles):
    '''
    Arguments
    ----------
    smiles: array-like, list of smiles strings to 'clean'

    Returns
    ----------
    clean_smiles: array-like, list of cleaned smiles
    duplicates: indecies of duplicate entries
    ''' #asldkjfhlaks

## Step 3

### Dataset Exploration

## Step 4

### Model Training

## Step 5

### Performance Testing

## Step 6

### Optimisation

Improving performance by means of feature selection/engineering and hyperparameter optimisation