# Task Specific MSPM Model Fine-Tuning (Optional)

The stage is optional for
MolPMoFiT. For QSAR tasks, the target datasets may have a distribution different from ChEMBL
dataset (e.g., toxicity data, drug activity data). In this stage, the goal is to fine-tuning the general
domain MSPM on the target QSAR datasets to create the task-specific (endpoint-specific) MSPM. 

**The experiments on BBBP and HIV datasets shown the task-specific MSPM fine-tuning was not beneficial for the model performance.**

The notebook presents an example of fine-tuning the general-domain MSPM on **BBBP** dataset.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rdkit import RDLogger 
RDLogger.DisableLog('rdApp.*') # switch off RDKit warning messages

from sklearn.model_selection import train_test_split

from fastai import *
from fastai.text import *
from utils import *

torch.cuda.set_device(1) #change to 0 if you only has one GPU 



Create a path to save results.

In [2]:
result_path = Path('../results')
name = 'BBBP'
path = result_path/name
path.mkdir(exist_ok=True, parents=True)

Model_path = path/'models'
Model_path.mkdir(exist_ok=True, parents=True)

Load the data.

In [3]:
bbbp_data = pd.read_csv('../data/QSAR/bbbp.csv')
print('Dataset:', bbbp_data.shape)
bbbp_data.head(1)

Dataset: (2039, 2)


Unnamed: 0,smiles,p_np
0,[Cl].CC(C)NCC(O)COc1cccc2ccccc12,1


Similar to training the general domain MSPM, we use randomized SMILES to augment the data. Since the dataset is small, we generated as much as 100 SMILES for each molecule.

In [4]:
def lm_smiles_augmentation(df, N_rounds):
    dist_aug = {col_name: [] for col_name in df}

    for i in range(df.shape[0]):
        for j in range(N_rounds):
            dist_aug['smiles'].append(randomize_smiles(df.iloc[i].smiles))
            dist_aug['p_np'].append(df.iloc[i]['p_np'])
    df_aug = pd.DataFrame.from_dict(dist_aug)
    df_aug = df_aug.append(df, ignore_index=True)
    return df_aug.drop_duplicates('smiles')

Random split the data into training and validation sets. **Note: this split is for training the structure prediction model not the QSAR model.**

In [5]:
bbbp_train , bbbp_val = train_test_split(bbbp_data, test_size=0.05, random_state=42)

In [None]:
%%time
bbbp_train_aug = lm_smiles_augmentation(bbbp_train, 100)
bbbp_val_aug = lm_smiles_augmentation(bbbp_train, 100)