<a href="https://colab.research.google.com/github/bhatnira/Interpretation-of-Best-Classification-Models-Acetylcholinesterase-Inhibitor-Discovery-/blob/main/FineTunedChemberta(DeepChem_ChemBERTa_10M_MLM).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fined Tuned pretained Chemberta(DeepChem_ChemBERTa_10M_MLM) -- classification modeling

## Introduction

 While not outperforming existing methods, ChemBERTa demonstrates competitive results when trained on larger datasets. It enhances interpretability and introduces an innovative attention-based visualization for elucidating the model's decision-making process. By directly accepting SMILES strings as input, ChemBERTa eliminates the need for extensive featurization, thus facilitating rapid screening(Ahmad et al., 2022; Chithrananda et al., 2020). We utilized twelve different variants of CheMERTa, pre-trained in different size and type of dataset, along with rigorous hyper parameter optimization via Optuna module (Chithrananda et al., 2020). Preferred tokenizer was the default RoBERTA tokenizer as there was no significant difference in performance across different tokenizers.


## Data loading and Preprocessing



In [1]:
import numpy as np
# Reproducibility
np.random.seed(42)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
!pip install --pre deepchem
import deepchem
deepchem.__version__



Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead


'2.8.1.dev'

In [4]:
import pandas as pd
df=pd.read_excel('/content/drive/MyDrive/Predictive-Generative-transfer learning/Part_2_standarizationOfMolecule/StandarizedSmiles_cutOFF800daltonMolecularweight.xlsx')
df.head()

Unnamed: 0,Molecule ChEMBL ID,Smiles,IC50,classLabel,IsValidSMILES,Morgan_FP,Molecule,Fingerprint,PCA1,PCA2,tSNE1,tSNE2,MolecularWeight,Frequency,cleanedMol
0,CHEMBL94,CNC(=O)Oc1ccc2c(c1)[C@]1(C)CCN(C)[C@@H]1N2C,28.0,1,True,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,<rdkit.Chem.rdchem.Mol object at 0x7bcfd8e880b0>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,-1.738898,-1.494784,-52.7617,-42.736099,275.352,27,CNC(=O)Oc1ccc2c(c1)[C@]1(C)CCN(C)[C@@H]1N2C
1,CHEMBL207777,Cc1ccccc1NC(=O)Oc1ccc2c(c1)[C@]1(C)CO[C@@H](C1)O2,97.0,1,True,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,<rdkit.Chem.rdchem.Mol object at 0x7bcfd8e88190>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,-1.014164,-1.325272,-63.027809,-53.765778,325.364,1,Cc1ccccc1NC(=O)Oc1ccc2c(c1)[C@]1(C)CO[C@@H](C1)O2
2,CHEMBL205967,CCNC(=O)Oc1ccc2c(c1)[C@]1(C)CO[C@@H](C1)O2,2420.0,0,True,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,<rdkit.Chem.rdchem.Mol object at 0x7bcfd8e88270>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,-1.235565,-1.559108,-61.968063,-54.078575,263.293,1,CCNC(=O)Oc1ccc2c(c1)[C@]1(C)CO[C@@H](C1)O2
3,CHEMBL60119,Cc1ccc2c(N)c3c(nc2c1)CCCC3,100.0,1,True,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,<rdkit.Chem.rdchem.Mol object at 0x7bcfd8e88430>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,2.705664,0.148113,57.758297,-8.911607,212.296,4,Cc1ccc2c(N)c3c(nc2c1)CCCC3
4,CHEMBL294525,CCCCCCCNc1c2c(nc3cc([N+](=O)[O-])ccc13)CCCC2,290.0,1,True,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,<rdkit.Chem.rdchem.Mol object at 0x7bcfd8e884a0>,<rdkit.DataStructs.cDataStructs.ExplicitBitVec...,3.600017,-0.978574,61.242554,5.783804,341.455,3,CCCCCCCNc1c2c(nc3cc([N+](=O)[O-])ccc13)CCCC2


In [5]:
df['classLabel'].value_counts()

classLabel
0    2330
1    1747
Name: count, dtype: int64

In [6]:
df_selected= df[['Smiles', 'classLabel']]
df_selected

Unnamed: 0,Smiles,classLabel
0,CNC(=O)Oc1ccc2c(c1)[C@]1(C)CCN(C)[C@@H]1N2C,1
1,Cc1ccccc1NC(=O)Oc1ccc2c(c1)[C@]1(C)CO[C@@H](C1)O2,1
2,CCNC(=O)Oc1ccc2c(c1)[C@]1(C)CO[C@@H](C1)O2,0
3,Cc1ccc2c(N)c3c(nc2c1)CCCC3,1
4,CCCCCCCNc1c2c(nc3cc([N+](=O)[O-])ccc13)CCCC2,1
...,...,...
4072,O=C(Nc1ccc(CN2CCOCC2)cc1C(=O)C(=O)N1C(=O)CCC1=...,1
4073,COc1cccc2c1C=[N+](c1ccccc1C(F)(F)F)CC2.[Br-],0
4074,COc1cccc2cc[n+](-c3ccc(C)cc3)cc12.[Br-],0
4075,COc1ccc(-[n+]2ccc3cccc(OC)c3c2)cc1.[Br-],1


 The following code for roberta tokenizer and chemberta model building is adopted from deepchem's tutorial: (https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Transfer_Learning_With_ChemBERTa_Transformers.ipynb)

In [7]:
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw, PyMol, rdFMCS
from rdkit.Chem.Draw import IPythonConsole
from rdkit import rdBase
import numpy as np
import deepchem as dc

In [8]:
df.to_csv('inhibitor.csv', index=False)

In [9]:
import deepchem as dc
import pandas as pd
dataset_file = 'inhibitor.csv'
task = ['classLabel']
featurizer_func = dc.feat.ConvMolFeaturizer()
loader = dc.data.CSVLoader(tasks=task, feature_field='cleanedMol', featurizer=featurizer_func)
dataset = loader.create_dataset(dataset_file)

In [10]:
transformer = dc.trans.BalancingTransformer(dataset=dataset)
dataset = transformer.transform(dataset)

In [11]:
from rdkit import Chem

In [12]:
!git clone https://github.com/NVIDIA/apex
!cd /content/apex
!pip install -v --no-cache-dir /content/apex
!pip install transformers
!pip install simpletransformers
!pip install wandb
!cd ..

fatal: destination path 'apex' already exists and is not an empty directory.
Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Processing ./apex
  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting setuptools
    Using cached setuptools-70.1.1-py3-none-any.whl (883 kB)
  Collecting wheel
    Using cached wheel-0.43.0-py3-none-any.whl (65 kB)
  Installing collected packages: wheel, setuptools
    Creating /tmp/pip-build-env-v5p1ngbq/overlay/local/bin
    changing mode of /tmp/pip-build-env-v5p1ngbq/overlay/local/bin/wheel to 755
  ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
  ipython 7.34.0 requires jedi>=0.16, which is not installed.
  Successfully installed setuptools-70.1.1 wheel-0.43.0
  Installing build dependencies ... 

In [13]:
import sys
!test -d bertviz_repo && echo "FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo"
!test -d bertviz_repo || git clone https://github.com/jessevig/bertviz bertviz_repo
if not 'bertviz_repo' in sys.path:
  sys.path += ['bertviz_repo']
!pip install regex

FYI: bertviz_repo directory already exists, to pull latest version uncomment this line: !rm -r bertviz_repo


In [14]:
!git clone https://github.com/seyonechithrananda/bert-loves-chemistry.git

Cloning into 'bert-loves-chemistry'...
remote: Enumerating objects: 1566, done.[K
remote: Counting objects: 100% (202/202), done.[K
remote: Compressing objects: 100% (111/111), done.[K
remote: Total 1566 (delta 96), reused 92 (delta 91), pack-reused 1364[K
Receiving objects: 100% (1566/1566), 55.35 MiB | 11.29 MiB/s, done.
Resolving deltas: 100% (1000/1000), done.
Updating files: 100% (122/122), done.


In [15]:
%cd /content/bert-loves-chemistry

/content/bert-loves-chemistry


In [16]:
import os
import numpy as np
import pandas as pd
from typing import List
from rdkit import Chem

In [17]:
!pip install --upgrade transformers



In [33]:
import deepchem as dc
import pandas as pd

task = ['classLabel']
dataset = dc.data.NumpyDataset(X=df['cleanedMol'], y=df['classLabel'])
splitter = dc.splits.RandomSplitter()
frac_train = 0.7
frac_valid = 0.15
frac_test = 0.15

# Split the dataset
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(
    dataset, frac_train=frac_train, frac_valid=frac_valid, frac_test=frac_test,seed=42
)

In [34]:
test_dataset

<NumpyDataset X.shape: (612,), y.shape: (612,), w.shape: (612,), ids: [3356 2744 3859 ... 860 3507 3174], task_names: [0]>

In [35]:
train_df = train_dataset.to_dataframe()
valid_df = valid_dataset.to_dataframe()
test_df = test_dataset.to_dataframe()

In [36]:
train_df

Unnamed: 0,X,y,w,ids
0,Clc1ccc(CN2CCN(c3nccc(NCc4ccccc4)n3)CC2)cc1,0,1.0,1749
1,CCCCNC(=O)Oc1cccc(CN(C)CCCOc2ccc3ccc(=O)oc3c2)c1,1,1.0,2053
2,Cc1cccc(C[n+]2ccc(C(=O)NCCc3c[nH]c4ccccc34)cc2)c1,0,1.0,538
3,CN(C)CCCCCCCCCCCCNc1c2c(nc3ccccc13)CCCC2,1,1.0,438
4,COc1cc2c(cc1OC)SC(C(=O)CCc1cc[n+](Cc3ccsc3)cc1)C2,1,1.0,2685
...,...,...,...,...
2848,CC1=Nc2nc3c(c(N)c2C(c2ccccc2F)C1C(=O)OC1CCC1)C...,1,1.0,2927
2849,COc1cc2c(cc1OC)C(c1ccccc1)N(CCCCCCc1cc(C)nc(C=...,0,1.0,375
2850,CC1CN(C(=O)Oc2ccc(Oc3ccc(C(F)(F)F)cn3)cc2)CC(C)O1,0,1.0,1908
2851,CN(C)Cc1ccc(CSCCCCCCCCCCSCc2ccc(CN(C)C)o2)o1,0,1.0,333


In [37]:
train_df=train_df[['X','y']]
valid_df=valid_df[['X','y']]
test_df=test_df[['X','y']]

## Cross species evaluation

### Upload and prep for species dataset

Species datasets: eel (CHEMBL4078), cow (CHEMBL4768), mouse (CHEMBL3199), ray (CHEMBL4780), mosquito (CHEMBL2046266), mouse (CHEMBL3198), and independent  human dataset containing molecules not present on  ChEMBL22  were obtained from Vignaux et al(2023) for validation and specificity inferences.





In [38]:
df_humanIndependent=pd.read_excel('/content/drive/MyDrive/Predictive-Generative-transfer learning/TransferAll/CleanedTestDatasetSmiles/df_humanIndependent.xlsx')
df_eel=pd.read_excel('/content/drive/MyDrive/Predictive-Generative-transfer learning/TransferAll/CleanedTestDatasetSmiles/df_eel.xlsx')
df_mouse=pd.read_excel('/content/drive/MyDrive/Predictive-Generative-transfer learning/TransferAll/CleanedTestDatasetSmiles/df_mouse.xlsx')
df_cow=pd.read_excel('/content/drive/MyDrive/Predictive-Generative-transfer learning/TransferAll/CleanedTestDatasetSmiles/df_cow.xlsx')
df_ray=pd.read_excel('/content/drive/MyDrive/Predictive-Generative-transfer learning/TransferAll/CleanedTestDatasetSmiles/df_ray.xlsx')
df_mosquito=pd.read_excel('/content/drive/MyDrive/Predictive-Generative-transfer learning/TransferAll/CleanedTestDatasetSmiles/df_mosquito.xlsx')

In [39]:
df_humanIndependent = dc.data.NumpyDataset(X=df_humanIndependent['cleanedMol'], y=df_humanIndependent['binary_activities'])
df_eel = dc.data.NumpyDataset(X=df_eel['cleanedMol'], y=df_eel['single-class-label'])
df_mouse = dc.data.NumpyDataset(X=df_mouse['cleanedMol'], y=df_mouse['single-class-label'])
df_cow = dc.data.NumpyDataset(X=df_cow['cleanedMol'], y=df_cow['single-class-label'])
df_ray = dc.data.NumpyDataset(X=df_ray['cleanedMol'], y=df_ray['single-class-label'])
df_mosquito = dc.data.NumpyDataset(X=df_mosquito['cleanedMol'], y=df_mosquito['single-class-label'])

In [40]:
df_humanIndependent = df_humanIndependent.to_dataframe()
df_eel = df_eel.to_dataframe()
df_mouse = df_mouse.to_dataframe()
df_cow = df_cow.to_dataframe()
df_ray = df_ray.to_dataframe()
df_mosquito = df_mosquito.to_dataframe()

In [41]:
df_humanIndependent=df_humanIndependent[['X','y']]
df_eel=df_eel[['X','y']]
df_mouse=df_mouse[['X','y']]
df_cow=df_cow[['X','y']]
df_ray=df_ray[['X','y']]
df_mosquito=df_mosquito[['X','y']]

In [42]:
df_humanIndependent

Unnamed: 0,X,y
0,COc1cc(N)c(Cl)cc1C(=O)CCC1CCN(Cc2cccc(C)c2)CC1,1
1,CCN(CCCCCn1c(C)cc(=O)n(CCCCCN(CC)Cc2ccccc2C#N)...,1
2,CCN(CCCCCCn1c(C)cc(=O)n(CCCCCCN(CC)Cc2ccccc2C#...,1
3,CCN(CCCCCn1c(=O)c2ccccc2n(CCCCCN(CC)Cc2ccccc2C...,1
4,C[n+]1c2c(c(N)c3ccccc31)CCCC2,0
...,...,...
203,COc1ccc2c(=O)cc(C(=O)Nc3ccc(CN(C)Cc4ccccc4)cc3...,0
204,COc1cc2[nH]c(C(=O)Nc3ccc(CN(C)Cc4ccccc4)cc3)cc...,0
205,CCN(CCCCn1c(C)cc(=O)n(CCCCN(CC)Cc2ccccc2C#N)c1...,1
206,Cc1cc(=O)n(CCCCCNCc2ccccc2C#N)c(=O)n1CCCCCNCc1...,1


## Model optimization

In [28]:
import tensorflow as tf
import torch
import gc

# Clear session for TensorFlow
tf.keras.backend.clear_session()

# Collect garbage and release unreferenced memory for PyTorch
gc.collect()
torch.cuda.empty_cache()

print("Memory cleared for TensorFlow and PyTorch.")


Memory cleared for TensorFlow and PyTorch.


In [29]:
!pip install Optuna

Collecting Optuna
  Downloading optuna-3.6.1-py3-none-any.whl (380 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from Optuna)
  Downloading alembic-1.13.2-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.0/233.0 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting colorlog (from Optuna)
  Downloading colorlog-6.8.2-py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->Optuna)
  Downloading Mako-1.3.5-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: Mako, colorlog, alembic, Optuna
Successfully installed Mako-1.3.5 Optuna-3.6.1 alembic-1.13.2 colorlog-6.8.2


In [30]:
import os
import optuna
from sklearn.metrics import roc_auc_score
from simpletransformers.classification.classification_model import ClassificationModel
classification_args = {
    'overwrite_output_dir': True,  # Set to True to overwrite the output directory if it exists
}

# Define the objective function for Optuna
def objective(trial):
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    num_train_epochs = trial.suggest_int("num_train_epochs", 1, 10)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])
    dropout = trial.suggest_float("dropout", 0.1, 0.5)
    warmup_steps = trial.suggest_int("warmup_steps", 0, 1000)
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.3)
    adam_epsilon = trial.suggest_float("adam_epsilon", 1e-9, 1e-7, log=True)
    # Generate a unique output directory for each trial
    output_dir = f'/content/AChE_transferLearning_trial_{trial.number}'

    # Create and train the model with the given hyperparameters
    model = ClassificationModel('roberta', 'DeepChem/ChemBERTa-10M-MLM', use_cuda=True, args={**classification_args, 'learning_rate': learning_rate, 'num_train_epochs': num_train_epochs, 'train_batch_size': batch_size, 'dropout': dropout, 'warmup_steps': warmup_steps, 'weight_decay': weight_decay, 'adam_epsilon': adam_epsilon})

    model.train_model(train_df, eval_df=valid_df)

    # Evaluate the model and calculate ROC AUC
    eval_results = model.eval_model(valid_df)
    predictions, _ = model.predict(valid_df['X'].tolist())
    roc_auc = roc_auc_score(valid_df['y'].tolist(), predictions)

    # Return ROC AUC for optimization
    return roc_auc

# Create an Optuna study and optimize hyperparameters
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)

# Get the best hyperparameters
best_params = study.best_params
print("Best Hyperparameters:", best_params)

# Train the final model with the best hyperparameters
final_model = ClassificationModel('roberta', 'DeepChem/ChemBERTa-10M-MLM', use_cuda=True, args={**classification_args, **best_params})
final_model.train_model(train_df, eval_df=valid_df)


[I 2024-06-27 00:36:11,870] A new study created in memory with name: no-name-6ab78d23-8c2b-4a09-b60a-f2d2b3631f75
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/13.7M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/6.96k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.26k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/420 [00:00<?, ?B/s]

  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/8 [00:00<?, ?it/s]

Running Epoch 1 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 2 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 3 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 4 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 5 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 6 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 7 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 8 of 8:   0%|          | 0/90 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:37:04,996] Trial 0 finished with value: 0.7104562014028601 and parameters: {'learning_rate': 9.975444368705209e-06, 'num_train_epochs': 8, 'batch_size': 32, 'dropout': 0.155882333003871, 'warmup_steps': 527, 'weight_decay': 0.25057731287451673, 'adam_epsilon': 1.1745894619938858e-09}. Best is trial 0 with value: 0.7104562014028601.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/90 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:37:41,695] Trial 1 finished with value: 0.8428416073079223 and parameters: {'learning_rate': 6.396360297483605e-05, 'num_train_epochs': 10, 'batch_size': 32, 'dropout': 0.27463130054425233, 'warmup_steps': 652, 'weight_decay': 0.1976018703490457, 'adam_epsilon': 2.140889809676454e-08}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:38:42,525] Trial 2 finished with value: 0.8117938121907454 and parameters: {'learning_rate': 2.035544778307895e-05, 'num_train_epochs': 10, 'batch_size': 16, 'dropout': 0.2628633455496934, 'warmup_steps': 928, 'weight_decay': 0.20843728719565494, 'adam_epsilon': 4.082990856683028e-08}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:40:36,709] Trial 3 finished with value: 0.8246261758468816 and parameters: {'learning_rate': 1.53165733263887e-05, 'num_train_epochs': 10, 'batch_size': 8, 'dropout': 0.10155790683831666, 'warmup_steps': 847, 'weight_decay': 0.2537755716491721, 'adam_epsilon': 2.4549695173590605e-08}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Running Epoch 1 of 4:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 2 of 4:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 3 of 4:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 4 of 4:   0%|          | 0/90 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:40:53,146] Trial 4 finished with value: 0.5269588385623403 and parameters: {'learning_rate': 1.5981805172850975e-06, 'num_train_epochs': 4, 'batch_size': 32, 'dropout': 0.1213412772664722, 'warmup_steps': 92, 'weight_decay': 0.23742116704589494, 'adam_epsilon': 8.117890244138235e-09}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Running Epoch 1 of 5:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 2 of 5:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 3 of 5:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 4 of 5:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 5 of 5:   0%|          | 0/90 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:41:13,295] Trial 5 finished with value: 0.5842857919634604 and parameters: {'learning_rate': 3.793233363690425e-06, 'num_train_epochs': 5, 'batch_size': 32, 'dropout': 0.2376886655900421, 'warmup_steps': 35, 'weight_decay': 0.21988032552809936, 'adam_epsilon': 1.0080890980455058e-08}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/179 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:42:15,865] Trial 6 finished with value: 0.82994399434506 and parameters: {'learning_rate': 0.0004579227899388307, 'num_train_epochs': 10, 'batch_size': 16, 'dropout': 0.47915765331253135, 'warmup_steps': 703, 'weight_decay': 0.12663107244111224, 'adam_epsilon': 1.9674401463967128e-09}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/179 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:42:27,930] Trial 7 finished with value: 0.7333043336414551 and parameters: {'learning_rate': 0.0009285305456021237, 'num_train_epochs': 1, 'batch_size': 16, 'dropout': 0.4942382375684159, 'warmup_steps': 679, 'weight_decay': 0.20739256429467473, 'adam_epsilon': 5.786351010807755e-09}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/90 [00:00<?, ?it/s]

Running Epoch 2 of 2:   0%|          | 0/90 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:42:44,723] Trial 8 finished with value: 0.81532814963841 and parameters: {'learning_rate': 0.00034959801039136904, 'num_train_epochs': 2, 'batch_size': 32, 'dropout': 0.46044348359959597, 'warmup_steps': 87, 'weight_decay': 0.1831592316373526, 'adam_epsilon': 8.186798833422698e-08}. Best is trial 1 with value: 0.8428416073079223.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/179 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/179 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/7 [00:00<?, ?it/s]

[I 2024-06-27 00:43:08,186] Trial 9 finished with value: 0.7907998477516176 and parameters: {'learning_rate': 0.0009298556927642271, 'num_train_epochs': 3, 'batch_size': 16, 'dropout': 0.14487331753159022, 'warmup_steps': 617, 'weight_decay': 0.21396287319351726, 'adam_epsilon': 3.49812914404003e-09}. Best is trial 1 with value: 0.8428416073079223.


Best Hyperparameters: {'learning_rate': 6.396360297483605e-05, 'num_train_epochs': 10, 'batch_size': 32, 'dropout': 0.27463130054425233, 'warmup_steps': 652, 'weight_decay': 0.1976018703490457, 'adam_epsilon': 2.140889809676454e-08}


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.pid = os.fork()


  0%|          | 0/5 [00:00<?, ?it/s]

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

Running Epoch 10 of 10:   0%|          | 0/357 [00:00<?, ?it/s]

(3570, 0.35303583502200764)

In [57]:
best_params

{'learning_rate': 0.000645685210909375,
 'num_train_epochs': 10,
 'batch_size': 16,
 'dropout': 0.27572484082762017,
 'warmup_steps': 251,
 'weight_decay': 0.20011624141211595,
 'adam_epsilon': 5.439154836943154e-09}

### Optimized model cross species evaluation

In [43]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef, cohen_kappa_score
import pandas as pd

# Assuming test_df and df_humanIndependent are predefined DataFrames containing the test datasets
datasets = {
    'Human Test': test_df,
    'Human Independent': df_humanIndependent,
}

evaluation_results = {}

# Assuming final_model is a predefined model with an eval_model method
for dataset_name, dataset in datasets.items():
    result_final, model_outputs_final, wrong_predictions_final = final_model.eval_model(dataset, acc=accuracy_score)
    evaluation_results[dataset_name] = {
        'result': result_final,
        'model_outputs': model_outputs_final,
        'wrong_predictions': wrong_predictions_final
    }

    # Add prediction probabilities and binary predictions to the DataFrame
    dataset[f'Class_0_Prob'] = model_outputs_final[:, 0]
    dataset[f'Class_1_Prob'] = model_outputs_final[:, 1]
    dataset[f'Binary_Prediction'] = np.argmax(model_outputs_final, axis=1)

    # Save the updated DataFrame to an Excel file
    dataset.to_excel(f'{dataset_name}_with_predictions.xlsx', index=False)

# Initialize a dictionary to store evaluation metrics for each dataset
evaluation_metrics = {}

# Evaluate and store evaluation metrics for each dataset
for dataset_name, result in evaluation_results.items():
    y_true = datasets[dataset_name]['y'].ravel()
    y_pred_binary = np.argmax(result['model_outputs'], axis=1)  # Convert probabilities to binary predictions

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_true, y_pred_binary)
    precision = precision_score(y_true, y_pred_binary)
    recall = recall_score(y_true, y_pred_binary)
    f1 = f1_score(y_true, y_pred_binary)
    mcc = matthews_corrcoef(y_true, y_pred_binary)
    cohen_kappa = cohen_kappa_score(y_true, y_pred_binary)

    # Store the evaluation metrics in the dictionary
    evaluation_metrics[dataset_name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'MCC': mcc,
        "Cohen's Kappa": cohen_kappa
    }

# Create a DataFrame from the evaluation_metrics dictionary
df_metrics = pd.DataFrame(evaluation_metrics).transpose()

# Save the evaluation metrics DataFrame to an Excel file
df_metrics.to_excel('evaluation_metrics.xlsx', index=True)

# Display the resultant DataFrame with different names
print("Resultant DataFrame with Evaluation Metrics:")
print(df_metrics)


  self.pid = os.fork()


  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/7 [00:00<?, ?it/s]

  self.pid = os.fork()


0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

Resultant DataFrame with Evaluation Metrics:
                   Accuracy  Precision    Recall  F1 Score       MCC  \
Human Test         0.833333   0.777778  0.833333  0.804598  0.660793   
Human Independent  0.701923   0.544444  0.700000  0.612500  0.384257   

                   Cohen's Kappa  
Human Test              0.659600  
Human Independent       0.376402  


In [45]:
test_df.rename(columns={'X': 'cleanedMol'}, inplace=True)
test_df.rename(columns={'y': 'single-class-label'}, inplace=True)

In [46]:
df_humanIndependent.rename(columns={'X': 'cleanedMol'}, inplace=True)
df_humanIndependent.rename(columns={'y': 'single-class-label'}, inplace=True)

## Prediction analysis

In [47]:
pip install -U kaleido

Collecting kaleido
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaleido
Successfully installed kaleido-0.2.1


### Distribution plot for class 1 prediction probabilities(Human Independent)

In [48]:
import pandas as pd
import plotly.subplots as sp
import plotly.graph_objects as go

# Extract True Positives, True Negatives, False Positives, and False Negatives for Human Independent dataset
true_positives_hi = df_humanIndependent[(df_humanIndependent['single-class-label'] == 1) & (df_humanIndependent['Binary_Prediction'] == 1)]
true_negatives_hi = df_humanIndependent[(df_humanIndependent['single-class-label'] == 0) & (df_humanIndependent['Binary_Prediction'] == 0)]
false_positives_hi = df_humanIndependent[(df_humanIndependent['single-class-label'] == 0) & (df_humanIndependent['Binary_Prediction'] == 1)]
false_negatives_hi = df_humanIndependent[(df_humanIndependent['single-class-label'] == 1) & (df_humanIndependent['Binary_Prediction'] == 0)]

# Create subplots for Human Independent dataset
fig_hi = sp.make_subplots(rows=1, cols=1, subplot_titles=['Human Independent'])

# Add violin traces for Human Independent
fig_hi.add_trace(go.Violin(x=true_positives_hi['Binary_Prediction'], y=true_positives_hi['Class_1_Prob'], name='True Positives', box_visible=True, points='all', jitter=0.3, marker=dict(size=6)))
fig_hi.add_trace(go.Violin(x=false_positives_hi['Binary_Prediction'], y=false_positives_hi['Class_1_Prob'], name='False Positives', box_visible=True, points='all', jitter=0.3, marker=dict(size=6)))
fig_hi.add_trace(go.Violin(x=true_negatives_hi['Binary_Prediction'], y=true_negatives_hi['Class_1_Prob'], name='True Negatives', box_visible=True, points='all', jitter=0.3, marker=dict(size=6)))
fig_hi.add_trace(go.Violin(x=false_negatives_hi['Binary_Prediction'], y=false_negatives_hi['Class_1_Prob'], name='False Negatives', box_visible=True, points='all', jitter=0.3, marker=dict(size=6)))

# Update layout for better presentation
fig_hi.update_layout(
    template='plotly_white',
    title='Violin Plots with Data Points for True Positives, False Positives, True Negatives, and False Negatives (Human Independent)',
    xaxis=dict(title='Binary Prediction', showgrid=False),
    yaxis=dict(title='Class 1 Probability', showgrid=False),
    violingap=0
)


### Distribution plot for class 1 prediction probabilities(Human Test)

In [49]:
# Extract True Positives, True Negatives, False Positives, and False Negatives for Human Test dataset
true_positives_ht = test_df[(test_df['single-class-label'] == 1) & (test_df['Binary_Prediction'] == 1)]
true_negatives_ht = test_df[(test_df['single-class-label'] == 0) & (test_df['Binary_Prediction'] == 0)]
false_positives_ht = test_df[(test_df['single-class-label'] == 0) & (test_df['Binary_Prediction'] == 1)]
false_negatives_ht = test_df[(test_df['single-class-label'] == 1) & (test_df['Binary_Prediction'] == 0)]

# Create a figure for Human Test dataset
fig_ht = go.Figure()

# Add violin traces for Human Test
fig_ht.add_trace(go.Violin(x=true_positives_ht['Binary_Prediction'], y=true_positives_ht['Class_1_Prob'], name='True Positives', box_visible=True, points='all', jitter=0.2, marker=dict(size=6)))
fig_ht.add_trace(go.Violin(x=false_positives_ht['Binary_Prediction'], y=false_positives_ht['Class_1_Prob'], name='False Positives', box_visible=True, points='all', jitter=0.2, marker=dict(size=6)))
fig_ht.add_trace(go.Violin(x=true_negatives_ht['Binary_Prediction'], y=true_negatives_ht['Class_1_Prob'], name='True Negatives', box_visible=True, points='all', jitter=0.2, marker=dict(size=6)))
fig_ht.add_trace(go.Violin(x=false_negatives_ht['Binary_Prediction'], y=false_negatives_ht['Class_1_Prob'], name='False Negatives', box_visible=True, points='all', jitter=0.2, marker=dict(size=6)))

# Update layout for better presentation
fig_ht.update_layout(template='plotly_white', title='Violin Plots with Data Points for True Positives, False Positives, True Negatives, and False Negatives (Human Test)')
fig_ht.update_xaxes(title_text='Binary Prediction', showgrid=False)  # Remove gridlines on x-axis
fig_ht.update_yaxes(title_text='Class 1 Probability', showgrid=False)  # Remove gridlines on y-axis
fig_ht.update_layout(violingap=0)  # Set gap between violins to zero for closer appearance


### Distribution of class 1 probabilities for positive prediction(Human Indenpendent)





In [50]:
import pandas as pd
import plotly.subplots as sp
import plotly.graph_objects as go
from IPython.display import display, Image

# Extract True Positives and False Positives for Human Independent dataset
true_positives_hi = df_humanIndependent[(df_humanIndependent['single-class-label'] == 1) & (df_humanIndependent['Binary_Prediction'] == 1)]
false_positives_hi = df_humanIndependent[(df_humanIndependent['single-class-label'] == 0) & (df_humanIndependent['Binary_Prediction'] == 1)]

# Create subplots for Human Independent dataset
fig_hi = sp.make_subplots(rows=1, cols=1, subplot_titles=['True Positives and False Positives'])

# Add violin traces for True Positives and False Positives
fig_hi.add_trace(go.Violin(x=true_positives_hi['Binary_Prediction'], y=true_positives_hi['Class_1_Prob'], name='True Positives', box_visible=True, points='all', jitter=0.2, marker=dict(size=5), line=dict(color='blue')), row=1, col=1)
fig_hi.add_trace(go.Violin(x=false_positives_hi['Binary_Prediction'], y=false_positives_hi['Class_1_Prob'], name='False Positives', box_visible=True, points='all', jitter=0.2, marker=dict(size=5), line=dict(color='orange')), row=1, col=1)

# Update layout for better presentation
fig_hi.update_layout(template='plotly_white', title='Violin Plot with Data Points for True Positives and False Positives (Human Independent)', height=1000, legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01))
fig_hi.update_xaxes(title_text='Binary Prediction', row=1, col=1)
fig_hi.update_yaxes(title_text='Class 1 Probability', row=1, col=1)
fig_hi.update_xaxes(showgrid=False, row=1, col=1)
fig_hi.update_yaxes(showgrid=False, row=1, col=1)
fig_hi.update_layout(violingap=0)
fig_hi.update_traces(meanline_visible=True)


### Distribution of class 1 probabilities for positive prediction(Human Test)

In [51]:
import pandas as pd
import plotly.subplots as sp
import plotly.graph_objects as go
from IPython.display import display, Image
# Extract True Positives and False Positives for Human Independent dataset
true_positives_ht = test_df[(test_df['single-class-label'] == 1) & (test_df['Binary_Prediction'] == 1)]
false_positives_ht = test_df[(test_df['single-class-label'] == 0) & (test_df['Binary_Prediction'] == 1)]

# Create subplots for Human Independent dataset
fig_ht = sp.make_subplots(rows=1, cols=1, subplot_titles=['True Positives and False Positives'])

# Add violin traces for True Positives and False Positives
fig_ht.add_trace(go.Violin(x=true_positives_ht['Binary_Prediction'], y=true_positives_ht['Class_1_Prob'], name='True Positives', box_visible=True, points='all', jitter=0.2, marker=dict(size=5), line=dict(color='blue')), row=1, col=1)
fig_ht.add_trace(go.Violin(x=false_positives_ht['Binary_Prediction'], y=false_positives_ht['Class_1_Prob'], name='False Positives', box_visible=True, points='all', jitter=0.2, marker=dict(size=5), line=dict(color='orange')), row=1, col=1)

# Update layout for better presentation
fig_ht.update_layout(template='plotly_white', title='Violin Plot with Data Points for True Positives and False Positives (Human Test)', height=1000, legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01))
fig_ht.update_xaxes(title_text='Binary Prediction', row=1, col=1)
fig_ht.update_yaxes(title_text='Class 1 Probability', row=1, col=1)
fig_ht.update_xaxes(showgrid=False, row=1, col=1)
fig_ht.update_yaxes(showgrid=False, row=1, col=1)
fig_ht.update_layout(violingap=0)
fig_ht.update_traces(meanline_visible=True)


## References


*  https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Transfer_Learning_With_ChemBERTa_Transformers.ipynb
*   https://huggingface.co/DeepChem

*  Ramsundar, B., Eastman, P., Walters, P., Pande, V., Leswing, K., & Wu, Z. (2019). Deep Learning for the Life Sciences. O’Reilly Media. https://www.amazon.com/Deep-Learning-Life-Sciences-Microscopy/dp/1492039837

*  Vignaux, P. A., Lane, T. R., Urbina, F., Gerlach, J., Puhl, A. C., Snyder, S. H., & Ekins, S. (2023). Validation of Acetylcholinesterase Inhibition Machine Learning Models for Multiple Species. Chemical Research in Toxicology, 36(2), 188–201. https://doi.org/10.1021/acs.chemrestox.2c00283


*   Ahmad, W., Simon, E., Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models (arXiv:2209.01712). arXiv. http://arxiv.org/abs/2209.01712
*   Chithrananda, S., Grand, G., & Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2010.09885







