## **Environment Setup**

To execute this notebook, activate the main conda environment in a new bash shell:

```bash
conda activate main_DE
```

In [1]:
import pandas as pd
from rdkit import Chem

# PBT Prediction with Chemprop
This notebook provides guidance on how to make a **prediction of Persistence, Bioaccumulation and Toxicity (PBT) using a deep learning model trained with Chemprop.**

To run the model prediction using `chemprop` in a Bash environment:

Open a new bash shell and execute the following commands:

1. **Activate the Conda Environment**  
   First, make sure to activate the `chemprop_DE` Conda environment to access the required dependencies:
   
   ```bash
   conda activate chemprop_DE
   ```

3. **Run the Prediction Command**
   ```bash
chemprop_predict --test_path your_dataset.csv \
                 --features_generator rdkit_2d_normalized \
                 --no_features_scaling \
                 --checkpoint_dir 0fold_outputs/CC_results_nok/fold_0/ \
                 --preds_path your_final_predictions.csv
    ```

after completing the **Predicting command** in bash returns to this notebook

In [2]:
import pandas as pd
### this is an example of a dataset that you can give in input in yoor model prediction
test_set = pd.read_csv('5130smiles_PBT-nonPBT.csv')
test_set

Unnamed: 0,standardized_smiles
0,CCCC[Sn](CCCC)(CCCC)O[Sn](CCCC)(CCCC)CCCC
1,C=C(F)C(=O)OC
2,Cc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl
3,Clc1ccc2c(Cl)ccnc2c1
4,S=c1[nH]c2ccccc2s1
...,...
5125,CN(C)CCCN(C)C
5126,C[N+](C)(C)C1CCCCC1
5127,c1ccc(N(CC2CO2)CC2CO2)cc1
5128,CCCCCCCCCCCC(=O)N(CCO)CCO


Explanation of Each Parameter:

- --***test_path your_dataset.csv***: This is the path to the dataset file you want to use for predictions. Replace your_dataset.csv with the path to your own dataset.

- --***features_generator rdkit_2d_normalized***: This parameter specifies the features generator for Chemprop. rdkit_2d_normalized means that Chemprop will use 2D molecular features generated by RDKit and normalize them, ensuring consistency with the training process.

- --***no_features_scaling***: Disables additional scaling of features, preserving the normalization applied by rdkit_2d_normalized. This ensures that feature transformations remain consistent with the trained model.

- --***checkpoint_dir*** CC_results_nok/fold_0/: Specifies the directory containing the trained model files. Here, CC_results_nok/fold_0/ is the location of the model trained on cluster centroids. Ensure this path matches the actual location of your trained model files.

- --***preds_path*** test_final_predictions.csv: Defines the output file name for the prediction results. test_final_predictions.csv will be created in the current directory with the prediction results.

in the following cell an example of the output predictions and the binary classification using the chosen cutoff value (here 0.51)

In [3]:
output_pred = pd.read_csv('outputs_pred_interpr/compiled_dataset_totalpred.csv')
# binary classification with the chosen cutoff value 
output_pred['PBT_bin_pred'] = (output_pred['PBT_label'] > 0.51).astype(int)
# create a mol object column
output_pred['mol'] = output_pred.apply(lambda x: Chem.MolFromSmiles(x['standardized_smiles']), axis=1)
output_pred

Unnamed: 0,standardized_smiles,PBT_label,PBT_bin_pred,mol
0,CCCC[Sn](CCCC)(CCCC)O[Sn](CCCC)(CCCC)CCCC,0.298063,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f87f6f0>
1,C=C(F)C(=O)OC,0.000273,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f87f510>
2,Cc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl,0.996734,1,<rdkit.Chem.rdchem.Mol object at 0x7fae8f889330>
3,Clc1ccc2c(Cl)ccnc2c1,0.222799,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f889390>
4,S=c1[nH]c2ccccc2s1,0.005313,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f8892d0>
...,...,...,...,...
5125,CN(C)CCCN(C)C,0.006039,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f7d7150>
5126,C[N+](C)(C)C1CCCCC1,0.025639,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f7d71b0>
5127,c1ccc(N(CC2CO2)CC2CO2)cc1,0.011846,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f7d7210>
5128,CCCCCCCCCCCC(=O)N(CCO)CCO,0.000478,0,<rdkit.Chem.rdchem.Mol object at 0x7fae8f7d7270>
