## Environment Setup

To execute this notebook, use the main conda environment (created using environment_main.yaml). If running in terminal activate using:

```bash
conda activate main_DE
```

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import rdkit
from rdkit import Chem
from rdkit.Chem import Draw

from rdkit import DataStructs
from rdkit.Chem import Fragments, rdMolDescriptors, rdchem, MCS, rdFingerprintGenerator, Descriptors
from rdkit.ML.Cluster import Butina

from sklearn.preprocessing import MinMaxScaler
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score

# Local imports
from chem_utils import neutralize_atoms, get_largest_frag


  # This is added back by InteractiveShellApp.init_path()


## **Dataset Collection**

In this section we will collect different datasets from different sources in order to build our compiled dataset. The final dataset will be then sanitized with a standardized procedure described above (***Filtering and Pre-Processing Procedure***) to obtain the **valid_df_PBT** with 5130 compounds (2710 labelled as PBT and 2420 as non-PBT)

Data is sourced from:

* Chemicals from European Chemicals Agency ECHA-PBT vPvB assessment under the previous EU chemicals legislation
* ECHA-registered substances
* ECHA PBT assessment list
* ECHA list of substances subject to POPs regulation
* Stockholm Convention of POPs 
* European databases SMILECAS, EINECS and ELINCS.

### **Potential PBT chemicals identified by Strempel et al**

In [2]:
Stremper_et_al = pd.read_csv('Compiled_dataset/Strempel_etal_PBTcompounds.csv')
Stremper_et_al

Unnamed: 0,CAS,R,SMILES,PBT_label
0,50-29-3,pr,Clc1ccc(cc1)C(c1ccc(Cl)cc1)C(Cl)(Cl)Cl,1
1,50-41-9,pr,CCN(CC)CCOc1ccc(cc1)C(=C(Cl)c1ccccc1)c1ccccc1,1
2,50-52-2,pr,CSc1ccc2Sc3ccccc3N(CCC3CCCCN3C)c2c1,1
3,53-19-0,pr,Clc1ccc(cc1)C(C(Cl)Cl)c1ccccc1Cl,1
4,53-69-0,pr,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1,1
...,...,...,...,...
2780,187348-02-3,n,ClCC12C(Cl)C(Cl)C(CC1(Cl)Cl)C2(CCl)C(Cl)Cl,1
2781,189084-62-6,n,Brc1cccc(Br)c1Oc1ccc(Br)c(Br)c1,1
2782,190383-43-8,n,CCC(Cc1ccccc1)c1cc(O)c(c(=O)o1)C(C1CC1)c1cccc(...,1
2783,226256-56-0,n,C[C@H](NCCCc1cccc(c1)C(F)(F)F)c1cccc2ccccc12,1


### **Expert-verified PBT chemicals from the European Chemicals Agency (ECHA) PBT/vPvB assessments**

In [3]:
ECHA_expert_verified = pd.read_csv('Compiled_dataset/expert-verified_PBT_chemicals_ECHA.csv')
ECHA_expert_verified

Unnamed: 0,SMILES,Source,PBT_label
0,c1ccc(-c2cccc(-c3ccccc3)c2)cc1,https://echa.europa.eu/pbt,1
1,O=S(=O)(c1ccc(Cl)cc1)c1ccc(Cl)cc1,https://echa.europa.eu/pbt,1
2,COc1ccc(C(c2ccc(OC)cc2)C(Cl)(Cl)Cl)cc1,https://echa.europa.eu/pbt,1
3,c1cc2ccc3cccc4ccc(c1)c2c34,https://echa.europa.eu/pbt,1
4,c1ccc2sc(SN(C3CCCCC3)C3CCCCC3)nc2c1,https://echa.europa.eu/pbt,1
...,...,...,...
76,N.O=S(=O)(O)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F...,https://echa.europa.eu/pbt,1
77,O=C1OC(=O)c2c(Br)c(Br)c(Br)c(Br)c21,https://echa.europa.eu/pbt,0
78,O=C([O-])C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(...,https://echa.europa.eu/pbt,1
79,c1ccc2c(c1)-c1cccc3cccc-2c13,https://echa.europa.eu/pbt,1


### **ECHA PBT assessment list**

In [4]:
ECHA_PBT_assessment_list = pd.read_csv('Compiled_dataset/ECHA_PBT_assessment_list.csv')
ECHA_PBT_assessment_list

Unnamed: 0,SMILES,Source,PBT_label
0,Cc1ccc2ccccc2c1,https://echa.europa.eu/information-on-chemical...,0
1,O=[N+]([O-])c1ccc(Oc2ccc(Cl)cc2Cl)cc1,https://echa.europa.eu/information-on-chemical...,1
2,CCCC[Sn](CCCC)(CCCC)O[Sn](CCCC)(CCCC)CCCC,https://echa.europa.eu/information-on-chemical...,1
3,C[Pb](C)(C)C,https://echa.europa.eu/information-on-chemical...,1
4,Clc1ccc(C(Cl)(Cl)Cl)cc1,https://echa.europa.eu/information-on-chemical...,0
...,...,...,...
59,CCCCCCCCCCCCCCCCCCOC(=O)CCc1cc(C(C)(C)C)c(O)c(...,https://echa.europa.eu/information-on-chemical...,0
60,CC(C)(C)c1cc(CCC(=O)OCC(COC(=O)CCc2cc(C(C)(C)C...,https://echa.europa.eu/information-on-chemical...,0
61,CCCCCCCCCCCCCCCS(=O)(=O)Oc1ccccc1,https://echa.europa.eu/information-on-chemical...,0
62,CC[Pb](CC)(CC)CC,https://echa.europa.eu/information-on-chemical...,0


### **The ECHA list of substances subject to POP Regulation**

In [5]:
ECHA_POP_regulation = pd.read_csv('Compiled_dataset/ECHA_substances_POP_Regulation.csv')
ECHA_POP_regulation

Unnamed: 0,SMILES,Source,PBT_label
0,Br[C@H]1CC[C@@H](Br)[C@H](Br)CC[C@@H](Br)[C@@H...,https://www.echa.europa.eu/list-of-substances-...,1
1,Br[C@H]1CC[C@H](Br)[C@H](Br)CC[C@@H](Br)[C@H](...,https://www.echa.europa.eu/list-of-substances-...,1
2,Brc1cc(Br)c(Oc2ccc(Br)c(Br)c2Br)cc1Br,https://www.echa.europa.eu/list-of-substances-...,1
3,Brc1cc(Br)c(Oc2ccc(Br)c(Br)c2)c(Br)c1,https://www.echa.europa.eu/list-of-substances-...,1
4,Brc1ccc(Oc2ccc(Br)c(Br)c2Br)c(Br)c1,https://www.echa.europa.eu/list-of-substances-...,1
...,...,...,...
213,O=C([O-])C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(C(F)...,https://www.echa.europa.eu/list-of-substances-...,1
214,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,https://www.echa.europa.eu/list-of-substances-...,1
215,c1ccc2c(c1)-c1cccc3c1c-2cc1ccccc13,https://www.echa.europa.eu/list-of-substances-...,1
216,O=C(O)CC(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C(F)...,https://www.echa.europa.eu/list-of-substances-...,1


### **The new POP list under the Stockholm Convention**

In [6]:
POP_list_stockolm = pd.read_csv('Compiled_dataset/new POP list under the Stockholm Convention.csv')
POP_list_stockolm

Unnamed: 0,SMILES,Source,PBT_label
0,CCCC(Cl)CCCC(Cl)CCC(Cl)CCC(Cl)CCC(Cl)CCCC(Cl)CCC,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
1,CCCC(Cl)CCC(Cl)CC(Cl)C(Cl)CCC(Cl)CCC(Cl)C(Cl)C...,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
2,COc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
3,CC(Cl)CCCC(Cl)CCCC(Cl)CCCC(Cl)CCCC(Cl)CCCC(Cl)...,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
4,CCCCCCCCCCCC(=O)Oc1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
5,O.[Na+].[O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
6,[Na+].[O-]c1c(Cl)c(Cl)c(Cl)c(Cl)c1Cl,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
7,CCC(Cl)C(Cl)C(Cl)CC(Cl)C(Cl)C(C)Cl,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
8,Brc1cc(Br)c(Oc2cc(Br)c(Br)cc2Br)c(Br)c1,http://www.pops.int/TheConvention/ThePOPs/TheN...,1
9,Brc1cc(Br)c(Oc2c(Br)cc(Br)cc2Br)c(Br)c1,http://www.pops.int/TheConvention/ThePOPs/TheN...,1


In [7]:
len(POP_list_stockolm)

37

### **ECHA-registered substances**

In [8]:
ECHA_reg_substances = pd.read_csv('Compiled_dataset/ECHA-registered_substances.csv')
ECHA_reg_substances

Unnamed: 0,SMILES,Source,PBT_label
0,CC1(C)[C@@H]2CC[C@@]1(C)C(=O)C2,https://echa.europa.eu/information-on-chemical...,0
1,Cl.NC(N)=NCCC[C@H](N)C(=O)O,https://echa.europa.eu/information-on-chemical...,0
2,O=C(O)[C@H](O)[C@@H](O)C(=O)O,https://echa.europa.eu/information-on-chemical...,0
3,O=C(O)[C@@H](O)[C@H](O)C(=O)O,https://echa.europa.eu/information-on-chemical...,0
4,CC1=CCC(/C=C/C(C)(C)C(C)O)C1(C)C,https://echa.europa.eu/information-on-chemical...,0
...,...,...,...
2882,CC1(C)[C@@H]2CC[C@@]1(C)[C@@H](O)C2,https://echa.europa.eu/information-on-chemical...,0
2883,CC1(C)[C@@H]2CC[C@]1(C)[C@H](O)C2,https://echa.europa.eu/information-on-chemical...,0
2884,CC1(C)C2CC[C@]1(C)C(O)C2,https://echa.europa.eu/information-on-chemical...,0
2885,CC1(C)[C@@H]2CC[C@@]1(C)[C@H](O)C2,https://echa.europa.eu/information-on-chemical...,0


In [9]:
data_concat = pd.concat([Stremper_et_al, ECHA_expert_verified, ECHA_PBT_assessment_list, ECHA_POP_regulation, POP_list_stockolm, ECHA_reg_substances], ignore_index=True)
data_concat

Unnamed: 0,CAS,R,SMILES,PBT_label,Source
0,50-29-3,pr,Clc1ccc(cc1)C(c1ccc(Cl)cc1)C(Cl)(Cl)Cl,1,
1,50-41-9,pr,CCN(CC)CCOc1ccc(cc1)C(=C(Cl)c1ccccc1)c1ccccc1,1,
2,50-52-2,pr,CSc1ccc2Sc3ccccc3N(CCC3CCCCN3C)c2c1,1,
3,53-19-0,pr,Clc1ccc(cc1)C(C(Cl)Cl)c1ccccc1Cl,1,
4,53-69-0,pr,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1,1,
...,...,...,...,...,...
6067,,,CC1(C)[C@@H]2CC[C@@]1(C)[C@@H](O)C2,0,https://echa.europa.eu/information-on-chemical...
6068,,,CC1(C)[C@@H]2CC[C@]1(C)[C@H](O)C2,0,https://echa.europa.eu/information-on-chemical...
6069,,,CC1(C)C2CC[C@]1(C)C(O)C2,0,https://echa.europa.eu/information-on-chemical...
6070,,,CC1(C)[C@@H]2CC[C@@]1(C)[C@H](O)C2,0,https://echa.europa.eu/information-on-chemical...


In [10]:
data_PBT_concat= data_concat.drop(['Source', 'CAS', 'R'], axis= 1)
data_PBT_concat

Unnamed: 0,SMILES,PBT_label
0,Clc1ccc(cc1)C(c1ccc(Cl)cc1)C(Cl)(Cl)Cl,1
1,CCN(CC)CCOc1ccc(cc1)C(=C(Cl)c1ccccc1)c1ccccc1,1
2,CSc1ccc2Sc3ccccc3N(CCC3CCCCN3C)c2c1,1
3,Clc1ccc(cc1)C(C(Cl)Cl)c1ccccc1Cl,1
4,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1,1
...,...,...
6067,CC1(C)[C@@H]2CC[C@@]1(C)[C@@H](O)C2,0
6068,CC1(C)[C@@H]2CC[C@]1(C)[C@H](O)C2,0
6069,CC1(C)C2CC[C@]1(C)C(O)C2,0
6070,CC1(C)[C@@H]2CC[C@@]1(C)[C@H](O)C2,0


## **Filtering and Pre-Processing Procedure**

The following steps have been applied to the `data_PBT_concat` dataset to standardize and filter molecules. These steps should be repeated for any other datasets, such as `agrochemicals` and `DrugBank_data`.

1. **Convert SMILES to Molecules**: 
   - Each SMILES string is converted to an RDKit molecule object using `Chem.MolFromSmiles`.
   
2. **Standardization**:
   - **Largest Fragment**: Only the largest fragment in each molecule is retained, using the `get_largest_frag` function.
   - **Neutralization**: Each molecule is neutralized by adjusting formal charges with the `neutralize_atoms` function.
   - **Canonical SMILES**: A canonical SMILES representation (`comparator_smiles`) is generated for each standardized molecule.

3. **Duplicate Identification**:
   - Identify and group duplicated molecules based on their `comparator_smiles` values. 
   - For each group of duplicates, create pairs of SMILES strings and their corresponding standardized versions along with their labels (`PBT_label`).

4. **Consistency Filtering**:
   - For each unique standardized SMILES in the pairs, check if all labels in the group are consistent. 
   - If the labels are consistent (i.e., they all share the same `PBT_label`), keep this standardized SMILES with its label.

5. **Data Filtering**:
   - Remove any entries from the original dataset that match the standardized SMILES kept in the previous step.
   - Combine the consistent standardized SMILES and labels with the filtered dataset to create a final, standardized dataset.

6. **Deduplication**:
   - Remove any duplicate rows in the standardized dataset to ensure each molecule is unique.

7. **Validity Check**:
   - Define a function to check if each SMILES can be converted to a valid molecule.
   - Filter the dataset to keep only rows where the standardized SMILES can be successfully converted to valid molecule objects.

The resulting dataset will contain unique, valid, and standardized SMILES with consistent labels.


1. **Convert SMILES to Molecules**
2. **Standardization**

In [11]:
# 1. Convert SMILES to RDKit molecules
data_PBT_concat['mol'] = data_PBT_concat['SMILES'].apply(Chem.MolFromSmiles)
# 2. Get the largest fragment
data_PBT_concat['mol_no-mc'] = data_PBT_concat['mol'].apply(get_largest_frag)
# 3. Neutralize molecules
data_PBT_concat['mol_no-mc-neutral'] = data_PBT_concat['mol_no-mc'].apply(neutralize_atoms)
# 4. Canonicalize everything to a new SMILES column
data_PBT_concat['comparator_smiles'] = data_PBT_concat['mol_no-mc-neutral'].apply(
    lambda mol: Chem.MolToSmiles(mol) if mol else None)

data_PBT_concat.head()

[11:20:16] Can't kekulize mol.  Unkekulized atoms: 9 10 11
[11:20:16] Explicit valence for atom # 7 N, 4, is greater than permitted
[11:20:16] Explicit valence for atom # 1 C, 5, is greater than permitted
[11:20:16] Explicit valence for atom # 8 N, 4, is greater than permitted
[11:20:16] Can't kekulize mol.  Unkekulized atoms: 2 3 4 5 6 7 9
[11:20:16] Explicit valence for atom # 9 N, 4, is greater than permitted
[11:20:16] Explicit valence for atom # 7 N, 4, is greater than permitted
[11:20:16] Explicit valence for atom # 10 N, 4, is greater than permitted
[11:20:16] Explicit valence for atom # 8 O, 3, is greater than permitted
[11:20:16] Explicit valence for atom # 6 N, 4, is greater than permitted
[11:20:16] Explicit valence for atom # 22 O, 3, is greater than permitted
[11:20:16] Can't kekulize mol.  Unkekulized atoms: 17 19 20 21 22 23 24
[11:20:16] Explicit valence for atom # 11 O, 3, is greater than permitted
[11:20:16] Can't kekulize mol.  Unkekulized atoms: 4 5 6 7 8 10 11 12 1

Unnamed: 0,SMILES,PBT_label,mol,mol_no-mc,mol_no-mc-neutral,comparator_smiles
0,Clc1ccc(cc1)C(c1ccc(Cl)cc1)C(Cl)(Cl)Cl,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c090>,<rdkit.Chem.rdchem.Mol object at 0x7f55667819f0>,<rdkit.Chem.rdchem.Mol object at 0x7f55667819f0>,Clc1ccc(C(c2ccc(Cl)cc2)C(Cl)(Cl)Cl)cc1
1,CCN(CC)CCOc1ccc(cc1)C(=C(Cl)c1ccccc1)c1ccccc1,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c690>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781c30>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781c30>,CCN(CC)CCOc1ccc(C(=C(Cl)c2ccccc2)c2ccccc2)cc1
2,CSc1ccc2Sc3ccccc3N(CCC3CCCCN3C)c2c1,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c6f0>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781c90>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781c90>,CSc1ccc2c(c1)N(CCC1CCCCN1C)c1ccccc1S2
3,Clc1ccc(cc1)C(C(Cl)Cl)c1ccccc1Cl,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c390>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781b70>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781b70>,Clc1ccc(C(c2ccccc2Cl)C(Cl)Cl)cc1
4,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c1b0>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781f90>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781f90>,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1


3. **Duplicate Identification**

In [12]:
df = data_PBT_concat[data_PBT_concat.duplicated(keep=False, subset=['comparator_smiles'])]

pair_list = df.groupby(by=['comparator_smiles']).apply(lambda x: tuple(x.index))
pair_list = pair_list.tolist()
print(pair_list)

[(4777, 5938), (549, 2930, 2931, 3165), (2683, 2684), (1364, 1758), (3157, 3167), (3161, 3171), (3156, 3166), (3158, 3168), (2933, 3130), (3162, 3172), (2932, 3129), (2877, 2889), (806, 2695), (3159, 3169), (3164, 3174), (3061, 3132), (2781, 3062, 3133), (3064, 3135), (3063, 3134), (2934, 3131), (3163, 3173), (335, 368), (3160, 3170), (3583, 5898), (145, 880), (2798, 2843), (2879, 2913), (3281, 6006), (4990, 6014), (3544, 5887), (5412, 5413), (627, 2967, 3098), (3615, 5903), (3031, 3108), (3032, 3109), (3008, 3106), (3009, 3107), (3048, 3101), (449, 4067, 5876), (3939, 4023, 6050), (205, 766), (3662, 5997), (3326, 6016), (4022, 6059), (4020, 4521), (4520, 5700), (3050, 3117), (3049, 3116), (3033, 3120), (3026, 3119), (3007, 3118), (1180, 2482, 2966, 3110), (907, 4066, 5867), (2814, 2842), (252, 363), (1252, 3038, 3097), (4064, 5866), (394, 613), (215, 604), (4250, 6008), (3373, 6053), (4190, 4191), (3826, 3966), (3925, 3926), (4197, 4198), (4746, 4747, 4748, 4927, 5333), (4749, 4901), 

In [13]:
len(pair_list)

593

4. **Consistency Filtering**:

In [14]:
pairs_dicts = []

for pair in pair_list:
    # Iterate over each element in the tuple
    for i in range(len(pair)):
        for j in range(i + 1, len(pair)):
            # Create pairs from the tuple
            i1, i2 = pair[i], pair[j]
            d = {
                'smi1_original': df.loc[i1]['SMILES'],
                'smi2_original': df.loc[i2]['SMILES'],
                'smi1_standardized': Chem.MolToSmiles(Chem.MolFromSmiles(df.loc[i1]['comparator_smiles'])),
                'smi2_standardized': Chem.MolToSmiles(Chem.MolFromSmiles(df.loc[i2]['comparator_smiles'])),
                'label_1': df.loc[i1]['PBT_label'],
                'label_2': df.loc[i2]['PBT_label'],
            }
            pairs_dicts.append(d)

# Create a DataFrame from the list of dictionaries and save it as a CSV file
pd.DataFrame(pairs_dicts).to_csv('pair_comparisons_newdataPBT.csv', index=False)

In [15]:
pairs = pd.read_csv('pair_comparisons_newdataPBT.csv')
pairs

Unnamed: 0,smi1_original,smi2_original,smi1_standardized,smi2_standardized,label_1,label_2
0,BrC(Br)Br,BrC(Br)Br,BrC(Br)Br,BrC(Br)Br,0,0
1,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,Br[C@H]1CC[C@@H](Br)[C@H](Br)CC[C@@H](Br)[C@@H...,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,1,1
2,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,Br[C@H]1CC[C@H](Br)[C@H](Br)CC[C@@H](Br)[C@H](...,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,1,1
3,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,Br[C@H]1CC[C@@H](Br)[C@H](Br)CC[C@H](Br)[C@H](...,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,1,1
4,Br[C@H]1CC[C@@H](Br)[C@H](Br)CC[C@@H](Br)[C@@H...,Br[C@H]1CC[C@H](Br)[C@H](Br)CC[C@@H](Br)[C@H](...,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,BrC1CCC(Br)C(Br)CCC(Br)C(Br)CCC1Br,1,1
...,...,...,...,...,...,...
2008,c1ccc2cc3c(cc2c1)c1cccc2cccc3c12,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,1,1
2009,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,c1ccc2cc3c(cc2c1)-c1cccc2cccc-3c12,1,1
2010,c1ccc2cc3c(cc2c1)c1ccccc1c1ccccc31,c1ccc2cc3c(cc2c1)c1ccccc1c1ccccc31,c1ccc2cc3c4ccccc4c4ccccc4c3cc2c1,c1ccc2cc3c4ccccc4c4ccccc4c3cc2c1,1,1
2011,c1ccc2sc(SN(C3CCCCC3)C3CCCCC3)nc2c1,c1ccc2sc(SN(C3CCCCC3)C3CCCCC3)nc2c1,c1ccc2sc(SN(C3CCCCC3)C3CCCCC3)nc2c1,c1ccc2sc(SN(C3CCCCC3)C3CCCCC3)nc2c1,1,1


In [16]:
unique_smis = list(set(pairs['smi1_standardized']))
keep = {}
for smi in unique_smis:
    rows = pairs[pairs['smi1_standardized']==smi]
    labels = list(rows['label_1']) + list(rows['label_2'])
    if len(set(labels)) <= 1:
        keep[smi] = labels[0]

In [17]:
len(unique_smis)

593

In [18]:
# Assuming standardized_smiles_data is your standardized SMILES data
data_PBT_concat['standardized_smiles'] = data_PBT_concat['comparator_smiles']

In [19]:
data_PBT_concat = data_PBT_concat[~data_PBT_concat['standardized_smiles'].isin(unique_smis)]
data_PBT_concat

Unnamed: 0,SMILES,PBT_label,mol,mol_no-mc,mol_no-mc-neutral,comparator_smiles,standardized_smiles
4,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c1b0>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781f90>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781f90>,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1,Cc1cc(C)c2nc3ccc4ccccc4c3cc2c1
5,c1ccc2ccc3cc4c(ccc5ccccc45)cc3c2c1,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c5d0>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781f30>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781f30>,c1ccc2c(c1)ccc1cc3c(ccc4ccccc43)cc12,c1ccc2c(c1)ccc1cc3c(ccc4ccccc43)cc12
6,Cc1ccc2cc3c4ccccc4ccc3c3CCc1c23,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c150>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781ed0>,<rdkit.Chem.rdchem.Mol object at 0x7f5566781ed0>,Cc1ccc2cc3c(ccc4ccccc43)c3c2c1CC3,Cc1ccc2cc3c(ccc4ccccc43)c3c2c1CC3
9,O=C(CCC1CCCC1)OC1CCC2C3CCC4=CC(=O)CCC4(C)C3CCC12C,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c570>,<rdkit.Chem.rdchem.Mol object at 0x7f55667840f0>,<rdkit.Chem.rdchem.Mol object at 0x7f55667840f0>,CC12CCC(=O)C=C1CCC1C2CCC2(C)C(OC(=O)CCC3CCCC3)...,CC12CCC(=O)C=C1CCC1C2CCC2(C)C(OC(=O)CCC3CCCC3)...
10,CCN(CC)CCN1c2ccccc2Sc2ccc(Cl)cc12,1,<rdkit.Chem.rdchem.Mol object at 0x7f556686c930>,<rdkit.Chem.rdchem.Mol object at 0x7f5566784150>,<rdkit.Chem.rdchem.Mol object at 0x7f5566784150>,CCN(CC)CCN1c2ccccc2Sc2ccc(Cl)cc21,CCN(CC)CCN1c2ccccc2Sc2ccc(Cl)cc21
...,...,...,...,...,...,...,...
5860,CN(C)CCCN(C)C,0,<rdkit.Chem.rdchem.Mol object at 0x7f556677c930>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d4b0>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d4b0>,CN(C)CCCN(C)C,CN(C)CCCN(C)C
5861,C[N+](C)(C)C1CCCCC1.[OH-],0,<rdkit.Chem.rdchem.Mol object at 0x7f556677c990>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d510>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d510>,C[N+](C)(C)C1CCCCC1,C[N+](C)(C)C1CCCCC1
5862,c1ccc(N(CC2CO2)CC2CO2)cc1,0,<rdkit.Chem.rdchem.Mol object at 0x7f556677c9f0>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d570>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d570>,c1ccc(N(CC2CO2)CC2CO2)cc1,c1ccc(N(CC2CO2)CC2CO2)cc1
5863,CCCCCCCCCCCC(=O)N(CCO)CCO,0,<rdkit.Chem.rdchem.Mol object at 0x7f556677ca50>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d5d0>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d5d0>,CCCCCCCCCCCC(=O)N(CCO)CCO,CCCCCCCCCCCC(=O)N(CCO)CCO


5. **Data Filtering**:

In [20]:
#add the things from keep dict to it
starting_PBTdat = pd.concat([pd.DataFrame([{'standardized_smiles':k, 'PBT_label':v} for k,v in keep.items()]), data_PBT_concat])
starting_PBTdat

Unnamed: 0,standardized_smiles,PBT_label,SMILES,mol,mol_no-mc,mol_no-mc-neutral,comparator_smiles
0,COc1ccc(COC(=O)NC(CO)C(=O)Oc2c(Cl)c(Cl)c(Cl)c(...,1,,,,,
1,O=C(O)c1ccccc1,0,,,,,
2,O=c1oc(=O)c2cc3c(=O)oc(=O)c3cc12,0,,,,,
3,NN=C(c1ccccc1)c1ccccc1,0,,,,,
4,CC1(C)C2CCC1(CS(=O)(=O)O)C(=O)C2,0,,,,,
...,...,...,...,...,...,...,...
5860,CN(C)CCCN(C)C,0,CN(C)CCCN(C)C,<rdkit.Chem.rdchem.Mol object at 0x7f556677c930>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d4b0>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d4b0>,CN(C)CCCN(C)C
5861,C[N+](C)(C)C1CCCCC1,0,C[N+](C)(C)C1CCCCC1.[OH-],<rdkit.Chem.rdchem.Mol object at 0x7f556677c990>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d510>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d510>,C[N+](C)(C)C1CCCCC1
5862,c1ccc(N(CC2CO2)CC2CO2)cc1,0,c1ccc(N(CC2CO2)CC2CO2)cc1,<rdkit.Chem.rdchem.Mol object at 0x7f556677c9f0>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d570>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d570>,c1ccc(N(CC2CO2)CC2CO2)cc1
5863,CCCCCCCCCCCC(=O)N(CCO)CCO,0,CCCCCCCCCCCC(=O)N(CCO)CCO,<rdkit.Chem.rdchem.Mol object at 0x7f556677ca50>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d5d0>,<rdkit.Chem.rdchem.Mol object at 0x7f556670d5d0>,CCCCCCCCCCCC(=O)N(CCO)CCO


In [21]:
starting_PBT = starting_PBTdat.iloc[:, :2]
starting_PBT

Unnamed: 0,standardized_smiles,PBT_label
0,COc1ccc(COC(=O)NC(CO)C(=O)Oc2c(Cl)c(Cl)c(Cl)c(...,1
1,O=C(O)c1ccccc1,0
2,O=c1oc(=O)c2cc3c(=O)oc(=O)c3cc12,0
3,NN=C(c1ccccc1)c1ccccc1,0
4,CC1(C)C2CCC1(CS(=O)(=O)O)C(=O)C2,0
...,...,...
5860,CN(C)CCCN(C)C,0
5861,C[N+](C)(C)C1CCCCC1,0
5862,c1ccc(N(CC2CO2)CC2CO2)cc1,0
5863,CCCCCCCCCCCC(=O)N(CCO)CCO,0


6. **Deduplication**:

In [22]:
starting_PBT = starting_PBT.drop_duplicates()
starting_PBT

Unnamed: 0,standardized_smiles,PBT_label
0,COc1ccc(COC(=O)NC(CO)C(=O)Oc2c(Cl)c(Cl)c(Cl)c(...,1
1,O=C(O)c1ccccc1,0
2,O=c1oc(=O)c2cc3c(=O)oc(=O)c3cc12,0
3,NN=C(c1ccccc1)c1ccccc1,0
4,CC1(C)C2CCC1(CS(=O)(=O)O)C(=O)C2,0
...,...,...
5860,CN(C)CCCN(C)C,0
5861,C[N+](C)(C)C1CCCCC1,0
5862,c1ccc(N(CC2CO2)CC2CO2)cc1,0
5863,CCCCCCCCCCCC(=O)N(CCO)CCO,0


7. **Validity Check**:

In [23]:
# Define the to_mol function to convert SMILES to molecule
def to_mol(smi):
    try:
        return Chem.MolFromSmiles(smi)
    except:
        return None

valid_df_PBT = starting_PBT[starting_PBT["standardized_smiles"].apply(lambda smi: to_mol(smi) is not None)]
valid_df_PBT.reset_index(drop=True, inplace=True)
valid_df_PBT

[11:20:30] Can't kekulize mol.  Unkekulized atoms: 3 4 5 6 7
[11:20:30] Explicit valence for atom # 4 B, 5, is greater than permitted
[11:20:30] Can't kekulize mol.  Unkekulized atoms: 0 1 2 3 4
[11:20:30] Explicit valence for atom # 3 B, 5, is greater than permitted
[11:20:30] Explicit valence for atom # 1 B, 5, is greater than permitted


Unnamed: 0,standardized_smiles,PBT_label
0,COc1ccc(COC(=O)NC(CO)C(=O)Oc2c(Cl)c(Cl)c(Cl)c(...,1
1,O=C(O)c1ccccc1,0
2,O=c1oc(=O)c2cc3c(=O)oc(=O)c3cc12,0
3,NN=C(c1ccccc1)c1ccccc1,0
4,CC1(C)C2CCC1(CS(=O)(=O)O)C(=O)C2,0
...,...,...
5125,CN(C)CCCN(C)C,0
5126,C[N+](C)(C)C1CCCCC1,0
5127,c1ccc(N(CC2CO2)CC2CO2)cc1,0
5128,CCCCCCCCCCCC(=O)N(CCO)CCO,0


In [24]:
#valid_df_PBT.to_csv('Datasets/5130compounds_PBT.csv', Index = False)