## Molecular Property Prediction

Molecular property prediction plays a crucial role in drug discovery by enabling the estimation of key characteristics, such as solubility, permeability, and bioavailability, without extensive experimental efforts. These predictions help prioritize compounds for synthesis and testing, saving time and resources.

This project utilizes the following tools:
- **RDKit**: A cheminformatics library for handling chemical information and generating molecular descriptors.
- **Mordred**: A Python library for calculating thousands of molecular descriptors from chemical structures.

#### Objectives of this Project:
- Predict aqueous solubility (logS) values of small drug-like molecules.
- Build a regression model using experimental data and molecular descriptors.
- Visualize relationships between descriptors and solubility values.

In [1]:
%pip install requests
%pip install pandas
%pip install mordred
%pip install statsmodels

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.



### Aqueous solubility (logS)

Aqueous solubility (logS) is a critical molecular property in drug discovery as it directly impacts a compound's absorption, distribution, and overall bioavailability. Poor solubility is a major reason why many promising drug candidates fail during development.

The experimental logS values for this project were sourced from:  
[**The AAPS Journal 2005; 7 (1) Article 10**](https://link.springer.com/article/10.1208/aapsj070110)
  
This dataset provides high-quality solubility data for drug-like organic compounds under standardized conditions.


In [2]:
import pandas as pd        
df = pd.read_csv('logS_dataset.csv', sep=';') 
df.head(5)  

Unnamed: 0,Name,logS
0,"1,2,3-Trichlorobenzene",–3.76
1,"1,3,5-Trichlorobenzene",–4.44
2,"1,4-Dibromobenzene",–4.07
3,17Alpha-ethynylestradiol,–4.484
4,1-Butyltheobromine,–1.625


In [3]:
import requests
import time

### Retrieving Canonical SMILES from PubChem

Canonical SMILES for the compounds in the dataset were retrieved using the PubChem PUG REST API. According to PubChem's usage policies:
- The request rate must be limited to **five requests per second** or fewer to avoid being temporarily blocked from accessing PubChem (or NCBI) resources.
- Each request has a standard time limit of **30 seconds**, after which a time-out error will occur if the request is not completed.

To comply with these policies, requests were processed in a controlled manner with a delay between queries.

In [4]:
prolog = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
name_list = df['Name'].values.tolist()

smiles = []

def get_smiles(start, end):
    
    smiles_batch = []
    idx = 0
    for name in name_list[start:end]:

        url = prolog + '/compound/name/' + name + '/property/CanonicalSMILES/txt'
        res = requests.get(url)

        if res.status_code == 200:
            smiles_batch += list(set(res.text.split())) # Use a set to remove duplicates (w/ different CIDs)
        else:
            smiles_batch.append('None')
        
        if idx % 5 == 4: # to limit the request rate to five requests per second
            time.sleep(0.2)
        idx += 1
    
    return smiles_batch

In [5]:
num_batch = len(name_list) // 50 # Batch size of 50 compounds

for i in range(num_batch):
    start = i*50
    end = (i+1)*50
    smiles += get_smiles(start, end)
    print(f'Compounds {start+1} - {end} Done!')
    time.sleep(1)

last_batch_start = 50*num_batch
last_batch_end = 50*num_batch + len(name_list)%50
smiles += get_smiles(last_batch_start, last_batch_end)
print(f'Compounds {last_batch_start+1} - {last_batch_end} Done!')

print(len(smiles))

Compounds 1 - 50 Done!
Compounds 51 - 100 Done!
Compounds 101 - 150 Done!
Compounds 151 - 200 Done!
Compounds 201 - 250 Done!
Compounds 251 - 300 Done!
Compounds 301 - 322 Done!
322


The retrieved Canonical SMILES were added as a new column to the dataset. This ensures that molecular structures are readily accessible for further descriptor calculations and analysis.

In [6]:
df['Canonical_SMILES'] = smiles
df

Unnamed: 0,Name,logS,Canonical_SMILES
0,"1,2,3-Trichlorobenzene",–3.76,C1=CC(=C(C(=C1)Cl)Cl)Cl
1,"1,3,5-Trichlorobenzene",–4.44,C1=C(C=C(C=C1Cl)Cl)Cl
2,"1,4-Dibromobenzene",–4.07,C1=CC(=CC=C1Br)Br
3,17Alpha-ethynylestradiol,–4.484,CC12CCC3C(C1CCC2(C#C)O)CCC4=C3C=CC(=C4)O
4,1-Butyltheobromine,–1.625,CCCCN1C(=O)C2=C(N=CN2C)N(C1=O)C
...,...,...,...
317,Uric acid,–3.402,C12=C(NC(=O)N1)NC(=O)NC2=O
318,Vinbarbital,–2.458,CCC=C(C)C1(C(=O)NC(=O)NC1=O)CC
319,Xanthine,–2.483,C1=NC2=C(N1)C(=O)NC(=O)N2
320,Zidovudine,–1.029,CC1=CN(C(=O)NC1=O)C2CC(C(O2)CO)N=[N+]=[N-]


### Handling Missing SMILES

Several compounds are not found in the PubChem queries. These missing Canonical SMILES are manually searched and added to the dataset. After updating the dataset with the manually collected SMILES, the dataframe is checked for any remaining missing values. 

In [7]:
indices = [i for i, val in enumerate(smiles) if val == 'None']

name_not_found = []
for i in indices:
    name_not_found.append(df['Name'].iloc[i])

print(len(indices))
print(name_not_found)

30
['1-Propyltheobromine', '2-Aminopteridine', '2-Hydroxypteridine', '5,5-Diethylbarbiturate', '5,5-Dimethylbarbiturate', '5,5-Diphenylbarbiturate', '5,5-Dipropylbarbiturate', '5-Allyl-5-phenylbarbiturate', '5-Ethyl-5-(3-methylbut-2-enyl)barbiturate', '5-Ethyl-5-allylbarbiturate', '5-Ethyl-5-nonylbarbiturate', '5-Ethyl-5-octylbarbiturate', '5-Ethyl-5-pentylbarbiturate', '5-Ethyl-5-propylbarbiturate', '5-Ethyl-barbiturate', '5-i-Propyl-5-(3-methylbut-2enyl)barbiturate', '5-Methyl barbiturate', '5-Methyl-5-(3-methylbut-2enyl)barbiturate', '5-Methyl-5-ethylbarbiturate', '5-t-Butyl-5-(3-methylbut-2enyl)barbiturate', 'Cyclobutane-spirobarbiturate', 'Cycloethane-spirobarbiturate', 'Cycloheptane-spirobarbiturate', 'Cyclohexane-spirobarbiturate', 'Cyclopentane-spirobarbiturate', 'Cyclopropane-spirobarbiturate', 'Isopropylbarbiturate', 'Pteridine-2-methyl-thiol', 'Pteridine-4-methyl-thiol', 'Pteridine-7-methyl-thiol']


In [8]:
smiles_missing = {'1-Propyltheobromine': 'CCCN1C(=O)C2=C(N=CN2C)N(C1=O)C',
                  '2-Aminopteridine': 'N=1C=CN=C2C=NC(=NC12)N',
                  '2-Hydroxypteridine': 'O=C1N=CN=C2NC=CN=C12',
                  '5,5-Diethylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(CC)CC',
                  '5,5-Dimethylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(C)C',
                  '5,5-Diphenylbarbiturate': 'O=C1NC(=O)C(C=2C=CC=CC2)(C=3C=CC=CC3)C(=O)N1',
                  '5,5-Dipropylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(CCC)CCC',
                  '5-Allyl-5-phenylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(C=2C=CC=CC2)CC=C',
                  '5-Ethyl-5-(3-methylbut-2-enyl)barbiturate': 'O=C1NC(=O)C(C=CC(C)C)(C(=O)N1)CC',
                  '5-Ethyl-5-allylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(CC=C)CC',
                  '5-Ethyl-5-nonylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(CC)CCCCCCCCC',
                  '5-Ethyl-5-octylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(CC)CCCCCCCC',
                  '5-Ethyl-5-pentylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(CC)CCCCC',
                  '5-Ethyl-5-propylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(CC)CCC',
                  '5-Ethyl-barbiturate': 'O=C1NC(=O)C(C(=O)N1)CC',
                  '5-i-Propyl-5-(3-methylbut-2enyl)barbiturate': 'O=C1NC(=O)C(C=CC(C)C)(C(=O)N1)C(C)C',
                  '5-Methyl barbiturate': 'O=C1NC(=O)C(C(=O)N1)C',
                  '5-Methyl-5-(3-methylbut-2enyl)barbiturate': 'O=C1NC(=O)C(C=CC(C)C)(C(=O)N1)C',
                  '5-Methyl-5-ethylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)(C)CC',
                  '5-t-Butyl-5-(3-methylbut-2enyl)barbiturate': 'O=C1N(C(=O)C(C=CC(C)C)(C(=O)N1)C(C)(C)C)',
                  'Cyclobutane-spirobarbiturate': 'O=C1NC(=O)C2(C(=O)N1)CCCC2',
                  'Cycloethane-spirobarbiturate': 'O=C1NC(=O)C2(C(=O)N1)CC2',
                  'Cycloheptane-spirobarbiturate': 'O=C1NC(=O)C2(C(=O)N1)CCCCCCC2',
                  'Cyclohexane-spirobarbiturate': 'O=C1NC(=O)C2(C(=O)N1)CCCCCC2',
                  'Cyclopentane-spirobarbiturate': 'O=C1NC(=O)C2(C(=O)N1)CCCCC2',
                  'Cyclopropane-spirobarbiturate': 'O=C1NC(=O)C2(C(=O)N1)CCC2',
                  'Isopropylbarbiturate': 'O=C1NC(=O)C(C(=O)N1)C(C)C',
                  'Pteridine-2-methyl-thiol': 'N=1C=CN=C2C=NC(=NC12)SC',
                  'Pteridine-4-methyl-thiol': 'N=1C=NC(SC)=C2N=CC=NC12',
                  'Pteridine-7-methyl-thiol': 'N=1C=NC=2N=C(SC)C=NC2C1'
                  }

In [9]:
for name, smiles in smiles_missing.items():
    df.loc[(df["Name"] == name) & (df["Canonical_SMILES"] == "None"), "Canonical_SMILES"] = smiles

print(df[df['Canonical_SMILES'] == 'None'])

Empty DataFrame
Columns: [Name, logS, Canonical_SMILES]
Index: []


The final updated dataframe, now complete with Canonical SMILES, is saved as a CSV file for further use in the project.

In [10]:
df.to_csv("logS_dataset_updated.csv", index=False)

In [20]:
prolog = "https://pubchem.ncbi.nlm.nih.gov/rest/pug"
url = prolog + "/compound/name/" + 'Ceftazidime' + "/cids/txt"
res = requests.get(url)

print(res.text.split())

['5481173', '5484131', '2650', '91713', '44298077', '133687818', '66509024']
