### __2. Data Preprocessing__

In the previous notebook (step_1), I identified the target, PfDHODH, and retrieved its compounds data. Now, in this notebook, I will clean the data by removing null values in the `standard_value` column, and selecting the longest SMILES string in `canonical_smiles (IC50)`, as it is potentially the most informative from each compound, if any, and store them in a new Pandas Series for further analysis and manipulation, like standarization, etc.

First, import the necessary libraries and modules for data manipulation.

In [83]:
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Descriptors, Lipinski

#### __2.1 Remove `null values` in the 'standard_value' (IC50) column__

In [84]:
# read the raw csv file
df = pd.read_csv('../data/chembl_dataset/00_PfDHODH_raw_data.csv')
# Display the DataFrame
df[['standard_value', 'molecule_chembl_id','canonical_smiles' ]]

Unnamed: 0,standard_value,molecule_chembl_id,canonical_smiles
0,42600.00,CHEMBL199572,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O
1,142600.00,CHEMBL199574,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1
2,93400.00,CHEMBL372561,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O
3,153500.00,CHEMBL370865,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1
4,200000.00,CHEMBL199575,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O
...,...,...,...
597,250000.00,CHEMBL4569109,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1
598,250000.00,CHEMBL4568957,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1
599,250000.00,CHEMBL4449622,Cn1nc(O)c(C(N)=O)c1COc1ccccc1
600,10.00,CHEMBL1956285,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...


In [85]:
# Drop rows with missing values in the 'standard_value' column
df.dropna(subset=['standard_value'],inplace=True) 
# Save the DataFrame to a CSV file
df.to_csv('../data/chembl_dataset/01_PfDHODH_compds.csv', index=False) 

In [86]:
## Load the prepared the CSV file in the previous notebook into a pandas DataFrame and display the DataFrame
data = pd.read_csv('../data/chembl_dataset/01_PfDHODH_compds.csv')
data

Unnamed: 0,action_type,activity_comment,activity_id,activity_properties,assay_chembl_id,assay_description,assay_type,assay_variant_accession,assay_variant_mutation,bao_endpoint,...,target_organism,target_pref_name,target_tax_id,text_value,toid,type,units,uo_units,upper_value,value
0,,,1662473,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,42.60
1,,,1662477,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,142.60
2,,,1662479,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,93.40
3,,,1662481,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,153.50
4,,,1662493,[],CHEMBL863916,Binding affinity to Plasmodium falciparum DHODH,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,200.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
597,,,18951322,[],CHEMBL4331522,Inhibition of Plasmodium falciparum recombinan...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,250.00
598,,,18951323,[],CHEMBL4331522,Inhibition of Plasmodium falciparum recombinan...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,250.00
599,,,18951324,[],CHEMBL4331522,Inhibition of Plasmodium falciparum recombinan...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,250.00
600,,,19145102,[],CHEMBL4378107,Inhibition of Plasmodium falciparum DHODH expr...,B,,,BAO_0000190,...,Plasmodium falciparum,Dihydroorotate dehydrogenase,5833,,,IC50,uM,UO_0000065,,0.01


In [87]:
# Display the shape of the DataFrame
data.shape

(602, 46)

##### __The data looks good and there are no missing values in the 'standard_value IC50'!__

#### __2.1.1 Classify compounds based on 'standard_value' (IC50) column__
The bioactivity data is measured in IC50 in M (Molar) units. Compounds with IC50 values below 1,000 nM are considered active. Compounds with values above 10,000 nM are labeled as inactive. Compounds with IC50 values between 1,000 and 10,000 nM are classified as intermediate active.

Understanding the compounds based on this classification would be valuable. Let's classify them.

In [88]:
bioactivity_class = [] # List to store the bioactivity class of each compound
for i in data.standard_value: # For each value in the 'standard_value' column
  if float(i) >= 10000:   # If the value is greater than or equal to 10000  
    bioactivity_class.append("inactive") # Then append 'inactive' to the bioactivity_class list
  elif float(i) <= 1000:  # If the value is less than or equal to 1000
    bioactivity_class.append("active") # Then append 'active' to the bioactivity_class list
  else: # If the value is between 1000 and 10000
    bioactivity_class.append("intermediate_active") # Then append 'mediumactive' to the bioactivity_class list 

In [89]:
# just visulize the calssification
bioactivity_class[:5]

['inactive', 'inactive', 'inactive', 'inactive', 'inactive']

In [90]:
# Count the number of occurrences for each bioactivity class
from collections import Counter 
Counter(bioactivity_class)

Counter({'inactive': 279, 'active': 234, 'intermediate_active': 89})

##### Now, creat a new dataframe and include a new column 'bioactivity_class' with 'molecule_chembl_id', 'canonical_smiles', 'standard_value' columns.

In [91]:
data['bioactivity_class'] = bioactivity_class
data2 = data[['molecule_chembl_id', 'canonical_smiles', 'bioactivity_class', 'standard_value']] 
data2.reset_index(inplace=True) # Reset the index of the DataFrame

In [92]:
data2

Unnamed: 0,index,molecule_chembl_id,canonical_smiles,bioactivity_class,standard_value
0,0,CHEMBL199572,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O,inactive,42600.00
1,1,CHEMBL199574,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1,inactive,142600.00
2,2,CHEMBL372561,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O,inactive,93400.00
3,3,CHEMBL370865,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1,inactive,153500.00
4,4,CHEMBL199575,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O,inactive,200000.00
...,...,...,...,...,...
597,597,CHEMBL4569109,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1,inactive,250000.00
598,598,CHEMBL4568957,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1,inactive,250000.00
599,599,CHEMBL4449622,Cn1nc(O)c(C(N)=O)c1COc1ccccc1,inactive,250000.00
600,600,CHEMBL1956285,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...,active,10.00


#### Save the prepared data as csv file

In [93]:
data2.to_csv('../data/chembl_dataset/02_PfDHODH_bioactivity_prepared.csv', index=False) # Save the DataFrame to a CSV file 

#### __2.2 Clean the 'canonical_smiles' column__

##### Extract the longest SMILES string from each compound in the 'canonical_smiles' column, as a compound can have multiple structures. It is believed that the longest SMILES string is typically the most complete representation of the compound structure.

In [94]:
#Drop the 'canonical_smiles' column from the DataFrame and store the result in 'df_no_smiles'
df_no_smiles = data2.drop(columns='canonical_smiles')
df_no_smiles

Unnamed: 0,index,molecule_chembl_id,bioactivity_class,standard_value
0,0,CHEMBL199572,inactive,42600.00
1,1,CHEMBL199574,inactive,142600.00
2,2,CHEMBL372561,inactive,93400.00
3,3,CHEMBL370865,inactive,153500.00
4,4,CHEMBL199575,inactive,200000.00
...,...,...,...,...
597,597,CHEMBL4569109,inactive,250000.00
598,598,CHEMBL4568957,inactive,250000.00
599,599,CHEMBL4449622,inactive,250000.00
600,600,CHEMBL1956285,active,10.00


In [95]:
# Extract the logngest similes from 'canonical_smiles' column and store it in a pandas Series
smiles = []
for i in data2.canonical_smiles.tolist(): #Iterate over each compound in the 'canonical_smiles' column
  cpd = str(i).split('.') #Split the compound into individual SMILES strings
  cpd_longest = max(cpd, key = len) #Select the longest SMILES string
  smiles.append(cpd_longest)  #Append the longest SMILES string to the 'smiles' list 

smiles = pd.Series(smiles, name = 'canonical_smiles')   #Convert the 'smiles' list into a pandas Series
smiles

0             CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O
1                 O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1
2              CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O
3                O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1
4                  CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O
                             ...                        
597                  Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1
598               Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1
599                        Cn1nc(O)c(C(N)=O)c1COc1ccccc1
600    Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...
601    Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)c(F)c2)n2nc(C(C)(F...
Name: canonical_smiles, Length: 602, dtype: object

In [96]:
# Concatenate the DataFrame without 'canonical_smiles' and the cleaned 'canonical_smiles' Series to form a new DataFrame,
#then display the new DataFrame
df_clean_smiles = pd.concat([df_no_smiles,smiles], axis=1)
df_clean_smiles

Unnamed: 0,index,molecule_chembl_id,bioactivity_class,standard_value,canonical_smiles
0,0,CHEMBL199572,inactive,42600.00,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O
1,1,CHEMBL199574,inactive,142600.00,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1
2,2,CHEMBL372561,inactive,93400.00,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O
3,3,CHEMBL370865,inactive,153500.00,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1
4,4,CHEMBL199575,inactive,200000.00,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O
...,...,...,...,...,...
597,597,CHEMBL4569109,inactive,250000.00,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1
598,598,CHEMBL4568957,inactive,250000.00,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1
599,599,CHEMBL4449622,inactive,250000.00,Cn1nc(O)c(C(N)=O)c1COc1ccccc1
600,600,CHEMBL1956285,active,10.00,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...


#### 2.2.1 Calculate Lipinski descriptors
Lipinski's Rule of Five is a guideline used in drug discovery to assess whether a compound is likely to be orally bioavailable, meaning it can be absorbed into the bloodstream when taken orally. The rule states that a compound is more likely to be orally bioavailable if it meets the following criteria:

1) Molecular Weight: The compound's molecular weight should be less than 500 Dalton
2) Hydrogen Bond Donors: The compound should have no more than 5 hydrogen bond donors
3) Hydrogen Bond Acceptors: The compound should have no more than 10 hydrogen bond acceptoors
4) Lipophilicity (LogP): The compound's logarithm of the partition coefficient (LogP) should be less an 5.

If a compound meets these criteria, it is considered more likely to have favorable oral bioavailability property, which is an important factor in drug veopment.

In [97]:
#Calculate Lipinski descriptors for the 'canonical_smiles' column and store the result in a new DataFrame
def lipinski(smiles, verbose=False): 

    moldata= []
    for elem in smiles:
        mol=Chem.MolFromSmiles(elem) 
        moldata.append(mol)
       
    baseData= np.arange(1,1)
    i=0  
    for mol in moldata:        
       
        desc_MolWt = Descriptors.MolWt(mol)
        desc_MolLogP = Descriptors.MolLogP(mol)
        desc_NumHDonors = Lipinski.NumHDonors(mol)
        desc_NumHAcceptors = Lipinski.NumHAcceptors(mol)
           
        row = np.array([desc_MolWt,
                        desc_MolLogP,
                        desc_NumHDonors,
                        desc_NumHAcceptors])   
    
        if(i==0):
            baseData=row
        else:
            baseData=np.vstack([baseData, row])
        i=i+1      
    
    columnNames=["MW","LogP","NumHDonors","NumHAcceptors"]   
    descriptors = pd.DataFrame(data=baseData,columns=columnNames)
    
    return descriptors

In [98]:

df_lipinski = lipinski(df_clean_smiles.canonical_smiles)
df_lipinski
     

Unnamed: 0,MW,LogP,NumHDonors,NumHAcceptors
0,331.37,4.33,1.00,2.00
1,370.20,4.55,2.00,2.00
2,384.23,4.58,1.00,2.00
3,317.34,4.30,2.00,2.00
4,305.33,3.81,1.00,2.00
...,...,...,...,...
597,302.33,2.49,1.00,5.00
598,338.36,3.28,1.00,5.00
599,247.25,0.80,2.00,5.00
600,415.34,5.95,1.00,5.00


In [99]:
df_clean_smiles

Unnamed: 0,index,molecule_chembl_id,bioactivity_class,standard_value,canonical_smiles
0,0,CHEMBL199572,inactive,42600.00,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O
1,1,CHEMBL199574,inactive,142600.00,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1
2,2,CHEMBL372561,inactive,93400.00,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O
3,3,CHEMBL370865,inactive,153500.00,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1
4,4,CHEMBL199575,inactive,200000.00,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O
...,...,...,...,...,...
597,597,CHEMBL4569109,inactive,250000.00,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1
598,598,CHEMBL4568957,inactive,250000.00,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1
599,599,CHEMBL4449622,inactive,250000.00,Cn1nc(O)c(C(N)=O)c1COc1ccccc1
600,600,CHEMBL1956285,active,10.00,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...


##### Now, let's combine these two DataFrames into one, df_combined

In [100]:
# Combine the DataFrames 'df_clean_smiles' and 'df_lipinski' into a single DataFrame
df_combined = pd.concat([df_clean_smiles,df_lipinski], axis=1) 

In [101]:
# Dispaly the combined DataFrame
df_combined

Unnamed: 0,index,molecule_chembl_id,bioactivity_class,standard_value,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors
0,0,CHEMBL199572,inactive,42600.00,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O,331.37,4.33,1.00,2.00
1,1,CHEMBL199574,inactive,142600.00,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1,370.20,4.55,2.00,2.00
2,2,CHEMBL372561,inactive,93400.00,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O,384.23,4.58,1.00,2.00
3,3,CHEMBL370865,inactive,153500.00,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1,317.34,4.30,2.00,2.00
4,4,CHEMBL199575,inactive,200000.00,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O,305.33,3.81,1.00,2.00
...,...,...,...,...,...,...,...,...,...
597,597,CHEMBL4569109,inactive,250000.00,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1,302.33,2.49,1.00,5.00
598,598,CHEMBL4568957,inactive,250000.00,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1,338.36,3.28,1.00,5.00
599,599,CHEMBL4449622,inactive,250000.00,Cn1nc(O)c(C(N)=O)c1COc1ccccc1,247.25,0.80,2.00,5.00
600,600,CHEMBL1956285,active,10.00,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...,415.34,5.95,1.00,5.00


In [102]:
#Generate descriptive statistics for the 'standard_value' column in the combined DataFrame
pd.set_option('display.float_format', '{:.2f}'.format)
df_combined['standard_value'].describe()

count       602.00
mean      34995.51
std       91097.85
min           6.00
25%         242.50
50%        5300.00
75%       28000.00
max     1071519.31
Name: standard_value, dtype: float64

#### __3.Standardize and convert IC50 values to pIC50 values__

##### __3.1 Standardize/Normalize the IC50__
 IC50 values greater than 100,000,000 will be fixed at 100,000,000 otherwise the negative logarithmic value will become negative, for example: -np.log10((10**-9)* 10000000000) = -1.0, but not -np.log10((10**-9)* 100000000) = 1.0. Fortuantely, in our case the max IC50 value is 1071519.31 as shown by decribe () method above/below. Therefore it does not necessary to normalize the IC50 values. However, for the sake of statistical analysis, data normalization was performed.

I will first apply the norm_value() function so that the values in the standard_value column is normalized.

In [103]:
# Define a function to normalize the 'standard_value' column in the DataFrame,
# capping values at 100,000,000, and then drop the original 'standard_value' column
def norm_value(input): #define a function called norm_value with input as a parameter
    norm = [] #Create an empty list called 'norm'

    for i in input['standard_value']: #Iterates over each item in the 'standard_value' column
        if i > 100000000: #If the value is greater than 100,000,000
          i = 100000000 #Set the value to 100,000,000
        norm.append(i) #Appends the value to the 'norm' list

    input['standard_value_norm'] = norm #Creates a new column in the DataFrame called 'standard_value_norm' and assigns the list 'norm' to it  
    x = input.drop('standard_value',axis=1) #Removes the 'standard_value' column from the DataFrame
        
    return x
     

In [104]:
## Apply the normalization function to the combined DataFrame 
# and store the result in 'df_norm', then display the normalized DataFrame
df_norm = norm_value(df_combined) 
df_norm     

Unnamed: 0,index,molecule_chembl_id,bioactivity_class,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors,standard_value_norm
0,0,CHEMBL199572,inactive,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O,331.37,4.33,1.00,2.00,42600.00
1,1,CHEMBL199574,inactive,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1,370.20,4.55,2.00,2.00,142600.00
2,2,CHEMBL372561,inactive,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O,384.23,4.58,1.00,2.00,93400.00
3,3,CHEMBL370865,inactive,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1,317.34,4.30,2.00,2.00,153500.00
4,4,CHEMBL199575,inactive,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O,305.33,3.81,1.00,2.00,200000.00
...,...,...,...,...,...,...,...,...,...
597,597,CHEMBL4569109,inactive,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1,302.33,2.49,1.00,5.00,250000.00
598,598,CHEMBL4568957,inactive,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1,338.36,3.28,1.00,5.00,250000.00
599,599,CHEMBL4449622,inactive,Cn1nc(O)c(C(N)=O)c1COc1ccccc1,247.25,0.80,2.00,5.00,250000.00
600,600,CHEMBL1956285,active,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...,415.34,5.95,1.00,5.00,10.00


##### __3.2 Convert IC50 values to pIC50 values__
To improve the model's ability to predict standard values, I will convert IC50 values to pIC50 values using the formula: −log(IC50×10−9). This transformation is often used in pharmacology to express IC50 values in a logarithmic scale, making them more suitable for analysis and modeling.

In [105]:
#Convert IC50 to pIC50
def pIC50(input): #define a function called pIC50 with input as a parameter
    pIC50 = [] #Create an empty list called 'pIC50'

    for i in input['standard_value_norm']:  #Iterates over each item in the 'standard_value_norm' column
        molar = i*(10**-9) # Converts nM to M by multiplying the value by 10^-9
        pIC50.append(-np.log10(molar)) #Converts IC50 to pIC50 and appends it to the 'pIC50' list

    input['pIC50'] = pIC50  #Creates a new column in the DataFrame called 'pIC50' and assigns the list 'pIC50' to it
    #x = input.drop('standard_value_norm', axis=1) #Removes the 'standard_value_norm' column from the DataFrame
    x = input.copy() #Copies the DataFrame to avoid the SettingWithCopyWarning message    
    return x    


In [106]:
df_final = pIC50(df_norm)
df_final     

Unnamed: 0,index,molecule_chembl_id,bioactivity_class,canonical_smiles,MW,LogP,NumHDonors,NumHAcceptors,standard_value_norm,pIC50
0,0,CHEMBL199572,inactive,CN(C(=O)c1ccc(-c2ccccc2)cc1)c1ccccc1C(=O)O,331.37,4.33,1.00,2.00,42600.00,4.37
1,1,CHEMBL199574,inactive,O=C(Nc1ccccc1C(=O)O)c1ccc2cc(Br)ccc2c1,370.20,4.55,2.00,2.00,142600.00,3.85
2,2,CHEMBL372561,inactive,CN(C(=O)c1ccc2cc(Br)ccc2c1)c1ccccc1C(=O)O,384.23,4.58,1.00,2.00,93400.00,4.03
3,3,CHEMBL370865,inactive,O=C(Nc1ccccc1C(=O)O)c1ccc(-c2ccccc2)cc1,317.34,4.30,2.00,2.00,153500.00,3.81
4,4,CHEMBL199575,inactive,CN(C(=O)c1ccc2ccccc2c1)c1ccccc1C(=O)O,305.33,3.81,1.00,2.00,200000.00,3.70
...,...,...,...,...,...,...,...,...,...,...
597,597,CHEMBL4569109,inactive,Cn1nc(OCC2CC2)c(C(=O)O)c1COc1ccccc1,302.33,2.49,1.00,5.00,250000.00,3.60
598,598,CHEMBL4568957,inactive,Cn1nc(OCc2ccccc2)c(C(=O)O)c1COc1ccccc1,338.36,3.28,1.00,5.00,250000.00,3.60
599,599,CHEMBL4449622,inactive,Cn1nc(O)c(C(N)=O)c1COc1ccccc1,247.25,0.80,2.00,5.00,250000.00,3.60
600,600,CHEMBL1956285,active,Cc1cc(Nc2ccc(S(F)(F)(F)(F)F)cc2)n2nc(C(C)(F)F)...,415.34,5.95,1.00,5.00,10.00,8.00


In [107]:
df_final.pIC50.describe()

count   602.00
mean      5.54
std       1.20
min       2.97
25%       4.55
50%       5.28
75%       6.62
max       8.22
Name: pIC50, dtype: float64

##### Save the DataFrame to a CSV file

In [108]:
df_final.to_csv('../data/chembl_dataset/03_pfDHODH_bioactivity_data_3category_norm_pIC50.csv') 