<a href="https://colab.research.google.com/github/dibdin/malebirthcontrol/blob/main/Part2_Pubchem_Descriptor_Dataset_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bioinformatics Project - Computational Drug Discovery [Part 2] Descriptor Calculation and Dataset Preparation**

original code credit: Chanin Nantasenamat

[*'Data Professor' YouTube channel*](http://youtube.com/dataprofessor)


## **Download PaDEL-Descriptor**




In [None]:
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.zip
! wget https://github.com/dataprofessor/bioinformatics/raw/master/padel.sh

In [None]:
! unzip padel.zip

## upload bioactivity_preprocessed_data_spem1.csv from GitHub

### **Convert IC50 to pIC50**
To allow **IC50** data to be more uniformly distributed, we will convert **IC50** to the negative logarithmic scale which is essentially **-log10(IC50)**.

This custom function pIC50() will accept a DataFrame as input and will:
* Take the IC50 values from the ``standard_value`` column and converts it from nM to M by multiplying the value by 10$^{-9}$
* Take the molar value and apply -log10
* Delete the ``standard_value`` column and create a new ``pIC50`` column

In [None]:
# https://github.com/chaninlab/estrogen-receptor-alpha-qsar/blob/master/02_ER_alpha_RO5.ipynb

import numpy as np

def pIC50(input):
    pIC50 = []

    for i in input['standard_value_norm']:
        molar = i*(10**-9) # Converts nM to M
        pIC50.append(-np.log10(molar))

    input['pIC50'] = pIC50
    x = input.drop('standard_value_norm', 1)
        
    return x

Point to note: Values greater than 100,000,000 will be fixed at 100,000,000 otherwise the negative logarithmic value will become negative.

In [None]:
df_combined.standard_value.describe()

In [None]:
-np.log10( (10**-9)* 100000000 )

In [None]:
-np.log10( (10**-9)* 10000000000 )

In [None]:
def norm_value(input):
    norm = []

    for i in input['standard_value']:
        if i > 100000000:
          i = 100000000
        norm.append(i)

    input['standard_value_norm'] = norm
    x = input.drop('standard_value', 1)
        
    return x

We will first apply the norm_value() function so that the values in the standard_value column is normalized.

In [None]:
df_norm = norm_value(df_combined)
df_norm

In [None]:
df_norm.standard_value_norm.describe()

In [None]:
df_final = pIC50(df_norm)
df_final

In [None]:
df_final.pIC50.describe()

### **Removing the 'intermediate' bioactivity class**
Here, we will be removing the ``intermediate`` class from our data set.

In [None]:
df_2class = df_final[df_final.bioactivity_class != 'intermediate']
df_2class

In [None]:
selection = ['canonical_smiles','molecule_chembl_id']
df_2class_selection = df3[selection]
df_2class_selection.to_csv('molecule.smi', sep='\t', index=False, header=False)

In [None]:
df_2class

In [None]:
! cat molecule.smi | head -5

In [None]:
! cat molecule.smi | wc -l

## **Calculate fingerprint descriptors**


### **Calculate PaDEL descriptors**

In [None]:
! cat padel.sh

In [None]:
! bash padel.sh

In [None]:
! ls -l

## **Preparing the X and Y Data Matrices**

### **X data matrix**

In [None]:
df_2class_X = pd.read_csv('descriptors_output.csv')

In [None]:
df_2class_X

In [None]:
df_2class_X = df_2class_X.drop(columns=['Name'])
df_2class_X

## **Y variable**

### **Convert IC50 to pIC50**

In [None]:
df_2class_Y = df3['pIC50']
df_2class_Y

## **Combining X and Y variable**

In [None]:
dataset3 = pd.concat([df_2class_X,df_2class_Y], axis=1)
dataset3

In [None]:
dataset3.to_csv('bioactivity_data_spem1_correct_combined_pubchem_fp.csv', index=False)

# **Let's download the CSV file to your local computer for the Part 3B (Model Building).**

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

In [None]:
! cp bioactivity_data_spem1_correct_combined_pubchem_fp.csv "/content/gdrive/My Drive/Colab Notebooks/databirthcontrol"

In [None]:
! ls "/content/gdrive/My Drive/Colab Notebooks/databirthcontrol"