## Physiochemical Properties of Chemicals

`Log P` is an experimental measure of lipophilicity of small molecules.

`cLog P` is a computationally determined parameter for the same measure of lipophilicity (using variety of software tools, employing different algorithms. the ‘c’ stands for calculated, to distinguish it from experimentally determined values.

Chemical structure is a 2D graphic representation of the compounds, which you do not need for your assignment .

The alternative measure of lipophilicity, `ICHI`, which is included as ‘output variable’ in the first excel sheet is the one relevant for your task. The publications I sent to  you would have used either `Log P` or `cLog P` as output variable for their modelling exercise

The hypothesis behind the study is that, the `ICHI` value, which was obtained experimentally,  represents a more biomimetic measure of lipophilicity than the conventional `Log P` value. This hypothesis is based on specific elements of the experimental design,  which is different from the experimental set up for the determination of `Log P`.

In [1]:
# Importations

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings('ignore')


In [2]:
train = pd.read_csv('train_set.csv',encoding='latin-1')
train.head()

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI
0,pyrazinamide,94.5 + 3.0,31.36 + 3.0,12.43 + 0.5 x 10-24,68.9,-0.092
1,dapsone,182.3 + 3.0,67.51 + 0.4,26.76 + 0.5 x 10-24,94.6,0.027
2,phenobarbitone,188.1 + 3.0,59.21 + 0.3,23.47 + 0.5 x 10-24,75.3,-0.003
3,sulphamethoxazole,173.1 + 3.0,62.45 + 0.4,24.75 + 0.5 x 10-24,107.0,-0.106
4,theophylline,122.9 + 3.0,43.14 + 0.3,17.10 + 0.5 x 10-24,69.3,-0.11


In [3]:
test = pd.read_csv('test_set.csv',encoding='latin-1')
test.head()

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI
0,metronidazole,117.8 + 7.0,40.98 + 0.5,16.24 + 0.5 x 10-24,83.9,-0.025
1,prednisolone,274.7 + 5.0,95.48 + 0.4,37.85 + 0.5 x 10-24,94.8,0.32
2,diazepam,225.8 + 7.0,80.91 + 0.5,32.07 + 0.5 x 10-24,32.7,0.61
3,chlorpheniramine,211.4 + 3.0,71.35 + 0.3,28.28 + 0.5 x 10-24,16.1,0.99


In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   name                       40 non-null     object 
 1   Molar Volume  (cm3)        40 non-null     object 
 2   Molar Refractivity  (cm3)  40 non-null     object 
 3   Polarizability  (cm3)      40 non-null     object 
 4   TPSA  (Å2)                 40 non-null     float64
 5   ICHI                       40 non-null     float64
dtypes: float64(2), object(4)
memory usage: 2.0+ KB


In [5]:
train.describe()

Unnamed: 0,TPSA (Å2),ICHI
count,40.0,40.0
mean,75.7475,0.285375
std,30.650026,0.364413
min,23.5,-0.5
25%,53.125,-0.01425
50%,69.2,0.34
75%,93.25,0.5425
max,159.0,1.1


### Data Cleaning/Feature Engineering

In [6]:
def split_mv(x):

    # Molar Volume  (cm3)
    train[['Mean MV','Uncertainty MV']] = train['Molar Volume  (cm3)'].str.split('+',expand=True)

    train['Mean MV'] = train['Mean MV'].astype(float)
    train['Uncertainty MV'] = train['Uncertainty MV'].astype(float)

    train['UpperBound MV'] = train['Mean MV'] + train ['Uncertainty MV']
    train['LowerBound MV'] = train['Mean MV'] - train ['Uncertainty MV']

     # Relative Uncertainty
    train['RelativeUncertainty MV'] = train['Uncertainty MV']/train['Mean MV']

    return x


def split_mr(x):
    # Molar Refractivity (cm3)
    train[['Mean MR','Uncertainty MR']] = train['Molar Refractivity  (cm3)'].str.split('+',expand=True)
    train['Mean MR'] = train['Mean MR'].astype(float)
    train['Uncertainty MR'] = train['Uncertainty MR'].astype(float)

    train['UpperBound MR'] = train['Mean MR'] + train ['Uncertainty MR']
    train['LowerBound MR'] = train['Mean MR'] - train ['Uncertainty MR']

    # Relative Uncertainty
    train['RelativeUncertainty MR'] = train['Uncertainty MR']/train['Mean MR']
    return x

   


In [7]:
train['Molar Volume  (cm3)'].apply(split_mv)
train['Molar Refractivity  (cm3)'].apply(split_mr)
train.head()


Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR
0,pyrazinamide,94.5 + 3.0,31.36 + 3.0,12.43 + 0.5 x 10-24,68.9,-0.092,94.5,3.0,97.5,91.5,0.031746,31.36,3.0,34.36,28.36,0.095663
1,dapsone,182.3 + 3.0,67.51 + 0.4,26.76 + 0.5 x 10-24,94.6,0.027,182.3,3.0,185.3,179.3,0.016456,67.51,0.4,67.91,67.11,0.005925
2,phenobarbitone,188.1 + 3.0,59.21 + 0.3,23.47 + 0.5 x 10-24,75.3,-0.003,188.1,3.0,191.1,185.1,0.015949,59.21,0.3,59.51,58.91,0.005067
3,sulphamethoxazole,173.1 + 3.0,62.45 + 0.4,24.75 + 0.5 x 10-24,107.0,-0.106,173.1,3.0,176.1,170.1,0.017331,62.45,0.4,62.85,62.05,0.006405
4,theophylline,122.9 + 3.0,43.14 + 0.3,17.10 + 0.5 x 10-24,69.3,-0.11,122.9,3.0,125.9,119.9,0.02441,43.14,0.3,43.44,42.84,0.006954


In [8]:
for row in train.columns:
    print(row)

name
Molar Volume  (cm3)
Molar Refractivity  (cm3)
Polarizability  (cm3)
TPSA  (Å2)
ICHI 
Mean MV
Uncertainty MV
UpperBound MV
LowerBound MV
RelativeUncertainty MV
Mean MR
Uncertainty MR
UpperBound MR
LowerBound MR
RelativeUncertainty MR


In [9]:
import re

train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace('x','*', regex=False)
#train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace(r'10-(\d+)',r'10^\1', regex=True)
train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace(r'10^(\d+)',r'10^-\1', regex=True)
train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace('10-','10^-', regex=False)
# train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace('*','e', regex=False)

# def convert(value):
#     return value.replace('10^','10^-')

# train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].apply(convert)

# def process_value(value):
#     match =re.match(r"([0-9.]+)\s*+\s*([0-9.]+)\s*10\^(-?[0-9]+)", value)

#     if match:
#         mean_value = float(match.group(1)) # To extract mean value
#         uncertainty = float(match.group(2)) * 10**int(match.group(3)) # To convert uncertainty to float

#         return mean_value, uncertainty
#     return None, None # Return None if the format doesn't match

# for row in train['Polarizability  (cm3)']:
#     mean, uncertainty = process_value(row)
#     print(f"Mean: {mean}, Uncertainty:{uncertainty}")

In [10]:
train.head(5)

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR
0,pyrazinamide,94.5 + 3.0,31.36 + 3.0,12.43 + 0.5 * 10^-24,68.9,-0.092,94.5,3.0,97.5,91.5,0.031746,31.36,3.0,34.36,28.36,0.095663
1,dapsone,182.3 + 3.0,67.51 + 0.4,26.76 + 0.5 * 10^-24,94.6,0.027,182.3,3.0,185.3,179.3,0.016456,67.51,0.4,67.91,67.11,0.005925
2,phenobarbitone,188.1 + 3.0,59.21 + 0.3,23.47 + 0.5 * 10^-24,75.3,-0.003,188.1,3.0,191.1,185.1,0.015949,59.21,0.3,59.51,58.91,0.005067
3,sulphamethoxazole,173.1 + 3.0,62.45 + 0.4,24.75 + 0.5 * 10^-24,107.0,-0.106,173.1,3.0,176.1,170.1,0.017331,62.45,0.4,62.85,62.05,0.006405
4,theophylline,122.9 + 3.0,43.14 + 0.3,17.10 + 0.5 * 10^-24,69.3,-0.11,122.9,3.0,125.9,119.9,0.02441,43.14,0.3,43.44,42.84,0.006954


In [31]:
def split_p(x):
    # Molar Refractivity (cm3)
    train[['Mean P','Uncertainty P']] = train['Polarizability  (cm3)'].str.split('+',expand=True)
    train['Mean P'] = train['Mean P'].astype(float)
    # train['Uncertainty P'] = train['Uncertainty P'].astype(float)
    # train['Uncertainty P'] = train['Uncertainty P'].apply(lambda value:
    #                                                       value.replace('* 10^','e'))
    # train['Uncertainty P'] = train['Uncertainty P'].str.strip().astype(float)
    
    train['Uncertainty P']= train['Uncertainty P'].apply(lambda x: float(x.strip()) if x.strip().replace(' ', '', 1).replace('* 10^', 'e', 1).isdigit() else None)
 


    # train['UpperBound P'] = train['Mean P'] + train ['Uncertainty P']
    # train['LowerBound P'] = train['Mean P'] - train ['Uncertainty P']

    # # Relative Uncertainty
    # train['RelativeUncertainty P'] = train['Uncertainty P']/train['Mean P']
    # return x

In [32]:
train['Polarizability  (cm3)'].apply(split_p)
train.head(5)

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR,Mean P,Uncertainty P
0,pyrazinamide,94.5 + 3.0,31.36 + 3.0,12.43 + 0.5 * 10^-24,68.9,-0.092,94.5,3.0,97.5,91.5,0.031746,31.36,3.0,34.36,28.36,0.095663,12.43,
1,dapsone,182.3 + 3.0,67.51 + 0.4,26.76 + 0.5 * 10^-24,94.6,0.027,182.3,3.0,185.3,179.3,0.016456,67.51,0.4,67.91,67.11,0.005925,26.76,
2,phenobarbitone,188.1 + 3.0,59.21 + 0.3,23.47 + 0.5 * 10^-24,75.3,-0.003,188.1,3.0,191.1,185.1,0.015949,59.21,0.3,59.51,58.91,0.005067,23.47,
3,sulphamethoxazole,173.1 + 3.0,62.45 + 0.4,24.75 + 0.5 * 10^-24,107.0,-0.106,173.1,3.0,176.1,170.1,0.017331,62.45,0.4,62.85,62.05,0.006405,24.75,
4,theophylline,122.9 + 3.0,43.14 + 0.3,17.10 + 0.5 * 10^-24,69.3,-0.11,122.9,3.0,125.9,119.9,0.02441,43.14,0.3,43.44,42.84,0.006954,17.1,
