## Physiochemical Properties of Chemicals

`Log P` is an experimental measure of lipophilicity of small molecules.

`cLog P` is a computationally determined parameter for the same measure of lipophilicity (using variety of software tools, employing different algorithms. the ‘c’ stands for calculated, to distinguish it from experimentally determined values.

Chemical structure is a 2D graphic representation of the compounds, which you do not need for your assignment .

The alternative measure of lipophilicity, `ICHI`, which is included as ‘output variable’ in the first excel sheet is the one relevant for your task. The publications I sent to  you would have used either `Log P` or `cLog P` as output variable for their modelling exercise

The hypothesis behind the study is that, the `ICHI` value, which was obtained experimentally,  represents a more biomimetic measure of lipophilicity than the conventional `Log P` value. This hypothesis is based on specific elements of the experimental design,  which is different from the experimental set up for the determination of `Log P`.

In [1]:
# Importations

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
sns.set_style('darkgrid')
import warnings
warnings.filterwarnings('ignore')


In [2]:
#Loading Train Dataset
train = pd.read_csv('train_set.csv',encoding='latin-1')
train.head()

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI
0,pyrazinamide,94.5 + 3.0,31.36 + 3.0,12.43 + 0.5 x 10-24,68.9,-0.092
1,dapsone,182.3 + 3.0,67.51 + 0.4,26.76 + 0.5 x 10-24,94.6,0.027
2,phenobarbitone,188.1 + 3.0,59.21 + 0.3,23.47 + 0.5 x 10-24,75.3,-0.003
3,sulphamethoxazole,173.1 + 3.0,62.45 + 0.4,24.75 + 0.5 x 10-24,107.0,-0.106
4,theophylline,122.9 + 3.0,43.14 + 0.3,17.10 + 0.5 x 10-24,69.3,-0.11


In [3]:
# Loading Test Dataset
test = pd.read_csv('test_set.csv',encoding='latin-1')
test.head()

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI
0,metronidazole,117.8 + 7.0,40.98 + 0.5,16.24 + 0.5 x 10-24,83.9,-0.025
1,prednisolone,274.7 + 5.0,95.48 + 0.4,37.85 + 0.5 x 10-24,94.8,0.32
2,diazepam,225.8 + 7.0,80.91 + 0.5,32.07 + 0.5 x 10-24,32.7,0.61
3,chlorpheniramine,211.4 + 3.0,71.35 + 0.3,28.28 + 0.5 x 10-24,16.1,0.99


In [4]:
# Training Set Information
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   name                       40 non-null     object 
 1   Molar Volume  (cm3)        40 non-null     object 
 2   Molar Refractivity  (cm3)  40 non-null     object 
 3   Polarizability  (cm3)      40 non-null     object 
 4   TPSA  (Å2)                 40 non-null     float64
 5   ICHI                       40 non-null     float64
dtypes: float64(2), object(4)
memory usage: 2.0+ KB


In [5]:
# Training Set Description
train.describe()

Unnamed: 0,TPSA (Å2),ICHI
count,40.0,40.0
mean,75.7475,0.285375
std,30.650026,0.364413
min,23.5,-0.5
25%,53.125,-0.01425
50%,69.2,0.34
75%,93.25,0.5425
max,159.0,1.1


### Data Cleaning/Feature Engineering

In [6]:
# Molar Velocity (cm3) Column Splitting Function
def split_mv(x):
    train[['Mean MV','Uncertainty MV']] = train['Molar Volume  (cm3)'].str.split('+',expand=True)
    train['Mean MV'] = train['Mean MV'].astype(float)

    train['Uncertainty MV'] = train['Uncertainty MV'].astype(float)
    train['UpperBound MV'] = train['Mean MV'] + train['Uncertainty MV']
    train['LowerBound MV'] = train['Mean MV'] - train['Uncertainty MV']

     # Relative Uncertainty
    train['RelativeUncertainty MV'] = train['Uncertainty MV']/train['Mean MV']

    return x

# Molar Refractivity (cm3) Column Splitting Function
def split_mr(x):
    train[['Mean MR','Uncertainty MR']] = train['Molar Refractivity  (cm3)'].str.split('+',expand=True)
    train['Mean MR'] = train['Mean MR'].astype(float)

    train['Uncertainty MR'] = train['Uncertainty MR'].astype(float)
    train['UpperBound MR'] = train['Mean MV'] + train['Uncertainty MR']
    train['LowerBound MR'] = train['Mean MV'] - train['Uncertainty MR']

     # Relative Uncertainty
    train['RelativeUncertainty MR'] = train['Uncertainty MR']/train['Mean MR']
    return x

#For Test Data
# Molar Velocity (cm3) Column Splitting Function
def split_mvt(x):
    test[['Mean MV','Uncertainty MV']] = test['Molar Volume  (cm3)'].str.split('+',expand=True)
    test['Mean MV'] = test['Mean MV'].astype(float)

    test['Uncertainty MV'] = test['Uncertainty MV'].astype(float)
    test['UpperBound MV'] = test['Mean MV'] + test['Uncertainty MV']
    test['LowerBound MV'] = test['Mean MV'] - test['Uncertainty MV']

     # Relative Uncertainty
    test['RelativeUncertainty MV'] = test['Uncertainty MV']/test['Mean MV']

    return x

# Molar Refractivity (cm3) Column Splitting Function
def split_mrt(x):
    test[['Mean MR','Uncertainty MR']] = test['Molar Refractivity  (cm3)'].str.split('+',expand=True)
    test['Mean MR'] = test['Mean MR'].astype(float)

    test['Uncertainty MR'] = test['Uncertainty MR'].astype(float)
    test['UpperBound MR'] = test['Mean MV'] + test['Uncertainty MR']
    test['LowerBound MR'] = test['Mean MR'] - test['Uncertainty MR']

     # Relative Uncertainty
    test['RelativeUncertainty MR'] = test['Uncertainty MR']/test['Mean MR']
    return x

In [7]:
# Applying split_mc and split_mr Functions
train['Molar Volume  (cm3)'] = train['Molar Volume  (cm3)'].apply(split_mv)
train['Molar Volume  (cm3)'] = train['Molar Refractivity  (cm3)'].apply(split_mr)

# For Test Data
test['Molar Volume  (cm3)'] = test['Molar Volume  (cm3)'].apply(split_mvt)
test['Molar Refractivity  (cm3)'] = test['Molar Refractivity  (cm3)'].apply(split_mrt)

train.head()

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR
0,pyrazinamide,31.36 + 3.0,31.36 + 3.0,12.43 + 0.5 x 10-24,68.9,-0.092,94.5,3.0,97.5,91.5,0.031746,31.36,3.0,97.5,91.5,0.095663
1,dapsone,67.51 + 0.4,67.51 + 0.4,26.76 + 0.5 x 10-24,94.6,0.027,182.3,3.0,185.3,179.3,0.016456,67.51,0.4,182.7,181.9,0.005925
2,phenobarbitone,59.21 + 0.3,59.21 + 0.3,23.47 + 0.5 x 10-24,75.3,-0.003,188.1,3.0,191.1,185.1,0.015949,59.21,0.3,188.4,187.8,0.005067
3,sulphamethoxazole,62.45 + 0.4,62.45 + 0.4,24.75 + 0.5 x 10-24,107.0,-0.106,173.1,3.0,176.1,170.1,0.017331,62.45,0.4,173.5,172.7,0.006405
4,theophylline,43.14 + 0.3,43.14 + 0.3,17.10 + 0.5 x 10-24,69.3,-0.11,122.9,3.0,125.9,119.9,0.02441,43.14,0.3,123.2,122.6,0.006954


In [8]:
# Column Names
for row in train.columns:
    print(row)

name
Molar Volume  (cm3)
Molar Refractivity  (cm3)
Polarizability  (cm3)
TPSA  (Å2)
ICHI 
Mean MV
Uncertainty MV
UpperBound MV
LowerBound MV
RelativeUncertainty MV
Mean MR
Uncertainty MR
UpperBound MR
LowerBound MR
RelativeUncertainty MR


In [9]:
# Column Names
for row in test.columns:
    print(row)

name
Molar Volume  (cm3)
Molar Refractivity  (cm3)
Polarizability  (cm3)
TPSA  (Å2)
ICHI 
Mean MV
Uncertainty MV
UpperBound MV
LowerBound MV
RelativeUncertainty MV
Mean MR
Uncertainty MR
UpperBound MR
LowerBound MR
RelativeUncertainty MR


In [10]:
test.head()

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR
0,metronidazole,117.8 + 7.0,40.98 + 0.5,16.24 + 0.5 x 10-24,83.9,-0.025,117.8,7.0,124.8,110.8,0.059423,40.98,0.5,118.3,40.48,0.012201
1,prednisolone,274.7 + 5.0,95.48 + 0.4,37.85 + 0.5 x 10-24,94.8,0.32,274.7,5.0,279.7,269.7,0.018202,95.48,0.4,275.1,95.08,0.004189
2,diazepam,225.8 + 7.0,80.91 + 0.5,32.07 + 0.5 x 10-24,32.7,0.61,225.8,7.0,232.8,218.8,0.031001,80.91,0.5,226.3,80.41,0.00618
3,chlorpheniramine,211.4 + 3.0,71.35 + 0.3,28.28 + 0.5 x 10-24,16.1,0.99,211.4,3.0,214.4,208.4,0.014191,71.35,0.3,211.7,71.05,0.004205


In [11]:
# Importing Regex
import re

train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace('x','*', regex=False)
test['Polarizability  (cm3)'] = test['Polarizability  (cm3)'].str.replace('x','*', regex=False)
#train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace(r'10-(\d+)',r'10^\1', regex=True)

train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace(r'10^(\d+)',r'10^-\1', regex=True)
test['Polarizability  (cm3)'] = test['Polarizability  (cm3)'].str.replace(r'10^(\d+)',r'10^-\1', regex=True)

train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace('10-','10^-', regex=False)
test['Polarizability  (cm3)'] = test['Polarizability  (cm3)'].str.replace('10-','10^-', regex=False)


# train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].str.replace('*','e', regex=False)

# def convert(value):
#     return value.replace('10^','10^-')

# train['Polarizability  (cm3)'] = train['Polarizability  (cm3)'].apply(convert)

# def process_value(value):
#     match =re.match(r"([0-9.]+)\s*+\s*([0-9.]+)\s*10\^(-?[0-9]+)", value)

#     if match:
#         mean_value = float(match.group(1)) # To extract mean value
#         uncertainty = float(match.group(2)) * 10**int(match.group(3)) # To convert uncertainty to float

#         return mean_value, uncertainty
#     return None, None # Return None if the format doesn't match

# for row in train['Polarizability  (cm3)']:
#     mean, uncertainty = process_value(row)
#     print(f"Mean: {mean}, Uncertainty:{uncertainty}")

In [12]:
train.head(5)

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR
0,pyrazinamide,31.36 + 3.0,31.36 + 3.0,12.43 + 0.5 * 10^-24,68.9,-0.092,94.5,3.0,97.5,91.5,0.031746,31.36,3.0,97.5,91.5,0.095663
1,dapsone,67.51 + 0.4,67.51 + 0.4,26.76 + 0.5 * 10^-24,94.6,0.027,182.3,3.0,185.3,179.3,0.016456,67.51,0.4,182.7,181.9,0.005925
2,phenobarbitone,59.21 + 0.3,59.21 + 0.3,23.47 + 0.5 * 10^-24,75.3,-0.003,188.1,3.0,191.1,185.1,0.015949,59.21,0.3,188.4,187.8,0.005067
3,sulphamethoxazole,62.45 + 0.4,62.45 + 0.4,24.75 + 0.5 * 10^-24,107.0,-0.106,173.1,3.0,176.1,170.1,0.017331,62.45,0.4,173.5,172.7,0.006405
4,theophylline,43.14 + 0.3,43.14 + 0.3,17.10 + 0.5 * 10^-24,69.3,-0.11,122.9,3.0,125.9,119.9,0.02441,43.14,0.3,123.2,122.6,0.006954


In [13]:
test.head(5)

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR
0,metronidazole,117.8 + 7.0,40.98 + 0.5,16.24 + 0.5 * 10^-24,83.9,-0.025,117.8,7.0,124.8,110.8,0.059423,40.98,0.5,118.3,40.48,0.012201
1,prednisolone,274.7 + 5.0,95.48 + 0.4,37.85 + 0.5 * 10^-24,94.8,0.32,274.7,5.0,279.7,269.7,0.018202,95.48,0.4,275.1,95.08,0.004189
2,diazepam,225.8 + 7.0,80.91 + 0.5,32.07 + 0.5 * 10^-24,32.7,0.61,225.8,7.0,232.8,218.8,0.031001,80.91,0.5,226.3,80.41,0.00618
3,chlorpheniramine,211.4 + 3.0,71.35 + 0.3,28.28 + 0.5 * 10^-24,16.1,0.99,211.4,3.0,214.4,208.4,0.014191,71.35,0.3,211.7,71.05,0.004205


In [14]:
# Polarization (cm3) Column Splitting Function
def split_p(x):
    
    train[['Mean P','Uncertainty P']] = train['Polarizability  (cm3)'].str.split('+',expand=True)
    train['Mean P'] = train['Mean P'].astype(float)

    def convert(value):
        value = re.sub(r'\s*\*\s*10\^', 'e',value)
        try:
            return float(value)
        except ValueError:
            return None
        
    train['Uncertainty P']=train['Uncertainty P'].apply(lambda x:convert(x))
    train['UpperBound P'] = train['Mean P'] + train['Uncertainty P']
    train['LowerBound P'] = train['Mean P'] - train['Uncertainty P']

#Test Data
def split_pt(x):
    
    test[['Mean P','Uncertainty P']] = test['Polarizability  (cm3)'].str.split('+',expand=True)
    test['Mean P'] = test['Mean P'].astype(float)

    def convert(value):
        value = re.sub(r'\s*\*\s*10\^', 'e',value)
        try:
            return float(value)
        except ValueError:
            return None
        
    test['Uncertainty P']=test['Uncertainty P'].apply(lambda x:convert(x))
    test['UpperBound P'] = test['Mean P'] + test['Uncertainty P']
    test['LowerBound P'] = test['Mean P'] - test['Uncertainty P']
 

In [15]:
# Applying split_p function
train['Polarizability  (cm3)']=train['Polarizability  (cm3)'].apply(split_p)
test['Polarizability  (cm3)']=test['Polarizability  (cm3)'].apply(split_pt)

In [16]:
train.head()

Unnamed: 0,name,Molar Volume (cm3),Molar Refractivity (cm3),Polarizability (cm3),TPSA (Å2),ICHI,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR,Mean P,Uncertainty P,UpperBound P,LowerBound P
0,pyrazinamide,31.36 + 3.0,31.36 + 3.0,,68.9,-0.092,94.5,3.0,97.5,91.5,0.031746,31.36,3.0,97.5,91.5,0.095663,12.43,5.0000000000000005e-25,12.43,12.43
1,dapsone,67.51 + 0.4,67.51 + 0.4,,94.6,0.027,182.3,3.0,185.3,179.3,0.016456,67.51,0.4,182.7,181.9,0.005925,26.76,5.0000000000000005e-25,26.76,26.76
2,phenobarbitone,59.21 + 0.3,59.21 + 0.3,,75.3,-0.003,188.1,3.0,191.1,185.1,0.015949,59.21,0.3,188.4,187.8,0.005067,23.47,5.0000000000000005e-25,23.47,23.47
3,sulphamethoxazole,62.45 + 0.4,62.45 + 0.4,,107.0,-0.106,173.1,3.0,176.1,170.1,0.017331,62.45,0.4,173.5,172.7,0.006405,24.75,5.0000000000000005e-25,24.75,24.75
4,theophylline,43.14 + 0.3,43.14 + 0.3,,69.3,-0.11,122.9,3.0,125.9,119.9,0.02441,43.14,0.3,123.2,122.6,0.006954,17.1,5.0000000000000005e-25,17.1,17.1


In [17]:
# data = {'values':['0.25 * 10^-24', '0.31 * 10^-24']}
# df = pd.DataFrame(data)

# def convert(value):
#     value = re.sub(r'\s*\*\s*10\^', 'e',value)
#     try:
#         return float(value)
#     except ValueError:
#         return None
    
# df['FloatValues']=df['values'].apply(lambda x:convert(x))

# print(df)

In [18]:
# Column Names 
print('Train Columns =',train.columns)
print('Test Columns =',test.columns)

Train Columns = Index(['name', 'Molar Volume  (cm3)', 'Molar Refractivity  (cm3)',
       'Polarizability  (cm3)', 'TPSA  (Å2)', 'ICHI ', 'Mean MV',
       'Uncertainty MV', 'UpperBound MV', 'LowerBound MV',
       'RelativeUncertainty MV', 'Mean MR', 'Uncertainty MR', 'UpperBound MR',
       'LowerBound MR', 'RelativeUncertainty MR', 'Mean P', 'Uncertainty P',
       'UpperBound P', 'LowerBound P'],
      dtype='object')
Test Columns = Index(['name', 'Molar Volume  (cm3)', 'Molar Refractivity  (cm3)',
       'Polarizability  (cm3)', 'TPSA  (Å2)', 'ICHI ', 'Mean MV',
       'Uncertainty MV', 'UpperBound MV', 'LowerBound MV',
       'RelativeUncertainty MV', 'Mean MR', 'Uncertainty MR', 'UpperBound MR',
       'LowerBound MR', 'RelativeUncertainty MR', 'Mean P', 'Uncertainty P',
       'UpperBound P', 'LowerBound P'],
      dtype='object')


In [19]:
# Renaming Columns
train=train.rename(columns={'name':'Name','TPSA  (Å2)':'TPSA', 'ICHI ':'ICHI'})
test=test.rename(columns={'name':'Name','TPSA  (Å2)':'TPSA', 'ICHI ':'ICHI'})

In [20]:
new_column =['Name','Mean MV',
       'Uncertainty MV', 'UpperBound MV', 'LowerBound MV',
       'RelativeUncertainty MV', 'Mean MR', 'Uncertainty MR', 'UpperBound MR',
       'LowerBound MR', 'RelativeUncertainty MR', 'UpperBound P', 'LowerBound P','TPSA','ICHI']

In [21]:
train[new_column].head()
test[new_column].head()

Unnamed: 0,Name,Mean MV,Uncertainty MV,UpperBound MV,LowerBound MV,RelativeUncertainty MV,Mean MR,Uncertainty MR,UpperBound MR,LowerBound MR,RelativeUncertainty MR,UpperBound P,LowerBound P,TPSA,ICHI
0,metronidazole,117.8,7.0,124.8,110.8,0.059423,40.98,0.5,118.3,40.48,0.012201,16.24,16.24,83.9,-0.025
1,prednisolone,274.7,5.0,279.7,269.7,0.018202,95.48,0.4,275.1,95.08,0.004189,37.85,37.85,94.8,0.32
2,diazepam,225.8,7.0,232.8,218.8,0.031001,80.91,0.5,226.3,80.41,0.00618,32.07,32.07,32.7,0.61
3,chlorpheniramine,211.4,3.0,214.4,208.4,0.014191,71.35,0.3,211.7,71.05,0.004205,28.28,28.28,16.1,0.99


## Modelling

In [22]:
# Extracting Feature Names
features =['Name','UpperBound MV', 'LowerBound MV','UpperBound MR','LowerBound MR','UpperBound P', 'LowerBound P','TPSA','ICHI']
train_df = train[features].head()
test_df= test[features].head()

In [23]:
print('Train Columns: ')
train_df.head()

Train Columns: 


Unnamed: 0,Name,UpperBound MV,LowerBound MV,UpperBound MR,LowerBound MR,UpperBound P,LowerBound P,TPSA,ICHI
0,pyrazinamide,97.5,91.5,97.5,91.5,12.43,12.43,68.9,-0.092
1,dapsone,185.3,179.3,182.7,181.9,26.76,26.76,94.6,0.027
2,phenobarbitone,191.1,185.1,188.4,187.8,23.47,23.47,75.3,-0.003
3,sulphamethoxazole,176.1,170.1,173.5,172.7,24.75,24.75,107.0,-0.106
4,theophylline,125.9,119.9,123.2,122.6,17.1,17.1,69.3,-0.11


In [24]:
print('Test Columns: ')
test_df.head()

Test Columns: 


Unnamed: 0,Name,UpperBound MV,LowerBound MV,UpperBound MR,LowerBound MR,UpperBound P,LowerBound P,TPSA,ICHI
0,metronidazole,124.8,110.8,118.3,40.48,16.24,16.24,83.9,-0.025
1,prednisolone,279.7,269.7,275.1,95.08,37.85,37.85,94.8,0.32
2,diazepam,232.8,218.8,226.3,80.41,32.07,32.07,32.7,0.61
3,chlorpheniramine,214.4,208.4,211.7,71.05,28.28,28.28,16.1,0.99


In [25]:
# Scikit-Learn Importations
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb


from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.metrics import mean_squared_error


In [26]:
# Separating Features and Target in train dataset
X_train = train_df.drop('ICHI',axis=1) 
y_train = train_df['ICHI'] 

# Separating Features in test dataset 
X_test = test_df.drop('ICHI',axis=1) 
y_test = test_df['ICHI'] 

In [27]:
# ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), ['UpperBound MV', 'LowerBound MV','UpperBound MR','LowerBound MR','UpperBound P', 'LowerBound P','TPSA']),  # Scaling numeric features
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['Name'])      # One-Hot Encoding categorical features
    ])

### Random Forest Pipeline

In [28]:
# Pipeline
p1 = Pipeline(steps=[
    ('preprocessor', preprocessor),  
    ('model', RandomForestRegressor())  
])

# Fitting the pipeline on training data
p1.fit(X_train, y_train)

In [29]:
# Predict on the test data
y_pred = p1.predict(X_test)

# Output predictions
print(y_pred)

[-0.09775 -0.03617 -0.03724 -0.03724]


In [30]:
 # MAE Evaluation
RandomF = (mean_squared_error(y_test, y_pred))
print(RandomF)

0.40157281664999994


In [31]:
test_df['ICHI'].tolist()

[-0.025, 0.32, 0.61, 0.99]

### Linear Regression Pipeline

In [32]:
p2= Pipeline(steps=[
    ('preprocessor', preprocessor),  
    ('model', LinearRegression())  
])

# Fitting the pipeline on training data
p2.fit(X_train, y_train)

In [33]:
# Predict on the test data
y_pred = p2.predict(X_test)

# Output predictions
print(y_pred)

[-0.10097006  0.0233399   0.01300821  0.00467292]


In [34]:
 # MAE Evaluation
LinearR = (mean_squared_error(y_test, y_pred))
print(LinearR)

0.35526182812384116


### Decision Tree Pipeline

In [35]:
p3= Pipeline(steps=[
    ('preprocessor', preprocessor),  
    ('model', DecisionTreeRegressor())  
])

# Fitting the pipeline on training data
p3.fit(X_train, y_train)

In [36]:
# Predict on the test data
y_pred = p3.predict(X_test)

# Output predictions
print(y_pred)

[-0.092 -0.092 -0.092 -0.092]


In [37]:
DecisionT = (mean_squared_error(y_test, y_pred))
print(DecisionT)

0.45944025


### Xgboost Pipeline

In [38]:
p4= Pipeline(steps=[
    ('preprocessor', preprocessor),  
    ('model', xgb.XGBRegressor())  
])

# Fitting the pipeline on training data
p4.fit(X_train, y_train)

In [39]:
# Predict on the test data
y_pred = p4.predict(X_test)

# Output predictions
print(y_pred)

[-0.09205928 -0.00047073 -0.00047073 -0.00047073]


In [40]:
Xgbt = (mean_squared_error(y_test, y_pred))
print(Xgbt)

0.36522630407551204


### Gradient Boosting Pipeline

In [41]:
p5= Pipeline(steps=[
    ('preprocessor', preprocessor),  
    ('model', GradientBoostingRegressor())  
])

# Fitting the pipeline on training data
p5.fit(X_train, y_train)

In [42]:
# Predict on the test data
y_pred = p5.predict(X_test)

# Output predictions
print(y_pred)

[-0.10669543 -0.01758003 -0.02141588 -0.02141588]


In [43]:
Gbt = (mean_squared_error(y_test, y_pred))
print(Gbt)

0.3855706315252491


In [44]:
print(f'The Linear regression MSE is {LinearR}\nThe Decision Tree Classifier MSE is {DecisionT}\nThe Random Forest MSE is {RandomF}\nThe Xgboost MSE is {Xgbt}\nThe Gradient Boost MSE is {Gbt}')

The Linear regression MSE is 0.35526182812384116
The Decision Tree Classifier MSE is 0.45944025
The Random Forest MSE is 0.40157281664999994
The Xgboost MSE is 0.36522630407551204
The Gradient Boost MSE is 0.3855706315252491


In [45]:
# # Save predictions to CSV
# output = pd.DataFrame({'Predictions': y_pred})
# output.to_csv('predictions.csv', index=False)