<a href="https://colab.research.google.com/github/ebertsch123/LiProS/blob/main/LiProS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LiProS**

<img src="https://i.imgur.com/FjTNXlA.png" width="75">

---

**A FAIR workflow to determine accurate lipophilicity profiles for small molecules.**


$\log{D}$ is an evergrowing physicochemical property applied in drug design, environmental chemistry, medicinal chemistry, and food chemistry. It is often not possible to measure the $\log{D}$ of a molecule. Therefore, the use of thermodynamically derived equations has been successful in facilitating straightforward calculations. In this case, 2 main equations have been used.

### **Equation 1:**

$$\log{D_{\text{pH}}} = \log{P_{\text{N}}}-\log{\left(1+10^{\delta}\right)}$$

### **Equation 2:**

$$\log{D_{\text{pH}}} = \log{\left(P_{\text{N}}+P_{\text{I}}^{\text{app}}\cdot10^{\delta}\right)}-\log{\left(1+10^{\delta}\right)}$$

**Equation 1** is often used due to its simplicity. On the other hand, **Equation 2** often gives more accurate computations, but it requires more experimental data (the applied ionic partition coefficient of a molecule ($P_{\text{I}}^{\text{app}}$)).

*Which equation do we use and in which cases?* This script will help you with this decision with the aim of the most accurate and efficient lipophilicity calculations.



---



 **WARNING:** *This model was created for small organic molecules. Unaccurate results might occur if large molecules (M > 1000 Da), salts, or organometallic complexes are evaluated.*



---


## **1. Tools Instalation**

These cells do not need any imput. You just have to run them.

You can run the cells by pressing the *play* button, or by pressing `ctrl + Enter`.







This first cell might run for a couple of minutes.

In [None]:
%%capture
#install required packages
!pip install condacolab
import condacolab
condacolab.install()
!mamba install openbabel
!mamba install rdkit
!pip install jazzy
!mamba install r-rcdk
!mamba install bioconda::bioconductor-chemminer=3.50
!mamba install r-dplyr
!mamba install rpy2=3.5.1
!mamba install r-caret
!mamba install r-dplyr
!mamba install r-Metrics
!mamba install r-randomForest
!mamba install r-e1071

#import the files and script for our models from repositories
!git clone https://github.com/Anto3006/CalculateDescriptors.git
!cp CalculateDescriptors/* .
!rm -r CalculateDescriptors
!git clone https://github.com/ebertsch123/LiProS.git

Keep running these lines. No input should be required

In [None]:
%%capture
#activate the R environment to import our models
%load_ext rpy2.ipython

Keep running these lines. No input should be required

In [None]:
%%capture
%%R
source('/content/LiProS/models.R') #load our models

Keep running...

In [None]:
import pandas as pd
pd.DataFrame.iteritems = pd.DataFrame.items
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
pandas2ri.activate()
from google.colab import files

#transform R dataframes into pandas dataframes
acids_d = ro.conversion.rpy2py(ro.r['acids_d']).drop(columns=['cond'])
bases_d = ro.conversion.rpy2py(ro.r['bases_d']).drop(columns=['cond'])

#calculate descriptors function
from calculateDescriptors import calculateDescriptors

def create_descriptors_table(smiles):
    if len(smiles) == 0:
      return
    unique_smiles = len(smiles) == 1
    if unique_smiles:
      smiles = smiles + ["C"]
    descriptors = calculateDescriptors(smiles)
    if unique_smiles:
      descriptors.drop(descriptors.tail(1).index,inplace = True)
      smiles = smiles[:-1]
    return descriptors



---


## **2. Import your molecules**


### **2.1. Acidic Molecules**

**In this section, you will only import your acidic molecules.**

In the following cells, you will input your molecules.

The next cell will give you 3 commands:


1.   `Enter the SMILES of your ACIDIC molecules:`
Herein, you will paste the SMILES code of your molecule(s). If you want to input multiple molecules, you have to enter each SMILES code separated by spaces. You need to enter at least **two molecules**.
2.   `Enter the (estimated) pKa:`
Enter the $\text{p}K_{\text{a}}$ of the molecule. You can estimate its value if needed. Otherwise, you can use $\text{p}K_{\text{a}}$ prediction tools (https://playground.calculators.cxn.io/). **These values should be separated between spaces**.
3.   `Enter the desired pH:`
Enter the desired pH, in which you want to calculate its $\log{D}$ value. **These values should be separated between spaces**.

In [None]:
%%capture
from google.colab import data_table
data_table.enable_dataframe_formatter()
smiles = input("Enter the SMILES of your ACIDIC molecules:\n")
pKa = input("Enter the (estimated) pKa:\n")
pH = input("Enter the desired pH:\n")
pd.option_context('display.max_rows',None)
descriptors = create_descriptors_table(smiles.split(" ")) #calculate descriptors from SMILES
descriptors['pKa'] = pKa.split(" ")
descriptors['pH'] = pH.split(" ")
descriptors['pKa'] = descriptors['pKa'].astype(float)
descriptors['pH'] = descriptors['pH'].astype(float)
descriptors['delta'] = descriptors.apply(lambda row: row['pH'] - row['pKa'], axis=1)
descriptors = descriptors[acids_d.columns] #filter the descriptors to only get the necessary ones to predict our outcomes

Keep running these cells

In [None]:
%%R -i descriptors
LR_fit = as.numeric(predict(model_LR_A, descriptors))
RF_fit = as.numeric(predict(model_RF_A, descriptors))
SVML_fit = as.numeric(predict(model_SVML_A, descriptors))
descriptors$Prediction = ifelse(as.factor(round((LR_fit + RF_fit + SVML_fit)/3-1,0))=='0','Use Eq. 2', 'Use Eq. 1')

The following cell will give you the **predictions**. The "prediction" will recommend you which equation is the most appropriate depending on the molecule and desired pH.

A csv file named `logD_acids.csv` will be created and automatically downloaded.

In [None]:
descriptors = ro.conversion.rpy2py(ro.r['descriptors'])

output = {'smiles': smiles.split(" "),'pH': pH.split(" "), 'Prediction': descriptors['Prediction']}
output = pd.DataFrame(output)
output.to_csv('LiProS_acids.csv', index = False)
files.download('LiProS_acids.csv')
output



---


### **2.2. Basic Molecules**

**In this section, you will only import your basic molecules.**

In the following cells, you will input your molecule.

The next cell will give you 3 commands:


1.   `Enter the SMILES of your BASIC molecules:`
Herein, you will paste the SMILES code of your molecule(s). If you want to input multiple molecules, you have to enter each SMILES code separated by spaces.
2.   `Enter the (estimated) pKa:`
Enter the $\text{p}K_{\text{a}}$ of the molecule. You can estimate its value if needed. Otherwise, you can use $\text{p}K_{\text{a}}$ prediction tools (https://playground.calculators.cxn.io/). **These values should be separated between spaces**.
3.   `Enter the desired pH:`
Enter the desired pH, in which you want to calculate its $\log{D}$ value. **These values should be separated between spaces**.

In [None]:
%%capture
from google.colab import data_table
data_table.enable_dataframe_formatter()
smiles = input("Enter the SMILES of your BASIC molecules:\n")
pKa = input("Enter the (estimated) pKa:\n")
pH = input("Enter the desired pH:\n")
pd.option_context('display.max_rows',None)
descriptors = create_descriptors_table(smiles.split(" "))
descriptors['pKa'] = pKa.split(" ")
descriptors['pH'] = pH.split(" ")
descriptors['pKa'] = descriptors['pKa'].astype(float)
descriptors['pH'] = descriptors['pH'].astype(float)
descriptors['delta'] = descriptors.apply(lambda row: row['pKa'] - row['pH'], axis=1)
descriptors = descriptors[bases_d.columns]

In [None]:
%%R -i descriptors
descriptors$Prediction = ifelse(predict(model_SVML_B, descriptors)=='0','Use Eq. 2', 'Use Eq. 1')

The following cell will give you the **predictions**. The "prediction" will recommend you which equation is the most appropriate depending on the molecule and desired pH.

A csv file named `logD_bases.csv` will be created and automatically downloaded.

In [None]:
descriptors = ro.conversion.rpy2py(ro.r['descriptors'])

output = {'smiles': smiles.split(" "),'pH': pH.split(" "), 'Prediction': descriptors['Prediction']}
output = pd.DataFrame(output)
output.to_csv('LiProS_bases.csv', index = False)
files.download('LiProS_bases.csv')
output

##**2.3. Insert your data in a `.xlsx` file**

Otherwise, you can input your data as a `.xlsx` file.

Your file must contain the following columns:

1. **`SMILES`:** The SMILES code for each molecules.
2. **`type`:** Is your molecule acidic or basic? You must enter "acid" if the molecule is acidic or "base" if the molecule is basic.
3. **`pKa`:** The $\text{p}K_{\text{a}}$ of your molecule. It can be an estimated value.
4. **`pH`:** The desired pH of this molecule.

First, upload your `.csv` file into this colab in the "files" menu on the left of the screen.

Then, run the next cell and type the name of your `.xlsx` file.

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()
fileName = input("Enter the name of the file: ") #enter the name of your file


#read the SMILES codes
fileFound = False
try:
  df = pd.read_excel(fileName)
  fileFound = True
except FileNotFoundError:
  print("File not found")

if fileFound and "SMILES" in df.columns:
    smiles = df["SMILES"]
else:
    output ="SMILES not found"
    smiles = []


#calculate descriptors
descriptors = create_descriptors_table(df["SMILES"])
descriptors['pKa'] = df["pKa"]
descriptors['pH'] = df["pH"]
descriptors['delta'] = descriptors.apply(lambda row: row['pH'] - row['pKa'] if df['type'][row.name] == 'acid' else row['pKa'] - row['pH'], axis=1)
descriptors

#split the data into acidic and basic molecules
acids = descriptors[df['type'] == 'acid']
acids = acids[acids_d.columns]
bases = descriptors[df['type'] == 'base']
bases = bases[bases_d.columns]

Keep running this line.

In [None]:
%%R -i acids -i bases
#predict for acids
LR_fit = as.numeric(predict(model_LR_A, acids))
RF_fit = as.numeric(predict(model_RF_A, acids))
SVML_fit = as.numeric(predict(model_SVML_A, acids))
acids$Prediction = ifelse(as.factor(round((LR_fit + RF_fit + SVML_fit)/3-1,0))=='0','Use Eq. 2', 'Use Eq. 1')

#predict for bases
bases$Prediction = ifelse(predict(model_SVML_B, bases)=='0','Use Eq. 2', 'Use Eq. 1')

This final cell will give you a `.csv` file that will automatically get downloaded to your computer.

In [None]:
acids = ro.conversion.rpy2py(ro.r['acids'])
bases = ro.conversion.rpy2py(ro.r['bases'])


# Reset indices to ensure proper alignment
acids = acids.reset_index(drop=True)
bases = bases.reset_index(drop=True)
df = df.reset_index(drop=True)

# Create a list to store the combined data
combined_data = []

# Iterate over the original DataFrame and extract predictions
acid_counter = 0
base_counter = 0
for i in range(len(df)):
  if df['type'][i] == 'acid':
    combined_data.append([df['id'][i], df['SMILES'][i], df['type'][i], df['pH'][i], acids['Prediction'][acid_counter]])
    acid_counter += 1 # Increment acid counter
  elif df['type'][i] == 'base':
    combined_data.append([df['id'][i], df['SMILES'][i], df['type'][i], df['pH'][i], bases['Prediction'][base_counter]])
    base_counter += 1 # Increment base counter

# Create the output DataFrame
output = pd.DataFrame(combined_data, columns=['id', 'SMILES', 'type', 'pH', 'Prediction'])

#transform the output to csv and download it
output.to_csv('LiProS_results.csv', index = False)
files.download('LiProS_results.csv')
output