# ACE2-Spike protein interaction

This notebook shows the steps to clean up a dataset obtained from a published study (Yang et al, Nature Microbilogy, 2023) and train a machine learning model based on that data.

_This notebook is prepared to run locally when clonning this repository, if you want to run it in Colab please make sure to add the datasets on Colab and redirect the paths as needed_

##  1. Data Cleaning

We have downloaded two of the supplementary files, for Fig 2b (natural product screening) and Extended Data Figure 1b (synthetic drugs). We have kept the columns "Drug Cas", "Drug Name", "Log2FoldChange" and "-Log10Pvalue" in individual .csv files.
We will start the cleaning process from those.

In [3]:
import os
import pandas as pd

DATAPATH = "../data/m3_datasets/ace2-spike"

In [None]:
datasets = ["natural", "synthetic"]

In [3]:
#these files are separated by ; instead of , since the names of the drugs contain ,
np = pd.read_csv(os.path.join(DATAPATH,"original","natural_products_screening.csv"), sep=";", decimal=",") 
sp = pd.read_csv(os.path.join(DATAPATH,"original","synthetic_drugs_screening.csv"), sep=";", decimal=",") 

In [4]:
# we need a function to get the SMILES from the CAS number
# Pubchem has both identifiers, so we can go there to obtain the smiles

import requests
from tqdm import tqdm

def get_smiles_from_cas(cas_number):
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{cas_number}/property/CanonicalSMILES/TXT"
    response = requests.get(url)
    if response.status_code == 200:
        return response.text.strip()
    else:
        return None

In [None]:
# we will run a small loop to process both datasets
for d in datasets:
    df = pd.read_csv(os.path.join(DATAPATH, "original", "{}_products_screening.csv".format(d)), sep=";", decimal=",") 
    smiles = []
    for i, cas in tqdm(enumerate(sp["drug CAS"].tolist())):
        smi = get_smiles_from_cas(cas)
        smiles += [smi]
    smiles_ = []
    for smi in smiles:
        if smi is None:
            smiles_ += [None]
        else:
            smiles_ += [smi.split("\n")[0]]
    df["SMILES"] = smiles_
    df.to_csv(os.path.join(DATAPATH, "processed", "{}_products_screening_with_smiles.csv".format(d)), index=False)

In [None]:
# next, we obtain the Canonical format of the SMILES
# We use the standardiser package for that

from rdkit import Chem
from standardiser import standardise

for d in datasets:
    df = pd.read_csv(os.path.join(DATAPATH,"processed", "{}_products_screening_with_smiles.csv".format(d)), sep=",")
    df = df[df["SMILES"].notnull()]
    std_smi = []
    for smi in df["SMILES"].tolist():
        mol = Chem.MolFromSmiles(smi)
        try:
            mol = standardise.run(mol)
            smi = Chem.MolToSmiles(mol)
        except:
            smi = None
        std_smi += [smi]
    data = {
    "CAS": df["drug CAS"].tolist(),
    "SMILES": std_smi,
    "Log2FoldChange": df["Log2FoldChange"].tolist(),
    "mLogPvalue": df["mLog10Pvalue"].tolist()
    }
    data = pd.DataFrame(data)
    data = data.sample(data.shape[0]) #shuffling of the data
    data = data[data["SMILES"].notnull()]
    data.to_csv(os.path.join(DATAPATH, "processed", "{}_products_ace2_spike_processed.csv".format(d)), index=False)

## 2. Model Training

We will use the lazyQSAR package to train a model. We will first analyse the datasets to understand the distribution of our activity of interest, and what might be a good cut-off for our problem.