# Intro

This notebook will show how to prepare your data to use in our ML training pipeline. The main pipeline here will be to download data from a CDD saved search and from Fragalysis, then process this data to build the appropriate data objects. There will also be notes as we go to account for data coming from other sources.

# Downloading data from CDD

The ASAP project uses the CDD Vault for storing our compound data, so the first step will be to pull our data down from CDD. The main process for this is:

1. Using the online CDD UI, perform a search for the molecules you want to download. Ensure that all the information you want for each compound is selected in the report.
2. Save the search and keep a note of the search's URL
3. Use the `asapdiscovery` API to download, filter, and process your data as desired. The code for this step is shown below

Note also that you can of course skip all the CDD-related steps if you have an alternative data source or already have your molecules in a local CSV file.

The code below should run as is, with the exception of the `vault_id`, `search`, and `CDDTOKEN` needing to be filled in, and potentially needing to change the `id_fieldname`, `smiles_fieldname`, and `assay_name`.

In [None]:
from asapdiscovery.data.services.cdd.cdd_download import download_url
from asapdiscovery.data.util.utils import (
    filter_molecules_dataframe,
    parse_fluorescence_data_cdd,
)
from pathlib import Path

import pandas

"""
There are 3 separate functions that perform the downloading, filtering,
and processing. We also provide a convenient wrapper function that calls all 3.

We will first show  an example using the individual functions, and then show the
equivalent single function that will get the same results

For a saved search with URL https://app.collaborativedrug.com/vaults/<vault_id>/searches/<search>
the code will be:
"""

# Download the molecules in CSV format (you will need to set the CDDTOKEN variable to
#  your CDD API token)
response = download_url(
    "https://app.collaborativedrug.com/api/v1/vaults/<vault_id>/searches/<search>",
    header={"X-CDD-token": CDDTOKEN},
)

# Save the downloaded text as a CSV file
cache_file = Path("cdd_unfiltered.csv")
cache_file.write_text(response.content.decode())

# Load the data back as a pandas DataFrame
mol_df = pandas.read_csv(cache_file)

"""
In this example, we will ultimately use this data to train a structure-based SchNet
model, so we will keep all achiral and enantiopure molecules, including any molecule
with semiquantitative fluorescence values.

In the below function call, we use "molecule_id" as the id_fieldname,
"molecule_smiles" as the smiles_fieldname, and "fluorescence_exp" as the assay_name.
You will likely need to replace these with whatever the corresponding data field in
your CDD vault is called.
"""
mol_df_filt = filter_molecules_dataframe(
    mol_df,
    id_fieldname="molecule_id",
    smiles_fieldname="molecule_smiles",
    assay_name="fluorescence_exp",
    retain_achiral=True,
    retain_enantiopure=True,
    retain_semiquantitative_data=True,
)

"""
In addition to being appropriately filtered, mol_df_filt now contains some identifying
colums with standardized names, so we only need to pass along the assay_name.

The parse_fluorescence_data_cdd function standardizes the fluorescence assay results,
adding a number of columns to the data frame. In addition to IC50 and pIC50 values, it
will also calculate deltaG in kcal/mol and kT units. If you know your fluorescence
assay conditions and they were consistent across all molecules, you can supply the
information to the cp_values arg as a tuple of (substrate_concentration, Km), which the
function will use in the Cheng-Prusoff equation to calculate the deltaG values. If you
don't have these values, the function will use a less accurate approximation. We will
exclude the Cheng-Prusoff values in this example for simplicity.

More details on the columns that the function expects the input to have and that it
adds h a he output can be found in the function's docstring.
"""
mol_df_filt_processed = parse_fluorescence_data_cdd(
    mol_df_filt, assay_name="fluorescence_exp"
)

# Save the processed data
mol_df_filt_processed.to_csv("cdd_filtered_processed.csv", index=False)

In [None]:
from asapdiscovery.data.services.cdd.cdd_download import download_molecules

# As an alternative to the previous cell, we can accomplish the same thing with a call
#  to our helper method
mol_df_filt_processed = download_molecules(
    header={"X-CDD-token": CDDTOKEN},
    vault=vault_id,
    search=search,
    fn_out="cdd_filtered_processed.csv",
    fn_cache="cdd_unfiltered.csv",
    id_fieldname="molecule_id",
    smiles_fieldname="molecule_smiles",
    assay_name="fluorescence_exp",
    retain_achiral=True,
    retain_enantiopure=True,
    retain_semiquantitative_data=True,
)

# Preparing the experimental data

Now that we have our experimental data downloaded and processed, we need to convert it into the format that the ML pipeline expects it in. We provide a function that does exactly this, that takes the previously generated CSV file as input and produces a JSON file that we will load later.

In [None]:
from asapdiscovery.data.util.utils import cdd_to_schema

_ = cdd_to_schema(cdd_csv="cdd_filtered_processed.csv", out_json="cdd_filtered_processed.json")

# Downloading data from Fragalysis

The ASAP project uses Fragalysis for storing our crystal structures, so will also need to pull those files. This process is much less involved than the CDD data, as the structures in Fragalysis are already aligned and formatted nicely. We will simply download the archive and extract it.

As before, you can of course skip this step and instead provide a directory of your own PDB files.

In [None]:
from asapdiscovery.data.services.fragalysis.fragalysis_download import (
    API_CALL_BASE,
    download,
)
from pathlib import Path

# Copy the base call in case something tries to use it elsewhere
api_call = copy.deepcopy(API_CALL_BASE)
# We will be downloading the SARS-CoV-2 Mpro structures for this example
api_call["target_name"] = "Mpro"

# The archive will be extracted in the same directory that the zip file is downloaded
#  to, so make a new directory for everything
out_dir = Path("./mpro_fragalysis/")
out_dir.mkdir(exist_ok=True)
# Download and extract the files
download(out_fn=str(out_dir / "fragalysis.zip"), api_call=api_call)

# Next steps
This should be enough to get you ready to train ML models on ASAP data using either the `CLI` (see `guides`) or API layer (see example notebook). 