# Training ML models 

## Prerequisites

Before starting this tutorial, you should have already worked through the tutorials on [Interfacing with databases and systems](https://asapdiscovery.readthedocs.io/en/latest/tutorials/index.html#interfacing-with-databases-and-systems) and [Docking and scoring](https://asapdiscovery.readthedocs.io/en/latest/tutorials/index.html#docking-and-scoring).

## Intro

In this guide, we will start with a CSV file downloaded from CDD and a directory of docked protein-ligand complex PDB files. These are the two outputs from the guides mentioned in the Prerequisites section, so be sure to complete those if you haven't already.

## Preparing the experimental data

Before using the data in training, we will do some filtering and processing of the experimental data to ensure that everything is in the correct format. We will use all the default values for column names, which come from the [COVID Moonshot project](https://www.science.org/doi/10.1126/science.abo7201). See the docs for individual functions to see how these values can be tuned for your use-case.

In [None]:
from asapdiscovery.data.util.utils import (
    cdd_to_schema,
    filter_molecules_dataframe,
    parse_fluorescence_data_cdd,
)
from pathlib import Path

import pandas

# Replace this name with whatever you've saved your CDD download as
mol_df = pandas.read_csv("cdd_unfiltered.csv")

"""
In this example, we will ultimately use this data to train both 2D and structure-based
models, so we will keep all achiral and enantiopure molecules, including any molecules
with semiquantitative fluorescence values.
"""
mol_df_filt = filter_molecules_dataframe(
    mol_df,
    retain_achiral=True,
    retain_enantiopure=True,
    retain_semiquantitative_data=True,
)

"""
In addition to being appropriately filtered, mol_df_filt now contains some identifying
colums with standardized names.

The parse_fluorescence_data_cdd function standardizes the fluorescence assay results,
adding a number of columns to the data frame. In addition to IC50 and pIC50 values, it
will also calculate deltaG in kcal/mol and kT units. If you know your fluorescence
assay conditions and they were consistent across all molecules, you can supply the
information to the cp_values arg as a tuple of (substrate_concentration, Km), which the
function will use in the Cheng-Prusoff equation to calculate the deltaG values. If you
don't have these values, the function will use a less accurate approximation. We will
exclude the Cheng-Prusoff values in this example for simplicity.

More details on the columns that the function expects the input to have and that it
adds to the output can be found in the function's docstring.
"""
mol_df_filt_processed = parse_fluorescence_data_cdd(mol_df_filt)

# Save the processed data
mol_df_filt_processed.to_csv("cdd_filtered_processed.csv", index=False)

"""
The last step in this process is to convert it into the format that the ML pipeline
expects it in. The below function does that, taking the previously generated CSV file
as input and producing a JSON file that we will load later.
"""
_ = cdd_to_schema(
    cdd_csv="cdd_filtered_processed.csv", out_json="cdd_filtered_processed.json"
)