# Interfacing with databases and systems

ASAP's workflows involve interfacing with many different services, data providers and integrations. Much like with our base level abstractions, we aim to provide a seamless way to work with these databases and integrations with high level abstractions

## Reading lots of molecules from files

One often wants to read a giant file filled with molecule data, e.g. an `SDF` or `mol2` file. We provide a `MolFileFactory` to quickly read these into a list of Ligands

In [1]:
from asapdiscovery.data.readers.molfile import MolFileFactory
from asapdiscovery.data.testing.test_resources import fetch_test_file

big_sdf_file = fetch_test_file("Mpro_combined_labeled.sdf") # SDF file filled with COVID Moonshot compounds

factory = MolFileFactory(filename=big_sdf_file)
ligands = factory.load()



In [2]:
print(len(ligands)) # loaded 576 ligands into a list 

576


## Reading structures from Fragalysis 

Diamond light source uses the [Fragalysis](https://fragalysis.diamond.ac.uk/viewer/react/landing)  platform to display their crystallography results. ASAP makes extensive use of Diamond's high throughput crystallography pipeline, and therefore have developed easy ways to download and parse Fragalysis data in our workflows.

To get a Fragalysis format dump, navigate to the `Download` button on the desired target in the Fragalysis UI. For ease of use here we have vendored a SARS-CoV-2-Mpro fragalysis file in our testing suite. 

In [3]:
from asapdiscovery.data.testing.test_resources import fetch_test_file

mpro_fragalysis_zipped = fetch_test_file("mpro_fragalysis-04-01-24_zipped.zip")
extract_dir = "."
# unzip 
import shutil
shutil.unpack_archive(mpro_fragalysis_zipped, ".")

Downloading file 'mpro_fragalysis-04-01-24_zipped.zip' from 'https://asap-discovery-test-files.s3.amazonaws.com/mpro_fragalysis-04-01-24_zipped.zip' to '/Users/joshua/Library/Caches/asapdiscovery_testing'.


In [4]:
from asapdiscovery.data.services.fragalysis.fragalysis_reader import FragalysisFactory

frag_factory = FragalysisFactory.from_dir("mpro_fragalysis-04-01-24_zipped")
complexes = frag_factory.load(use_dask=True) # we can use dask to speed this up a lot

# we now have a list of 800 complexes from fragalysis to use!
print(len(complexes))




803


## Loading compounds from Postera

Postera's [Manifold](https://app.postera.ai/) platform is the primary place for a lot of our DMTA cycle 

We need to be able to push and pull data from there with easy. To use this example you will need a valid `POSTERA_API_KEY`, which you can create after making an account

In [5]:
%env POSTERA_API_KEY=EXAMPLE

from asapdiscovery.data.services.postera.postera_factory import PosteraFactory, PosteraSettings

ps = PosteraSettings()
ps

env: POSTERA_API_KEY=EXAMPLE


PosteraSettings(POSTERA_API_KEY='EXAMPLE', POSTERA_API_URL='https://api.asap.postera.ai', POSTERA_API_VERSION='v1')

In [6]:
pf = PosteraFactory(settings=ps, molecule_set_name="MY_MOLSET")
# pf.pull() will return a list of Ligands

## Pushing data to Postera

Pushing data to postera is similar to loading, however we only allow certain tags followin our design specification to be updated in Manifold. These can be queried with `ManifoldAllowedTags`

In [7]:
from asapdiscovery.data.services.postera.manifold_data_validation import ManifoldAllowedTags
ManifoldAllowedTags.get_values()[::10]

['SMILES',
 'biochemical-activity_EV-A71-3Cpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_EV-D68-3Cpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_MERS-CoV-Mpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_SARS-CoV-2-Mpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_ZIKV-NS2B-NS3pro_computed-SchNet-pIC50_msk',
 'in-silico_DENV-NS2B-NS3pro_ligand-conformer-strain-szybki-kcal-mol_msk',
 'in-silico_EV-A71-3Cpro_docking-structure-POSIT_msk',
 'in-silico_EV-A71-Capsid_docking-pose-fitness-POSIT_msk',
 'in-silico_EV-D68-3Cpro_docking-hit_msk',
 'in-silico_EV-D68-3Cpro_md-pose_msk',
 'in-silico_EV-D68-Capsid_ligand-local-strain-szybki-kcal-mol_msk',
 'in-silico_MERS-CoV-Mpro_ligand-conformer-strain-szybki-kcal-mol_msk',
 'in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk',
 'in-silico_SARS-CoV-2-Mpro_docking-pose-fitness-POSIT_msk',
 'in-silico_SARS-CoV-2-N-protein_docking-hit_msk',
 'in-silico_SARS-CoV-2-N-protein_md-pose_msk',
 'in-silico_ZIKV-NS2B-NS3pro_liga

Lets push some mock data to postera! You have to provide a SMILES and also a `ligand_id` which will be propagated to postera backend if it matches a UUID already present in Postera. 

In [8]:
from asapdiscovery.data.services.postera.postera_uploader import PosteraUploader
import pandas as pd
data = {"SMILES": ["CCC", "CCCC"], "ligand_id":["abcderf1244134jasdasda", "asidaosidasdnalsd"], "in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk":["structure1", "structure2"]}
df = pd.DataFrame(data)


In [9]:
df

Unnamed: 0,SMILES,ligand_id,in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk
0,CCC,abcderf1244134jasdasda,structure1
1,CCCC,asidaosidasdnalsd,structure2


In [10]:
pu = PosteraUploader(settings=ps, molecule_set_name="MY_MOLSET")
# pu.push(df) # will push data to remote 

## Reading data from CDD

At ASAP we use the [CDD vault](https://www.collaborativedrug.com/) to store assay information on tested molecules and often need to search and pull data. To use this service you should export your `CDD_API_KEY` and `CDD_VAULT_NUMBER` which will be automatically picked up by our `CDDSettings`:

In [11]:
%env CDD_API_KEY=EXAMPLE, CDD_VAULT_NUMBER=1

from asapdiscovery.data.services.cdd.cdd_api import CDDAPI, CDDSettings
settings = CDDSettings()
settings

env: CDD_API_KEY=EXAMPLE, CDD_VAULT_NUMBER=1


CDDSettings(CDD_API_KEY='EXAMPLE, CDD_VAULT_NUMBER=1', CDD_VAULT_NUMBER=6890, CDD_API_URL='https://app.collaborativedrug.com', CDD_API_VERSION='v1')

we can now use the `CDDAPI` interface to query our vault, lets start by searching for molecules, note that they are returned as raw dictionary data from the CDD which can be converted into ligand objects using the `CXSmiles` or `Smiles` data:

In [None]:
cdd_api = CDDAPI.from_settings(settings=settings)
# Search for a specific molecule in the vault, can only do one search at a time using smiles
benzene = cdd_api.get_molecules(smiles="c1ccccc1")
# search for a list of molecules by their name in CDD
molecules_by_name = cdd_api.get_molecules(names=['org-id-1', 'org-id-2'])
# or search for molecules using the CDD compound-id
molecules_by_id = cdd_api.get_molecules(compound_ids=[1, 2, 3, 4])

Another common task is to download all `IC50` data for a given protocol to use in ML model development or benchmarking binding affinity calculations, this is trivial using the API:

In [None]:
ic50_dataframe = cdd_api.get_ic50_data(protocol_name='assay-1')

We also provide a utility function which allows you to quickly download all of the molecules in a given protocol and filter for fully defined stereo and non-covalent ligands only, which returns the results as asap `Lignad` objects:

In [None]:
from asapdiscovery.alchemy.cli.utils import get_cdd_molecules

molecules = get_cdd_molecules(protocol_name='assay-1', defined_stereo_only=True, remove_covalent=True)