# Interfacing with databases and systems

ASAP's workflows involve interfacing with many different services, data providers and integrations. Much like with our base level abstractions, we aim to provide a seamless way to work with these databases and integrations with high level abstractions

# Reading lots of molecules from files

One often wants to read a giant file filled with molecule data, e.g. an `SDF` or `mol2` file. We provide a `MolFileFactory` to quickly read these into a list of Ligands

In [1]:
from asapdiscovery.data.readers.molfile import MolFileFactory
from asapdiscovery.data.testing.test_resources import fetch_test_file

big_sdf_file = fetch_test_file("Mpro_combined_labeled.sdf") # SDF file filled with COVID Moonshot compounds

factory = MolFileFactory(filename=big_sdf_file)
ligands = factory.load()



In [2]:
print(len(ligands)) # loaded 576 ligands into a list 

576


# Reading structures from Fragalysis 

Diamond light source uses the [Fragalysis](https://fragalysis.diamond.ac.uk/viewer/react/landing)  platform to display their crystallography results. ASAP makes extensive use of Diamond's high throughput crystallography pipeline, and therefore have developed easy ways to download and parse Fragalysis data in our workflows.

To get a Fragalysis format dump, navigate to the `Download` button on the desired target in the Fragalysis UI. For ease of use here we have vendored a SARS-CoV-2-Mpro fragalysis file in our testing suite. 

In [7]:
from asapdiscovery.data.testing.test_resources import fetch_test_file

mpro_fragalysis_zipped = fetch_test_file("mpro_fragalysis-04-01-24_zipped.zip")
extract_dir = "."
# unzip 
import shutil
shutil.unpack_archive(mpro_fragalysis_zipped, ".")

In [10]:
from asapdiscovery.data.services.fragalysis.fragalysis_reader import FragalysisFactory

frag_factory = FragalysisFactory.from_dir("mpro_fragalysis-04-01-24_zipped")
complexes = frag_factory.load(use_dask=True) # we can use dask to speed this up a lot

# we now have a list of 800 complexes from fragalysis to use!
print(len(complexes))




803


# Loading compounds from Postera

Postera's [Manifold](https://app.postera.ai/) platform is the primary place for a lot of our DMTA cycle 

We need to be able to push and pull data from there with easy. To use this example you will need a valid `POSTERA_API_KEY`, which you can create after making an account

In [11]:
%env POSTERA_API_KEY=EXAMPLE

from asapdiscovery.data.services.postera.postera_factory import PosteraFactory, PosteraSettings

ps = PosteraSettings()
ps

env: POSTERA_API_KEY=EXAMPLE


PosteraSettings(POSTERA_API_KEY='EXAMPLE', POSTERA_API_URL='https://api.asap.postera.ai', POSTERA_API_VERSION='v1')

In [12]:
pf = PosteraFactory(settings=ps, molecule_set_name="MY_MOLSET")
# pf.pull() will return a list of Ligands

# Pushing data to Postera

Pushing data to postera is similar to loading, however we only allow certain tags followin our design specification to be updated in Manifold. These can be queried with `ManifoldAllowedTags`

In [14]:
from asapdiscovery.data.services.postera.manifold_data_validation import ManifoldAllowedTags
ManifoldAllowedTags.get_values()[::10]

['SMILES',
 'biochemical-activity_EV-A71-3Cpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_EV-D68-3Cpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_MERS-CoV-Mpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_SARS-CoV-2-Mpro_computed-SchNet-pIC50_msk',
 'biochemical-activity_ZIKV-NS2B-NS3pro_computed-SchNet-pIC50_msk',
 'in-silico_DENV-NS2B-NS3pro_ligand-conformer-strain-szybki-kcal-mol_msk',
 'in-silico_EV-A71-3Cpro_docking-structure-POSIT_msk',
 'in-silico_EV-A71-Capsid_docking-pose-fitness-POSIT_msk',
 'in-silico_EV-D68-3Cpro_docking-hit_msk',
 'in-silico_EV-D68-3Cpro_md-pose_msk',
 'in-silico_EV-D68-Capsid_ligand-local-strain-szybki-kcal-mol_msk',
 'in-silico_MERS-CoV-Mpro_ligand-conformer-strain-szybki-kcal-mol_msk',
 'in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk',
 'in-silico_SARS-CoV-2-Mpro_docking-pose-fitness-POSIT_msk',
 'in-silico_SARS-CoV-2-N-protein_docking-hit_msk',
 'in-silico_SARS-CoV-2-N-protein_md-pose_msk',
 'in-silico_ZIKV-NS2B-NS3pro_liga

Lets push some mock data to postera! You have to provide a SMILES and also a `ligand_id` which will be propagated to postera backend if it matches a UUID already present in Postera. 

In [18]:
from asapdiscovery.data.services.postera.postera_uploader import PosteraUploader
import pandas as pd
data = {"SMILES": ["CCC", "CCCC"], "ligand_id":["abcderf1244134jasdasda", "asidaosidasdnalsd"], "in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk":["structure1", "structure2"]}
df = pd.DataFrame(data)


In [19]:
df

Unnamed: 0,SMILES,ligand_id,in-silico_SARS-CoV-2-Mac1_docking-structure-POSIT_msk
0,CCC,abcderf1244134jasdasda,structure1
1,CCCC,asidaosidasdnalsd,structure2


In [20]:
pu = PosteraUploader(settings=ps, molecule_set_name="MY_MOLSET")
# pu.push(df) # will push data to remote 

# Reading data from CDD

In [None]:
# TODO @jthorton fill in 