# Get Datasets Stored on the MDF
Pull the datasets used in our [previous molecular design tests](https://github.com/exalearn/multi-site-campaigns), which we have published on the Materials Data Facility.
They are smaller in scale (MOSES ~ 1M, QM9 ~0.1M) which will make for easier tests of the system.

In [1]:
from tempfile import TemporaryDirectory
from shutil import copyfileobj
from typing import Iterator
from pathlib import Path
import requests

Configuration

In [2]:
base_url = 'https://data.materialsdatafacility.org/mdf_open/multiresource_ai_v2.1/multisite/data/moldesign/search-space'
dataset_names = ['QM9', 'MOS']

## Make the Functions
A function to interate over the SMILES strings in the dataset

In [3]:
def get_smiles_strings(name: str) -> Iterator[str]:
    """Iterate over all of the SMILES strings in PubChem
    
    Args:
        name: Name of the dataset
    Yields:
        SMILES string of a molecule
    """
    with TemporaryDirectory(prefix='smiles') as tmp:
        file_path = Path(tmp) / 'space.csv'
        with requests.get(f'{base_url}/{name}-search.csv', stream=True) as req, file_path.open('wb') as fo:
            copyfileobj(req.raw, fo)
    
        with open(file_path, 'rt') as fp:
            header = fp.readline()  # Header
            assert header.startswith('smiles,')
            for line in fp:
                smiles = line.split(",")[0]
                yield smiles

## Download the Data
Store all of the datasets we want from the MDF

In [4]:
for name in dataset_names:
    with open(f'output/mdf-{name.lower()}.smi', 'w') as fp:
        for smiles in get_smiles_strings(name):
            print(smiles.strip(), file=fp)